System adaptivity and fault-tolerance in NoC-based MPSoCs: The MADNESS project approach

Meloni, Paolo; Tuveri, Giuseppe; Raffo, Luigi; Cannella, E; Stefanov, T; Derin, O; Fiorin, L; Sami, M.

doi:10.1109/DSD.2012.122

Modern embedded systems increasingly require adaptive run-time management. The system may adapt the mapping of the applications in order to accommodate the current workload conditions, to balance load for efficient resource utilization, to meet quality of service agreements, to avoid thermal hot-spots and to reduce power consumption. As the possibility of experiencing run-time faults becomes increasingly relevant with deep-sub-micron technology nodes, in the scope of the MADNESS project, we focus particularly on the problem of graceful degradation by dynamic remapping in presence of run-time faults. In this paper, we summarize the major results achieved in the MADNESS project until now regarding the system adaptivity and fault tolerant processing. We report the first results of the integration between platform level and middleware level support for adaptivity and fault tolerance. A case study demonstrates the survival ability of the system via a low-overhead process migration mechanism and a near-optimal online remapping heuristic.