|
|
Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance
Alex Shye, Tipp Moseley, Vijay Janapa Reddi, Joseph Blomstedt, Daniel A. Connors.
Proceedings of the 2007 International Conference on Dependable Systems and Networks (DSN).
June,
2007.
|
Transient faults are emerging as a critical concern in the reliability of
general-purpose microprocessors. As architectural trends point towards
multi-threaded multi-core designs, there is substantial interest in adapting
such parallel hardware resources for transient fault tolerance. This paper
proposes a software-based multi-core alternative for transient fault tolerance
using process-level redundancy (PLR). PLR creates a set of redundant
processes per application process and systematically compares the processes to
guarantee correct execution. Redundancy at the process level allows the
operating system to freely schedule the processes across all available hardware
resources. PLR's software-centric approach to transient fault
tolerance shifts the focus from ensuring correct hardware execution to ensuring
correct software execution. As a result, PLR ignores many benign faults that
do not propagate to affect program correctness. A real PLR prototype for
running single-threaded applications is presented and evaluated for fault
coverage and performance. On a 4-way SMP machine, PLR provides improved
performance over existing software transient fault tolerance techniques with
16.9% overhead for fault detection on a set of optimized SPEC2000
binaries.
|
| [ PDF ] |
|