|
|
PLR: A Software Approach to Transient Fault Tolerance for Multi-Core Architectures
Alex Shye, Joseph Blomstedt, Tipp Moseley, Vijay Janapa Reddi, Daniel A. Connors.
IEEE Transactions on Dependable and Secure Computing (TDSC)
December,
2008.
|
Transient faults are emerging as a critical concern in the
relia-bility of general-purpose microprocessors. As architectural
trends point towards multi-core designs, there is substantial interest
in adapting such parallel hardware resources for transient
fault tolerance. This paper presents process-level redundancy (PLR),
a software technique for transient fault tolerance which
leverages multiple cores for low overhead. PLR creates a set of
redundant processes per application process, and systematically
compares the processes to guarantee correct execution. Redundancy at the
process level allows the operating system to freely schedule the
processes across all available hardware resources. PLR uses a
software-centric approach to transient fault tolerance which shifts the
focus from ensuring correct hardware execution toensuring correct
software execution. As a result, many benignfaults that do not
propagate to affect program correctness can be safely ignored. A real
prototype is presented that is designed to be transparent to the
application and can run on general-purpose single-threaded programs
without modifications to the program, operating system, or underlying
hardware. Thesystem is evaluated for fault coverage and performance on
4-way SMP machine, and provides improved performance over
existing software transient fault tolerance techniques with an
16.9% overhead for fault detection on a set of optimized
SPEC2000 binaries.
|
| [ PDF ] |
|