|
|
Transient Fault Tolerance via Dynamic Process Redundancy
Alex Shye, Vijay Janapa Reddi, Tipp Moseley, and Daniel A. Connors
Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA).
October,
2006.
|
Transient faults are emerging as a critical concern in the reliability of
microprocessors. While hardware reliability techniques are often employed
for transient fault tolerance, software techniques represent a more
cost-effective and flexible alternative. This paper proposes a software
approach to transient fault tolerance which utilizes a run-time system
to automatically apply process-level redundancy (PLR). PLR creates
a set of redundant processes per application process and compares
the processes during runtime
to guarantee correct execution. Redundancy at the process level allows
the operating system to freely schedule the processes across all
available hardware resources (extra threads or cores). PLR is a software-
centric approach in which the focus is shifted from ensuring correct
hardware execution, to ensuring correct software execution.
The software-centric approach is able to ignore many benign faults
which do not propagate to affect the program output. In addition,
the dynamic deployment creates a very flexible fault tolerant system
which transparently applies PLR to any program, without prior modifications
to the application, shared library, or operating system. Experiments using
a PLR prototype demonstrate that PLR can effectively provide fault
tolerance with a slowdown of only 1.26x.
|
| [ PDF ] |
|