Transient Fault Tolerance via Dynamic Process Redundancy

Alex Shye, Vijay Janapa Reddi, Tipp Moseley, and Daniel A. Connors
Proceedings of the Workshop on Binary Instrumentation and Applications (WBIA). October, 2006.
Transient faults are emerging as a critical concern in the reliability of microprocessors. While hardware reliability techniques are often employed for transient fault tolerance, software techniques represent a more cost-effective and flexible alternative. This paper proposes a software approach to transient fault tolerance which utilizes a run-time system to automatically apply process-level redundancy (PLR). PLR creates a set of redundant processes per application process and compares the processes during runtime to guarantee correct execution. Redundancy at the process level allows the operating system to freely schedule the processes across all available hardware resources (extra threads or cores). PLR is a software- centric approach in which the focus is shifted from ensuring correct hardware execution, to ensuring correct software execution. The software-centric approach is able to ignore many benign faults which do not propagate to affect the program output. In addition, the dynamic deployment creates a very flexible fault tolerant system which transparently applies PLR to any program, without prior modifications to the application, shared library, or operating system. Experiments using a PLR prototype demonstrate that PLR can effectively provide fault tolerance with a slowdown of only 1.26x.

[ PDF ]