Characterizing the Use of Program Vulnerability Factors for Studying Transient Fault Tolerance in Multi-core Architectures

Robert Kost, Daniel Connors, Sudeep Pasricha.
Proceedings of the 2009 International Conference on Dependable Systems and Networks (DSN) Workshop on Compiler and Architectural Techniques for Application Reliability and Security (CATARS) June, 2009.
Semiconductor transient faults (soft errors) are a critical design concern in the reliability of computer systems. Most recent architecture research is focused on using performance models to provide Architecture Vulnerability Factor (AVF) estimates of processor reliability rather than deploying detailed fault injection into hardware RTL models. While AVF analysis provides support for investigating new fault tolerant architecture techniques, program execution characteristics are largely missing from determining periods of soft error susceptibility. The primary problem with AVF is that software periods of vulnerability substantially differ from micro-architecture periods of vulnerability. As research trends dictate finding ways to selectively enable software-based transient fault tolerant mechanisms, run-time and off-line experimental techniques must be guided equally by program behavior and hardware. To address issues with AVF as well as the efficiency of fault injection studies, we examine elements of Program Vulnerability Factor (PVF) in the context of multi-core architectures. PVF has previously been introduced to consider program behavior in the form of memory/register vulnerability, however we explore static and profile based techniques for extending the work. By leveraging PVF we explore some initial contributions to the area of computer architecture research. First, we demonstrate that a more efficient fault injection campaign can be constructed and the outcome of fault injections in application execution can be accurately predicted. Second, compiler optimizations can be applied to better understand how the compiler affects fault susceptibility and program behavior. Finally, we motivate the need for developing a PVF metric for program data that is communicated between cores.

[ PDF ]