Run-time Compilation, Profiling, and Optimization
Dynamic code transformation systems have the potential to impact the design and use of modern computer systems since they can provide a number of services at run-time, such as instrumentation, optimization, translation and security. These systems have an inherent advantage over static techniques, as they can collect and exploit run-time execution characteristics. The information collected can be used to adapt the execution of the target application. However, since the execution of current run-time system is generally interleaved with the execution of the application, there is a substantial overhead penalty. In the context of performance analysis and program behavior tools, the overhead of a dynamic code transformation system may be acceptable. Nonetheless, execution penalty is a major barrier towards deploying run-time code transformation in computing systems, specifically in use in large-scale application environments. Prior work in run-time optimization has concentrated on the use of a single processor core to provide profiling, analysis, and optimizaton of an application. The DRACO group investigates the use of multi-core sources to perform concurrent profiling and optimization. These techniques overcome traditional barriers to applying run-time compilation techniques.
[ Run-time ]

Adaptive Multithreading
To date, methods of deploying multithreaded processor designs have included three general domains of optimization: thread affinity scheduling, dedicated architecture techniques, and cooperative compilation techniques. These techniques have largely been directed using prior knowledge of program execution behavior. For instance, multithreaded machines benefit from job scheduling, the process of simultaneously scheduling application threads to share the same processor resources. Thread affinity is obtained when threads do not hurt the performance of other concurrents threads, and are able to benefit from other thread's execution. As such, proposed design changes in input evaluation and workload behavior have not been widely an element. Rather than adapting the system to the scenario, current designs attempt to encompass all possible scenarios in a rigid processor system. Dynamic program optimization techniques have been shown to dramatically impact system performance by integrating run-time execution behavior with run-time optimization software technology to make dynamic program transformations. We propose the Adaptive Multithreading (AMT) model of managing multithreading shared resources which enables adaptive threads, rather than static threads, that change execution behavior to improve the mulithreaded system efficiency. In this context, strategic compiler technology and run-time system initiatives are proposed to investigate run-time thread activation, optimization, and management.
[ AMT ]

Transient Fault Tolerance and Scientific Computing Fault Tolerance
Transient faults are emerging as a critical concern in the reliability of general-purpose computer systems. While hardware redundancy techniques may be effective, software approaches provide a more flexible and low-cost alternative. Our research investigates transparent, replica-based software-only system which leverages multiple cores for transient fault tolerance. The system improves the model of process replicas to maintain the semantic which users expect in general-purpose systems. Our approach augments the implementation of proces replica with a run-time optimizer (RTO) which is able to provide instruction-level control over the running application. With the fine-grained control of a RTO, a novel software-only techniques for deterministically handling asynchronous interrupts, shared memory access, and device I/O can be created. In addition, system supports both active and passive replicas, including run-time switching between the different modes of replicas. Initial results are presented which show the trade-off between system load, performance, and fault coverage when switching between active and passive replicas.
[ Fault-Tolerance ]

FPGA Architecture Design and Optimization
We investigate the construction of hardware/software partitioning on Xilinx Virtex-II Pro devices for application-specific multiprocessors and reconfigurable computing. We are currently investigating the use of hybrid (processor in FPGA) systems for spacecraft fault tolerance, efficient pattern search, and video processing. Our interest is accelerating automated video analysis algorithms that aim to extract, efficiently describe, and organize information (at runtime) regarding the state or state transition of individuals (identity, emotional state, activity, position and pose, etc), interactions between individuals (dialogue, gestures, engagement into collaborative or competitive activities like sports), physical characteristics of humans (anthropometric characteristics, 3D head/body models), and so forth.
[ FPGA ]

Multi-threaded Multi-core Simulation Acceleration Techniques
Detailed cycle-accurate simulation remains a vital component of the processor design process. However, with the increasing complexity of modern processors and application workloads, full detailed simulation is prohibitively slow, often taking several months of simulation time. Sampled simulation seeks to overcome this problem by only simulating in detail a very small but representative subset of the overall execution. Two popular sampling techniques which have been shown effective and accurate are phase-based simulation and small-sample simulation. Consequently, both of these techniques are derived using the same benchmark suite and promote the same sampling method for every application being studied. In fact, to achieve the most efficient and accurate simulation acceleration, a sampling-based simulation technique must adapt to the unique characteristics of the individual application being simulated. DRACO investigates simulation techniques coupled with hardware-performance counters in real hardware to get accurate views of co-phase execution behavior of mulitple simultaneous application threads in a processor system.
[ Fast-Sim ]

Dynamic Load-Scheduling in Multi-processor Clusters
Dynamic load balancing applied to scientific applications is generally performed through a measurable run-time quantity. An alternative approach is to use a physical property that is produced by or required of a scientific model that directly or indirectly results in the desired representation of the reality that the model portrays. This is the definition of a model based load index (MBLI) and can include properties such as mass of an atom in a molecular dynamics (MD) code or rainfall amounts in a climate simulation. An MBLI may exist for a particular application and it can be used at run-time with less overhead and higher accuracy than a measured run-time quantity. DRACO investigates DMBLI (Dynamic MBLI) which dynamically gathers model data to balance load scheduling algorithms found in several high-performance scientific applications. These programs include: LAMMPS - a molecular dynamics simulation designed to model ensembles of particles in a liquid, solid, or gaseous states and can model atomic, polymeric, biological, metallic, or granular systems using a variety of force fields and boundary conditions; Community Climate System Model and Community Atmosphere Model (CCSM/CAM3) - a global climate model that provides state-of-the-art computer simulations of the Earth's past, present, and future climate states; BLAST - an analysis system that compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches; and NAMD2 (a Molecular Dynamics code). We evaluate the results on IBM Power5, IBM BlueGene/L and Pentium 4 Xeon systems.
[ Load-Balance ]

Compiler-Guided Memory System Prefetching
The growing disparity between processor speed and memory latency has made cache miss penalty the primary problem in achieving higher levels of processor performance. Prefetching mechanisms can be used to address this problem, but traditional approaches attempt to predict future memory references based upon regularity in the memory reference stream. Several techniques have been proposed to prefetch dynamic data structures, although, many of these approaches do not successfully hide the latency between memory and processor speeds. The DRACO group proposes Compiler-Directed Content-Aware Prefetching (CDCAP), a novel integrated compiler and hardware approach to prefetch dynamic data structures. The approach coordinates compiler-directed prefetch instructions with an intelligent hardware prefetching engine (HPE) resident with the lower levels of the memory system. The inserted prefetches contain information about the static attributes of the data structure that get communicated to the HPE. The compiler-inserted hints can invoke the HPE to generate prefetches based on the current program state and the contents of the prefetched cache lines. The technique overcomes the shortcomings of software-only techniques by eliminating the need to transform the data structure without the use of excessive prefetch instructions and does not require prior execution of the traversed data structure path.
[ PREFETCH ]