|
|
Run-time Compilation, Profiling, and Optimization
|
|
Dynamic code transformation systems have the potential to impact the
design and use of modern computer systems since they can provide a
number of services at run-time, such as instrumentation, optimization,
translation and security. These systems have an inherent advantage
over static techniques, as they can collect and exploit run-time
execution characteristics. The information collected can be used to
adapt the execution of the target application. However, since the
execution of current run-time system is generally interleaved with the
execution of the application, there is a substantial overhead
penalty. In the context of performance analysis and program behavior
tools, the overhead of a dynamic code transformation system may be
acceptable. Nonetheless, execution penalty is a major barrier towards
deploying run-time code transformation in computing systems,
specifically in use in large-scale application environments.
Prior work in run-time optimization has concentrated on the use of a single processor core to provide profiling, analysis, and optimizaton of an application.
The DRACO group investigates the use of multi-core sources to perform
concurrent profiling and optimization. These techniques overcome
traditional barriers to applying run-time compilation techniques.
|
[ Run-time ]
|
Adaptive Multithreading
|
|
To date, methods of deploying multithreaded processor designs have
included three general
domains of optimization: thread affinity scheduling, dedicated
architecture techniques, and cooperative compilation techniques.
These techniques have largely been directed using prior knowledge of
program execution behavior. For instance, multithreaded machines benefit from
job scheduling, the process of simultaneously scheduling application
threads to share the same processor resources. Thread affinity is
obtained when threads do not hurt the performance of other concurrents
threads, and are able to benefit from other thread's execution. As
such, proposed design changes in input evaluation and workload
behavior have not been widely an element. Rather than adapting the
system to the scenario, current designs attempt to encompass all
possible scenarios in a rigid processor system.
Dynamic program optimization techniques have been shown to
dramatically impact system performance by integrating run-time
execution behavior with run-time optimization software technology to
make dynamic program transformations. We propose the Adaptive
Multithreading (AMT) model of managing multithreading shared resources
which enables adaptive threads, rather than static threads, that
change execution behavior to improve the mulithreaded system efficiency. In
this context, strategic compiler technology and run-time system
initiatives are proposed to investigate run-time thread activation,
optimization, and management.
|
[ AMT ]
|
Transient Fault Tolerance and Scientific Computing Fault Tolerance
|
|
Transient faults are emerging as a critical concern in the reliability
of general-purpose computer systems. While hardware redundancy
techniques may be effective, software approaches provide a more
flexible and low-cost alternative. Our research investigates
transparent, replica-based software-only system which leverages
multiple cores for transient fault tolerance. The system improves the
model of process replicas to maintain the semantic which users expect
in general-purpose systems. Our approach augments the implementation
of proces replica with a run-time optimizer (RTO) which is able
to provide instruction-level control over the running application.
With the fine-grained control of a RTO, a novel
software-only techniques for deterministically handling asynchronous
interrupts, shared memory access, and device I/O can be created. In addition, system
supports both active and passive replicas, including run-time
switching between the different modes of replicas. Initial results
are presented which show the trade-off between system load,
performance, and fault coverage when switching between active and
passive replicas.
|
[ Fault-Tolerance ]
|
FPGA Architecture Design and Optimization
|
|
We investigate the construction of hardware/software partitioning on
Xilinx Virtex-II Pro devices
for application-specific multiprocessors and reconfigurable computing.
We are currently investigating the use of hybrid (processor in FPGA)
systems for spacecraft fault tolerance, efficient pattern search,
and video processing.
Our interest is accelerating automated video analysis algorithms that aim to
extract, efficiently describe, and organize information (at runtime) regarding the state or state transition of
individuals (identity, emotional state, activity, position and pose, etc),
interactions between
individuals (dialogue, gestures, engagement into collaborative or
competitive activities like sports), physical characteristics of humans
(anthropometric characteristics, 3D head/body models), and so forth.
|
[ FPGA ]
|
Multi-threaded Multi-core Simulation Acceleration Techniques
|
|
Detailed cycle-accurate simulation remains a vital component of the
processor design process. However, with the increasing complexity of
modern processors and application workloads, full detailed simulation
is prohibitively slow, often taking several months of simulation time.
Sampled simulation seeks to overcome this problem by only simulating
in detail a very small but representative subset of the overall
execution. Two popular sampling techniques which have been shown
effective and accurate are phase-based simulation and small-sample
simulation. Consequently, both of these techniques are derived using
the same benchmark suite and promote the same sampling method for
every application being studied. In fact, to achieve the most
efficient and accurate simulation acceleration, a sampling-based
simulation technique must adapt to the unique characteristics of the
individual application being simulated. DRACO investigates simulation
techniques coupled with hardware-performance counters in real
hardware to get accurate views of co-phase execution behavior
of mulitple simultaneous application threads in a processor system.
|
[ Fast-Sim ]
|
Dynamic Load-Scheduling in Multi-processor Clusters
|
|
Dynamic load balancing applied to scientific applications is generally
performed through a measurable run-time
quantity. An alternative approach is to use a physical property that is
produced by or required of a scientific model
that directly or indirectly results in the desired representation of
the reality that the model portrays. This is the
definition of a model based load index (MBLI) and can include properties
such as mass of an atom in a molecular
dynamics (MD) code or rainfall amounts in a climate simulation.
An MBLI may exist for a particular application and it can be used at
run-time with less overhead and higher accuracy than a
measured run-time quantity.
DRACO investigates DMBLI (Dynamic MBLI) which dynamically gathers
model data to balance load scheduling algorithms found in several
high-performance scientific applications.
These programs include: LAMMPS - a molecular dynamics
simulation designed to model ensembles of particles in a liquid,
solid, or gaseous states and can model atomic, polymeric, biological,
metallic, or granular systems using a variety of force fields and
boundary conditions; Community Climate System Model and Community
Atmosphere Model (CCSM/CAM3) - a global climate
model that provides state-of-the-art computer simulations of the
Earth's past, present, and future climate states;
BLAST - an analysis system that compares
nucleotide or protein sequences to sequence databases and calculates
the statistical significance of matches; and
NAMD2 (a Molecular Dynamics code).
We evaluate the results on IBM Power5, IBM BlueGene/L and Pentium 4 Xeon systems.
|
[ Load-Balance ]
|
Compiler-Guided Memory System Prefetching
|
|
The growing disparity between processor speed and memory latency has
made cache miss penalty the primary problem in achieving higher levels
of processor performance. Prefetching mechanisms can be used to
address this problem, but traditional approaches attempt to predict
future memory references based upon regularity in the memory reference
stream. Several techniques have been proposed to prefetch dynamic
data structures, although, many of these approaches do not
successfully hide the latency between memory and processor speeds.
The DRACO group proposes Compiler-Directed Content-Aware Prefetching
(CDCAP), a novel integrated compiler and hardware approach to prefetch
dynamic data structures. The approach coordinates compiler-directed
prefetch instructions with an intelligent hardware prefetching engine
(HPE) resident with the lower levels of the memory system. The
inserted prefetches contain information about the static attributes of
the data structure that get communicated to the HPE. The
compiler-inserted hints can invoke the HPE to generate prefetches
based on the current program state and the contents of the prefetched
cache lines. The technique overcomes the shortcomings of software-only
techniques by eliminating the need to transform the data structure
without the use of excessive prefetch instructions and does not
require prior execution of the traversed data structure path.
|
[ PREFETCH ]
|
|