Keynote: High
Performance Computing on Columbia.
Jim Taft, NASA
The NASA Columbia System located at the NASA's Ames
Research Center, is the world's fastest production supercomputer. It is
comprised of twenty 512 processor SGI Altix shared memory systems. The
Altix systems are based on the Intel Itanium 2 processors (1.5 GHz)
resulting in an aggregate system peak performance of over 60 GFLOPs.
The system currently sits at number 2 in the LINPACK Top 500 list. The
proposed presentation provides an overview of the Columbia system
architecture, details on its bring up at NASA, as well as performance
results for a number of NASA mission critical codes. Some early
statistics on the system's stability in a production environment are
also provided.
Methodology and Results
for Analyzing Huge Amounts of Data for IPF Applications.
Allan
Knies, Intel
The performance
projection phase of modern processor design is largely an exercise in
building simulators and running traces. One of the major problems with
this approach is that industrial design teams have thousands and
thousands of important traces to analyze across tens of potential
designs. This means that qualitative analysis of individual trace
simulations is nearly impossible. The first component of this talk
describes methods developed to analyze large amounts of data. The
second component shows analysis of about 30 industrial HPC applications
and the CPU2000 integer and floating-point benchmarks. The analysis
shows where specific applications reach critical breaking points for
architectural features such as cache size, cache latency,
memory bandwidth, and memory latency for a wide range of hypothetical
variations of the Intel Itanium 2 processor.
Efficient and
Transparent Instrumentation using Dynamic Compilation.
Robert
Cohn, Intel
Software
instrumentation inserts foreign code into an application for program
analysis tasks like bug profiling and bug checking. Pin provides a very
flexible and transparent facility for instrumentation for Itanium. When
inserting code, Pin must ensure that it does not disturb any of the
architectural state of the processor, such as register contents. On
processors that have a small register file, it is sufficient to save
and restore a small number of registers around the inserted code. On
Itanium, this is much more difficult. The instrumentation must manage
120 scratch registers, the register stack engine, advanced load table,
etc. Instrumentation can potentially modify all of this state. Simply
saving and restoring all the scratch state would be extremely
expensive, making instrumentation impractical. This talk describes the
problems and our solution.
Pin uses dynamic
compilation to generate efficient instrumentation by extensive use of
specialization. Instead of saving and restoring all the scratch
registers around instrumentation, it reallocates the registers in both
the application and the instrumentation to reduce the number of saves
and restores necessary for instrumentation. While instrumentation can
potentially modify a large number of registers, dynamically it often
touches a small subset. Pin creates specialized code to make the
executed paths fast. Pin also analyzes the use of rotating registers,
control and data speculation, and the register stack engine to reduce
the overhead. We describe the optimizations and evaluate their
effectiveness on SpecInt binaries. Optimization reduces the overhead
for basic block counting by a factor of 10, from 27x to 2.8x.
MAQAO
Modular Assembler Quality Analyzer, and Optimizer for Itanium 2.
Jean-Thomas Acquiva, Universite de
Versailles
Quality of the
code produced by compilers is essential to get high performance.
Therefore, being able to assess precisely code quality is extremely
important. This issue can be successfully tackled by using performance
counters and dynamic profiling. In this paper, we advocate that in many
interesting cases, a careful static analysis of assembly code can
achieve similar results at a much lower cost and with a better
accuracy.
The principles of an automatic tool (MAQAO) for performing such an
analysis are presented. Among its key advantages, MAQAO offers
versatility (the user can specify a particular analysis using SQL
formalism) and precise diagnosis capability which can be later used for
carefully driving the optimization process. Two case studies on real
codes are presented to illustrate the power of the tool: in each case,
MAQAO helped us locate performance problems easily and define an
optimization strategy leading to substantial code improvements (20 to
30% on the overall application execution time).
Keynote: Optimizing
compilers for Elbrus-2000 (E2k) architecture.
Vladimir Volkonskiy
Elbrus-2000 (E2k)
microprocessor architecture described in Microprocessor Report, Vol.11,
No.2, 1999 has several special features: 1) explicit instruction level
parallelism on the basis of wide instruction word (like VLIW or EPIC),
2) hardware support for full compatibility with IA-32 on the basis of
transparent dynamic binary translation, and 3) hardware support for
secure implementation of any high level language. To utilize all these
architectural features strong compiler technology is needed. We present
optimizing compilers developed for the E2k architecture.
The E2k optimizing compiler from high level
languages was developed along with the architecture. As a result, some
architectural features of E2k are more suitable for compiler
optimization than VLIW or EPIC architectures. We consider some of them,
such as branch preparation instead of branch prediction, asynchronous
array prefetch in addition to (sometimes, instead of) prefetch data in
cache, some specific features of speculation. Being in the process of
retargeting E2k optimizing compiler on Itanium2, we present a
preliminary comparative analysis of strong and weak points of both
architectures in terms of the optimizing compiler. We also present some
important algorithms implemented in the compiler, such as global
interprocedural analysis and global scheduling.
The transparent dynamic binary translation system
was also developed along with the E2k architecture. It was necessary
for efficient execution of any IA-32 program including any operating
system. We present a hierarchical four-level binary translation system
with a strong region based optimizer on the highest level of the
system. Some specific details of the optimizer as well as the most
important features of hardware support for both compatibility and
optimization goals are discussed.
The major distinctive feature of the E2k compiler
technology is secure implementation of any high level language. It is
based on strong hardware checks of all operations on pointers. It also
separates and strongly protects private data of each module. We present
secure C and C++ implemented in the compiler on the basis of secure
Linux kernel. The secure semantic mode implementation is done almost
without any language restrictions. It enables finding very
sophisticated bugs in the program.
Presenter's Biography:
Vladimir Volkonskiy is a chief of division in the
Russian company Elbrus-MCST. He received M.S. degree in mathematics
from the Moscow State University in 1972, Ph.D. degree in computer
science from Moscow Institute of Precision Mechanics and Computer
Equipment in 1980. His main research interests and professional
activity include compilers, optimization algorithms, dynamic optimizing
systems, secure implementation of programming languages in compilers,
and computer architecture design supporting all these directions.
Currently he manages all compiler projects for Elbrus-2000 (E2k)
computer architecture running at Elbrus-MCST including optimizing
compilers from C, C++, Fortran, in both secure and regular semantic
modes, and optimizing binary translation system from IA-32.
Finding
Parallelism for Future EPIC Machines.
Matthew Iyer, University of Colorado
at Boulder
EPIC
architectures were designed to allow software to expose ILP to the
underlying processor. In this effort, the processors have been more or
less successful, however the question of how to identify (or create)
additional parallelism remains. In this paper we explore how well the
compiler and EPIC ISA jointly create new ILP, versus simply allowing
existing hardware to better exploit it. To do this, we generate a trace
scheduler that computes the ideal ILP of applications (assuming perfect
memory disambiguation, an infinite issue window, and infinite
resources) and see how this ideal ILP changes as we vary compiler
aggressiveness. Changes in this ideal ILP represent creation of ILP
(both local and distant), versus better analyses and transformations
that simply permit exploitation of ILP by particular hardware. We also
explore how well the EPIC ISA allows for the creation of ILP by
comparing the aforementioned ideal ILP results with those for the Intel
80x86 architecture, whose ISA is far from ideal for representing ILP.
Our initial results show that for many applications
the EPIC platform does well for many applications, compared to x86. We
also find that there are cases where compiler transformation even
improve ideal ILP. For other applications, however, if false
dependences due to stack manipulation are ignored on x86, the x86 ISA
allows for greater ILP on some applications. We hope that careful
analysis of the origins of these effects will reveal new opportunities
for EPIC compilers and new sources of parallelism that can be exploited
on future CMP machines.
Resource
Aware Scheduling.
Kalyan Muthukumar, Intel
Currently, the
acyclic schedulers in Intel(r) Compiler for the Itanium(r) processor
(both the Global Code Scheduler (GCS) and the post-pass schedulers
(SCH)) compute scheduling priorities of instructions primarily based on
their dependence height in the Dependence Acyclic Graph (DAG) - we
believe this to be true of most production compilers today. They don't
take into account the resource requirements of the instructions in
their scheduling regions. The result is that we sometimes get
sub-optimal schedules for regions that are resource-bound (i.e.
Schedule Length based on resources is more than Schedule Length based
on dependence height) rather than dependence-height-bound.
This paper presents a novel method for
resource-aware scheduling for acyclic schedulers. The key idea of our
method is to compute the Resource Height of every instruction in a
scheduling region. This, combined with the Dependence Height of an
instruction, helps us to compute the slack of an instruction. We can
use this information about Instruction Slack to get a better schedule
for the region.
Decoupled
Software Pipelining: A Promising Technique to Exploit Thread Level
Parallelism.
Guilherme de Lima Ottoni, Princeton
University
Processor
manufacturers are moving to multi-core, multi-threaded designs because
of several factors such as cost, ease of design and scalability. As
most processors will be multi-threaded in the future, exposing
thread-level parallelism (TLP) is a problem of increasing importance.
Because the adequate granularity of the threads is dependent on the
target architecture, and writing sequential applications is usually
more natural, the compiler plays an important role in performing the
mapping from applications to the appropriate multi-threaded code. In
spite of this, few general-purpose compilation techniques have been
proposed to assist in this task. In this paper, we propose Decoupled
Software Pipelining (DSWP) to extract thread-level parallelism. DSWP
can convert most application loops into a pipeline of loop threads.
This brings pipeline parallelism to most application loops including
those not targeted by traditional software pipelining. DSWP does not
rely on complex hardware speculation support since it is a
non-speculative transformation. This paper describes the DSWP
technique, discusses its implementation in a compiler, and presents
experimental results demonstrating that it is a promising technique to
extract TLP.
Multipass
Pipelining.
Ron Barnes, University of Illinois
Because of the
growing disparity between processor logic and memory speed, tolerating
cache misses through dynamic scheduling has become almost a ubiquitous
characteristic of modern, non-EPIC processors. While
out-of-program-order execution can tolerate variable memory-instruction
latency, it adds hardware components that are problematic for
power-conscious design and whose complexity limits the practical
ability to reorder instructions. Unfortunately, the performance of
alternative architectures that rely solely on the compiler's
instruction arrangement have been found to suffer when cache misses
occur.
This paper introduces Multipass Pipelining, a new
processor organization that exploits both an EPIC compiler's meticulous
scheduling as well as advance execution beyond otherwise
in-order-stalled instructions. Unlike run-ahead execution schemes,
multipass pipelining provides for persistent advance execution,
increasing efficiency and facilitating further increases in
instruction-level parallelism. Multipass pipelining achieves speedups
as high as 1.65X while comparing favorably to achievable out-of-order
designs in terms of complexity and power.