Keynote: High Performance Computing on Columbia.
Jim Taft, NASA
    The NASA Columbia System located at the NASA's Ames Research Center, is the world's fastest production supercomputer. It is comprised of twenty 512 processor SGI Altix shared memory systems. The Altix systems are based on the Intel Itanium 2 processors (1.5 GHz) resulting in an aggregate system peak performance of over 60 GFLOPs. The system currently sits at number 2 in the LINPACK Top 500 list. The proposed presentation provides an overview of the Columbia system architecture, details on its bring up at NASA, as well as performance results for a number of NASA mission critical codes. Some early statistics on the system's stability in a production environment are also provided.
Methodology and Results for Analyzing Huge Amounts of Data for IPF Applications.
Allan Knies, Intel
    The performance projection phase of modern processor design is largely an exercise in building simulators and running traces. One of the major problems with this approach is that industrial design teams have thousands and thousands of important traces to analyze across tens of potential designs. This means that qualitative analysis of individual trace simulations is nearly impossible. The first component of this talk describes methods developed to analyze large amounts of data. The second component shows analysis of about 30 industrial HPC applications and the CPU2000 integer and floating-point benchmarks. The analysis shows where specific applications reach critical breaking points for architectural features such as cache size, cache latency, memory bandwidth, and memory latency for a wide range of hypothetical variations of the Intel Itanium 2 processor.

Efficient and Transparent Instrumentation using Dynamic Compilation.
Robert Cohn, Intel
    Software instrumentation inserts foreign code into an application for program analysis tasks like bug profiling and bug checking. Pin provides a very flexible and transparent facility for instrumentation for Itanium. When inserting code, Pin must ensure that it does not disturb any of the architectural state of the processor, such as register contents. On processors that have a small register file, it is sufficient to save and restore a small number of registers around the inserted code. On Itanium, this is much more difficult. The instrumentation must manage 120 scratch registers, the register stack engine, advanced load table, etc. Instrumentation can potentially modify all of this state. Simply saving and restoring all the scratch state would be extremely expensive, making instrumentation impractical. This talk describes the problems and our solution.
    Pin uses dynamic compilation to generate efficient instrumentation by extensive use of specialization. Instead of saving and restoring all the scratch registers around instrumentation, it reallocates the registers in both the application and the instrumentation to reduce the number of saves and restores necessary for instrumentation. While instrumentation can potentially modify a large number of registers, dynamically it often touches a small subset. Pin creates specialized code to make the executed paths fast. Pin also analyzes the use of rotating registers, control and data speculation, and the register stack engine to reduce the overhead. We describe the optimizations and evaluate their effectiveness on SpecInt binaries. Optimization reduces the overhead for basic block counting by a factor of 10, from 27x to 2.8x.

MAQAO Modular Assembler Quality Analyzer, and Optimizer for Itanium 2.
Jean-Thomas Acquiva, Universite de Versailles
    Quality of the code produced by compilers is essential to get high performance. Therefore, being able to assess precisely code quality is extremely important. This issue can be successfully tackled by using performance counters and dynamic profiling. In this paper, we advocate that in many interesting cases, a careful static analysis of assembly code can achieve similar results at a much lower cost and with a better accuracy. The principles of an automatic tool (MAQAO) for performing such an analysis are presented. Among its key advantages, MAQAO offers versatility (the user can specify a particular analysis using SQL formalism) and precise diagnosis capability which can be later used for carefully driving the optimization process. Two case studies on real codes are presented to illustrate the power of the tool: in each case, MAQAO helped us locate performance problems easily and define an optimization strategy leading to substantial code improvements (20 to 30% on the overall application execution time).

Keynote: Optimizing compilers for Elbrus-2000 (E2k) architecture.
Vladimir Volkonskiy
    Elbrus-2000 (E2k) microprocessor architecture described in Microprocessor Report, Vol.11, No.2, 1999 has several special features: 1) explicit instruction level parallelism on the basis of wide instruction word (like VLIW or EPIC), 2) hardware support for full compatibility with IA-32 on the basis of transparent dynamic binary translation, and 3) hardware support for secure implementation of any high level language. To utilize all these architectural features strong compiler technology is needed. We present optimizing compilers developed for the E2k architecture.
    The E2k optimizing compiler from high level languages was developed along with the architecture. As a result, some architectural features of E2k are more suitable for compiler optimization than VLIW or EPIC architectures. We consider some of them, such as branch preparation instead of branch prediction, asynchronous array prefetch in addition to (sometimes, instead of) prefetch data in cache, some specific features of speculation. Being in the process of retargeting E2k optimizing compiler on Itanium2, we present a preliminary comparative analysis of strong and weak points of both architectures in terms of the optimizing compiler. We also present some important algorithms implemented in the compiler, such as global interprocedural analysis and global scheduling.
    The transparent dynamic binary translation system was also developed along with the E2k architecture. It was necessary for efficient execution of any IA-32 program including any operating system. We present a hierarchical four-level binary translation system with a strong region based optimizer on the highest level of the system. Some specific details of the optimizer as well as the most important features of hardware support for both compatibility and optimization goals are discussed.
    The major distinctive feature of the E2k compiler technology is secure implementation of any high level language. It is based on strong hardware checks of all operations on pointers. It also separates and strongly protects private data of each module. We present secure C and C++ implemented in the compiler on the basis of secure Linux kernel. The secure semantic mode implementation is done almost without any language restrictions. It enables finding very sophisticated bugs in the program.
Presenter's Biography:
    Vladimir Volkonskiy is a chief of division in the Russian company Elbrus-MCST. He received M.S. degree in mathematics from the Moscow State University in 1972, Ph.D. degree in computer science from Moscow Institute of Precision Mechanics and Computer Equipment in 1980. His main research interests and professional activity include compilers, optimization algorithms, dynamic optimizing systems, secure implementation of programming languages in compilers, and computer architecture design supporting all these directions. Currently he manages all compiler projects for Elbrus-2000 (E2k) computer architecture running at Elbrus-MCST including optimizing compilers from C, C++, Fortran, in both secure and regular semantic modes, and optimizing binary translation system from IA-32.

Finding Parallelism for Future EPIC Machines.
Matthew Iyer, University of Colorado at Boulder
    EPIC architectures were designed to allow software to expose ILP to the underlying processor. In this effort, the processors have been more or less successful, however the question of how to identify (or create) additional parallelism remains. In this paper we explore how well the compiler and EPIC ISA jointly create new ILP, versus simply allowing existing hardware to better exploit it. To do this, we generate a trace scheduler that computes the ideal ILP of applications (assuming perfect memory disambiguation, an infinite issue window, and infinite resources) and see how this ideal ILP changes as we vary compiler aggressiveness. Changes in this ideal ILP represent creation of ILP (both local and distant), versus better analyses and transformations that simply permit exploitation of ILP by particular hardware. We also explore how well the EPIC ISA allows for the creation of ILP by comparing the aforementioned ideal ILP results with those for the Intel 80x86 architecture, whose ISA is far from ideal for representing ILP.
    Our initial results show that for many applications the EPIC platform does well for many applications, compared to x86. We also find that there are cases where compiler transformation even improve ideal ILP. For other applications, however, if false dependences due to stack manipulation are ignored on x86, the x86 ISA allows for greater ILP on some applications. We hope that careful analysis of the origins of these effects will reveal new opportunities for EPIC compilers and new sources of parallelism that can be exploited on future CMP machines.

Resource Aware Scheduling.
Kalyan Muthukumar, Intel
    Currently, the acyclic schedulers in Intel(r) Compiler for the Itanium(r) processor (both the Global Code Scheduler (GCS) and the post-pass schedulers (SCH)) compute scheduling priorities of instructions primarily based on their dependence height in the Dependence Acyclic Graph (DAG) - we believe this to be true of most production compilers today. They don't take into account the resource requirements of the instructions in their scheduling regions. The result is that we sometimes get sub-optimal schedules for regions that are resource-bound (i.e. Schedule Length based on resources is more than Schedule Length based on dependence height) rather than dependence-height-bound.
    This paper presents a novel method for resource-aware scheduling for acyclic schedulers. The key idea of our method is to compute the Resource Height of every instruction in a scheduling region. This, combined with the Dependence Height of an instruction, helps us to compute the slack of an instruction. We can use this information about Instruction Slack to get a better schedule for the region.

Decoupled Software Pipelining: A Promising Technique to Exploit Thread Level Parallelism.
Guilherme de Lima Ottoni, Princeton University
    Processor manufacturers are moving to multi-core, multi-threaded designs because of several factors such as cost, ease of design and scalability. As most processors will be multi-threaded in the future, exposing thread-level parallelism (TLP) is a problem of increasing importance. Because the adequate granularity of the threads is dependent on the target architecture, and writing sequential applications is usually more natural, the compiler plays an important role in performing the mapping from applications to the appropriate multi-threaded code. In spite of this, few general-purpose compilation techniques have been proposed to assist in this task. In this paper, we propose Decoupled Software Pipelining (DSWP) to extract thread-level parallelism. DSWP can convert most application loops into a pipeline of loop threads. This brings pipeline parallelism to most application loops including those not targeted by traditional software pipelining. DSWP does not rely on complex hardware speculation support since it is a non-speculative transformation. This paper describes the DSWP technique, discusses its implementation in a compiler, and presents experimental results demonstrating that it is a promising technique to extract TLP.

Multipass Pipelining.
Ron Barnes, University of Illinois
    Because of the growing disparity between processor logic and memory speed, tolerating cache misses through dynamic scheduling has become almost a ubiquitous characteristic of modern, non-EPIC processors. While out-of-program-order execution can tolerate variable memory-instruction latency, it adds hardware components that are problematic for power-conscious design and whose complexity limits the practical ability to reorder instructions. Unfortunately, the performance of alternative architectures that rely solely on the compiler's instruction arrangement have been found to suffer when cache misses occur.
    This paper introduces Multipass Pipelining, a new processor organization that exploits both an EPIC compiler's meticulous scheduling as well as advance execution beyond otherwise in-order-stalled instructions. Unlike run-ahead execution schemes, multipass pipelining provides for persistent advance execution, increasing efficiency and facilitating further increases in instruction-level parallelism. Multipass pipelining achieves speedups as high as 1.65X while comparing favorably to achievable out-of-order designs in terms of complexity and power.