A Dynamically Reconfigurable Cache for Multithreaded Processors

Alex Settle, Daniel A. Connors, Enric Gibert, Antonio Gonzalez.
Journal of Embedded Computing: Special Issue on Single-Chip Multi-core Architectures. December, 2005.
In order to leverage increasingly many transistors, designers are moving toward including multiple processor cores on a single chip die, known as chip multi-processors (CMP). These systems typically include multi-threaded support, such as simultaneous multithreading (SMT) and coarse-grain multithreading(CGMT), on each of the processor cores to enable cost-effective high-throughput execution. Such architectures are expected in the embedded domain, although their adoption requires that a number of unique, embedded systems constraints be addressed. Specifically, issues in the cache and memory system must be adequately resolved to eliminate the interference between co-active application threads in the system. Traditionally, the CMP cache hierarchy can either be shared across the cores or duplicated for each one. The decision to offer fully shared or fully distributed cache hierarchies is a design constraint that is driven by both power consumption and the chip area required for each of the supported cores. At the very least though, cache hierarchies support sharing for the individual hardware contexts that run on each core. To date, the majority of design techniques for improving multithreaded processor execution are focused on enabling resource utilization through instruction scheduling and novel pipeline concepts. However, when independent applications share the cache memory systems, severe performance penalties can result depending on the characteristics of the co-scheduled jobs. This penalty can be a major barrier to leveraging multi-core and multithreaded architectures for the embedded systems domain since the interference of co-active applications can compromise the expected system characteristics (e.g., missed real-time deadlines). In particular, co-scheduled applications compete for cache resources and combine to create a collective set of memory requests that cannot be adequately supplied through the use of a traditional cache system designed for a single-thread processor. To resolve the CMP issues for the embedded computing domain, adaptable hardware-based cache allocation systems are needed to balance the resource demands of each application and improve the overall throughput of the collective workload. For several different workloads of two co-scheduled applications, experimental results demonstrate speedups of up to 1.47X against a fully-shared two-level cache hierarchy and on average a 1.10X speedup over the leading cache partitioning model. Overall, by dynamically managing cache storage for multiple application threads at runtime, sizable performance levels are achieved, which can provide chip designers the opportunity to maintain high performance as cache size budgets are becoming a concern in the CMP design space.

[ PDF ]