Memory Latency : to Tolerate or to Reduce ?

It has become a truism that the gap between processor speed and memory access latency is continuing to increase at a rapid rate. This paper presents some of the architecture strategies which are used to bridge this gap. They are mostly of two kinds: memory latency reducing approaches such as employed in caches and HiDISC (Hierarchical Decoupled Architecture) or memory latency tolerating schemes such as SMT (Simultaneous Multithreading) or ISSC (I-structure software cache). Yet a third technique reduces the latency by integrating on the same chip processor and DRAM. Finally, algorithmic techniques to improve cache utilization and reduce average memory access latency for traditional cache architectures are discussed. Keywords— Memory Access Latency, Simultaneous Multithreading, Decoupled Architecture, Memory Bandwidth, and Processing in Memory.

[1]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[2]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[3]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[4]  Guang R. Gao,et al.  On memory models and cache management for shared-memory multiprocessors , 1995, Proceedings.Seventh IEEE Symposium on Parallel and Distributed Processing.

[5]  Nader Bagherzadeh,et al.  Performance study of a multithreaded superscalar microprocessor , 1996, Proceedings. Second International Symposium on High-Performance Computer Architecture.

[6]  Noah Treuhaft,et al.  Scalable Processors in the Billion-Transistor Era: IRAM , 1997, Computer.

[7]  D. Burger,et al.  Memory Bandwidth Limitations of Future Microprocessors , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[8]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[9]  Richard Crisp,et al.  Direct RAMbus technology: the new main memory standard , 1997, IEEE Micro.

[10]  Steven K. Reinhardt,et al.  A fully associative software-managed cache design , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Katherine Yelick,et al.  A Case for Intelligent DRAM: IRAM , 1998 .

[12]  Sally A. McKee,et al.  Access order and effective bandwidth for streams on a Direct Rambus memory , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[13]  Lizy Kurian John,et al.  Memory Latency Effects in Decoupled Architectures , 1994, IEEE Trans. Computers.

[14]  V. Cuppu,et al.  A performance comparison of contemporary DRAM architectures , 1999, Proceedings of the 26th International Symposium on Computer Architecture (Cat. No.99CB36367).

[15]  Guang R. Gao,et al.  A design study of the EARTH multiprocessor , 1995, PACT.

[16]  Wolfgang K. Giloi,et al.  MANNA: prototype of a distributed memory architecture with maximized sustained performance , 1996, Proceedings of 4th Euromicro Workshop on Parallel and Distributed Processing.

[17]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[18]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[19]  Dean M. Tullsen,et al.  Simultaneous multithreading: Maximizing on-chip parallelism , 1995, Proceedings 22nd Annual International Symposium on Computer Architecture.

[20]  Ali R. Hurson,et al.  Effects of Multithreading on Cache Performance , 1999, IEEE Trans. Computers.

[21]  Apoorv Srivastava,et al.  A High-Performance, Hierarchical Decoupled Architecture , 1996 .

[22]  Christoforos E. Kozyrakis,et al.  A New Direction for Computer Architecture Research , 1998, Computer.

[23]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[24]  Fong Pong,et al.  Missing the Memory Wall: The Case for Processor/Memory Integration , 1996, 23rd Annual International Symposium on Computer Architecture (ISCA'96).

[25]  V. K. Prasanna-Kumar,et al.  Perfect Latin squares and parallel array access , 1989, ISCA '89.

[26]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[27]  Trevor Mudge,et al.  DDR2 and Low Latency Variants , 2000 .