Optimal data layout for block-level random accesses to scratchpad

3D memory is becoming an increasingly popular technology to overcome the performance gap between memory and processors. It has led to the development of new architectures with scratchpad memory, which offer high bandwidth and user-controlled access features. The ideal performance of this scratchpad memory is peak bandwidth for any random block access. However, 3D memories come with their constraints on the "ideal" access patterns for which high bandwidth is guaranteed and the actual bandwidth is significantly lower for other access patterns. In this paper, we address the challenge of achieving high bandwidth for random block accesses to 3D memory. We present optimal data layout which achieves maximum bandwidth for each vault irrespective of the block accessed in a vault. Our data layout expressed as a mapping function determined by the architecture parameters exploits inter-layer pipelining to map the elements of each block among various layers of a vault in a specific pattern. By doing so, our data layout can absorb the latency of accesses to banks in the same layer and more importantly, hide the latency of accesses to different rows in the same bank irrespective of the block being accessed. We compare the performance of our proposed data layout with existing data layout using PARSEC 2.0 benchmarks. Our experimental results demonstrate as high as 56% improvement in access time in comparison with the existing data layout across various workloads.

[1]  Pei-Yun Tsai,et al.  A Generalized Conflict-Free Memory Addressing Scheme for Continuous-Flow Parallel-Processing FFT Processors With Rescheduling , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[3]  Cynthia A. Phillips,et al.  k-Means Clustering on Two-Level Memory Systems , 2015, MEMSYS.

[4]  C. Nicopoulos,et al.  Design and Management of 3D Chip Multiprocessors Using Network-in-Memory , 2006, ISCA 2006.

[5]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[6]  Onur Mutlu,et al.  Ramulator: A Fast and Extensible DRAM Simulator , 2016, IEEE Computer Architecture Letters.

[7]  Viktor K. Prasanna,et al.  On-chip memory efficient data layout for 2D FFT on 3D memory integrated FPGA , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[8]  Yong Chen,et al.  HMC-Sim: A Simulation Framework for Hybrid Memory Cube Devices , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[9]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[10]  Michael J. Levenhagen,et al.  Exploring Memory Management Strategies in Catamount. , 2008 .

[11]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[12]  Jan Reineke,et al.  Ascertaining Uncertainty for Efficient Exact Cache Analysis , 2017, CAV.

[13]  Christian Bienia,et al.  PARSEC 2.0: A New Benchmark Suite for Chip-Multiprocessors , 2009 .

[14]  Jung Ho Ahn,et al.  CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[15]  Luca Benini,et al.  Logic-Base Interconnect Design for Near Memory Computing in the Smart Memory Cube , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[16]  Jing-ling Yang Parallel Interleavers Through Optimized Memory Address Remapping , 2010, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[17]  Franz Franchetti,et al.  Data reorganization in memory using 3D-stacked DRAM , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[18]  Sally A. McKee,et al.  Hitting the memory wall: implications of the obvious , 1995, CARN.

[19]  Gabriel H. Loh,et al.  3D-Stacked Memory Architectures for Multi-core Processors , 2008, 2008 International Symposium on Computer Architecture.

[20]  Cynthia A. Phillips,et al.  Two-Level Main Memory Co-Design: Multi-threaded Algorithmic Primitives, Analysis, and Simulation , 2015, IPDPS.

[21]  Hsien-Hsin S. Lee,et al.  An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.