Performance Modeling of Matrix Multiplication on 3D Memory Integrated FPGA

Recent advances in three dimensional integrated circuits have enabled vertical stacks of memory to be integrated with an FPGA layer. Such architectures enable high bandwidth and low latency access to memory which is beneficial for memory-intensive applications. We build a performance model of a representative 3D Memory Integrated FPGA architecture for matrix multiplication. We derive the peak performance of the algorithm on this model in terms of throughput and energy efficiency. We evaluate the effect of different architecture parameters on performance and identify the critical bottlenecks. The parameters include the configuration of memory layers, vaults, and Through Silicon Vias (TSVs). Our analysis indicates that memory is one of the major consumers of energy on such an architecture. We model memory activation scheduling on vaults for this application and show that it improves energy efficiency by 1.83× while maintaining a throughput of 200 GOPS/s. The 3D Memory Integrated FPGA model achieves a peak performance of 93 GOPS/J for a matrix of size 16K×16K. We also compare the peak performance of a 2D architecture with that of the 3D architecture and observe a marginal improvement in both throughput and energy efficiency. Our analysis indicates that the bottleneck is the FPGA which dominates the total computation time and energy consumption. In addition to matrix multiplication, which requires O (m3) amount of computation work to be done, we also analyzed the class of applications which require O (m2) work. In particular, for matrix transposition we found out that the improvement is of the order 3× for energy consumption and 7× in runtime. This indicates that the computation cost of the application must match the memory access time in order to exploit the large bandwidth of 3D memory.

[1]  Gabriel H. Loh,et al.  Implementing register files for high-performance microprocessors in a die-stacked (3D) technology , 2006, IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures (ISVLSI'06).

[2]  Abbas El Gamal,et al.  Architectures for High Dynamic Range, High Speed Image Sensor Readout Circuits , 2006, VLSI-SoC.

[3]  Kiran Kumar Matam,et al.  Evaluating energy efficiency of floating point matrix multiplication on FPGAs , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Franz Franchetti,et al.  A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing , 2013, 2013 IEEE International 3D Systems Integration Conference (3DIC).

[5]  Paul D. Franzon,et al.  Application Exploration for 3-D Integrated Circuits: TCAM, FIFO, and FFT Case Studies , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Hsien-Hsin S. Lee,et al.  3D-MAPS: 3D Massively parallel processor with stacked memory , 2012, 2012 IEEE International Solid-State Circuits Conference.

[7]  Narayanan Vijaykrishnan,et al.  Architecting Microprocessor Components in 3D Design Space , 2007, 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID'07).

[8]  Gabriel H. Loh,et al.  The impact of 3-dimensional integration on the design of arithmetic units , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[9]  Adrian Cosoroaba High-Performance, Lower-Power Memory Interfaces with UltraScale Architecture FPGAs , 2014 .

[10]  Xiao Yu,et al.  Performance and power consumption analysis of memory efficient 3D network-on-chip architecture , 2013, 2013 10th IEEE International Conference on Control and Automation (ICCA).

[11]  Ali Akoglu,et al.  A power efficient reconfigurable system-in-stack: 3D integration of accelerators, FPGAs, and DRAM , 2014, 2014 27th IEEE International System-on-Chip Conference (SOCC).

[12]  Shreyas G. Singapura,et al.  Towards Performance Modeling of 3D Memory Integrated FPGA Architectures , 2015, ARC.

[13]  Hannu Tenhunen,et al.  3-D memory organization and performance analysis for multi-processor network-on-chip architecture , 2009, 2009 IEEE International Conference on 3D System Integration.

[14]  Gabriel H. Loh,et al.  Dynamic instruction schedulers in a 3-dimensional integration technology , 2006, GLSVLSI '06.

[15]  Michael Robertson,et al.  Monolithic 3D integration of SRAM and image sensor using two layers of single grain silicon , 2010, 2010 IEEE International 3D Systems Integration Conference (3DIC).

[16]  Junho Lee,et al.  I/O power estimation and analysis of high-speed channels in through-silicon via (TSV)-based 3D IC , 2010, 19th Topical Meeting on Electrical Performance of Electronic Packaging and Systems.