Improving memory hierarchy performance using data reorganization

In state-of-the-art computing platforms with deep memory hierarchies, mismatch between data layout and data access pattern during a computation causes expensive memory access costs, thereby degrading the overall performance. To improve the memory hierarchy performance, we propose data reorganization as an approach, where data layout is reorganized to match the data access pattern. Our approach reduces the number of cache misses in uni-processor platforms and the number of remote memory accesses in multi-processor platforms. This dissertation discusses three main contributions in using data reorganization. The data reorganization in multi-processor platforms is called data redistribution. However, the redistribution itself requires interprocessor communications. To minimize the overhead of data redistribution, we propose an efficient array redistribution algorithm between processor sets. The proposed algorithm minimizes the number of communication steps, eliminates node contention, and fully utilizes the network bandwidth in each communication step; thereby reducing the data transfer time. Furthermore, the time to compute the schedule and the index set is minimized. Therefore, our algorithm is suitable for runtime data redistribution. For dense matrix applications, block data layout (BDL) has been proposed in conjunction with tiling. It is called static data reorganization in this dissertation, since the data layout is reorganized before the computation begins and then is fixed during the computation. For generic matrix access patterns, we derive an asymptotic lower bound on the number of Translation Look-aside Buffer (TLB) misses for any data layout and show that BDL achieves this bound. We also show that BDL improves TLB misses by a factor of O(B) when compared with conventional data layouts for the tiled matrix multiplication, where B is the block size of BDL. Using our TLB and cache performance analysis, we also discuss the impact of block size on the memory hierarchy performance. As the computation proceeds, the data access stride changes. This change results in poor cache performance, if the data layout is fixed during the computation. To improve cache performance, the data layout is dynamically reorganized between computation stages. This is called dynamic data layout (DDL) approach. Using DDL approach, we develop optimized packages for Fast Fourier Transform (FFT) and Walsh-Hadamard Transform (WHT). Simulation and experiment results show that our FFT and WHT packages achieve performance improvement over other FFT and WHT packages.