On the design of multi-directional systolic arrays for Band and Generic Matrix-Matrix Multiplications

Gouveia Ergin, Leonel

Files

GouveiaErgin_21461600_2022_APPENDIX1.pdf

Open access
Adobe PDF
3.32 MB

Download

GouveiaErgin_21461600_2022.pdf

Open access
Adobe PDF
7.17 MB

Download

Details

Supervisors: Christian Pilato ; Stephanie Soldavini
Faculty: Ecole polytechnique de Louvain
Degree label: Master [120] : ingénieur civil électricien, à finalité spécialisée
Abstract: For the most part of the 20th century, computers have benefited from transistor scaling in order to support exponential performance improvement. As computer-chip features get smaller, ever larger proportions of chips must be turned off during operation due to power budget limitations. This new obstacle calls for a shift in paradigm in computer architecture. As a consequence of this, we have moved from an era of single-core design in the 20th century, through an era of homogeneous multi-core design in the beginning of the 21st century to the now ever-expanding trend of heterogeneous multi-core architectures with custom accelerators. In this thesis, we explore multi-directional systolic accelerators for Band and Generic matrix-matrix multiplications (BMMM and GMMM). Starting from a systolic design introduced in the 1970's by Kung and Leiserson, we conceptualized changing the direction of some data paths in order to achieve more than one operation. We then implemented the design in Verilog and the necessary memory management hardware in C++, using HLS tools to compile it to hardware. To link the RTL and HLS designs together, we developed a hybrid design workflow using Xilinx tools. For comparison, we also implemented an equivalent fully-HLS kernel. Architecturally, our design achieves 20x performance improvement for many streamed 16x16 GMMM operations and a 610x performance improvement in large BMMM matrices with a band size of 31, while using 30x more DSP's than our best HLS counterpart. Using our own fully-parametric data management hardware, we have achieved performance parity for GMMM and a 23x performance improvement for BMMM. Our benchmarks were performed on a Alveo U280 Data Center Card containing a Virtex Ultrascale+ FPGA and High Bandwidth Memory.

ATTENTION/WARNING - NE PAS DÉPOSER ICI/DO NOT SUBMIT HERE

On the design of multi-directional systolic arrays for Band and Generic Matrix-Matrix Multiplications

Files

GouveiaErgin_21461600_2022_APPENDIX1.pdf

GouveiaErgin_21461600_2022.pdf

Details