ATTENTION/WARNING - NE PAS DÉPOSER ICI/DO NOT SUBMIT HERE

Ceci est la version de TEST de DIAL.mem. Veuillez ne pas soumettre votre mémoire sur ce site mais bien à l'URL suivante: 'https://thesis.dial.uclouvain.be'.
This is the TEST version of DIAL.mem. Please use the following URL to submit your master thesis: 'https://thesis.dial.uclouvain.be'.
 

On the design of multi-directional systolic arrays for Band and Generic Matrix-Matrix Multiplications

(2023)

Files

GouveiaErgin_21461600_2022_APPENDIX1.pdf
  • Open access
  • Adobe PDF
  • 3.32 MB

GouveiaErgin_21461600_2022.pdf
  • Open access
  • Adobe PDF
  • 7.17 MB

Details

Supervisors
Faculty
Degree label
Abstract
For the most part of the 20th century, computers have benefited from transistor scaling in order to support exponential performance improvement. As computer-chip features get smaller, ever larger proportions of chips must be turned off during operation due to power budget limitations. This new obstacle calls for a shift in paradigm in computer architecture. As a consequence of this, we have moved from an era of single-core design in the 20th century, through an era of homogeneous multi-core design in the beginning of the 21st century to the now ever-expanding trend of heterogeneous multi-core architectures with custom accelerators. In this thesis, we explore multi-directional systolic accelerators for Band and Generic matrix-matrix multiplications (BMMM and GMMM). Starting from a systolic design introduced in the 1970's by Kung and Leiserson, we conceptualized changing the direction of some data paths in order to achieve more than one operation. We then implemented the design in Verilog and the necessary memory management hardware in C++, using HLS tools to compile it to hardware. To link the RTL and HLS designs together, we developed a hybrid design workflow using Xilinx tools. For comparison, we also implemented an equivalent fully-HLS kernel. Architecturally, our design achieves 20x performance improvement for many streamed 16x16 GMMM operations and a 610x performance improvement in large BMMM matrices with a band size of 31, while using 30x more DSP's than our best HLS counterpart. Using our own fully-parametric data management hardware, we have achieved performance parity for GMMM and a 23x performance improvement for BMMM. Our benchmarks were performed on a Alveo U280 Data Center Card containing a Virtex Ultrascale+ FPGA and High Bandwidth Memory.