Convolutional Neural Networks (CNNs) have reached out-standing results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are com-posed of multiple filtering layers that perform 2D convolu-Tions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively ac-celerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for accel-eration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible archi-Tecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cy-cle without degrading the performance of the accelerator in most of the meaningful use-cases.

Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA

MELONI, PAOLO;DERIU, GIANFRANCO;RAFFO, LUIGI;
2016-01-01

Abstract

Convolutional Neural Networks (CNNs) have reached out-standing results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are com-posed of multiple filtering layers that perform 2D convolu-Tions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively ac-celerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for accel-eration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible archi-Tecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cy-cle without degrading the performance of the accelerator in most of the meaningful use-cases.
2016
9781450341288
9781450341288
Accelerator; Convolutional Neural Network; FPGA; Software
File in questo prodotto:
File Dimensione Formato  
p376-meloni.pdf

Solo gestori archivio

Tipologia: versione editoriale (VoR)
Dimensione 182.35 kB
Formato Adobe PDF
182.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/177741
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 14
  • ???jsp.display-item.citation.isi??? 12
social impact