UNICA IRIS Institutional Research Information System

Convolutional Neural Networks (CNNs) have reached out-standing results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are com-posed of multiple filtering layers that perform 2D convolu-Tions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively ac-celerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for accel-eration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible archi-Tecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cy-cle without degrading the performance of the accelerator in most of the meaningful use-cases.

Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA

MELONI, PAOLO;DERIU, GIANFRANCO;Conti, Francesco;Loi, Igor;RAFFO, LUIGI;Benini, Luca

2016-01-01

Abstract

Convolutional Neural Networks (CNNs) have reached out-standing results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are com-posed of multiple filtering layers that perform 2D convolu-Tions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively ac-celerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for accel-eration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible archi-Tecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cy-cle without degrading the performance of the accelerator in most of the meaningful use-cases.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2016
			
	Codice ISBN
	
				9781450341288
9781450341288
			
	Parole chiave
	
				Accelerator; Convolutional Neural Network; FPGA; Software
			
	Tipologia:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
p376-meloni.pdf Solo gestori archivio Tipologia: versione editoriale (VoR) Dimensione 182.35 kB Formato Adobe PDF Visualizza/Apri Richiedi una copia	182.35 kB	Adobe PDF	Visualizza/Apri Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11584/177741

Citazioni

ND

14

12

social impact