Convolutional Neural Networks (CNNs) have reached out-standing results in several complex visual recognition tasks, such as classification and scene parsing. CNNs are com-posed of multiple filtering layers that perform 2D convolu-Tions over input images. The intrinsic parallelism in such a computation kernel makes it suitable to be effectively ac-celerated on parallel hardware. In this paper we propose a highly flexible and scalable architectural template for accel-eration of CNNs on FPGA devices, based on the cooperation between a set of software cores and a parallel convolution engine that communicate via a tightly coupled L1 shared scratchpad. Our accelerator structure, tested on a Xilinx Zynq XC-Z7045 device, delivers peak performance up to 80 GMAC/s, corresponding to 100 MMAC/s for each DSP slice in the programmable fabric. Thanks to the flexible archi-Tecture, convolution operations can be scheduled in order to reduce input/output bandwidth down to 8 bytes per cy-cle without degrading the performance of the accelerator in most of the meaningful use-cases.
|Titolo:||Curbing the roofline: A scalable and flexible architecture for CNNs on FPGA|
|Data di pubblicazione:||2016|
|Tipologia:||4.1 Contributo in Atti di convegno|