Research Progress on FPGA-based Machine Learning Hardware Acceleration

被引:0
|
作者
Wang C. [1 ]
Wang T. [1 ]
Ma X. [1 ]
Zhou X.-H. [1 ]
机构
[1] School of Computer Science and Technology, University of Science and Technology of China, Hefei
来源
关键词
Accelerator; Big data; Field-Programmable Gate Array; Machine learning; Neural network accelerators;
D O I
10.11897/SP.J.1016.2020.01161
中图分类号
学科分类号
摘要
With the increasing production of massive data information and the widespread use of data mining applications in the fields of voice, languages, image, video, etc., people have entered the era of big data. In such an era, how to access data information efficiently and steadily and how to speed up the implementation of data mining applications have become the key issues that need to be solved urgently in academia and industry. And the machine learning algorithms, as the core component of data mining applications, have attracted more and more researchers' attention to applicate in various fields. Therefore, using existing hardware and software means to accelerate machine learning algorithms has become a research hotspot. In this study stocks boom, the current acceleration platforms can be summarized into four categories respectively: the custom logic circuits (such as FPGA/ASIC), the general graphics processing unit (GPGPU), the cloud computing platform and the heterogeneous computing platform. These acceleration platforms often show different parallel granularity and are suitable for different application scenarios. However, this is also a choice that they are combined to form heterogeneous systems to give full play to the processing capabilities of different acceleration devices. But, due to the customizable and high energy-efficient, the FPGA-based hardware acceleration is becoming a hot choice as the machine learning acceleration. Therefore, this paper mainly focuses on the field of machine learning algorithm accelerator based on FPGA. First of all, the paper introduces the machine learning algorithms and background knowledge in chapter 1&2. And then, we give the current development in this field, which composed of three parts (the methods for acceleration, the hardware platforms for acceleration and evaluations for accelerators) in chapter 3. And we propose an overview of the possible points of the accelerator with four sections. There are accelerating the kernel in the algorithm, abstracting the common feature in the algorithm, parallelizing algorithm and optimizing the data communication. After that, in chapter 4, the design and implementation of the current mainstream accelerators are introduced with four kinds of examples of various hardware machine learning accelerators, which is in order of specific issues, specific algorithm, common feature, and hardware template. Meanwhile, the structure of the design of accelerators is also simply classified and summarized in this chapter. The current hardware accelerator design is divided into two types, Stream and Single Engine. The characteristic of the Stream model is optimized for each calculation process in various hardware blocks and paralleled in the pipeline to achieve high performance. And the characteristic of the Single Engine model is focused on the common feature between all calculation process. Therefore, the Single Engine model could get larger and more hardware blocks than the Stream model to get high computation performance and compatibility. Finally, this paper summarizes the field of hardware-accelerated machine learning algorithms and puts forward the research direction and development trend in six points. In summary, this paper summarizes the current status of the development of machine learning accelerators and pointed out the future direction of development in the analysis of the current accelerator method. © 2020, Science Press. All right reserved.
引用
收藏
页码:1161 / 1182
页数:21
相关论文
共 80 条
  • [1] Bekkerman R, Bilenko M, Langford J., Scaling Up Machine Learning: Parallel and Distributed Approaches, (2011)
  • [2] Gonzalez J E, Low Y, Gu H, Et al., PowerGraph: Distributed graph-parallel computation on natural graphs, Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation(OSDI 12), pp. 17-30, (2012)
  • [3] Li C, Xue Y, Wang J, Et al., Edge-oriented computing paradigms: A survey on architecture design and system management, ACM Computing Surveys, 51, 2, pp. 1-34, (2018)
  • [4] Chen T, Moreau T, Jiang Z, Et al., TVM: An automated end-to-end optimizing compiler for deep learning, Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation(OSDI 18), pp. 578-594, (2018)
  • [5] Liu D, Chen T, Liu S, Et al., PuDianNao: A polyvalent machine learning accelerator, ACM SIGARCH Computer Architecture News, 43, 1, pp. 369-381, (2015)
  • [6] Jouppi N P, Young C, Patil N, Et al., In-datacenter performance analysis of a tensor processing unit, Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1-12, (2017)
  • [7] Zhu Hu-Ming, Li Pei, Jiao Li-Cheng, Et al., The overview of the parallelization in deep neural network, Chinese Journal of Computers, 41, 8, pp. 171-191, (2018)
  • [8] Chu C, Kim S K, Lin Y A, Et al., Map-reduce for machine learning on multicore, Advances in Neural Information Processing Systems, 19, (2007)
  • [9] Choudhary A N, Honbo D, Kumar P, Et al., Accelerating data mining workloads: Current approaches and future challenges in system architecture design, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1, 1, pp. 41-54, (2011)
  • [10] Huang Shan, Wang Bo-Tao, Wang Guo-Ren, Et al., The survey of the optimization technique for MapReduce, Journal of Frontiers of Computer Science&Technology, 7, 10, pp. 885-905, (2013)