A Survey on Data Collection for Machine Learning: A Big Data-AI Integration Perspective

被引:360
|
作者
Roh, Yuji [1 ]
Heo, Geon [1 ]
Whang, Steven Euijong [1 ]
机构
[1] Korea Adv Inst Sci & Technol, Sch Elect Engn, Daejeon, South Korea
基金
新加坡国家研究基金会;
关键词
Machine learning; Data collection; Labeling; Data models; Data acquisition; Training data; Smart manufacturing; data acquisition; data labeling; machine learning; CHALLENGES;
D O I
10.1109/TKDE.2019.2946162
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.
引用
收藏
页码:1328 / 1347
页数:20
相关论文
共 50 条
  • [1] A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    [J]. EURASIP Journal on Advances in Signal Processing, 2016
  • [2] A survey of machine learning for big data processing
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,
  • [3] A Survey of Machine Learning Methods for Big Data
    Ruiz, Zoila
    Salvador, Jaime
    Garcia-Rodriguez, Jose
    [J]. BIOMEDICAL APPLICATIONS BASED ON NATURAL AND ARTIFICIAL COMPUTING, PT II, 2017, 10338 : 259 - 267
  • [4] An Integration of Extreme Learning Machine for Classification of Big Data
    Zhou, Guanwu
    Zhao, Yulong
    Xu, Wenju
    [J]. PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMPUTER APPLICATIONS (ICSA 2013), 2013, 92 : 81 - 86
  • [5] A Bayesian perspective of statistical machine learning for big data
    Sambasivan, Rajiv
    Das, Sourish
    Sahu, Sujit K.
    [J]. COMPUTATIONAL STATISTICS, 2020, 35 (03) : 893 - 930
  • [6] A Bayesian perspective of statistical machine learning for big data
    Rajiv Sambasivan
    Sourish Das
    Sujit K. Sahu
    [J]. Computational Statistics, 2020, 35 : 893 - 930
  • [7] Survey of Machine Learning Methods for Big Data Applications
    Vinothini, A.
    Priya, S. Baghavathi
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN DATA SCIENCE (ICCIDS), 2017,
  • [8] Systematic Survey on Evolution of Machine Learning for Big Data
    Swathi, R.
    Seshadri, R.
    [J]. 2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2017, : 204 - 209
  • [9] A SURVEY OF MACHINE LEARNING ALGORITHMS FOR BIG DATA ANALYTICS
    Athmaja, S.
    Hanumanthappa, M.
    Kavitha, Vasantha
    [J]. 2017 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2017,
  • [10] Erratum to: A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    [J]. EURASIP Journal on Advances in Signal Processing, 2016