Comparative study of term-weighting schemes for environmental big data using machine learning

被引:4
|
作者
Kim, JungJin [1 ]
Kim, Han-Ul [2 ]
Adamowski, Jan [3 ]
Hatami, Shadi [3 ]
Jeong, Hanseok [1 ,4 ,5 ]
机构
[1] Seoul Natl Univ Sci & Technol, Inst Environm Technol, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Appl Artificial Intelligence, Seoul 01811, South Korea
[3] McGill Univ, Dept Bioresource Engn, Ste Anne De Bellevue, PQ, Canada
[4] Seoul Natl Univ Sci & Technol, Dept Environm Engn, Seoul 01811, South Korea
[5] 120-1 Chungun Hall 232 Gongneung ro, Seoul 01811, South Korea
基金
新加坡国家研究基金会;
关键词
Text classification; Environmental digital news; Term -weighting schemes; Feature selection; TEXT; CLASSIFICATION; FRAMEWORK;
D O I
10.1016/j.envsoft.2022.105536
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF-inverse document frequency (TF-IDF), Best Match 25 (BM25), TF-inverse gravity moment (TF-IGM), and TF-IDF-inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] Machine learning and financial big data control using IoT
    Xiao, Jian
    Intelligent Decision Technologies, 2024, 18 (04) : 2657 - 2670
  • [42] Tension in big data using machine learning: Analysis and applications
    Wang, Huamao
    Yao, Yumei
    Salhi, Said
    TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, 2020, 158
  • [43] BIG DATA, MACHINE LEARNING AND ENVIRONMENTAL PRESERVATION: TECHNOLOGICAL INSTRUMENTS IN DEFENSE OF THE ENVIRONMENT
    Molinaro, Carlos Alberto
    Leal, Augusto Fontanive
    VEREDAS DO DIREITO, 2018, 15 (31): : 201 - 224
  • [44] A comprehensive study of Big Data Machine Learning Approaches and Challenges
    Singh, Neelam
    Singh, Devesh Pratap
    Pant, Bhasker
    2017 INTERNATIONAL CONFERENCE ON NEXT GENERATION COMPUTING AND INFORMATION SYSTEMS (ICNGCIS), 2017, : 80 - 85
  • [45] SHM data anomaly classification using machine learning strategies: A comparative study
    Chou, Jau-Yu
    Fu, Yuguang
    Huang, Shieh-Kung
    Chang, Chia-Ming
    SMART STRUCTURES AND SYSTEMS, 2022, 29 (01) : 77 - 91
  • [46] Machine learning for big data analytics
    Oja, E. (erkki.oja@aalto.fi), 1600, Springer Verlag (384):
  • [47] Big data and machine learning in health
    Carvalho, D.
    Cruz, R.
    EUROPEAN JOURNAL OF PUBLIC HEALTH, 2020, 30 : 10 - 11
  • [48] Machine learning and big scientific data
    Hey, Tony
    Butler, Keith
    Jackson, Sam
    Thiyagalingam, Jeyarajan
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2020, 378 (2166):
  • [49] Machine learning, big data, and neuroscience
    Pillow, Jonathan
    Sahani, Maneesh
    CURRENT OPINION IN NEUROBIOLOGY, 2019, 55 : III - IV
  • [50] Machine Learning under Big Data
    Shi, Chunhe
    Wu, Chengdong
    Han, Xiaowei
    Xie, Yinghong
    Li, Zhen
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ELECTRONIC, MECHANICAL, INFORMATION AND MANAGEMENT SOCIETY (EMIM), 2016, 40 : 301 - 305