Comparative study of term-weighting schemes for environmental big data using machine learning

被引:4
|
作者
Kim, JungJin [1 ]
Kim, Han-Ul [2 ]
Adamowski, Jan [3 ]
Hatami, Shadi [3 ]
Jeong, Hanseok [1 ,4 ,5 ]
机构
[1] Seoul Natl Univ Sci & Technol, Inst Environm Technol, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Appl Artificial Intelligence, Seoul 01811, South Korea
[3] McGill Univ, Dept Bioresource Engn, Ste Anne De Bellevue, PQ, Canada
[4] Seoul Natl Univ Sci & Technol, Dept Environm Engn, Seoul 01811, South Korea
[5] 120-1 Chungun Hall 232 Gongneung ro, Seoul 01811, South Korea
基金
新加坡国家研究基金会;
关键词
Text classification; Environmental digital news; Term -weighting schemes; Feature selection; TEXT; CLASSIFICATION; FRAMEWORK;
D O I
10.1016/j.envsoft.2022.105536
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF-inverse document frequency (TF-IDF), Best Match 25 (BM25), TF-inverse gravity moment (TF-IGM), and TF-IDF-inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Model-induced term-weighting schemes for text classification
    Kim, Hyun Kyung
    Kim, Minyoung
    APPLIED INTELLIGENCE, 2016, 45 (01) : 30 - 43
  • [2] Model-induced term-weighting schemes for text classification
    Hyun Kyung Kim
    Minyoung Kim
    Applied Intelligence, 2016, 45 : 30 - 43
  • [3] Comparative Evaluation of Term-Weighting Methods for Automatic Summarization
    Orasan, Constantin
    JOURNAL OF QUANTITATIVE LINGUISTICS, 2009, 16 (01) : 67 - 95
  • [4] Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space
    Cummins, Ronan
    O'Riordan, Colm
    ARTIFICIAL INTELLIGENCE REVIEW, 2006, 26 (1-2) : 35 - 47
  • [5] Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space
    Ronan Cummins
    Colm O’Riordan
    Artificial Intelligence Review, 2006, 26 : 35 - 47
  • [6] Big Data Analytics in Healthcare Using Machine Learning Algorithms: A Comparative Study
    Akundi, Sai Hanuman
    Soujanya, R.
    Madhuri, P. M.
    INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2020, 16 (13) : 19 - 32
  • [7] Term-weighting learning via genetic programming for text classification
    Escalante, Hugo Jair
    García-Limón, Mauricio A.
    Morales-Reyes, Alicia
    Graff, Mario
    Montes-y-Gómez, Manuel
    Morales, Eduardo F.
    Martínez-Carranza, José
    Knowledge-Based Systems, 2015, 83 : 176 - 189
  • [8] An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions
    Cummins, Ronan
    O'Riordan, Colm
    ARTIFICIAL INTELLIGENCE REVIEW, 2007, 28 (01) : 51 - 68
  • [9] Evolving general term-weighting schemes for information retrieval: Tests on larger collections
    Cummins, R
    O'riordan, C
    ARTIFICIAL INTELLIGENCE REVIEW, 2005, 24 (3-4) : 277 - 299
  • [10] A Comparative Study on Term Weighting Schemes for Text Classification
    Mazyad, Ahmad
    Teytaud, Fabien
    Fonlupt, Cyril
    MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 100 - 108