Comparative study of term-weighting schemes for environmental big data using machine learning

被引:4
|
作者
Kim, JungJin [1 ]
Kim, Han-Ul [2 ]
Adamowski, Jan [3 ]
Hatami, Shadi [3 ]
Jeong, Hanseok [1 ,4 ,5 ]
机构
[1] Seoul Natl Univ Sci & Technol, Inst Environm Technol, Seoul 01811, South Korea
[2] Seoul Natl Univ Sci & Technol, Dept Appl Artificial Intelligence, Seoul 01811, South Korea
[3] McGill Univ, Dept Bioresource Engn, Ste Anne De Bellevue, PQ, Canada
[4] Seoul Natl Univ Sci & Technol, Dept Environm Engn, Seoul 01811, South Korea
[5] 120-1 Chungun Hall 232 Gongneung ro, Seoul 01811, South Korea
基金
新加坡国家研究基金会;
关键词
Text classification; Environmental digital news; Term -weighting schemes; Feature selection; TEXT; CLASSIFICATION; FRAMEWORK;
D O I
10.1016/j.envsoft.2022.105536
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF-inverse document frequency (TF-IDF), Best Match 25 (BM25), TF-inverse gravity moment (TF-IGM), and TF-IDF-inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.
引用
收藏
页数:11
相关论文
共 50 条
  • [31] Comparative Evaluation of Machine Learning Strategies for Analyzing Big Data in Psychiatry
    Cao, Han
    Meyer-Lindenberg, Andreas
    Schwarz, Emanuel
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2018, 19 (11)
  • [32] Deep learning in big data Analytics: A comparative study
    Jan, Bilal
    Farman, Haleem
    Khan, Murad
    Imran, Muhammad
    Ul Islam, Ihtesham
    Ahmad, Awais
    Ali, Shaukat
    Jeon, Gwanggil
    COMPUTERS & ELECTRICAL ENGINEERING, 2019, 75 : 275 - 287
  • [33] Machine learning approaches for wind speed forecasting using long-term monitoring data: a comparative study
    Ye, X. W.
    Ding, Y.
    Wan, H. P.
    SMART STRUCTURES AND SYSTEMS, 2019, 24 (06) : 733 - 744
  • [34] A Comparative Study Weighting Schemes for Double Scoring Technique
    Wichaiwong, Tanakorn
    Jaruskulchai, Chuleerat
    WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, WCECS 2011, VOL I, 2011, : 443 - 447
  • [35] A new term-weighting scheme for text classification using the odds of positive and negative class probabilities
    Ko, Youngjoong
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (12) : 2553 - 2565
  • [36] Author Detection by Using Different Term Weighting Schemes
    Tufekci, Pinar
    Uzun, Erdinc
    2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
  • [37] A noun-based approach to feature location using time-aware term-weighting
    Zamani, Sima
    Lee, Sai Peck
    Shokripour, Rarnin
    Anvik, John
    INFORMATION AND SOFTWARE TECHNOLOGY, 2014, 56 (08) : 991 - 1011
  • [38] Using machine learning to optimize parallelism in big data applications
    Brandon Hernandez, Alvaro
    Perez, Maria S.
    Gupta, Smrati
    Muntes-Mulero, Victor
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 86 : 1076 - 1092
  • [39] Prediction of Human Health using Machine Learning and Big Data
    Fahad, P. K.
    Pallavi, M. S.
    PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), 2018, : 195 - 199
  • [40] Big Data Machine Learning using Apache Spark MLlib
    Assefi, Mehdi
    Behravesh, Ehsun
    Liu, Guangchi
    Tafti, Ahmad P.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3492 - 3498