Comparative study of term-weighting schemes for environmental big data using machine learning

被引：4

作者：

Kim, JungJin ^{[1
]}

Kim, Han-Ul ^{[2
]}

Adamowski, Jan ^{[3
]}

Hatami, Shadi ^{[3
]}

Jeong, Hanseok ^{[1
,4
,5
]}

机构：

[1] Seoul Natl Univ Sci & Technol, Inst Environm Technol, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Appl Artificial Intelligence, Seoul 01811, South Korea

[3] McGill Univ, Dept Bioresource Engn, Ste Anne De Bellevue, PQ, Canada

[4] Seoul Natl Univ Sci & Technol, Dept Environm Engn, Seoul 01811, South Korea

[5] 120-1 Chungun Hall 232 Gongneung ro, Seoul 01811, South Korea

来源：

ENVIRONMENTAL MODELLING & SOFTWARE | 2022年 / 157卷

基金：

新加坡国家研究基金会;

关键词：

Text classification; Environmental digital news; Term -weighting schemes; Feature selection; TEXT; CLASSIFICATION; FRAMEWORK;

D O I：

10.1016/j.envsoft.2022.105536

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF-inverse document frequency (TF-IDF), Best Match 25 (BM25), TF-inverse gravity moment (TF-IGM), and TF-IDF-inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.

引用

页数：11

共 50 条

[31] Comparative Evaluation of Machine Learning Strategies for Analyzing Big Data in Psychiatry
Cao, Han
Meyer-Lindenberg, Andreas
Schwarz, Emanuel
INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2018, 19 (11)
[32] Deep learning in big data Analytics: A comparative study
Jan, Bilal
Farman, Haleem
Khan, Murad
Imran, Muhammad
Ul Islam, Ihtesham
Ahmad, Awais
Ali, Shaukat
Jeon, Gwanggil
COMPUTERS & ELECTRICAL ENGINEERING, 2019, 75 : 275 - 287
[33] Machine learning approaches for wind speed forecasting using long-term monitoring data: a comparative study
Ye, X. W.
Ding, Y.
Wan, H. P.
SMART STRUCTURES AND SYSTEMS, 2019, 24 (06) : 733 - 744
[34] A Comparative Study Weighting Schemes for Double Scoring Technique
Wichaiwong, Tanakorn
Jaruskulchai, Chuleerat
WORLD CONGRESS ON ENGINEERING AND COMPUTER SCIENCE, WCECS 2011, VOL I, 2011, : 443 - 447
[35] A new term-weighting scheme for text classification using the odds of positive and negative class probabilities
Ko, Youngjoong
JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2015, 66 (12) : 2553 - 2565
[36] Author Detection by Using Different Term Weighting Schemes
Tufekci, Pinar
Uzun, Erdinc
2013 21ST SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2013,
[37] A noun-based approach to feature location using time-aware term-weighting
Zamani, Sima
Lee, Sai Peck
Shokripour, Rarnin
Anvik, John
INFORMATION AND SOFTWARE TECHNOLOGY, 2014, 56 (08) : 991 - 1011
[38] Using machine learning to optimize parallelism in big data applications
Brandon Hernandez, Alvaro
Perez, Maria S.
Gupta, Smrati
Muntes-Mulero, Victor
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 86 : 1076 - 1092
[39] Prediction of Human Health using Machine Learning and Big Data
Fahad, P. K.
Pallavi, M. S.
PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), 2018, : 195 - 199
[40] Big Data Machine Learning using Apache Spark MLlib
Assefi, Mehdi
Behravesh, Ehsun
Liu, Guangchi
Tafti, Ahmad P.
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 3492 - 3498

← 1 2 3 4 5 →