Comparative study of term-weighting schemes for environmental big data using machine learning

被引：4

作者：

Kim, JungJin ^{[1
]}

Kim, Han-Ul ^{[2
]}

Adamowski, Jan ^{[3
]}

Hatami, Shadi ^{[3
]}

Jeong, Hanseok ^{[1
,4
,5
]}

机构：

[1] Seoul Natl Univ Sci & Technol, Inst Environm Technol, Seoul 01811, South Korea

[2] Seoul Natl Univ Sci & Technol, Dept Appl Artificial Intelligence, Seoul 01811, South Korea

[3] McGill Univ, Dept Bioresource Engn, Ste Anne De Bellevue, PQ, Canada

[4] Seoul Natl Univ Sci & Technol, Dept Environm Engn, Seoul 01811, South Korea

[5] 120-1 Chungun Hall 232 Gongneung ro, Seoul 01811, South Korea

来源：

ENVIRONMENTAL MODELLING & SOFTWARE | 2022年 / 157卷

基金：

新加坡国家研究基金会;

关键词：

Text classification; Environmental digital news; Term -weighting schemes; Feature selection; TEXT; CLASSIFICATION; FRAMEWORK;

D O I：

10.1016/j.envsoft.2022.105536

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF-inverse document frequency (TF-IDF), Best Match 25 (BM25), TF-inverse gravity moment (TF-IGM), and TF-IDF-inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes (NB), logistic regression (LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.

引用

页数：11

共 50 条

[1] Model-induced term-weighting schemes for text classification
Kim, Hyun Kyung
Kim, Minyoung
APPLIED INTELLIGENCE, 2016, 45 (01) : 30 - 43
[2] Model-induced term-weighting schemes for text classification
Hyun Kyung Kim
Minyoung Kim
Applied Intelligence, 2016, 45 : 30 - 43
[3] Comparative Evaluation of Term-Weighting Methods for Automatic Summarization
Orasan, Constantin
JOURNAL OF QUANTITATIVE LINGUISTICS, 2009, 16 (01) : 67 - 95
[4] Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space
Cummins, Ronan
O'Riordan, Colm
ARTIFICIAL INTELLIGENCE REVIEW, 2006, 26 (1-2) : 35 - 47
[5] Evolved term-weighting schemes in Information Retrieval: an analysis of the solution space
Ronan Cummins
Colm O’Riordan
Artificial Intelligence Review, 2006, 26 : 35 - 47
[6] Big Data Analytics in Healthcare Using Machine Learning Algorithms: A Comparative Study
Akundi, Sai Hanuman
Soujanya, R.
Madhuri, P. M.
INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2020, 16 (13) : 19 - 32
[7] Term-weighting learning via genetic programming for text classification
Escalante, Hugo Jair
García-Limón, Mauricio A.
Morales-Reyes, Alicia
Graff, Mario
Montes-y-Gómez, Manuel
Morales, Eduardo F.
Martínez-Carranza, José
Knowledge-Based Systems, 2015, 83 : 176 - 189
[8] An axiomatic comparison of learned term-weighting schemes in information retrieval: clarifications and extensions
Cummins, Ronan
O'Riordan, Colm
ARTIFICIAL INTELLIGENCE REVIEW, 2007, 28 (01) : 51 - 68
[9] Evolving general term-weighting schemes for information retrieval: Tests on larger collections
Cummins, R
O'riordan, C
ARTIFICIAL INTELLIGENCE REVIEW, 2005, 24 (3-4) : 277 - 299
[10] A Comparative Study on Term Weighting Schemes for Text Classification
Mazyad, Ahmad
Teytaud, Fabien
Fonlupt, Cyril
MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 100 - 108

← 1 2 3 4 5 →