The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach

被引:16
|
作者
Sebok, Miklos [1 ]
Kacsuk, Zoltan [1 ,2 ]
机构
[1] Hungarian Acad Sci, Ctr Social Sci, Budapest, Hungary
[2] Hsch Medien, Stuttgart, Germany
关键词
machine learning; statistical analysis of texts; Comparative Agendas Project; multiclass classification; automated content analysis;
D O I
10.1017/pan.2020.27
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
In this article, we present a machine learning-based solution for matching the performance of the gold standard of double-blind human coding when it comes to content analysis in comparative politics. We combine a quantitative text analysis approach with supervised learning and limited human resources in order to classify the front-page articles of a leading Hungarian daily newspaper based on their full text. Our goal was to assign items in our dataset to one of 21 policy topics based on the codebook of the Comparative Agendas Project. The classification of the imbalanced classes of topics was handled by a hybrid binary snowball workflow. This relies on limited human resources as well as supervised learning; it simplifies the multiclass problem to one of binary choice; and it is based on a snowball approach as we augment the training set with machine-classified observations after each successful round and also between corpora. Our results show that our approach provided better precision results (of over 80% for most topic codes) than what is customary for human coders and most computer-assisted coding projects. Nevertheless, this high precision came at the expense of a relatively low, below 60%, share of labeled articles.
引用
收藏
页码:236 / 249
页数:14
相关论文
共 50 条
  • [21] Novel multiclass classification machine learning approach for the early-stage classification of systemic autoimmune rheumatic diseases
    Wang, Yun
    Wei, Wei
    Ouyang, Renren
    Chen, Rujia
    Wang, Ting
    Yuan, Xu
    Wang, Feng
    Hou, Hongyan
    Wu, Shiji
    LUPUS SCIENCE & MEDICINE, 2024, 11 (01):
  • [22] FBCSP and Adaptive Boosting for Multiclass Motor Imagery BCI Data Classification: A Machine Learning Approach
    Das, Rig
    Lopez, Paula S.
    Khan, Muhammad Ahmed
    Iversen, Helle K.
    Puthusserypady, Sadasivan
    2020 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC), 2020, : 1275 - 1279
  • [23] Multiclass covert speech classification using extreme learning machine
    Dipti Pawar
    Sudhir Dhage
    Biomedical Engineering Letters, 2020, 10 : 217 - 226
  • [24] Orthogonal incremental extreme learning machine for regression and multiclass classification
    Li Ying
    Neural Computing and Applications, 2016, 27 : 111 - 120
  • [25] A Hybrid Model of Extreme Learning Machine Based on Bat and Cuckoo Search Algorithm for Regression and Multiclass Classification
    Fan, Qinwei
    Fan, Tongke
    JOURNAL OF MATHEMATICS, 2021, 2021
  • [26] Multiclass covert speech classification using extreme learning machine
    Pawar, Dipti
    Dhage, Sudhir
    BIOMEDICAL ENGINEERING LETTERS, 2020, 10 (02) : 217 - 226
  • [27] A multiclass machine learning approach to credit rating prediction
    Ye, Yun
    Liu, Shufen
    Li, Jinyu
    2008 INTERNATIONAL SYMPOSIUM ON INFORMATION PROCESSING AND 2008 INTERNATIONAL PACIFIC WORKSHOP ON WEB MINING AND WEB-BASED APPLICATION, 2008, : 57 - 61
  • [28] Orthogonal incremental extreme learning machine for regression and multiclass classification
    Ying, Li
    NEURAL COMPUTING & APPLICATIONS, 2016, 27 (01): : 111 - 120
  • [29] An alternative approach for statistical single-label document classification of newspaper articles
    Mamakis, Georgios
    Malamos, Athanasios G.
    Ware, J. Andrew
    JOURNAL OF INFORMATION SCIENCE, 2011, 37 (03) : 293 - 303
  • [30] A Hybrid Machine Learning Approach for Analysis and Classification of Social Network Sentiments
    Sarowar, Md Golam
    Rahman, Mushfiqur
    Ali, Md Nawab Yousuf
    Ripon, Shamim H.
    2019 IEEE 5TH INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2019,