Automated Text Classification of News Articles: A Practical Guide

被引:71
|
作者
Barbera, Pablo [1 ]
Boydstun, Amber E. [2 ]
Linn, Suzanna [3 ]
McMahon, Ryan [4 ,5 ]
Nagler, Jonathan [6 ,7 ]
机构
[1] Univ Southern Calif, Polit Sci & Int Relat, Los Angeles, CA 90089 USA
[2] Univ Calif Davis, Polit Sci, Davis, CA 95616 USA
[3] Penn State Univ, Dept Polit Sci, Polit Sci, University Pk, PA 16802 USA
[4] Penn State Univ, Dept Polit Sci, University Pk, PA 16802 USA
[5] Google, Mountain View, CA 94043 USA
[6] NYU, Polit, New York, NY 10012 USA
[7] NYU, Ctr Social Media & Polit, New York, NY 10012 USA
基金
美国国家科学基金会;
关键词
statistical analysis of texts; automated content analysis; content analysis; ECONOMIC-NEWS; MEDIA; SENTIMENT; IMPACT; WORDS;
D O I
10.1017/pan.2020.8
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.
引用
收藏
页码:19 / 42
页数:24
相关论文
共 50 条
  • [21] HNTSumm: Hybrid text summarization of transliterated news articles
    Muniraj P.
    Sabarmathi K.R.
    Leelavathi R.
    Balaji B S.
    International Journal of Intelligent Networks, 2023, 4 : 53 - 61
  • [22] Vectorization of Text Documents for Identifying Unifiable News Articles
    Singh, Anita Kumari
    Shashi, Mogalla
    INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (07) : 305 - 310
  • [23] Multidimensional Text Warehousing for Automated Text Classification
    Kim, Jiyun
    Kim, Han-joon
    JOURNAL OF INFORMATION TECHNOLOGY RESEARCH, 2018, 11 (02) : 168 - 183
  • [24] Automated Classification of Text Sentiment
    Dufourq, Emmanuel
    Bassett, Bruce A.
    SOUTH AFRICAN INSTITUTE OF COMPUTER SCIENTISTS AND INFORMATION TECHNOLOGISTS (SACSIT 2017), 2017, : 96 - +
  • [25] Hierarchical Multilabel Classification for Indonesian News Articles
    Irsan, Ivana Clairine
    Khodra, Masayu Leylia
    2016 INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS - CONCEPTS, THEORY AND APPLICATION (ICAICTA), 2016,
  • [26] Classification and skimming of articles for an effective news browsing
    Cho, J
    Jeong, S
    Choi, B
    KNOWLEDGE-BASED INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, PT 3, PROCEEDINGS, 2005, 3683 : 704 - 712
  • [27] Automatic Multilabel Classification for Indonesian News Articles
    Rahmawati, Dyah
    Khodra, Masayu Leylia
    2015 2ND INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS: CONCEPTS, THEORY AND APPLICATIONS ICAICTA, 2015,
  • [28] What is (automated) news? A content analysis of algorithm-written news articles
    Tandoc Jr, Edson C.
    Wu, Shangyuan
    Tan, Jessica
    Contreras-Yap, Sofia
    MEDIA & JORNALISMO, 2022, 22 (41) : 103 - 120
  • [29] Text Mining Analysis of News Articles Related to 'Space Hazard'
    Jo, Hoon
    Sohn, Jungjoo
    JOURNAL OF THE KOREAN EARTH SCIENCE SOCIETY, 2022, 43 (01): : 224 - 235
  • [30] Ontology-based text summarization for business news articles
    Wu, CW
    Liu, CL
    COMPUTERS AND THEIR APPLICATIONS, 2003, : 389 - 392