Automated Text Classification of News Articles: A Practical Guide

被引:71
|
作者
Barbera, Pablo [1 ]
Boydstun, Amber E. [2 ]
Linn, Suzanna [3 ]
McMahon, Ryan [4 ,5 ]
Nagler, Jonathan [6 ,7 ]
机构
[1] Univ Southern Calif, Polit Sci & Int Relat, Los Angeles, CA 90089 USA
[2] Univ Calif Davis, Polit Sci, Davis, CA 95616 USA
[3] Penn State Univ, Dept Polit Sci, Polit Sci, University Pk, PA 16802 USA
[4] Penn State Univ, Dept Polit Sci, University Pk, PA 16802 USA
[5] Google, Mountain View, CA 94043 USA
[6] NYU, Polit, New York, NY 10012 USA
[7] NYU, Ctr Social Media & Polit, New York, NY 10012 USA
基金
美国国家科学基金会;
关键词
statistical analysis of texts; automated content analysis; content analysis; ECONOMIC-NEWS; MEDIA; SENTIMENT; IMPACT; WORDS;
D O I
10.1017/pan.2020.8
中图分类号
D0 [政治学、政治理论];
学科分类号
0302 ; 030201 ;
摘要
Automated text analysis methods have made possible the classification of large corpora of text by measures such as topic and tone. Here, we provide a guide to help researchers navigate the consequential decisions they need to make before any measure can be produced from the text. We consider, both theoretically and empirically, the effects of such choices using as a running example efforts to measure the tone of New York Times coverage of the economy. We show that two reasonable approaches to corpus selection yield radically different corpora and we advocate for the use of keyword searches rather than predefined subject categories provided by news archives. We demonstrate the benefits of coding using article segments instead of sentences as units of analysis. We show that, given a fixed number of codings, it is better to increase the number of unique documents coded rather than the number of coders for each document. Finally, we find that supervised machine learning algorithms outperform dictionaries on a number of criteria. Overall, we intend this guide to serve as a reminder to analysts that thoughtfulness and human validation are key to text-as-data methods, particularly in an age when it is all too easy to computationally classify texts without attending to the methodological choices therein.
引用
收藏
页码:19 / 42
页数:24
相关论文
共 50 条
  • [41] Automated Big Text Security Classification
    Alzhrani, Khudran
    Rudd, Ethan M.
    Boult, Terrance E.
    Chow, C. Edward
    IEEE INTERNATIONAL CONFERENCE ON INTELLIGENCE AND SECURITY INFORMATICS: CYBERSECURITY AND BIG DATA, 2016, : 103 - 108
  • [42] MN-DS: A Multilabeled News Dataset for News Articles Hierarchical Classification
    Petukhova, Alina
    Fachada, Nuno
    DATA, 2023, 8 (05)
  • [43] A Practical Guide to Library of Congress Classification
    Bothmann, Bobby
    CATALOGING & CLASSIFICATION QUARTERLY, 2018, 56 (04) : 385 - 386
  • [44] Upgrading the Newsroom: An Automated Image Selection System for News Articles
    Liu, Fangyu
    Lebret, Remi
    Orel, Didier
    Sordet, Philippe
    Aberer, Karl
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (03)
  • [45] A Practical Guide to Library of Congress Classification
    Sandy, Heather Moulaison
    TECHNICAL SERVICES QUARTERLY, 2018, 35 (03) : 318 - 319
  • [46] A Practical Guide to Library of Congress Classification
    Shreve, Sara
    LIBRARY JOURNAL, 2017, 142 (20) : 108 - 108
  • [47] Automated Arabic Text Classification With P-Stemmer, Machine Learning, and a Tailored News Article Taxonomy
    Kanan, Tarek
    Fox, Edward A.
    JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, 2016, 67 (11) : 2667 - 2683
  • [48] HFRECCA for Clustering of Text Data from Travel Guide Articles
    Wazarkar, Seema V.
    Manjrekar, Amrita A.
    2014 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2014, : 1486 - 1489
  • [49] Deep Learning methods for Subject Text Classification of Articles
    Semberecki, Piotr
    Maciejewski, Henryk
    PROCEEDINGS OF THE 2017 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), 2017, : 357 - 360
  • [50] News Article Text Classification in Indonesian Language
    Wongso, Rini
    Luwinda, Ferdinand Ariandy
    Trisnajaya, Brandon Christian
    Rusli, Olivia
    Rudy
    DISCOVERY AND INNOVATION OF COMPUTER SCIENCE TECHNOLOGY IN ARTIFICIAL INTELLIGENCE ERA, 2017, 116 : 137 - 143