Empath: Understanding Topic Signals in Large-Scale Text

被引:181
|
作者
Fast, Ethan [1 ]
Chen, Binbin [1 ]
Bernstein, Michael S. [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
关键词
social computing; computational social science; fiction;
D O I
10.1145/2858036.2858535
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.
引用
收藏
页码:4647 / 4657
页数:11
相关论文
共 50 条
  • [31] Large-Scale High-Precision Topic Modeling on Twitter
    Yang, Shuang
    Kolcz, Alek
    Schlaikjer, Andy
    Gupta, Pankaj
    [J]. PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 1907 - 1916
  • [32] Advances in the understanding of the large-scale gap test
    Burley, S. J.
    Bourne, N. K.
    Fung, V.
    Hollands, R.
    Millett, J. C. F.
    Milne, A. M.
    Wood, A.
    [J]. Shock Compression of Condensed Matter - 2005, Pts 1 and 2, 2006, 845 : 944 - 947
  • [33] Understanding Large-Scale Software - A Hierarchical View
    Levy, Omer
    Feitelson, Dror G.
    [J]. 2019 IEEE/ACM 27TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2019), 2019, : 283 - 293
  • [34] Understanding Source Code Comments at Large-Scale
    He, Hao
    [J]. ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, : 1217 - 1219
  • [35] Understanding the Context of Large-Scale IT Project Failures
    Rich, Eliot
    Nelson, Mark R.
    [J]. INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGIES AND SYSTEMS APPROACH, 2012, 5 (02) : 1 - 24
  • [36] Understanding Coarsening for Embedding Large-Scale Graphs
    Akyildiz, Taha Atahan
    Aljundi, Amro Alabsi
    Kaya, Kamer
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2020, : 2937 - 2946
  • [37] Understanding Large-Scale Dynamic Purchase Behavior
    Jacobs, Bruno
    Fok, Dennis
    Donkers, Bas
    [J]. MARKETING SCIENCE, 2021, 40 (05) : 844 - 870
  • [38] UIMA GRID: Distributed large-scale text analysis
    Egner, Michael Thomas
    Lorch, Markus
    Biddle, Edd
    [J]. CCGRID 2007: SEVENTH IEEE INTERNATIONAL SYMPOSIUM ON CLUSTER COMPUTING AND THE GRID, 2007, : 317 - +
  • [39] Large-Scale Extraction and Use of Knowledge from Text
    Clark, Peter
    Harrison, Phil
    [J]. K-CAP'09: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 2009, : 153 - 160
  • [40] Large-scale Bayesian logistic regression for text categorization
    Genkin, Alexander
    Lewis, David D.
    Madigan, David
    [J]. TECHNOMETRICS, 2007, 49 (03) : 291 - 304