Empath: Understanding Topic Signals in Large-Scale Text

被引:181
|
作者
Fast, Ethan [1 ]
Chen, Binbin [1 ]
Bernstein, Michael S. [1 ]
机构
[1] Stanford Univ, Stanford, CA 94305 USA
关键词
social computing; computational social science; fiction;
D O I
10.1145/2858036.2858535
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Human language is colored by a broad range of topics, but existing text analysis tools only focus on a small number of them. We present Empath, a tool that can generate and validate new lexical categories on demand from a small set of seed terms (like "bleed" and "punch" to generate the category violence). Empath draws connotations between words and phrases by deep learning a neural embedding across more than 1.8 billion words of modern fiction. Given a small set of seed words that characterize a category, Empath uses its neural embedding to discover new related terms, then validates the category with a crowd-powered filter. Empath also analyzes text across 200 built-in, pre-validated categories we have generated from common topics in our web dataset, like neglect, government, and social media. We show that Empath's data-driven, human validated categories are highly correlated (r=0.906) with similar categories in LIWC.
引用
收藏
页码:4647 / 4657
页数:11
相关论文
共 50 条
  • [1] Topic modeling for large-scale text data
    Xi-ming Li
    Ji-hong Ouyang
    You Lu
    [J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16 : 457 - 465
  • [2] Topic modeling for large-scale text data
    Li, Xi-ming
    Ouyang, Ji-hong
    Lu, You
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465
  • [3] A Distributed Topic Model for Large-Scale Streaming Text
    Li, Yicong
    Feng, Dawei
    Lu, Menglong
    Li, Dongsheng
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, KSEM 2019, PT II, 2019, 11776 : 37 - 48
  • [4] A Large-scale Text Analysis with Word Embeddings and Topic Modeling
    Choi, Won-Joon
    Kim, Euhee
    [J]. JOURNAL OF COGNITIVE SCIENCE, 2019, 20 (01) : 147 - 187
  • [5] Topic Modeling Techniques for Text Mining over a Large-Scale Scientific and Biomedical Text Corpus
    Avasthi S.
    Chauhan R.
    Acharjya D.P.
    [J]. International Journal of Ambient Computing and Intelligence, 2022, 13 (01)
  • [6] Topic Modeling of Large Scale Social Text
    Wang, Jia-wen
    Yang, Qun
    [J]. 2ND INTERNATIONAL CONFERENCE ON COMMUNICATIONS, INFORMATION MANAGEMENT AND NETWORK SECURITY (CIMNS 2017), 2017, : 237 - 242
  • [7] TAE: Topic-aware encoder for large-scale multi-label text classification
    Qin, Shaowei
    Wu, Hao
    Zhou, Lihua
    Zhao, Yiji
    Zhang, Lei
    [J]. APPLIED INTELLIGENCE, 2024, 54 (08) : 6269 - 6284
  • [8] Large-scale Analysis of Free-Text Data for Mental Health Surveillance with Topic Modelling
    Gu, Yang
    Leroy, Gondy
    [J]. AMCIS 2020 PROCEEDINGS, 2020,
  • [9] A Phrase Topic Model for Large-scale Corpus
    Li, Baoji
    Xu, Wenhua
    Tian, Yuhui
    Chen, Juan
    [J]. 2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 634 - 639
  • [10] Indexing of large-scale multimedia signals
    Wang, Meng
    Gao, Xinbo
    Yang, Yi
    Shan, Caifeng
    [J]. SIGNAL PROCESSING, 2013, 93 (08) : 2109 - 2110