Scalable Topical Phrase Mining from Text Corpora

被引:109
|
作者
El-Kishky, Ahmed [1 ]
Song, Yanglei [1 ]
Wang, Chi [2 ]
Voss, Clare R. [3 ]
Han, Jiawei [1 ]
机构
[1] Univ Illinois, Dept Comp Sci, Urbana, IL 61801 USA
[2] Microsoft Res, Redmond, WA USA
[3] Computat & Informat Sci Directorate Army Res Lab, Adelphi, MD USA
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2014年 / 8卷 / 03期
基金
美国国家科学基金会;
关键词
D O I
10.14778/2735508.2735519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inherent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram -discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized data sets. We propose a different approach that is both computationally efficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced document partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.
引用
收藏
页码:305 / 316
页数:12
相关论文
共 50 条
  • [1] Automated Phrase Mining from Massive Text Corpora
    Shang, Jingbo
    Liu, Jialu
    Jiang, Meng
    Ren, Xiang
    Voss, Clare R.
    Han, Jiawei
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (10) : 1825 - 1837
  • [2] Automated Context-Aware Phrase Mining from Text Corpora
    Zhang, Xue
    Li, Qinghua
    Li, Cuiping
    Chen, Hong
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS (DASFAA 2021), PT II, 2021, 12682 : 20 - 36
  • [3] ConPhrase: Enhancing Context-Aware Phrase Mining From Text Corpora
    Zhang, Xue
    Li, Qinghua
    Li, Cuiping
    Chen, Hong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (07) : 6767 - 6783
  • [4] Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach
    Ren, Xiang
    El-Kishky, Ahmed
    Wang, Chi
    Han, Jiawei
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 2319 - 2320
  • [5] Mining Quality Phrases from Massive Text Corpora
    Liu, Jialu
    Shang, Jingbo
    Wang, Chi
    Ren, Xiang
    Han, Jiawei
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1729 - 1744
  • [6] Text mining applied to multilingual corpora
    Neri, F
    Raffaelli, R
    [J]. Knowledge Mining, 2005, 185 : 123 - 131
  • [7] A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy
    Wang, Chi
    Danilevsky, Marina
    Desai, Nihit
    Zhang, Yinan
    Nguyen, Phuong
    Taula, Thrivikrama
    Han, Jiawei
    [J]. 19TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'13), 2013, : 437 - 445
  • [8] On the Construction of Multilingual Corpora for Clinical Text Mining
    Villena, Fabian
    Eisenmann, Urs
    Knaup, Petra
    Dunstan, Jocelyn
    Ganzinger, Matthias
    [J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 347 - 351
  • [9] Efficiently mining protein interaction dependencies from large text corpora
    Koester, Johannes
    Zamir, Eli
    Rahmann, Sven
    [J]. INTEGRATIVE BIOLOGY, 2012, 4 (07) : 805 - 812
  • [10] Towards Idea Mining: Problem-Solution Phrase Extraction from Text
    Liu, Haixia
    Brailsford, Tim
    Goulding, James
    Maul, Tomas
    Tan, Tao
    Chaudhuri, Debanjan
    [J]. ADVANCED DATA MINING AND APPLICATIONS, ADMA 2022, PT II, 2022, 13726 : 3 - 14