Mining significant associations in large scale text corpora

被引:2
|
作者
Raghavan, P
Tsaparas, P
机构
关键词
D O I
10.1109/ICDM.2002.1183933
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Mining large-scale text corpora is an essential step in extracting the key themes in a corpus. We motivate a quantitative measure for significant associations through the distributions of pairs and triplets of co-occurring words. We consider the algorithmic problem of efficiently enumerating such significant associations and present pruning algorithms for these problems, with theoretical as well as empirical analyses. Our algorithms make use of two novel mining methods: (1) matrix mining, and (2) shortened documents. We present evidence from a diverse set of documents that our measure does in fact elicit interesting co-occurrences.
引用
下载
收藏
页码:402 / 409
页数:8
相关论文
共 50 条
  • [1] Text Mining Methods for Social Representation Analysis in Large Corpora
    Chartier, Jean-Francois
    Meunier, Jean-Guy
    PAPERS ON SOCIAL REPRESENTATIONS, 2011, 20 (02):
  • [2] Efficiently mining protein interaction dependencies from large text corpora
    Koester, Johannes
    Zamir, Eli
    Rahmann, Sven
    INTEGRATIVE BIOLOGY, 2012, 4 (07) : 805 - 812
  • [3] Text mining applied to multilingual corpora
    Neri, F
    Raffaelli, R
    Knowledge Mining, 2005, 185 : 123 - 131
  • [4] Large-Scale Text Mining of Biomedical Literature
    Ginter, Filip
    ELECTRONIC PROCEEDINGS IN THEORETICAL COMPUTER SCIENCE, 2013, (116): : 43 - 44
  • [5] On the Construction of Multilingual Corpora for Clinical Text Mining
    Villena, Fabian
    Eisenmann, Urs
    Knaup, Petra
    Dunstan, Jocelyn
    Ganzinger, Matthias
    DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 347 - 351
  • [6] Manipulating large corpora for text classification
    Fukumoto, F
    Suzuki, Y
    PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2002, : 196 - 203
  • [7] Scalable Topical Phrase Mining from Text Corpora
    El-Kishky, Ahmed
    Song, Yanglei
    Wang, Chi
    Voss, Clare R.
    Han, Jiawei
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2014, 8 (03): : 305 - 316
  • [8] Automated Phrase Mining from Massive Text Corpora
    Shang, Jingbo
    Liu, Jialu
    Jiang, Meng
    Ren, Xiang
    Voss, Clare R.
    Han, Jiawei
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (10) : 1825 - 1837
  • [9] Mining Quality Phrases from Massive Text Corpora
    Liu, Jialu
    Shang, Jingbo
    Wang, Chi
    Ren, Xiang
    Han, Jiawei
    SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 1729 - 1744
  • [10] Portuguese Text Generation from Large Corpora
    de Novais, Eder M.
    Paraboni, Ivandre
    da Silva Junior, Douglas F. P.
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 4010 - 4014