N-gram over Context

被引:5
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;
D O I
10.1145/2872427.2882981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
引用
收藏
页码:1045 / 1055
页数:11
相关论文
共 50 条
  • [41] DERIN: A data extraction information and n-gram
    Lopes Figueiredo, Leandro Neiva
    de Assis, Guilherme Tavares
    Ferreira, Anderson A.
    INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
  • [42] Web as a Corpus: Going Beyond the n-gram
    Nakov, Preslav
    INFORMATION RETRIEVAL, RUSSIR 2014, 2015, 505 : 185 - 228
  • [43] Research of Affective Recognize Based on N-gram
    Xue Weimin
    Lin Benjing
    Yu Bing
    2008 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2008, : 702 - +
  • [44] Applications of Boolean equations in n-gram analysis
    Marovac, Ulfeta
    ICIST '18: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES, 2018,
  • [45] Perplexity of n-Gram and Dependency Language Models
    Popel, Martin
    Marecek, David
    TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
  • [46] ON THE USE OF N-GRAM TRANSDUCERS FOR DIALOGUE ANNOTATION
    Tamarit, Vicent
    Martinez-Hinarejos, Carlos-D.
    Benedi, Jose-Miguel
    SPOKEN DIALOGUE SYSTEMS: TECHNOLOGY AND DESIGN, 2011, : 255 - 276
  • [47] MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
    Sak, Hasim
    Allauzen, Cyril
    Nakajima, Kaisuke
    Beaufays, Francoise
    2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 31 - 36
  • [48] Differentiable N-gram objective on abstractive summarization
    Zhu, Yunqi
    Yang, Xuebing
    Wu, Yuanyuan
    Zhu, Mingjin
    Zhang, Wensheng
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 215
  • [49] A variant of n-gram based language classification
    Tomovic, Andrija
    Janicic, Predrag
    AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
  • [50] Twitter n-gram corpus with demographic metadata
    Amaç Herdağdelen
    Language Resources and Evaluation, 2013, 47 : 1127 - 1147