N-gram over Context

被引:5
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;
D O I
10.1145/2872427.2882981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
引用
收藏
页码:1045 / 1055
页数:11
相关论文
共 50 条
  • [31] Topic-dependent N-gram models based on Optimization of Context Lengths in LDA
    Nakamura, Akira
    Hayamizu, Satoru
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 3066 - 3069
  • [32] A unified context-free grammar and n-gram model for spoken language processing
    Wang, YY
    Mahajan, M
    Huang, XD
    2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1639 - 1642
  • [33] RNA modeling by combining stochastic context-free grammars and n-gram models
    Salvador, I
    Benedí, JM
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2002, 16 (03) : 309 - 315
  • [34] NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2310 - 2314
  • [35] Text authorship detection using decision trees and association rules over N-gram
    Course of Information and Computer Sciences, Graduate School of Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi-shi, Kanagawa 243-0292, Japan
    Proc. IADIS Int. Conf. Intelligent Syst. Agents, Proc. IADIS Eur. Conf. Data Min., Part MCCSIS, (167-170):
  • [36] Amyloidogenic motifs revealed by n-gram analysis
    Michał Burdukiewicz
    Piotr Sobczyk
    Stefan Rödiger
    Anna Duda-Madej
    Paweł Mackiewicz
    Małgorzata Kotulska
    Scientific Reports, 7
  • [37] A New Estimate of the n-gram Language Model
    Aouragh, Si Lhoussain
    Yousfi, Abdellah
    Laaroussi, Saida
    Gueddah, Hicham
    Nejja, Mohammed
    AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 211 - 215
  • [38] Discriminative N-gram Selection for Dialect Recognition
    Richardson, F. S.
    Campbell, W. M.
    Torres-Carrasquillo, P. A.
    INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 192 - 195
  • [39] Generalized N-gram measures for melodic similarity
    Frieler, Klaus
    Data Science and Classification, 2006, : 289 - 298
  • [40] N-gram feature selection for authorship identification
    Houvardas, John
    Stamatatos, Efstathios
    ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 77 - 86