N-gram over Context

被引:5
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;
D O I
10.1145/2872427.2882981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
引用
收藏
页码:1045 / 1055
页数:11
相关论文
共 50 条
  • [21] N-gram Analysis of a Mongolian Text
    Altangerel, Khuder
    Tsend, Ganbat
    Jalsan, Khash-Erdene
    IFOST 2008: PROCEEDING OF THE THIRD INTERNATIONAL FORUM ON STRATEGIC TECHNOLOGIES, 2008, : 258 - 259
  • [22] On compressing n-gram language models
    Hirsimaki, Teemu
    2007 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PTS 1-3, 2007, : 949 - 952
  • [23] N-GRAM ANALYSIS IN THE ENGINEERING DOMAIN
    Leary, Martin
    Pearson, Geoff
    Burvill, Colin
    Mazur, Maciej
    Subic, Aleksandar
    PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON ENGINEERING DESIGN (ICED 11): IMPACTING SOCIETY THROUGH ENGINEERING DESIGN, VOL 6: DESIGN INFORMATION AND KNOWLEDGE, 2011, 6 : 414 - 423
  • [24] Supervised N-gram Topic Model
    Kawamae, Noriaki
    WSDM'14: PROCEEDINGS OF THE 7TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, 2014, : 473 - 482
  • [25] Discriminative n-gram language modeling
    Roark, Brian
    Saraclar, Murat
    Collins, Michael
    COMPUTER SPEECH AND LANGUAGE, 2007, 21 (02): : 373 - 392
  • [26] Context-Sensitive Arabic Spell Checker using Context Words and N-gram Language Models
    Al-Jefri, Majed M.
    Mahmoud, Sabri A.
    2013 TAIBAH UNIVERSITY INTERNATIONAL CONFERENCE ON ADVANCES IN INFORMATION TECHNOLOGY FOR THE HOLY QURAN AND ITS SCIENCES, 2013, : 258 - 263
  • [27] Similar N-gram Language Model
    Gillot, Christian
    Cerisara, Christophe
    Langlois, David
    Haton, Jean-Paul
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 1824 - 1827
  • [28] Croatian Language N-Gram System
    Dembitz, Sandor
    Blaskovic, Bruno
    Gledec, Gordan
    ADVANCES IN KNOWLEDGE-BASED AND INTELLIGENT INFORMATION AND ENGINEERING SYSTEMS, 2012, 243 : 696 - 705
  • [29] Google N-Gram Viewer does not Include Arabic Corpus! Towards N-Gram Viewer for Arabic Corpus
    Alsmadi, Izzat
    Zarour, Mohammad
    INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2018, 15 (05) : 785 - 794
  • [30] Towards Competitive N-gram Smoothing
    Falahatgar, Moein
    Ohannessian, Mesrob
    Orlitsky, Alon
    Pichapati, Venkatadheeraj
    INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 108, 2020, 108 : 4206 - 4214