N-gram over Context

被引:5
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;
D O I
10.1145/2872427.2882981
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.
引用
收藏
页码:1045 / 1055
页数:11
相关论文
共 50 条
  • [1] N-gram and local context analysis for Persian text retrieval
    Aleahmad, Abolfazl
    Hakimian, Parsia
    Mahdikhani, Farzad
    Oroumchian, Farhad
    2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 284 - 287
  • [2] N-gram Insight
    Prans, George
    AMERICAN SCIENTIST, 2011, 99 (05) : 356 - 357
  • [3] A context evaluation approach for structural comparison of proteins using cross entropy over n-gram modelling
    Razmara, Jafar
    Deris, Safaai B.
    Parvizpour, Sepideh
    COMPUTERS IN BIOLOGY AND MEDICINE, 2013, 43 (10) : 1614 - 1621
  • [4] N-gram MalGAN: Evading machine learning detection via feature n-gram
    Zhu, Enmin
    Zhang, Jianjie
    Yan, Jijie
    Chen, Kongyang
    Gao, Chongzhi
    DIGITAL COMMUNICATIONS AND NETWORKS, 2022, 8 (04) : 485 - 491
  • [5] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
  • [6] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
    Ahmad, Adnan
    Talha, Mahbubur Rub
    Amin, Md. Ruhul
    Chowdhury, Farida
    2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [7] N-gram MalGAN:Evading machine learning detection via feature n-gram
    Enmin Zhu
    Jianjie Zhang
    Jijie Yan
    Kongyang Chen
    Chongzhi Gao
    Digital Communications and Networks, 2022, 8 (04) : 485 - 491
  • [8] N-gram模型综述
    尹陈
    吴敏
    计算机系统应用, 2018, 27 (10) : 33 - 38
  • [9] N-gram similarity and distance
    Kondrak, Grzegorz
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2005, 3772 : 115 - 126
  • [10] BIGRAM VS N-GRAM
    HALPIN, P
    BYTE, 1988, 13 (08): : 26 - 26