N-gram over Context

被引：5

作者：

Kawamae, Noriaki ^{[1
]}

机构：

[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan

来源：

PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16) | 2016年

关键词：

Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;

D O I：

10.1145/2872427.2882981

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.

引用

页码：1045 / 1055

页数：11

共 50 条

[41] DERIN: A data extraction information and n-gram
Lopes Figueiredo, Leandro Neiva
de Assis, Guilherme Tavares
Ferreira, Anderson A.
INFORMATION PROCESSING & MANAGEMENT, 2017, 53 (05) : 1120 - 1138
[42] Web as a Corpus: Going Beyond the n-gram
Nakov, Preslav
INFORMATION RETRIEVAL, RUSSIR 2014, 2015, 505 : 185 - 228
[43] Research of Affective Recognize Based on N-gram
Xue Weimin
Lin Benjing
Yu Bing
2008 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING, VOLS 1 AND 2, 2008, : 702 - +
[44] Applications of Boolean equations in n-gram analysis
Marovac, Ulfeta
ICIST '18: PROCEEDINGS OF THE 8TH INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES, 2018,
[45] Perplexity of n-Gram and Dependency Language Models
Popel, Martin
Marecek, David
TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
[46] ON THE USE OF N-GRAM TRANSDUCERS FOR DIALOGUE ANNOTATION
Tamarit, Vicent
Martinez-Hinarejos, Carlos-D.
Benedi, Jose-Miguel
SPOKEN DIALOGUE SYSTEMS: TECHNOLOGY AND DESIGN, 2011, : 255 - 276
[47] MIXTURE OF MIXTURE N-GRAM LANGUAGE MODELS
Sak, Hasim
Allauzen, Cyril
Nakajima, Kaisuke
Beaufays, Francoise
2013 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING (ASRU), 2013, : 31 - 36
[48] Differentiable N-gram objective on abstractive summarization
Zhu, Yunqi
Yang, Xuebing
Wu, Yuanyuan
Zhu, Mingjin
Zhang, Wensheng
EXPERT SYSTEMS WITH APPLICATIONS, 2023, 215
[49] A variant of n-gram based language classification
Tomovic, Andrija
Janicic, Predrag
AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
[50] Twitter n-gram corpus with demographic metadata
Amaç Herdağdelen
Language Resources and Evaluation, 2013, 47 : 1127 - 1147

← 1 2 3 4 5 →