N-gram over Context

被引：5

作者：

Kawamae, Noriaki ^{[1
]}

机构：

[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan

来源：

PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16) | 2016年

关键词：

Nonparametric models; Topic models; Latent variable models; Graphical models; N-gram topic model; MapReduce;

D O I：

10.1145/2872427.2882981

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Our proposal, N-gram over Context (NOC), is a nonparametric topic model that aims to help our understanding of a given corpus, and be applied to many text mining applications. Like other topic models, NOC represents each document as a mixture of topics and generates each word from one topic. Unlike these models, NOC focuses on both a topic structure as an internal linguistic structure, and N-gram as an external linguistic structure. To improve the quality of topic specific N-grams, NOC reveals a tree of topics that captures the semantic relationship between topics from a given corpus as context, and forms N-gram by offering power-law distributions for word frequencies on this topic tree. To gain both these linguistic structures efficiently, NOC learns them from a given corpus in a unified manner. By accessing this entire tree at the word level in the generative process of each document, NOC enables each document to maintain a thematic coherence and form N-grams over context. We develop a parallelizable inference algorithm, DNOC, to support large data sets. Experiments on review articles/papers/tweet show that NOC is useful as a generative model to discover both the topic structure and the corresponding N-grams, and well complements human experts and domain specific knowledge. D-NOC can process large data sets while preserving full generative model performance, by the help of an open-source distributed machine learning framework.

引用

页码：1045 / 1055

页数：11

共 50 条

[31] Topic-dependent N-gram models based on Optimization of Context Lengths in LDA
Nakamura, Akira
Hayamizu, Satoru
11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 3066 - 3069
[32] A unified context-free grammar and n-gram model for spoken language processing
Wang, YY
Mahajan, M
Huang, XD
2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI, 2000, : 1639 - 1642
[33] RNA modeling by combining stochastic context-free grammars and n-gram models
Salvador, I
Benedí, JM
INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2002, 16 (03) : 309 - 315
[34] NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS
Haidar, Md. Akmal
O'Shaughnessy, Douglas
2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2310 - 2314
[35] Text authorship detection using decision trees and association rules over N-gram
Course of Information and Computer Sciences, Graduate School of Kanagawa Institute of Technology, 1030 Shimo-ogino, Atsugi-shi, Kanagawa 243-0292, Japan
Proc. IADIS Int. Conf. Intelligent Syst. Agents, Proc. IADIS Eur. Conf. Data Min., Part MCCSIS, (167-170):
[36] Amyloidogenic motifs revealed by n-gram analysis
Michał Burdukiewicz
Piotr Sobczyk
Stefan Rödiger
Anna Duda-Madej
Paweł Mackiewicz
Małgorzata Kotulska
Scientific Reports, 7
[37] A New Estimate of the n-gram Language Model
Aouragh, Si Lhoussain
Yousfi, Abdellah
Laaroussi, Saida
Gueddah, Hicham
Nejja, Mohammed
AI IN COMPUTATIONAL LINGUISTICS, 2021, 189 : 211 - 215
[38] Discriminative N-gram Selection for Dialect Recognition
Richardson, F. S.
Campbell, W. M.
Torres-Carrasquillo, P. A.
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 192 - 195
[39] Generalized N-gram measures for melodic similarity
Frieler, Klaus
Data Science and Classification, 2006, : 289 - 298
[40] N-gram feature selection for authorship identification
Houvardas, John
Stamatatos, Efstathios
ARTIFICIAL INTELLIGENCE: METHODOLOGY, SYSTEMS, AND APPLICATIONS, PROCEEDINGS, 2006, 4183 : 77 - 86

← 1 2 3 4 5 →