Supervised N-gram Topic Model

被引:12
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric Bayes models; Nonparametric Dirichlet process; Topic models; Latent variable models; Graphical models; Sentiment analysis; N-gram topic model;
D O I
10.1145/2556195.2559895
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a Bayesian nonparametric topic model that represents relationships between given labels and the corresponding words/phrases, as found in supervised articles. Unlike existing supervised topic models, our proposal, supervised N-gram topic model (SNT), focuses on both the number of topics and power-law distribution in the word frequencies for topic-specific N-grams. To achieve this goal, SNT takes a Bayesian nonparametric approach to topic sampling; it assigns a topic to each token using Chinese restaurant process (CRP), and generates a word distribution jointly with the given variable in textual order, and then forms each N-gram word as a hierarchy of Pitman-Yor process (PYP) priors. CRP can help SNT to automatically estimate the appropriate number of topics, which impacts the quality of topic specific words, N-grams, and observed value distribution. Since PYP recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing approaches for N-gram language models, it can allow SNT to generate more interpretable N-grams that the alternatives. Experiments on labeled text data show that SNT is useful as a generative model for discovering more phrases that better complement human experts than existing alternatives and provide more domain specific knowledge. The results show that SNT can be applied to various tasks such as automatic annotation.
引用
收藏
页码:473 / 482
页数:10
相关论文
共 50 条
  • [1] Semantic N-Gram Topic Modeling
    Kherwa, Pooja
    Bansal, Poonam
    [J]. EAI ENDORSED TRANSACTIONS ON SCALABLE INFORMATION SYSTEMS, 2020, 7 (26) : 1 - 12
  • [2] NOVEL TOPIC N-GRAM COUNT LM INCORPORATING DOCUMENT-BASED TOPIC DISTRIBUTIONS AND N-GRAM COUNTS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 2014 PROCEEDINGS OF THE 22ND EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO), 2014, : 2310 - 2314
  • [3] Topic-Dependent-Class-Based n-Gram Language Model
    Naptali, Welly
    Tsuchiya, Masatoshi
    Nakagawa, Seiichi
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (05): : 1513 - 1525
  • [4] Extracting Mobile Behavioral Patterns with the Distant N-Gram Topic Model
    Farrahi, Katayoun
    Gatica-Perez, Daniel
    [J]. 2012 16TH INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (ISWC), 2012, : 1 - 8
  • [5] TOPIC N-GRAM COUNT LANGUAGE MODEL ADAPTATION FOR SPEECH RECOGNITION
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 2012 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY (SLT 2012), 2012, : 165 - 169
  • [6] Arabic supervised learning method using N-gram
    Sanan, Majed
    Rammal, Mahmoud
    Zreik, Khaldoun
    [J]. INTERACTIVE TECHNOLOGY AND SMART EDUCATION, 2008, 5 (03) : 157 - +
  • [7] Recasting the discriminative n-gram model as a pseudo-conventional n-gram model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. 2008 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-12, 2008, : 4933 - 4936
  • [8] Pseudo-Conventional N-Gram Representation of the Discriminative N-Gram Model for LVCSR
    Zhou, Zhengyu
    Meng, Helen
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, 2010, 4 (06) : 943 - 952
  • [9] Pipilika N-gram Viewer: An Efficient Large Scale N-gram Model for Bengali
    Ahmad, Adnan
    Talha, Mahbubur Rub
    Amin, Md. Ruhul
    Chowdhury, Farida
    [J]. 2018 INTERNATIONAL CONFERENCE ON BANGLA SPEECH AND LANGUAGE PROCESSING (ICBSLP), 2018,
  • [10] A Methodology to Identify Topic of Video via N-Gram Approach
    Pervaiz, Ramsha
    Aloufi, Khalid
    Zaidi, Syed Shabbar Raza
    Malik, Kaleem Razzaq
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2020, 20 (01): : 79 - 94