Supervised N-gram Topic Model

被引:12
|
作者
Kawamae, Noriaki [1 ]
机构
[1] NTT Comware, Mihama Ku, 1-6 Nakase, Chiba 2610023, Japan
关键词
Nonparametric Bayes models; Nonparametric Dirichlet process; Topic models; Latent variable models; Graphical models; Sentiment analysis; N-gram topic model;
D O I
10.1145/2556195.2559895
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
We propose a Bayesian nonparametric topic model that represents relationships between given labels and the corresponding words/phrases, as found in supervised articles. Unlike existing supervised topic models, our proposal, supervised N-gram topic model (SNT), focuses on both the number of topics and power-law distribution in the word frequencies for topic-specific N-grams. To achieve this goal, SNT takes a Bayesian nonparametric approach to topic sampling; it assigns a topic to each token using Chinese restaurant process (CRP), and generates a word distribution jointly with the given variable in textual order, and then forms each N-gram word as a hierarchy of Pitman-Yor process (PYP) priors. CRP can help SNT to automatically estimate the appropriate number of topics, which impacts the quality of topic specific words, N-grams, and observed value distribution. Since PYP recovers the exact formulation of interpolated Kneser-Ney, one of the best smoothing approaches for N-gram language models, it can allow SNT to generate more interpretable N-grams that the alternatives. Experiments on labeled text data show that SNT is useful as a generative model for discovering more phrases that better complement human experts than existing alternatives and provide more domain specific knowledge. The results show that SNT can be applied to various tasks such as automatic annotation.
引用
收藏
页码:473 / 482
页数:10
相关论文
共 50 条
  • [31] Creating Robust Supervised Classifiers via Web-Scale N-gram Data
    Bergsma, Shane
    Pitler, Emily
    Lin, Dekang
    [J]. ACL 2010: 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2010, : 865 - 874
  • [32] Adaptable N-gram Classification Model for Data Leakage Prevention
    Alneyadi, Sultan
    Sithirasenan, Elankayer
    Muthukkumarasamy, Vallipuram
    [J]. 2013 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ICSPCS), 2013,
  • [33] An Empirical Model for n-gram Frequency Distribution in Large Corpora
    Silva, Joaquim F.
    Cunha, Jose C.
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2020, PT II, 2020, 12085 : 840 - 851
  • [34] Character N-Gram Spotting on Handwritten Documents using Weakly-Supervised Segmentation
    Roy, Udit
    Sankaran, Naveen
    Sankar, Pramod K.
    Jawahar, C. V.
    [J]. 2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 577 - 581
  • [35] Automatic Chinese Text Classification Using N-Gram Model
    Yen, Show-Jane
    Lee, Yue-Shi
    Wu, Yu-Chieh
    Ying, Jia-Ching
    Tseng, Vincent S.
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2010, PT 3, PROCEEDINGS, 2010, 6018 : 458 - +
  • [36] Product Reviews based on Location using N-gram model
    Varma, Kajal S.
    Mahajan, Arpana
    Degadwala, Sheshang D.
    [J]. PROCEEDINGS OF THE 2018 3RD INTERNATIONAL CONFERENCE ON INVENTIVE COMPUTATION TECHNOLOGIES (ICICT 2018), 2018, : 100 - 104
  • [37] Building Knowledge Domain N-Gram Model for Mobile Devices
    Choi, Dongjin
    Ko, Byungkyu
    Hwang, Myunggwon
    Kim, Pankoo
    [J]. INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2011, 14 (11): : 3583 - 3590
  • [38] Bayesian estimation methods for N-gram language model adaptation
    Federico, M
    [J]. ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, 1996, : 240 - 243
  • [39] A Theoretical Model for n-gram Distribution in Big Data Corpora
    Silva, Joaquim F.
    Goncalves, Carlos
    Cunha, Jose C.
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 134 - 141
  • [40] Bangla Word Clustering Based on N-gram Language Model
    Ismail, Sabir
    Rahman, M. Shahidur
    [J]. 2014 1ST INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING AND INFORMATION & COMMUNICATION TECHNOLOGY (ICEEICT 2014), 2014,