Latent Topic-based Subspace for Natural Language Processing

被引:0
|
作者
Mohamed Morchid
Pierre-Michel Bousquet
Waad Ben Kheder
Killian Janod
机构
[1] University of Avignon,Laboratoire Informatique d’Avignon (LIA)
来源
关键词
Latent topic-based model; Deep neural networks; Author-topic model; Factor analysis; -vector; 20-Newsgroups; DECODA;
D O I
暂无
中图分类号
学科分类号
摘要
Natural Language Processing (NLP) applications have difficulties in dealing with automatically transcribed spoken documents recorded in noisy conditions, due to high Word Error Rates (WER), or in dealing with textual documents from the Internet, such as forums or micro-blogs, due to misspelled or truncated words, bad grammatical form… To improve the robustness against document errors, hitherto-proposed methods map these noisy documents in a latent space such as Latent Dirichlet Allocation (LDA), supervised LDA and author-topic (AT) models. In comparison to LDA, the AT model considers not only the document content (words), but also the class related to the document. In addition to these high-level representation models, an original compact representation, called c-vector, has recently been introduced avoid the tricky choice of the number of latent topics in these topic-based representations. The main drawback in the c-vector space building process is the number of sub-tasks required. Recently, we proposed both improving the performance of this c-vector compact representation of spoken documents and reducing the number of needed sub-tasks, using an original framework in a robust low dimensional space of features from a set of AT models called “Latent Topic-based Subspace” (LTS). This paper goes further by comparing the original LTS-based representation with the c-vector technique as well as with the state-of-the-art compression approach based on neural networks Encoder-Decoder (Autoencoder) and classification methods called deep neural networks (DNN) and long short-term memory (LSTM), on two classification tasks using noisy documents taking the form of speech conversations but also with textual documents from the 20-Newsgroups corpus. Results show that the original LTS representation outperforms the best previous compact representations with a substantial gain of more than 2.1 and 3.3 points in terms of correctly labeled documents compared to c-vector and Autoencoder neural networks respectively. An optimization algorithm of the scoring model parameters is then proposed to improve both the robustness and the performance of the proposed LTS-based approach. Finally, an automatic clustering approach based on the radial proximity between documents classes is introduced and shows promising performances.
引用
收藏
页码:833 / 853
页数:20
相关论文
共 50 条
  • [1] Latent Topic-based Subspace for Natural Language Processing
    Morchid, Mohamed
    Bousquet, Pierre-Michel
    Ben Kheder, Waad
    Janod, Killian
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2019, 91 (08): : 833 - 853
  • [2] Spoken Language Understanding in a Latent Topic-based Subspace
    Morchid, Mohamed
    Bouaziz, Mohamed
    Ben Kheder, Waad
    Janod, Killian
    Bousquet, Pierre-Michel
    Dufour, Richard
    Linares, Georges
    [J]. 17TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2016), VOLS 1-5: UNDERSTANDING SPEECH PROCESSING IN HUMANS AND MACHINES, 2016, : 710 - 714
  • [3] Sentence retrieval with a topic-based language model
    National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China
    [J]. Jisuanji Yanjiu yu Fazhan, 2007, 2 (288-295):
  • [4] Latent topic-based super-resolution for remote sensing
    Fernandez-Beltran, Ruben
    Latorre-Carmona, Pedro
    Pla, Filiberto
    [J]. REMOTE SENSING LETTERS, 2017, 8 (06) : 498 - 507
  • [5] Topic-based language models using dirichlet mixtures
    Sadamitsu, Kugatsu
    Mishina, Takuya
    Yamamoto, Mikio
    [J]. Systems and Computers in Japan, 2007, 38 (12): : 76 - 85
  • [6] Topic-Based Language Modeling with Dynamic Bayesian Networks
    Wiggers, Pascal
    Rothkrantz, Leon J. M.
    [J]. INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, 2006, : 1866 - 1869
  • [7] A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval
    Arun, K. S.
    Govindan, V. K.
    [J]. DATA SCIENCE AND ENGINEERING, 2018, 3 (02) : 166 - 195
  • [8] Topic-Based User Segmentation for Online Advertising with Latent Dirichlet Allocation
    Tu, Songgao
    Lu, Chaojun
    [J]. ADVANCED DATA MINING AND APPLICATIONS (ADMA 2010), PT II, 2010, 6441 : 259 - 269
  • [9] Constraint selection for topic-based MDI adaptation of language models
    Lecorve, Gwenole
    Gravier, Guillaume
    Sebillot, Pascale
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 368 - +
  • [10] T2S: An Encoder-Decoder Model for Topic-Based Natural Language Generation
    Ou, Wenjie
    Chen, Chaotao
    Ren, Jiangtao
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2018), 2018, 10859 : 143 - 151