Unsupervised Language Filtering using the Latent Dirichlet Allocation

被引:0
|
作者
Zhang, Wei [1 ]
Clark, Robert A. J. [2 ]
Wang, Yongyuan [1 ]
机构
[1] Ocean Univ China, Qingdao 266100, Peoples R China
[2] Univ Edinburgh, CSTR, Edinburgh EH8 9AB, Midlothian, Scotland
关键词
Language Filtering; Language Purification; Language Identification; IDENTIFICATION; WEB;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To automatically build from scratch the language processing component for a speech synthesis system in a new language a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. We show that such a model is highly capable of identifying the primary language in a corpus and filtering out other languages present.
引用
收藏
页码:1268 / 1272
页数:5
相关论文
共 50 条
  • [1] Unsupervised language identification based on Latent Dirichlet Allocation
    Zhang, Wei
    Clark, Robert A. J.
    Wang, Yongyuan
    Li, Wen
    [J]. COMPUTER SPEECH AND LANGUAGE, 2016, 39 : 47 - 66
  • [2] UNSUPERVISED LANGUAGE MODEL ADAPTATION USING LATENT DIRICHLET ALLOCATION AND DYNAMIC MARGINALS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1480 - 1484
  • [3] Novel Weighting Scheme for Unsupervised Language Model Adaptation Using Latent Dirichlet Allocation
    Haidar, Md Akmal
    O'Shaughnessy, Douglas
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2438 - 2441
  • [4] Unsupervised Object Localization with Latent Dirichlet Allocation
    Yang, Tong-feng
    Ma, Jun
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (ICCSAI 2013), 2013, : 230 - 234
  • [5] Unsupervised Feature Selection for Latent Dirichlet Allocation
    Xu Weiran
    Du Gang
    Chen Guang
    Guo Jun
    Yang Jie
    [J]. CHINA COMMUNICATIONS, 2011, 8 (05) : 54 - 62
  • [6] Exploit latent Dirichlet allocation for collaborative filtering
    Li, Zhoujun
    Zhang, Haijun
    Wang, Senzhang
    Huang, Feiran
    Li, Zhenping
    Zhou, Jianshe
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2018, 12 (03) : 571 - 581
  • [7] Exploit latent Dirichlet allocation for collaborative filtering
    Zhoujun Li
    Haijun Zhang
    Senzhang Wang
    Feiran Huang
    Zhenping Li
    Jianshe Zhou
    [J]. Frontiers of Computer Science, 2018, 12 : 571 - 581
  • [9] Unsupervised Domain Discovery using Latent Dirichlet Allocation for Acoustic Modelling in Speech Recognition
    Doulaty, Mortaza
    Saz, Oscar
    Hain, Thomas
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3640 - 3644
  • [10] Latent Dirichlet Allocation for Unsupervised Activity Analysis on an Autonomous Mobile Robot
    Duckworth, Paul
    Alomari, Muhannad
    Charles, James
    Hogg, David C.
    Cohn, Anthony G.
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3819 - 3826