Unsupervised language identification based on Latent Dirichlet Allocation

被引:11
|
作者
Zhang, Wei [1 ,2 ]
Clark, Robert A. J. [2 ]
Wang, Yongyuan [1 ]
Li, Wen [1 ]
机构
[1] Ocean Univ China, Dept Comp Sci & Tech, Qingdao 266100, Peoples R China
[2] Univ Edinburgh, CSTR, Edinburgh EH89AB, Midlothian, Scotland
来源
关键词
Language filtering; Language purifying; Language identification;
D O I
10.1016/j.csl.2016.02.001
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
To automatically build, from scratch, the language processing component for a speech synthesis system in a new language, a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. In order to find the number of languages present, we compared four kinds of measure and also the Hierarchical Dirichlet process on several configurations of the ECI/UCI benchmark. Experiments on the ECI/MCI data and a Wikipedia based Swahili corpus shows this LDA method, without any annotation, has comparable precisions, recalls and F-scores to state of the art supervised language identification techniques. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:47 / 66
页数:20
相关论文
共 50 条
  • [1] Unsupervised Language Filtering using the Latent Dirichlet Allocation
    Zhang, Wei
    Clark, Robert A. J.
    Wang, Yongyuan
    [J]. 15TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2014), VOLS 1-4, 2014, : 1268 - 1272
  • [2] UNSUPERVISED LANGUAGE MODEL ADAPTATION USING LATENT DIRICHLET ALLOCATION AND DYNAMIC MARGINALS
    Haidar, Md. Akmal
    O'Shaughnessy, Douglas
    [J]. 19TH EUROPEAN SIGNAL PROCESSING CONFERENCE (EUSIPCO-2011), 2011, : 1480 - 1484
  • [3] Unsupervised Object Localization with Latent Dirichlet Allocation
    Yang, Tong-feng
    Ma, Jun
    [J]. 2013 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ARTIFICIAL INTELLIGENCE (ICCSAI 2013), 2013, : 230 - 234
  • [4] Unsupervised Feature Selection for Latent Dirichlet Allocation
    Xu Weiran
    Du Gang
    Chen Guang
    Guo Jun
    Yang Jie
    [J]. CHINA COMMUNICATIONS, 2011, 8 (05) : 54 - 62
  • [5] Novel Weighting Scheme for Unsupervised Language Model Adaptation Using Latent Dirichlet Allocation
    Haidar, Md Akmal
    O'Shaughnessy, Douglas
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2438 - 2441
  • [7] Unsupervised segmentation of greenhouse plant images based on modified Latent Dirichlet Allocation
    Wang, Yi
    Xu, Lihong
    [J]. PEERJ, 2018, 6
  • [8] Language Model Adaptation Based on Topic Probability of Latent Dirichlet Allocation
    Jeon, Hyung-Bae
    Lee, Soo-Young
    [J]. ETRI JOURNAL, 2016, 38 (03) : 487 - 493
  • [9] Author Identification Using Latent Dirichlet Allocation
    Calvo, Hiram
    Hernandez-Castaneda, Angel
    Garcia-Flores, Jorge
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, CICLING 2017, PT II, 2018, 10762 : 303 - 312
  • [10] Latent Dirichlet Allocation for Unsupervised Activity Analysis on an Autonomous Mobile Robot
    Duckworth, Paul
    Alomari, Muhannad
    Charles, James
    Hogg, David C.
    Cohn, Anthony G.
    [J]. THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 3819 - 3826