Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

被引:0
|
作者
Wang, Xiaolin [1 ]
Utiyama, Masao [1 ]
Finch, Andrew Michael [1 ]
Sumita, Eiichiro [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Koganei, Tokyo, Japan
来源
PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2 | 2014年
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
引用
收藏
页码:752 / 758
页数:7
相关论文
共 50 条
  • [21] Combining unsupervised and knowledge-based methods in large-scale forest classification
    Quegan, S
    Yu, JJ
    Balzter, H
    LeToan, T
    IGARSS 2000: IEEE 2000 INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM, VOL I - VI, PROCEEDINGS, 2000, : 426 - 428
  • [22] Construction and Application of a Large-Scale Chinese Abstractness Lexicon Based on Word Similarity
    Xu, Huidan
    Yang, Lijiao
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 122 - 130
  • [23] MFC: A method of co-referent relation acquisition from large-scale Chinese corpora
    Tian, Guogang
    Cao, Cungen
    Liu, Lei
    Wang, Haitao
    FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4223 : 1259 - 1268
  • [24] Rapid creation of large-scale corpora and frequency dictionaries
    Zseder, Attila
    Recski, Gabor
    Varga, Daniel
    Kornai, Andras
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 1462 - 1465
  • [25] Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and Prediction
    Andres Gutierrez, Hugo
    JOURNAL OF OFFICIAL STATISTICS, 2012, 28 (02) : 303 - 305
  • [26] The automatic construction of large-scale corpora for summarization research
    Marcu, D
    SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, : 137 - 144
  • [27] Fast Unsupervised Projection for Large-Scale Data
    Wang, Jingyu
    Wang, Lin
    Nie, Feiping
    Li, Xuelong
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (08) : 3634 - 3644
  • [28] Unsupervised Neural Word Segmentation for Chinese via Segmental Language Modeling
    Sun, Zhiqing
    Deng, Zhi-Hong
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4915 - 4920
  • [29] A Lexicon-Corpus-based Unsupervised Chinese Word Segmentation Approach
    Lu Pengyu
    Pu Jingchuan
    Du Mingming
    Lou Xiaojuan
    Jin Lijun
    INTERNATIONAL JOURNAL ON SMART SENSING AND INTELLIGENT SYSTEMS, 2014, 7 (01): : 263 - 282
  • [30] A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation
    Pei, Wenzhe
    Han, Dongxu
    Chang, Baobao
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, 2013, 8208 : 44 - 51