Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

被引:0
|
作者
Wang, Xiaolin [1 ]
Utiyama, Masao [1 ]
Finch, Andrew Michael [1 ]
Sumita, Eiichiro [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Koganei, Tokyo, Japan
来源
PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2 | 2014年
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
引用
收藏
页码:752 / 758
页数:7
相关论文
共 50 条
  • [31] Chinese Word Segmentation of Ideological and Political Education Based on Unsupervised Learning
    Zang, Wen-jing
    Yang, Xing-hai
    Liu, Zi-zhao
    Zhang, Yu-lin
    PROCEEDINGS OF 2019 2ND INTERNATIONAL CONFERENCE ON BIG DATA TECHNOLOGIES (ICBDT 2019), 2019, : 109 - 113
  • [32] Segmentation of object outlines into parts: A large-scale integrative study
    De Winter, J
    Wagemans, J
    COGNITION, 2006, 99 (03) : 275 - 325
  • [34] A Large-Scale Empirical Study on Semantic Versioning in Golang Ecosystem
    Li, Wenke
    Wu, Feng
    Fu, Cai
    Zhou, Fan
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1604 - 1614
  • [35] Empirical Analysis of Learnable Image Resizer For Large-Scale Medical Image Classification And Segmentation
    Rahman, M. M. Shaifur
    Alom, Md Zahangir
    Khan, Simon
    Taha, Tarek M.
    IEEE NATIONAL AEROSPACE AND ELECTRONICS CONFERENCE, NAECON 2024, 2024, : 56 - 61
  • [36] Software testing and Android applications: a large-scale empirical study
    Pecorelli, Fabiano
    Catolino, Gemma
    Ferrucci, Filomena
    De Lucia, Andrea
    Palomba, Fabio
    EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (02)
  • [37] A Large-Scale Empirical Study of Compiler Errors in Continuous Integration
    Zhang, Chen
    Chen, Bihuan
    Chen, Linlin
    Peng, Xin
    Zhao, Wenyun
    ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, : 176 - 187
  • [38] A Large-Scale Empirical Study of Aligned Time Series Forecasting
    Pilyugina, Polina
    Medvedeva, Svetlana
    Mosievich, Kirill
    Trofimov, Ilya
    Kostromina, Alina
    Simakov, Dmitry
    Burnaev, Evgeny
    IEEE ACCESS, 2024, 12 : 131100 - 131121
  • [39] A large-scale empirical study of code smells in JavaScript projects
    David Johannes
    Foutse Khomh
    Giuliano Antoniol
    Software Quality Journal, 2019, 27 : 1271 - 1314
  • [40] Software testing and Android applications: a large-scale empirical study
    Fabiano Pecorelli
    Gemma Catolino
    Filomena Ferrucci
    Andrea De Lucia
    Fabio Palomba
    Empirical Software Engineering, 2022, 27