Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

被引：0

作者：

Wang, Xiaolin ^{[1
]}

Utiyama, Masao ^{[1
]}

Finch, Andrew Michael ^{[1
]}

Sumita, Eiichiro ^{[1
]}

机构：

[1] Natl Inst Informat & Commun Technol, Koganei, Tokyo, Japan

来源：

PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2 | 2014年

关键词：

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.

引用

页码：752 / 758

页数：7

共 50 条

[31] Chinese Word Segmentation of Ideological and Political Education Based on Unsupervised Learning
Zang, Wen-jing
Yang, Xing-hai
Liu, Zi-zhao
Zhang, Yu-lin
PROCEEDINGS OF 2019 2ND INTERNATIONAL CONFERENCE ON BIG DATA TECHNOLOGIES (ICBDT 2019), 2019, : 109 - 113
[32] Segmentation of object outlines into parts: A large-scale integrative study
De Winter, J
Wagemans, J
COGNITION, 2006, 99 (03) : 275 - 325
[33] Chinese Word Boundary Ambiguity and Unknown Word Resolution Using Unsupervised Methods
傅国宏
HighTechnologyLetters, 2000, (02) : 29 - 39
[34] A Large-Scale Empirical Study on Semantic Versioning in Golang Ecosystem
Li, Wenke
Wu, Feng
Fu, Cai
Zhou, Fan
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1604 - 1614
[35] Empirical Analysis of Learnable Image Resizer For Large-Scale Medical Image Classification And Segmentation
Rahman, M. M. Shaifur
Alom, Md Zahangir
Khan, Simon
Taha, Tarek M.
IEEE NATIONAL AEROSPACE AND ELECTRONICS CONFERENCE, NAECON 2024, 2024, : 56 - 61
[36] Software testing and Android applications: a large-scale empirical study
Pecorelli, Fabiano
Catolino, Gemma
Ferrucci, Filomena
De Lucia, Andrea
Palomba, Fabio
EMPIRICAL SOFTWARE ENGINEERING, 2022, 27 (02)
[37] A Large-Scale Empirical Study of Compiler Errors in Continuous Integration
Zhang, Chen
Chen, Bihuan
Chen, Linlin
Peng, Xin
Zhao, Wenyun
ESEC/FSE'2019: PROCEEDINGS OF THE 2019 27TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, 2019, : 176 - 187
[38] A Large-Scale Empirical Study of Aligned Time Series Forecasting
Pilyugina, Polina
Medvedeva, Svetlana
Mosievich, Kirill
Trofimov, Ilya
Kostromina, Alina
Simakov, Dmitry
Burnaev, Evgeny
IEEE ACCESS, 2024, 12 : 131100 - 131121
[39] A large-scale empirical study of code smells in JavaScript projects
David Johannes
Foutse Khomh
Giuliano Antoniol
Software Quality Journal, 2019, 27 : 1271 - 1314
[40] Software testing and Android applications: a large-scale empirical study
Fabiano Pecorelli
Gemma Catolino
Filomena Ferrucci
Andrea De Lucia
Fabio Palomba
Empirical Software Engineering, 2022, 27

← 1 2 3 4 5 →