Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

被引：0

作者：

Wang, Xiaolin ^{[1
]}

Utiyama, Masao ^{[1
]}

Finch, Andrew Michael ^{[1
]}

Sumita, Eiichiro ^{[1
]}

机构：

[1] Natl Inst Informat & Commun Technol, Koganei, Tokyo, Japan

来源：

PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2 | 2014年

关键词：

D O I：

暂无

中图分类号：

TP39 [计算机的应用];

学科分类号：

081203 ; 0835 ;

摘要：

Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.

引用

页码：752 / 758

页数：7

共 50 条

[1] Large-Scale Unsupervised Semantic Segmentation
Gao, Shanghua
Li, Zhong-Yu
Yang, Ming-Hsuan
Cheng, Ming-Ming
Han, Junwei
Torr, Philip
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7457 - 7476
[2] WORD EMBEDDING BASED ON LARGE-SCALE WEB CORPORA AS A POWERFUL LEXICOGRAPHIC TOOL
Garabik, Radovan
RASPRAVE, 2020, 46 (02): : 603 - 618
[3] Hierarchical iterative and self-supervised method for concept-word acquisition from a large-scale Chinese corpora
Tian, GG
Cao, CG
PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 312 - 317
[4] A Large-Scale Empirical Study of Conficker
Shin, Seungwon
Gu, Guofei
Reddy, Narasimha
Lee, Christopher P.
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (02) : 676 - 690
[5] A Large-Scale Empirical Analysis of Chinese Web Passwords
Li, Zhigong
Han, Weili
Xu, Wenyuan
PROCEEDINGS OF THE 23RD USENIX SECURITY SYMPOSIUM, 2014, : 559 - 574
[6] A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
Feichtenhofer, Christoph
Fan, Haoqi
Xiong, Bo
Girshick, Ross
He, Kaiming
2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3298 - 3308
[7] Understanding Offline Password-Cracking Methods: A Large-Scale Empirical Study
Shi, Ruixin
Zhou, Yongbin
Li, Yong
Han, Weili
SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
[8] A Large-Scale Empirical Study of Security Patches
Li, Frank
Paxson, Vern
CCS'17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2017, : 2201 - 2215
[9] Conficker and Beyond: A Large-Scale Empirical Study
Shin, Seungwon
Gu, Guofei
26TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE (ACSAC 2010), 2010, : 151 - 160
[10] Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
Zeng, Xiaodong
Chao, Lidia S.
Wong, Derek F.
Trancoso, Isabel
Tian, Liang
PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 1360 - 1369

← 1 2 3 4 5 →