Empirical Study of Unsupervised Chinese Word Segmentation Methods for SMT on Large-scale Corpora

被引:0
|
作者
Wang, Xiaolin [1 ]
Utiyama, Masao [1 ]
Finch, Andrew Michael [1 ]
Sumita, Eiichiro [1 ]
机构
[1] Natl Inst Informat & Commun Technol, Koganei, Tokyo, Japan
来源
PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2 | 2014年
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Unsupervised word segmentation (UWS) can provide domain-adaptive segmentation for statistical machine translation (SMT) without annotated data, and bilingual UWS can even optimize segmentation for alignment. Monolingual UWS approaches of explicitly modeling the probabilities of words through Dirichlet process (DP) models or Pitman-Yor process (PYP) models have achieved high accuracy, but their bilingual counterparts have only been carried out on small corpora such as basic travel expression corpus (BTEC) due to the computational complexity. This paper proposes an efficient unified PYP-based monolingual and bilingual UWS method. Experimental results show that the proposed method is comparable to supervised segmenters on the in-domain NIST OpenMT corpus, and yields a 0.96 BLEU relative increase on NTCIR PatentMT corpus which is out-of-domain.
引用
收藏
页码:752 / 758
页数:7
相关论文
共 50 条
  • [1] Large-Scale Unsupervised Semantic Segmentation
    Gao, Shanghua
    Li, Zhong-Yu
    Yang, Ming-Hsuan
    Cheng, Ming-Ming
    Han, Junwei
    Torr, Philip
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7457 - 7476
  • [2] WORD EMBEDDING BASED ON LARGE-SCALE WEB CORPORA AS A POWERFUL LEXICOGRAPHIC TOOL
    Garabik, Radovan
    RASPRAVE, 2020, 46 (02): : 603 - 618
  • [3] Hierarchical iterative and self-supervised method for concept-word acquisition from a large-scale Chinese corpora
    Tian, GG
    Cao, CG
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 312 - 317
  • [4] A Large-Scale Empirical Study of Conficker
    Shin, Seungwon
    Gu, Guofei
    Reddy, Narasimha
    Lee, Christopher P.
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2012, 7 (02) : 676 - 690
  • [5] A Large-Scale Empirical Analysis of Chinese Web Passwords
    Li, Zhigong
    Han, Weili
    Xu, Wenyuan
    PROCEEDINGS OF THE 23RD USENIX SECURITY SYMPOSIUM, 2014, : 559 - 574
  • [6] A Large-Scale Study on Unsupervised Spatiotemporal Representation Learning
    Feichtenhofer, Christoph
    Fan, Haoqi
    Xiong, Bo
    Girshick, Ross
    He, Kaiming
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 3298 - 3308
  • [7] Understanding Offline Password-Cracking Methods: A Large-Scale Empirical Study
    Shi, Ruixin
    Zhou, Yongbin
    Li, Yong
    Han, Weili
    SECURITY AND COMMUNICATION NETWORKS, 2021, 2021
  • [8] A Large-Scale Empirical Study of Security Patches
    Li, Frank
    Paxson, Vern
    CCS'17: PROCEEDINGS OF THE 2017 ACM SIGSAC CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 2017, : 2201 - 2215
  • [9] Conficker and Beyond: A Large-Scale Empirical Study
    Shin, Seungwon
    Gu, Guofei
    26TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE (ACSAC 2010), 2010, : 151 - 160
  • [10] Toward Better Chinese Word Segmentation for SMT via Bilingual Constraints
    Zeng, Xiaodong
    Chao, Lidia S.
    Wong, Derek F.
    Trancoso, Isabel
    Tian, Liang
    PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 1360 - 1369