Exploration of N-gram Features for the Domain Adaptation of Chinese Word Segmentation

被引:0
|
作者
Guo, Zhen [1 ]
Zhang, Yujie [1 ]
Su, Chen [1 ]
Xu, Jinan [1 ]
机构
[1] Beijing Jiaotong Univ, Sch Comp & Informat Technol, Beijing 100044, Peoples R China
关键词
Chinese Word Segmentation; CRF; domain adaptation; n-gram feature;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A key problem in Chinese Word Segmentation is that the performance of a system will decrease when applied to a different domain. We propose an approach in which n-gram features from large raw corpus are explored to realize domain adaptation for Chinese Word Segmentation. The n-gram features include n-gram frequency feature and AV feature. We used the CRF model and a raw corpus consisting of 1 million patent description sentences to verify the proposed method. For test data, 300 patent description sentences are randomly selected and manually annotated. The results show that the improvement of Chinese Word Segmentation on the test data achieved at 2.53%.
引用
收藏
页码:121 / 131
页数:11
相关论文
共 50 条
  • [1] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [2] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. Lect. Notes Comput. Sci., 1600, (557-565):
  • [4] Unsupervised word sense disambiguation with N-gram features
    Preotiuc-Pietro, Daniel
    Hristea, Florentina
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2014, 41 (02) : 241 - 260
  • [5] Unsupervised word sense disambiguation with N-gram features
    Daniel Preotiuc-Pietro
    Florentina Hristea
    [J]. Artificial Intelligence Review, 2014, 41 : 241 - 260
  • [6] Neural Domain Adaptation or Chinese Word Segmentation
    Bao, Zuyi
    Li, Si
    Xu, Weiran
    Gao, Sheng
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 131 - 134
  • [7] Chinese new word identification using N-gram and PPM Models
    Li, Dun
    Tu, Wei
    Shi, Lei
    [J]. EMERGING SYSTEMS FOR MATERIALS, MECHANICS AND MANUFACTURING, 2012, 109 : 612 - 616
  • [8] Learning Chinese word representation better by cascade morphological n-gram
    Zongyang Xiong
    Ke Qin
    Haobo Yang
    Guangchun Luo
    [J]. Neural Computing and Applications, 2021, 33 : 3757 - 3768
  • [9] Detection of Algorithmically Generated Malicious Domain Names with Feature Fusion of Meaningful Word Segmentation and N-Gram Sequences
    Chen, Shaojie
    Lang, Bo
    Chen, Yikai
    Xie, Chong
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (07):
  • [10] MiNgMatch-A Fast N-gram Model for Word Segmentation of the Ainu Language
    Nowakowski, Karol
    Ptaszynski, Michal
    Masui, Fumito
    [J]. INFORMATION, 2019, 10 (10)