Exploiting Unlabeled Internal Data in Conditional Random Fields to Reduce Word Segmentation Errors for Chinese Texts

被引:0
|
作者
Tsai, Richard Tzong-Han [1 ]
Hung, Hsi-Chuan [1 ]
Dai, Hong-Jie [1 ]
Hsu, Wen-Lian [1 ]
机构
[1] Acad Sinica, Inst Informat Sci, Taipei, Taiwan
关键词
text-to-speech; Chinese word segmentation; segmentation errors; internal unlabeled data;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The application of text-to-speech (TTS) conversion has become widely used in recent years. Chinese TTS faces several unique difficulties. The most critical is caused by the lack of word delimiters in written Chinese. This means that Chinese word segmentation (CWS) must be the first step in Chinese TTS. Unfortunately, due to the ambiguous nature of word boundaries in Chinese, even the best CWS systems make serious segmentation errors. Incorrect sentence interpretation causes TTS errors, preventing TTS's wider use in applications such as automatic customer services or computer reader systems for the visually impaired. In this paper, we propose a novel method that exploits unlabeled internal data to reduce word segmentation errors without using external dictionaries. To demonstrate the generality of our method, we verify our system on the most widely recognized CWS evaluation tool-the SIGHAN bakeoff, which includes datasets in both traditional and simplified Chinese. These datasets are provided by four representative academics or industrial research institutes in HK, Taiwan, Mainland China, and the U.S. Our experimental results show that with only internal data and unlabeled test data, our approach reduces segmentation errors by an average of 15% compared to the traditional approach. Moreover, our approach achieves comparable performance to the best CWS systems that use external resources. Further analysis shows that our method has the potential to become more accurate as the amount of test data increases.
引用
收藏
页码:2944 / 2947
页数:4
相关论文
共 18 条
  • [1] Chinese Word Segmentation based on Conditional Random Fields with Character Clustering
    Du, Liping
    Li, Xiaoge
    Liu, Chunli
    Liu, Rui
    Fan, Xian
    Yang, Jianing
    Lin, Dayi
    Wei, Mian
    [J]. PROCEEDINGS OF THE 2016 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2016, : 258 - 261
  • [2] A Conditional Random Fields Model for Overlapping Ambiguity Resolution in Chinese Word Segmentation
    Liang, Yan
    Zhu, Yaoting
    [J]. 2009 IEEE INTERNATIONAL CONFERENCE ON GRANULAR COMPUTING ( GRC 2009), 2009, : 384 - +
  • [3] A Chinese word segmentation model for energy literature based on Conditional Random Fields
    Zhao, Liujun
    Kong, Weizheng
    Chai, Bo
    [J]. 2018 2ND IEEE CONFERENCE ON ENERGY INTERNET AND ENERGY SYSTEM INTEGRATION (EI2), 2018, : 785 - 788
  • [4] Improving Neural Chinese Word Segmentation Using Unlabeled Data
    Zhang, Yanna
    Xu, Jinan
    Miao, Guoyi
    Chen, Yufeng
    Zhang, Yujie
    [J]. 2018 2ND INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE APPLICATIONS AND TECHNOLOGIES (AIAAT 2018), 2018, 435
  • [5] Domain dependent word segmentation based on conditional random fields
    Fukuda, Takuya
    Izumi, Masataka
    Miura, Takao
    [J]. 2007 IEEE PACIFIC RIM CONFERENCE ON COMMUNICATIONS, COMPUTERS AND SIGNAL PROCESSING, VOLS 1 AND 2, 2007, : 264 - 267
  • [6] Scaling conditional random field with application to Chinese word segmentation
    Zhao, Hai
    Kit, Chunyu
    [J]. ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 5, PROCEEDINGS, 2007, : 95 - +
  • [7] Word segmentation using domain knowledge based on conditional random fields
    Fukuda, Takuya
    Izzumi, Masataka
    Miura, Takao
    [J]. 19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 436 - 439
  • [8] Name Origin Recognition in Chinese Texts Based on Conditional Random Fields
    Zhang, Jing
    Xu, Jian
    Zhang, Yujie
    [J]. PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMPUTER APPLICATIONS (ICSA 2013), 2013, 92 : 129 - 134
  • [9] Training conditional random fields with unlabeled data and limited number of labeled examples
    Wong, Tak-Lam
    Lam, Wai
    [J]. ADVANCES IN MACHINE LEARNING AND CYBERNETICS, 2006, 3930 : 477 - 486
  • [10] Chinese Unknown Word Recognition using improved Conditional Random Fields
    Xu, Yisu
    Wang, Xuan
    Tang, Buzhou
    Wang, Xiaolong
    [J]. ISDA 2008: EIGHTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, VOL 2, PROCEEDINGS, 2008, : 363 - 367