Chinese text classification without automatic word segmentation

被引:4
|
作者
Liu, Wei [1 ]
Allison, Ben [1 ]
Guthrie, David [1 ]
Guthrie, Louise [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England
关键词
D O I
10.1109/ALPIT.2007.19
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often, involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification accuracy. Our experiments show that a naive character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.
引用
收藏
页码:45 / +
页数:2
相关论文
共 50 条
  • [1] Text classification in Asian languages without word segmentation
    Peng, Fuchun
    Huang, Xiangji
    Schuurmans, Dale
    Wang, Shaojun
    [J]. Proceedings of the 6th International Workshop on Information Retrieval with Asian Languages, IRAL 2003, 2003, : 41 - 48
  • [2] Research of automatic Chinese word segmentation
    Liu, KY
    Zheng, JH
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 805 - 809
  • [3] Effect of Word Segmentation on Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Al-Subaie, Abdullah
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 127 - 131
  • [4] A Study on Automatic Chinese Text Classification
    Luo, Xi
    Ohyama, Wataru
    Wakabayashi, Tetsushi
    Kimura, Fumitaka
    [J]. 11TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR 2011), 2011, : 920 - 924
  • [5] An improved automatic Chinese word segmentation mechanism
    Wang, Hu
    Wang, Qianping
    [J]. RECENT ADVANCE OF CHINESE COMPUTING TECHNOLOGIES, 2007, : 147 - 150
  • [6] An Evolutionary Approach to Automatic Chinese Text Segmentation
    Zhang, Dong
    [J]. 2013 NINTH INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION (ICNC), 2013, : 771 - 776
  • [7] Text classification with improved word embedding and adaptive segmentation
    Sun, Guoying
    Cheng, Yanan
    Zhang, Zhaoxin
    Tong, Xiaojun
    Chai, Tingting
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [8] Automatic Chinese Text Classification Using Character-based and Word-based Approach
    Luo, Xi
    Ohyama, Wataru
    Wakabayashi, Tetsushi
    Kimura, Fumitaka
    [J]. 2013 12TH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), 2013, : 329 - 333
  • [9] The role of text familiarity in Chinese word segmentation and Chinese vocabulary recognition
    Chen Mingjing
    Wang Yongsheng
    Zhao Bingjie
    Li Xin
    Bai Xuejun
    [J]. ACTA PSYCHOLOGICA SINICA, 2022, 54 (10) : 1151 - +
  • [10] Decryption of Full Text Retrieval Technology: Chinese Word Segmentation
    Lu, Xuebing
    Xu, Yili
    Deng, Weiwei
    Yan, Yingjie
    [J]. PROCEEDINGS OF THE 2016 2ND INTERNATIONAL CONFERENCE ON MATERIALS ENGINEERING AND INFORMATION TECHNOLOGY APPLICATIONS (MEITA 2016), 2017, 107 : 334 - 337