Chinese text classification without automatic word segmentation

被引:4
|
作者
Liu, Wei [1 ]
Allison, Ben [1 ]
Guthrie, David [1 ]
Guthrie, Louise [1 ]
机构
[1] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England
关键词
D O I
10.1109/ALPIT.2007.19
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the lack of word boundaries in Asian systems of writing, machine processing of these languages often, involves segmenting text into word units. This paper tests the assumption that this segmentation is a necessary step for authorship attribution and topic classification tasks in Chinese, and demonstrates that it is not. We show extensive results for both tasks, considering both single words and short phrases as features, and examining the effect of document length on classification accuracy. Our experiments show that a naive character bigram model of text performs as well as models generated using a state-of-the-art automatic segmenter.
引用
收藏
页码:45 / +
页数:2
相关论文
共 50 条
  • [21] Chinese Text Classification Method Based on BERT Word Embedding
    Wang, Ziniu
    Huang, Zhilin
    Gao, Jianling
    2020 5TH INTERNATIONAL CONFERENCE ON MATHEMATICS AND ARTIFICIAL INTELLIGENCE (ICMAI 2020), 2020, : 66 - 71
  • [22] Word-character attention model for Chinese text classification
    Xue Qiao
    Chen Peng
    Zhen Liu
    Yanfeng Hu
    International Journal of Machine Learning and Cybernetics, 2019, 10 : 3521 - 3537
  • [23] Dynamically Jointing character and word embedding for Chinese text Classification
    Tang, Xuetao
    Hu, Xuegang
    Li, Peipei
    11TH IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH (ICKG 2020), 2020, : 336 - 343
  • [24] More than Text: Multi-modal Chinese Word Segmentation
    Zhang, Dong
    Hu, Zheng
    Li, Shoushan
    Wu, Hanqian
    Zhu, Qiaoming
    Zhou, Guodong
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 550 - 557
  • [25] Text knowledge management oriented adaptive Chinese word segmentation algorithms
    Feng, Yong
    He, Xun
    Tang, Li
    Chen, Xian-Yong
    Chen, Zhen
    Chongqing Daxue Xuebao/Journal of Chongqing University, 2010, 33 (10): : 110 - 117
  • [26] Term Extraction from Chinese Texts Without Word Segmentation
    Yu, Chuqiao
    Ma Pengyu
    Bessmertny, I. A.
    Platonov, A., V
    Poleschuk, E. A.
    2017 11TH IEEE INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT 2017), 2017, : 124 - 127
  • [27] Words without Boundaries: Computational Approaches to Chinese Word Segmentation
    Huang, Chu-Ren
    Xue, Nianwen
    LANGUAGE AND LINGUISTICS COMPASS, 2012, 6 (08): : 494 - 505
  • [28] Method of automatic Chinese word segmentation suitable for automation keyword indexing
    Zhenming, Tang
    Cong, Jin
    Jinyu, Yang
    Yuanfu, Li
    Nanjing Li Gong Daxue Xuebao/Journal of Nanjing University of Science and Technology, 1995, 19 (05):
  • [29] Automatic Chinese Text Classification Based on NSVMDT-KNN
    Xu, QiNan
    Liu, Zhijng
    FIFTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2008, : 410 - 414
  • [30] Improvement of Automatic Chinese Text Classification by Combining Multiple Features
    Luo, Xi
    Ohyama, Wataru
    Wakabayashi, Tetsushi
    Kimura, Fumitaka
    IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2015, 10 (02) : 166 - 174