The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization

被引:0
|
作者
Hu, Yan [1 ]
Wu, Wei [1 ]
Miao, Miao [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China
关键词
Automatic Construction; Large-scale Corpus; Chinese Text Categorization;
D O I
10.1109/IEEC.2009.141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on lager-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
引用
收藏
页码:640 / 645
页数:6
相关论文
共 50 条
  • [31] DuSQL: A Large-Scale and Pragmatic Chinese Text-to-SQL Dataset
    Wang, Lijie
    Zhang, Ao
    Wu, Kun
    Sun, Ke
    Li, Zhenghua
    Wu, Hua
    Zhang, Min
    Wang, Haifeng
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 6923 - 6935
  • [32] A Novel Hybrid system for Large-Scale Chinese Text Classification Problem
    Gao, Zhong
    Lu, Guanming
    Gu, Daquan
    FCST: 2008 JAPAN-CHINA JOINT WORKSHOP ON FRONTIER OF COMPUTER SCIENCE AND TECHNOLOGY, PROCEEDINGS, 2008, : 121 - +
  • [33] Automatic Chinese Text Categorization System Based on Mutual Information
    Lu, Zhimao
    Shi, Hong
    Zhang, Qi
    Yuan, Chaoyue
    2009 IEEE INTERNATIONAL CONFERENCE ON MECHATRONICS AND AUTOMATION, VOLS 1-7, CONFERENCE PROCEEDINGS, 2009, : 4986 - 4990
  • [34] Automatic Category Structure Generation and Categorization of Chinese Text Documents
    Yang, Hsin-Chang
    Lee, Chung-Hong
    LECTURE NOTES IN COMPUTER SCIENCE <D>, 2000, 1910 : 673 - 678
  • [35] Construction of Adverbial-Verb Collocation Database Based on Large-Scale Corpus
    Xing, Dan
    Xun, Endong
    Wang, Chengwen
    Rao, Gaoqi
    Ma, Luyao
    CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 585 - 595
  • [36] Enabling Empirical Research: A Corpus of Large-Scale Python']Python Systems
    Omari, Safwan
    Martinez, Gina
    PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2019, VOL 2, 2020, 1070 : 661 - 669
  • [37] A Corpus for Large-Scale Phonetic Typology
    Salesky, Elizabeth
    Chodroff, Eleanor
    Pimentel, Tiago
    Wiesner, Matthew
    Cotterell, Ryan
    Black, Alan W.
    Eisner, Jason
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 4526 - 4546
  • [38] A Large-Scale Corpus for Conversation Disentanglement
    Kummerfeld, Jonathan K.
    Athreya, Vignesh
    Patel, Siva Sankalp
    Gouravajhala, Sai R.
    Gunasekara, Chulaka
    Polymenakos, Lazaros
    Peper, Joseph J.
    Ganhotra, Jatin
    Lasecki, Walter S.
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 3846 - 3856
  • [39] A Fully Semantic Approach to Large Scale Text Categorization
    Dessi, Nicoletta
    Dessi, Stefania
    Pes, Barbara
    INFORMATION SCIENCES AND SYSTEMS 2013, 2013, 264 : 149 - 157
  • [40] Research on the A* Algorithm for Automatic Guided Vehicles in Large-Scale Maps
    Chen, Yuandong
    Pang, Jinhao
    Gou, Yuchen
    Lin, Zhiming
    Zheng, Shaofeng
    Chen, Dewang
    Applied Sciences (Switzerland), 2024, 14 (22):