The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization

被引:0
|
作者
Hu, Yan [1 ]
Wu, Wei [1 ]
Miao, Miao [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China
关键词
Automatic Construction; Large-scale Corpus; Chinese Text Categorization;
D O I
10.1109/IEEC.2009.141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on lager-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
引用
收藏
页码:640 / 645
页数:6
相关论文
共 50 条
  • [21] Research on Enhancing the Effectiveness of the Chinese Text Automatic Categorization Based on ICTCLAS Segmentation Method
    Li, Xiangdong
    Zhang, Cheng
    PROCEEDINGS OF 2013 IEEE 4TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS), 2012, : 267 - 270
  • [22] Construction of a large-scale Sino-Vietnamese Bilingual Parallel Corpus
    Luo, Lin
    Guo, Jian-yi
    Yu, Zheng-tao
    Mo, Yuan-yuan
    Zhou, Lan-Jiang
    2014 IEEE INTERNATIONAL CONFERENCE ON SYSTEM SCIENCE AND ENGINEERING (ICSSE), 2014, : 154 - 157
  • [23] Construction of Large-scale English Verbal Multiword Expression Annotated Corpus
    Kato, Akihiko
    Shindo, Hiroyuki
    Matsumoto, Yuji
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 2495 - 2499
  • [24] Applications of large-scale molecular profiling techniques to the study of the corpus luteum
    Pate, Joy L.
    Hughes, Camilla K.
    ANIMAL REPRODUCTION, 2018, 15 : 791 - 804
  • [25] LEDGAR: A Large-Scale Multilabel Corpus for Text Classification of Legal Provisions in Contracts
    Tuggener, Don
    von Daniken, Pius
    Peetz, Thomas
    Cieliebak, Mark
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1235 - 1241
  • [26] Improving Large-Scale k-Nearest Neighbor Text Categorization with Label Autoencoders
    Ribadas-Pena, Francisco J.
    Cao, Shuyuan
    Darriba Bilbao, Victor M.
    MATHEMATICS, 2022, 10 (16)
  • [27] Construction and Analysis of a Large Vietnamese Text Corpus
    Dieu-Thu Le
    Quasthoff, Uwe
    LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 412 - 416
  • [28] Automatic Acquisition of Large-scale Academic Bilingual Parallel Corpus from the Web
    Han Yong
    Li Yu
    He Xiaoning
    Yang Muyun
    Lei Guohua
    2009 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2009, : 318 - 321
  • [29] CASEBOOKS IN SCENIC TECHNIQUES - LARGE-SCALE FOAM CORE CONSTRUCTION
    MOORE, R
    THEATRE CRAFTS, 1979, 13 (06): : 30 - 30
  • [30] Construction and Inference Technique of Large-Scale Chinese Concreteness Lexicon
    Xie Z.
    Bi R.
    Beijing Daxue Xuebao (Ziran Kexue Ban)/Acta Scientiarum Naturalium Universitatis Pekinensis, 2022, 58 (01): : 1 - 6