The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization

被引:0
|
作者
Hu, Yan [1 ]
Wu, Wei [1 ]
Miao, Miao [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China
关键词
Automatic Construction; Large-scale Corpus; Chinese Text Categorization;
D O I
10.1109/IEEC.2009.141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on lager-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
引用
收藏
页码:640 / 645
页数:6
相关论文
共 50 条
  • [1] Automatic label curation from large-scale text corpus
    Avasthi, Sandhya
    Chauhan, Ritu
    ENGINEERING RESEARCH EXPRESS, 2024, 6 (01):
  • [2] A LARGE-SCALE CHINESE LONG-TEXT EXTRACTIVE SUMMARIZATION CORPUS
    Chen, Kai
    Fu, Guanyu
    Chen, Qingcai
    Hu, Baotian
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 7828 - 7832
  • [3] The automatic construction of large-scale corpora for summarization research
    Marcu, D
    SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 1999, : 137 - 144
  • [4] Research on Chinese Text Automatic Categorization Based on VSM
    Tong Xiao-Jun
    Cui Ming-Gen
    Song Guo-Long
    2007 INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-15, 2007, : 3863 - +
  • [5] Topic Modeling Techniques for Text Mining over a Large-Scale Scientific and Biomedical Text Corpus
    Avasthi S.
    Chauhan R.
    Acharjya D.P.
    International Journal of Ambient Computing and Intelligence, 2022, 13 (01)
  • [6] Large-scale Bayesian logistic regression for text categorization
    Genkin, Alexander
    Lewis, David D.
    Madigan, David
    TECHNOMETRICS, 2007, 49 (03) : 291 - 304
  • [7] Build a large-scale syntactically annotated Chinese corpus
    Qiang, Z
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 106 - 113
  • [8] Big data techniques: Large-scale text analysis for scientific and journalistic research
    Arcila-Calderon, Carlos
    Barbosa-Caro, Eduar
    Cabezuelo-Lorenzo, Francisco
    PROFESIONAL DE LA INFORMACION, 2016, 25 (04): : 623 - 631
  • [9] Temporal knowledge extraction from large-scale text corpus
    Yu Liu
    Wen Hua
    Xiaofang Zhou
    World Wide Web, 2021, 24 : 135 - 156
  • [10] Temporal knowledge extraction from large-scale text corpus
    Liu, Yu
    Hua, Wen
    Zhou, Xiaofang
    WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2021, 24 (01): : 135 - 156