The Research on Automatic Construction Techniques of Large-scale Corpus for Chinese Text Categorization

被引:0
|
作者
Hu, Yan [1 ]
Wu, Wei [1 ]
Miao, Miao [1 ]
机构
[1] Wuhan Univ Technol, Sch Comp Sci & Technol, Wuhan 430070, Peoples R China
关键词
Automatic Construction; Large-scale Corpus; Chinese Text Categorization;
D O I
10.1109/IEEC.2009.141
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large-scale corpus contains abundant language phenomenon. It can reflect the universal law of language using and has drawn the interest of many countries in the field of information technology and linguistics circle. It has become a hot topic in the field of natural language processing. But based on the current scarcity of Chinese corpus, especially in the field of text categorization, the Chinese categorization corpus is especially rare. Today the text categorization has become the core and foundation of large-scale data processing applications. The lagging of Corpus research has become the obstruction of information technology development. Therefore, by analyzing the characteristics of Chinese categorization corpus, combining with Internet which is the largest knowledge base at present and depending on the search capability of search engines, this paper proposes and realizes a kind of algorithm on lager-scale corpus for Chinese text categorization. Experiments show that the corpus constructed by this algorithm performance well in various classifiers. It has a certain practical value.
引用
收藏
页码:640 / 645
页数:6
相关论文
共 50 条
  • [41] Exploiting semantic resources for large scale text categorization
    Jian Qiang Li
    Yu Zhao
    Bo Liu
    Journal of Intelligent Information Systems, 2012, 39 : 763 - 788
  • [42] Text Analytic Research Portals: Supporting Large-Scale Social Science Research
    Singh, Lisa
    Padden, Colton
    Davis-Kean, Pamela
    David, Rabin
    Marwadi, Virinche
    Ren, Yiqing
    Vanarsdall, Rebecca
    2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2021, : 6020 - 6022
  • [43] The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification
    Chen, Jing
    Chen, Qingcai
    Liu, Xin
    Yang, Haijun
    Lu, Daohe
    Tang, Buzhou
    2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 4946 - 4951
  • [44] Exploiting semantic resources for large scale text categorization
    Li, Jian Qiang
    Zhao, Yu
    Liu, Bo
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2012, 39 (03) : 763 - 788
  • [45] Network of Experts for Large-Scale Image Categorization
    Ahmed, Karim
    Baig, Mohammad Haris
    Torresani, Lorenzo
    COMPUTER VISION - ECCV 2016, PT VII, 2016, 9911 : 516 - 532
  • [46] Large-Scale Personalized Categorization of Financial Transactions
    Lesner, Christopher
    Ran, Alexander
    Wang, Wei
    Rukonic, Marko
    AI MAGAZINE, 2020, 41 (03) : 63 - 77
  • [47] Multi-layer Embedding Neural Architecture with External Memory for Large-Scale Text Categorization
    Rafi, Robiul Hossain Md
    Tang, Bo
    Sharma, Suvash
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 2288 - 2294
  • [48] Large-scale automatic extraction of Chinese compound lexical cohesion pairs
    Qiu, Han
    Zhou, Qiang
    Qinghua Daxue Xuebao/Journal of Tsinghua University, 2011, 51 (09): : 1293 - 1297
  • [49] Lightweight Methods for Large-Scale Product Categorization
    Cortez, Eli
    Herrera, Mauro Rojas
    da Silva, Altigran S.
    de Moura, Edleno S.
    Neubert, Marden
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2011, 62 (09): : 1839 - 1848
  • [50] An automatic close copy speech synthesis tool for large-scale speech corpus evaluation
    Gibbon, Dafydd
    Bachan, Jolanta
    SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 902 - 907