A New Method of Automatic Text Document Classification

被引:2
|
作者
Yatsko, V. A. [1 ]
机构
[1] Katanov Khakass State Univ, Abakan, Russia
基金
俄罗斯基础研究基金会;
关键词
automatic text classification; methods and algorithms; Zipf distribution; reduction of text dimensionality; threshold levels; efficiency indices;
D O I
10.3103/S0005105521030080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method's efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts.
引用
收藏
页码:122 / 133
页数:12
相关论文
共 50 条
  • [31] Automatic Classification of Text Complexity
    Santucci, Valentino
    Santarelli, Filippo
    Forti, Luciana
    Spina, Stefania
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (20): : 1 - 19
  • [32] Text Graph Transformer for Document Classification
    Zhang, Haopeng
    Zhang, Jiawei
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 8322 - 8327
  • [33] A new feature selection method for text classification
    Uchyigit, Gulden
    Clark, Keith
    [J]. INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2007, 21 (02) : 423 - 438
  • [34] A New Method of Improving BERT for Text Classification
    Zheng, Shaomin
    Yang, Meng
    [J]. INTELLIGENCE SCIENCE AND BIG DATA ENGINEERING: BIG DATA AND MACHINE LEARNING, PT II, 2019, 11936 : 442 - 452
  • [35] A new feature extraction method for text classification
    Yildiz, H. Kemal
    Genctav, Murat
    Usta, Nurullah
    Diri, Banu
    Amasyali, M. Fatih
    [J]. 2007 IEEE 15TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS, VOLS 1-3, 2007, : 326 - 329
  • [36] A new method for sentiment classification in text retrieval
    Hu, Y
    Duan, JY
    Chen, XM
    Pei, BZ
    Lu, RZ
    [J]. NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 1 - 9
  • [37] LSRM: A New Method for Turkish Text Classification
    Borandağ, Emin
    [J]. Applied Sciences (Switzerland), 2024, 14 (23):
  • [38] Protein classification based on text document classification techniques
    Cheng, BYM
    Carbonell, JG
    Klein-Seetharaman, J
    [J]. PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 58 (04) : 955 - 970
  • [39] A New Automatic Multi-document Text Summarization using Topic Modeling
    Roul, Rajendra Kumar
    Mehrotra, Samarth
    Pungaliya, Yash
    Sahoo, Jajati Keshari
    [J]. DISTRIBUTED COMPUTING AND INTERNET TECHNOLOGY, ICDCIT 2019, 2019, 11319 : 212 - 221
  • [40] A New LSA and Entropy-Based Approach for Automatic Text Document Summarization
    Yadav, Chandra
    Sharan, Aditi
    [J]. INTERNATIONAL JOURNAL ON SEMANTIC WEB AND INFORMATION SYSTEMS, 2018, 14 (04) : 1 - 32