A New Method of Automatic Text Document Classification

被引:2
|
作者
Yatsko, V. A. [1 ]
机构
[1] Katanov Khakass State Univ, Abakan, Russia
基金
俄罗斯基础研究基金会;
关键词
automatic text classification; methods and algorithms; Zipf distribution; reduction of text dimensionality; threshold levels; efficiency indices;
D O I
10.3103/S0005105521030080
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
This paper describes the procedures and specific features of application of a new method of automatic classification based on calculation of deviations of stop-words distribution from Zipfian score. To neutralize discrepancies in texts lengths the author describes and applies the text undersampling methodology. The concept of an iterative threshold level is introduced to reduce text dimensionality to several dozen units. To evaluate the method's efficiency the author has developed discriminative and similarative powers indicators that underlie the generalized efficiency score. Fourteen tests have been conducted, including comparison with the cosine similarity measure, that proved high efficiency of the proposed method for the solution of the tasks of authorship attribution of texts of fiction and clusterization of political texts.
引用
收藏
页码:122 / 133
页数:12
相关论文
共 50 条
  • [1] A New Method of Automatic Text Document Classification
    V. A. Yatsko
    [J]. Automatic Documentation and Mathematical Linguistics, 2021, 55 : 122 - 133
  • [2] The Problems and Methods of Automatic Text Document Classification
    V. A. Yatsko
    [J]. Automatic Documentation and Mathematical Linguistics, 2021, 55 : 274 - 285
  • [3] The Problems and Methods of Automatic Text Document Classification
    Yatsko, V. A.
    [J]. AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (06) : 274 - 285
  • [4] An Automatic Text Classification Method Based on Hierarchical Taxonomies, Neural Networks and Document Embedding: The NETHIC Tool
    Lomasto, Luigi
    Di Florio, Rosario
    Ciapetti, Andrea
    Miscione, Giuseppe
    Ruggiero, Giulia
    Toti, Daniele
    [J]. ENTERPRISE INFORMATION SYSTEMS (ICEIS 2019), 2020, 378 : 57 - 77
  • [5] Text Document Classification
    Novovicova, Jana
    [J]. ERCIM NEWS, 2005, (62): : 53 - 54
  • [6] A New Similarity Measure for Document Classification and Text Mining
    Eminagaoglu, Mete
    Goksen, Yilmaz
    [J]. ECONOMIES OF THE BALKAN AND EASTERN EUROPEAN COUNTRIES, 2020, : 353 - 366
  • [7] AUTOMATIC DOCUMENT CLASSIFICATION
    BORKO, H
    BERNICK, M
    [J]. JOURNAL OF THE ACM, 1963, 10 (02) : 151 - &
  • [8] Using the MF/MD method for automatic text classification
    de Mönnink, I
    Brom, N
    Oostdijk, N
    [J]. EXTENDING THE SCOPE OF CORPUS-BASED RESEARCH: NEW APPLICATIONS, NEW CHALLENGES, 2003, (48): : 15 - 25
  • [9] A combined weight method in automatic classification of Chinese text
    Liao, SS
    Jiang, MH
    [J]. PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON NEURAL NETWORKS AND BRAIN, VOLS 1-3, 2005, : 625 - 630
  • [10] A NEW FEATURE SELECTION METHOD BASED ON CONCEPT EXTRACTION IN AUTOMATIC CHINESE TEXT CLASSIFICATION
    Liao, Shasha
    Jiang, Minghu
    [J]. NEW MATHEMATICS AND NATURAL COMPUTATION, 2007, 3 (03) : 331 - 347