Categorizing paper documents - A generic system for domain and language independent text categorization

被引:5
|
作者
Bayer, T [1 ]
Kressel, U [1 ]
Mogg-Schneider, H [1 ]
Renz, I [1 ]
机构
[1] Daimler Benz AG, Res & Technol, D-89081 Ulm, Germany
关键词
D O I
10.1006/cviu.1998.0687
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text categorization assigns predefined categories to either electronically available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based on statistical analysis of representative text corpora. Significant features are automatically derived from training texts by selecting substrings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential information. The classification is a minimum least-squares approach based on polynomials. The described system can be efficiently adapted to new domains or different languages. In application, the adapted text categorizers are reliable, fast, and completely automatic. Two example categorization tasks achieve recognition scores of approximately 80% and are very robust against recognition or typing errors. (C) 1999 Academic Press.
引用
收藏
页码:299 / 306
页数:8
相关论文
共 18 条
  • [1] Language and task independent text categorization with simple language models
    Peng, FC
    Schuurmans, D
    Wang, SJ
    [J]. HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 189 - 196
  • [2] GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT
    DAMASHEK, M
    [J]. SCIENCE, 1995, 267 (5199) : 843 - 848
  • [3] A variant of n-gram based language-independent text categorization
    Graovac, Jelena
    [J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
  • [4] CATEGORIZATION OF UNORGANIZED TEXT CORPORA FOR BETTER DOMAIN-SPECIFIC LANGUAGE MODELING
    Stas, Jan
    Zlacky, Daniel
    Hladek, Daniel
    Juhar, Jozef
    [J]. ADVANCES IN ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 11 (05) : 398 - 403
  • [5] A language-independent authorship attribution approach for author identification of text documents
    Ramezani, Reza
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 180
  • [6] Language-Independent Text-Line Extraction Algorithm for Handwritten Documents
    Ryu, Jewoong
    Koo, Hyung Il
    Cho, Nam Ik
    [J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1115 - 1119
  • [7] Text Borrowings Detection System for Natural Language Structured Digital Documents
    Kuropiatnyk, Olen
    Shynkarenko, Viktor
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS (COLINS 2020), VOL I: MAIN CONFERENCE, 2020, 2604
  • [8] Text Independent Language Recognition System for Indic Languages With new Features
    Sadanandam, M.
    Nagesh, A.
    Prasad, V. Kamakshi
    Janaki, V.
    [J]. 2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 139 - 143
  • [9] Text-Independent Automatic Accent Identification System for Kannada Language
    Soorajkumar, R.
    Girish, G. N.
    Ramteke, Pravin B.
    Joshi, Shreyas S.
    Koolagudi, Shashidhar G.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 2, 2017, 469 : 411 - 418
  • [10] Text Independent Language Recognition System Using DHMM with new Features
    Sadanandam, M.
    Prasad, V. Kamakshi
    Janaki, V.
    Nagesh, A.
    [J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 511 - +