Categorizing paper documents - A generic system for domain and language independent text categorization

被引：5

作者：

Bayer, T ^{[1
]}

Kressel, U ^{[1
]}

Mogg-Schneider, H ^{[1
]}

Renz, I ^{[1
]}

机构：

[1] Daimler Benz AG, Res & Technol, D-89081 Ulm, Germany

来源：

COMPUTER VISION AND IMAGE UNDERSTANDING | 1998年 / 70卷 / 03期

关键词：

D O I：

10.1006/cviu.1998.0687

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Text categorization assigns predefined categories to either electronically available texts or those resulting from document image analysis. A generic system for text categorization is presented which is based on statistical analysis of representative text corpora. Significant features are automatically derived from training texts by selecting substrings from actual word forms and applying statistical information and general linguistic knowledge. The dimension of the feature vectors is then reduced by linear transformation, keeping the essential information. The classification is a minimum least-squares approach based on polynomials. The described system can be efficiently adapted to new domains or different languages. In application, the adapted text categorizers are reliable, fast, and completely automatic. Two example categorization tasks achieve recognition scores of approximately 80% and are very robust against recognition or typing errors. (C) 1999 Academic Press.

引用

页码：299 / 306

页数：8

共 18 条

[1] Language and task independent text categorization with simple language models
Peng, FC
Schuurmans, D
Wang, SJ
[J]. HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2003, : 189 - 196
[2] GAUGING SIMILARITY WITH N-GRAMS - LANGUAGE-INDEPENDENT CATEGORIZATION OF TEXT
DAMASHEK, M
[J]. SCIENCE, 1995, 267 (5199) : 843 - 848
[3] A variant of n-gram based language-independent text categorization
Graovac, Jelena
[J]. INTELLIGENT DATA ANALYSIS, 2014, 18 (04) : 677 - 695
[4] CATEGORIZATION OF UNORGANIZED TEXT CORPORA FOR BETTER DOMAIN-SPECIFIC LANGUAGE MODELING
Stas, Jan
Zlacky, Daniel
Hladek, Daniel
Juhar, Jozef
[J]. ADVANCES IN ELECTRICAL AND ELECTRONIC ENGINEERING, 2013, 11 (05) : 398 - 403
[5] A language-independent authorship attribution approach for author identification of text documents
Ramezani, Reza
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 180
[6] Language-Independent Text-Line Extraction Algorithm for Handwritten Documents
Ryu, Jewoong
Koo, Hyung Il
Cho, Nam Ik
[J]. IEEE SIGNAL PROCESSING LETTERS, 2014, 21 (09) : 1115 - 1119
[7] Text Borrowings Detection System for Natural Language Structured Digital Documents
Kuropiatnyk, Olen
Shynkarenko, Viktor
[J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT SYSTEMS (COLINS 2020), VOL I: MAIN CONFERENCE, 2020, 2604
[8] Text Independent Language Recognition System for Indic Languages With new Features
Sadanandam, M.
Nagesh, A.
Prasad, V. Kamakshi
Janaki, V.
[J]. 2012 IEEE INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMPUTING RESEARCH (ICCIC), 2012, : 139 - 143
[9] Text-Independent Automatic Accent Identification System for Kannada Language
Soorajkumar, R.
Girish, G. N.
Ramteke, Pravin B.
Joshi, Shreyas S.
Koolagudi, Shashidhar G.
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON DATA ENGINEERING AND COMMUNICATION TECHNOLOGY, ICDECT 2016, VOL 2, 2017, 469 : 411 - 418
[10] Text Independent Language Recognition System Using DHMM with new Features
Sadanandam, M.
Prasad, V. Kamakshi
Janaki, V.
Nagesh, A.
[J]. PROCEEDINGS OF 2012 IEEE 11TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP) VOLS 1-3, 2012, : 511 - +

← 1 2 →