Effectiveness of document representation for classification

被引:0
|
作者
Chen, DY [1 ]
Li, X [1 ]
Dong, ZY [1 ]
Chen, X [1 ]
机构
[1] Univ Queensland, Sch Informat Technol & Elect Engn, Brisbane, Qld 4072, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification.
引用
收藏
页码:368 / 377
页数:10
相关论文
共 50 条
  • [41] DOCUMENT DESCRIPTION AND REPRESENTATION
    BATTEN, WE
    ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 1973, 8 : 43 - 68
  • [42] Ontology-Based Document and Query Representation May Improve the Effectiveness of Information Retrieval
    Dragoni, Mauro
    Pereira, Celia da Costa
    Tettamanzi, Andrea G. B.
    STAIRS 2010: PROCEEDINGS OF THE FIFTH STARTING AI RESEARCHERS' SYMPOSIUM, 2011, 222 : 89 - 100
  • [43] Comprehensive document representation
    Lipshutz, M
    Taylor, SL
    MATHEMATICAL AND COMPUTER MODELLING, 1997, 25 (04) : 85 - 93
  • [44] DOCUMENT DESCRIPTION AND REPRESENTATION
    ARTANDI, S
    ANNUAL REVIEW OF INFORMATION SCIENCE AND TECHNOLOGY, 1970, 5 : 143 - 167
  • [45] Boosting biomedical document classification through the use of domain entity recognizers and semantic ontologies for document representation: The case of gluten bibliome
    Perez-Perez, Martin
    Ferreira, Tania
    Lourenco, Analia
    Igrejas, Gilberto
    Fdez-Riverola, Florentino
    NEUROCOMPUTING, 2022, 484 : 223 - 237
  • [46] An Unified Approach for Multimedia Document Representation and Document Similarity
    Pushpalatha, K.
    Ananthanarayana, V. S.
    2014 IEEE 17TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND ENGINEERING (CSE), 2014, : 249 - 256
  • [47] Classification of forensic autopsy reports through conceptual graph-based document representation model
    Mujtaba, Ghulam
    Shuib, Liyana
    Raj, Ram Gopal
    Rajandram, Retnagowri
    Shaikh, Khairunisa
    Al-Garadi, Mohammed Ali
    JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 82 : 88 - 105
  • [48] The Nearest Centroid Based on Vector Norms: A New Classification Algorithm for a New Document Representation Model
    Mountassir, Asmaa
    Benbrahim, Houda
    Berrada, Ilham
    MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, MLDM 2014, 2014, 8556 : 442 - 456
  • [49] MULTIPLE REPRESENTATION DOCUMENT DEVELOPMENT
    CHEN, PH
    HARRISON, MA
    COMPUTER, 1988, 21 (01) : 15 - 31
  • [50] A Novel Model for Document Representation
    Mountassir, Asmaa
    2013 ACS INTERNATIONAL CONFERENCE ON COMPUTER SYSTEMS AND APPLICATIONS (AICCSA), 2013,