Temporal contexts: Effective text classification in evolving document collections

被引:8
|
作者
Rocha, Leonardo [1 ]
Mourao, Fernando [2 ]
Mota, Hilton [3 ]
Salles, Thiago [2 ]
Goncalves, Marcos Andre [2 ]
Meira, Wagner, Jr. [2 ]
机构
[1] Univ Fed Sao Joao del Rei, Comp Sci Dept, Sao Joao Del Rei, Brazil
[2] Univ Fed Minas Gerais, Comp Sci Dept, Belo Horizonte, MG, Brazil
[3] Univ Fed Minas Gerais, Elect Engn Dept, Belo Horizonte, MG, Brazil
关键词
Classification; Text mining; Temporal evolution; SIMILARITY;
D O I
10.1016/j.is.2012.11.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The management of a huge and growing amount of information available nowadays makes Automatic Document Classification (ADC), besides crucial, a very challenging task. Furthermore, the dynamics inherent to classification problems, mainly on the Web, make this task even more challenging. Despite this fact, the actual impact of such temporal evolution on ADC is still poorly understood in the literature. In this context, this work concerns to evaluate, characterize and exploit the temporal evolution to improve ADC techniques. As first contribution we highlight the proposal of a pragmatical methodology for evaluating the temporal evolution in ADC domains. Through this methodology, we can identify measurable factors associated to ADC models degradation over time. Going a step further, based on such analyzes, we propose effective and efficient strategies to make current techniques more robust to natural shifts over time. We present a strategy, named temporal context selection, for selecting portions of the training set that minimize those factors. Our second contribution consists of proposing a general algorithm, called Chronos, for determining such contexts. By instantiating Chronos, we are able to reduce uncertainty and improve the overall classification accuracy. Empirical evaluations of heuristic instantiations of the algorithm, named WindowsChronos and FilterChronos, on two real document collections demonstrate the usefulness of our proposal. Comparing them against state-of-the-art ADC algorithms shows that selecting temporal contexts allows improvements on the classification accuracy up to 10%. Finally, we highlight the applicability and the generality of our proposal in practice, pointing out this study as a promising research direction. (C) 2012 Elsevier Ltd. All rights reserved.
引用
收藏
页码:388 / 409
页数:22
相关论文
共 50 条
  • [1] Unsupervised classification of text-centric XML document collections
    Doucet, Antoine
    Lehtonen, Miro
    [J]. COMPARATIVE EVALUATION OF XML INFORMATION RETRIEVAL SYSTEMS, 2007, 4518 : 497 - 509
  • [2] Discriminative category matching: Efficient text classification for huge document collections
    Fung, GPC
    Yu, JX
    Lu, HJ
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 187 - 194
  • [3] Temporal Language Modeling for Short Text Document Classification with Transformers
    Pokrywka, Jakub
    Gralinski, Filip
    [J]. PROCEEDINGS OF THE 2022 17TH CONFERENCE ON COMPUTER SCIENCE AND INTELLIGENCE SYSTEMS (FEDCSIS), 2022, : 121 - 128
  • [4] Text Document Classification
    Novovicova, Jana
    [J]. ERCIM NEWS, 2005, (62): : 53 - 54
  • [5] Evolving rules for document classification
    Hirsch, L
    Saeedi, M
    Hirsch, R
    [J]. GENETIC PROGRAMMING, PROCEEDINGS, 2005, 3447 : 85 - 95
  • [6] MeSH Up: effective MeSH text classification for improved document retrieval
    Trieschnigg, Dolf
    Pezik, Piotr
    Lee, Vivian
    de Jong, Franciska
    Kraaij, Wessel
    Rebholz-Schuhmann, Dietrich
    [J]. BIOINFORMATICS, 2009, 25 (11) : 1412 - 1418
  • [7] Using structural contexts to compress semistructured text collections
    Adiego, Joaquin
    Navarro, Gonzalo
    de la Fuente, Pablo
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2007, 43 (03) : 769 - 790
  • [8] Text classification with document embeddings
    [J]. Huang, Chaochao (chaochaohuang12@fudan.edu.cn), 1600, Springer Verlag (8801):
  • [9] Text Classification with Document Embeddings
    Huang, Chaochao
    Qiu, Xipeng
    Huang, Xuanjing
    [J]. CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 131 - 140
  • [10] Mining association rules in temporal document collections
    Norvag, Kjetil
    Eriksen, Trond Oivind
    Skogstad, Kjell-Inge
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2006, 4203 : 745 - 754