Graph vs. bag representation models for the topic classification of web documents

被引:11
|
作者
Papadakis, George [1 ]
Giannakopoulos, George [2 ]
Paliouras, Georgios [2 ]
机构
[1] Univ Athens, Dept Informat & Telecommun, Athens 15784, Greece
[2] Natl Ctr Sci Res Demokritos, Patriarchou Grigoriou 27, Aghia Paraskevi 15310, Attica, Greece
关键词
Text classification; N-gram graphs; Web document types;
D O I
10.1007/s11280-015-0365-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.
引用
收藏
页码:887 / 920
页数:34
相关论文
共 45 条
  • [31] Weighting construction by bag-of-words with similarity-learning and supervised training for classification models in court text documents[Formula presented]
    Castro A.P., Jr.
    Wainer G.A.
    Calixto W.P.
    Applied Soft Computing, 2022, 124
  • [32] AGN X-ray absorption vs. optical classification: Hints from the XRB models
    Gilli, R
    Risaliti, G
    Salvati, M
    X-RAY ASTRONOMY: STELLAR ENDPOINTS, AGN, AND THE DIFFUSE X-RAY BACKGROUND, 2001, 599 : 626 - 629
  • [33] Evaluating Continuous Glucose Monitoring Documents in Electronic Health Records-A Comparative Study of Algorithmic Classification vs. Manual Review
    Zheng, Yaguang
    Iturrate, Eduardo
    Li, Lehan
    Wu, Bei
    Wylie-Rosett, Judith
    Small, William R.
    Zweig, Susan
    Fletcher, Jason
    Melkus, Gail D.
    Chen, Zhihao
    Johnson, Stephen B.
    DIABETES, 2024, 73
  • [34] Generalized models vs. classification tree analysis:: Predicting spatial distributions of plant species at different scales
    Thuiller, W
    Araújo, MB
    Lavorel, S
    JOURNAL OF VEGETATION SCIENCE, 2003, 14 (05) : 669 - 680
  • [35] Comparison of size vs. life-state classification demographic models for the terrestrial orchid Cleistes bifaria
    Gregg, KB
    Kéry, M
    BIOLOGICAL CONSERVATION, 2006, 129 (01) : 50 - 58
  • [36] Analytic Representation vs. Angle Modulation of Hilbert Transform of Fast Walsh-Hadamard Coefficients (HTFWHC) in Epileptic EEG Classification
    Goshvarpour, Atefeh
    Goshvarpour, Ateke
    BRAZILIAN JOURNAL OF PHYSICS, 2023, 53 (01)
  • [37] Analytic Representation vs. Angle Modulation of Hilbert Transform of Fast Walsh-Hadamard Coefficients (HTFWHC) in Epileptic EEG Classification
    Atefeh Goshvarpour
    Ateke Goshvarpour
    Brazilian Journal of Physics, 2023, 53
  • [38] Cross-company vs. single-company web effort models using the Tukutuku database: An extended study
    Mendes, Emilia
    Di Martino, Sergio
    Ferrucci, Filomena
    Gravino, Carmine
    JOURNAL OF SYSTEMS AND SOFTWARE, 2008, 81 (05) : 673 - 690
  • [39] BoW-based neural networks vs. cutting-edge models for single-label text classification
    Abdalla, Hassan I.
    Amer, Ali A.
    Ravana, Sri Devi
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (27): : 20103 - 20116
  • [40] BoW-based neural networks vs. cutting-edge models for single-label text classification
    Hassan I. Abdalla
    Ali A. Amer
    Sri Devi Ravana
    Neural Computing and Applications, 2023, 35 : 20103 - 20116