Graph vs. bag representation models for the topic classification of web documents

被引:11
|
作者
Papadakis, George [1 ]
Giannakopoulos, George [2 ]
Paliouras, Georgios [2 ]
机构
[1] Univ Athens, Dept Informat & Telecommun, Athens 15784, Greece
[2] Natl Ctr Sci Res Demokritos, Patriarchou Grigoriou 27, Aghia Paraskevi 15310, Attica, Greece
关键词
Text classification; N-gram graphs; Web document types;
D O I
10.1007/s11280-015-0365-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.
引用
收藏
页码:887 / 920
页数:34
相关论文
共 45 条
  • [1] Graph vs. bag representation models for the topic classification of web documents
    George Papadakis
    George Giannakopoulos
    Georgios Paliouras
    World Wide Web, 2016, 19 : 887 - 920
  • [2] Simple classification into large topic ontology of Web documents
    Grobelnik, M
    Mladenic, D
    ITI 2005: Proceedings of the 27th International Conference on Information Technology Interfaces, 2005, : 201 - 206
  • [3] Classification of web documents using graph matching
    Schenker, A
    Last, M
    Bunke, H
    Kandel, A
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2004, 18 (03) : 475 - 496
  • [4] Classification of web documents using a graph model
    Schenker, A
    Last, M
    Bunke, H
    Kandel, A
    SEVENTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2003, : 240 - 244
  • [5] Topic Models Vs. Unstructured Data
    Anthes, Gary
    COMMUNICATIONS OF THE ACM, 2010, 53 (12) : 16 - 18
  • [6] Automatic classification of web search results: Product review vs. non-review documents
    Thet, Tun Thura
    Na, Jin-Cheon
    Khoo, Christopher S. G.
    ASIAN DIGITAL LIBRARIES: LOOKING BACK 10 YEARS AND FORGING NEW FRONTIERS, PROCEEDINGS, 2007, 4822 : 65 - 74
  • [7] Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP
    Galke, Lukas
    Scherp, Ansgar
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4038 - 4051
  • [8] Classification of Durian Characteristics for Semantic Representation from Web Documents
    Abu Bakar, Zainab
    Ismail, Khairul Nurmazianna
    2012 IEEE SYMPOSIUM ON E-LEARNING, E-MANAGEMENT AND E-SERVICES (IS3E 2012), 2012, : 111 - 115
  • [9] Deep Learning vs. Bag of Features in Machine Learning for Image Classification
    Loussaief, Sehla
    Abdelkrim, Afef
    2018 INTERNATIONAL CONFERENCE ON ADVANCED SYSTEMS AND ELECTRICAL TECHNOLOGIES (IC_ASET), 2017, : 6 - 10
  • [10] Complexity vs. Performance in Granular Embedding Spaces for Graph Classification
    Baldini, Luca
    Martino, Alessio
    Rizzi, Antonello
    PROCEEDINGS OF THE 12TH INTERNATIONAL JOINT CONFERENCE ON COMPUTATIONAL INTELLIGENCE (IJCCI), 2020, : 338 - 349