Graph vs. bag representation models for the topic classification of web documents

被引:11
|
作者
Papadakis, George [1 ]
Giannakopoulos, George [2 ]
Paliouras, Georgios [2 ]
机构
[1] Univ Athens, Dept Informat & Telecommun, Athens 15784, Greece
[2] Natl Ctr Sci Res Demokritos, Patriarchou Grigoriou 27, Aghia Paraskevi 15310, Attica, Greece
关键词
Text classification; N-gram graphs; Web document types;
D O I
10.1007/s11280-015-0365-x
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.
引用
收藏
页码:887 / 920
页数:34
相关论文
共 45 条
  • [21] Web Service Recommendation via Combining Topic-aware Heterogeneous Graph Representation and Interactive Semantic Enhancement
    Cao B.
    Peng Q.
    Xie X.
    Peng Z.
    Liu J.
    Zheng Z.
    IEEE Transactions on Services Computing, 2024, 17 (06): : 1 - 16
  • [22] Automated U.S Diplomatic Cables Security Classification: Topic Model Pruning vs. Classification Based on Clusters
    Alzhrani, Khudran
    Rudd, Ethan M.
    Chow, C. Edward
    Boult, Terrance E.
    2017 IEEE INTERNATIONAL SYMPOSIUM ON TECHNOLOGIES FOR HOMELAND SECURITY (HST), 2017,
  • [23] Switchable Constraints vs. Max-Mixture Models vs. RRR - A Comparison of Three Approaches to Robust Pose Graph SLAM
    Suenderhauf, Niko
    Protzel, Peter
    2013 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), 2013, : 5198 - 5203
  • [24] OPM vs. UML--Experimenting with Comprehension and Construction of Web Application Models
    Iris Reinhartz-Berger
    Dov Dori
    Empirical Software Engineering, 2005, 10 : 57 - 80
  • [25] LOCAL SPATIAL INFORMATION WITH BAG-OF-VISUAL-WORDS MODEL VIA GRAPH-BASED REPRESENTATION FOR TEXTURE CLASSIFICATION
    Thewsuwan, Srisupang
    Horio, Keiichi
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2020, 16 (05): : 1611 - 1621
  • [26] GOW-LDA: Applying Term Co-occurrence Graph Representation in LDA Topic Models Improvement
    Phu Pham
    Phuc Do
    Ta, Chien D. C.
    COMPUTATIONAL SCIENCE AND TECHNOLOGY, ICCST 2017, 2018, 488 : 420 - 431
  • [27] Web Site Classification based on URL and Content: Algerian Vs. non-Algerian Case
    Abdessamed, Ouessai
    Zakaria, Elberrichi
    2015 12TH IEEE INTERNATIONAL CONFERENCE ON PROGRAMMING AND SYSTEMS (ISPS), 2015, : 116 - 123
  • [28] Overview of Data Mining Classification Techniques: Traditional vs. Parallel/Distributed Programming Models
    Besimi, Nuhi
    Cico, Betim
    Besimi, Adrian
    2017 6TH MEDITERRANEAN CONFERENCE ON EMBEDDED COMPUTING (MECO), 2017, : 433 - 436
  • [29] Semi-Supervised Node Classification on Graphs: Markov Random Fields vs. Graph Neural Networks
    Wang, Binghui
    Jia, Jinyuan
    Gong, Neil Zhenqiang
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 10093 - 10101
  • [30] Performance Comparison of Ad-Hoc Retrieval Models over Full-Text vs. Titles of Documents
    Saleh, Ahmed
    Beck, Tilman
    Galke, Lukas
    Scherp, Ansgar
    MATURITY AND INNOVATION IN DIGITAL LIBRARIES, ICADL 2018, 2018, 11279 : 290 - 303