Link-based similarity measures for the classification of Web documents

被引:54
|
作者
Calado, P
Cristo, M
Gonçalves, MA
de Moura, ES
Ribeiro-Neto, B
Ziviani, N
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
[2] Virginia Tech, Dept Comp Sci, Blacksburg, VA USA
关键词
D O I
10.1002/asi.20266
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional text-based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text-based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.
引用
收藏
页码:208 / 221
页数:14
相关论文
共 50 条
  • [41] Link-based web spam detection using weight properties
    Kwang Leng Goh
    Ravi Kumar Patchmuthu
    Ashutosh Kumar Singh
    Journal of Intelligent Information Systems, 2014, 43 : 129 - 145
  • [42] A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm
    BERAHMAND Kamal
    BOUYER Asgarali
    Journal of Systems Science & Complexity, 2019, 32 (03) : 737 - 758
  • [43] Using Link-Based Content Analysis to Measure Document Similarity Effectively
    Li, Pei
    Li, Zhixu
    Liu, Hongyan
    He, Jun
    Du, Xiaoyong
    ADVANCES IN DATA AND WEB MANAGEMENT, PROCEEDINGS, 2009, 5446 : 455 - 467
  • [44] Web link-based relationships among top European universities
    Figuerola, Carlos G.
    Alonso Berrocal, Jose L.
    JOURNAL OF INFORMATION SCIENCE, 2013, 39 (05) : 629 - 642
  • [45] Link-based Markov model prefetching algorithm on Web cache
    Wang, Z
    Guo, CC
    Yan, PL
    DCABES 2004, PROCEEDINGS, VOLS, 1 AND 2, 2004, : 530 - 534
  • [46] Use link-based clustering to improve Web search results
    Wang, YT
    Kitsuregawa, M
    SECOND INTERNATIONAL CONFERENCE ON WEB INFORMATION SYSTEMS ENGINEERING, VOL I, PROCEEDINGS, 2002, : 115 - 124
  • [47] A Bayesian Graph Embedding Model for Link-Based Classification Problems
    Zhang, Yichao
    Zhuang, Huangxin
    Liu, Tiantian
    Chen, Bowei
    Cao, Zhiwei
    Fu, Yun
    Fan, Zhijie
    Chen, Guanrong
    IEEE TRANSACTIONS ON NETWORK SCIENCE AND ENGINEERING, 2022, 9 (02): : 716 - 727
  • [48] Revisiting Link-Based Cluster Ensembles for Microarray Data Classification
    Iam-On, Natthakan
    Boongoen, Tossapon
    2013 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS (SMC 2013), 2013, : 4543 - 4548
  • [49] Rank-Stability and Rank-Similarity of Link-Based Web Ranking Algorithms in Authority-Connected Graphs
    R. Lempel
    S. Moran
    Information Retrieval, 2005, 8 : 245 - 264
  • [50] Learning contextual dependency network models for link-based classification
    Tian, Yonghong
    Yang, Qiang
    Huang, Tiejun
    Ling, Charles X.
    Gao, Wen
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2006, 18 (11) : 1482 - 1496