Link-based similarity measures for the classification of Web documents

被引:54
|
作者
Calado, P
Cristo, M
Gonçalves, MA
de Moura, ES
Ribeiro-Neto, B
Ziviani, N
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
[2] Virginia Tech, Dept Comp Sci, Blacksburg, VA USA
关键词
D O I
10.1002/asi.20266
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional text-based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text-based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.
引用
下载
收藏
页码:208 / 221
页数:14
相关论文
共 50 条
  • [1] Classifying documents with link-based bibliometric measures
    T. Couto
    N. Ziviani
    P. Calado
    M. Cristo
    M. Gonçalves
    E. S. de Moura
    W. Brandão
    Information Retrieval, 2010, 13 : 315 - 345
  • [2] Classifying documents with link-based bibliometric measures
    Couto, T.
    Ziviani, N.
    Calado, P.
    Cristo, M.
    Goncalves, M.
    de Moura, E. S.
    Brandao, W.
    INFORMATION RETRIEVAL, 2010, 13 (04): : 315 - 345
  • [3] Link-Based Clustering Algorithm for Clustering Web Documents
    Ashokkumar, P.
    Don, S.
    JOURNAL OF TESTING AND EVALUATION, 2019, 47 (06) : 4096 - 4107
  • [4] EFFICIENT COMPUTATIONS OF LINK-BASED SIMILARITY MEASURES ON THE GPU
    Jo, Yong-Yeon
    Bae, Duck-Ho
    Kim, Sang-Wook
    PROCEEDINGS OF THE 3RD IEEE INTERNATIONAL CONFERENCE ON NETWORK INFRASTRUCTURE AND DIGITAL CONTENT (IEEE IC-NIDC 2012), 2012, : 261 - 265
  • [5] Link-Based Similarity Measures Using Reachability Vectors
    Yoon, Seok-Ho
    Kim, Ji-Soo
    Ha, Jiwoon
    Kim, Sang-Wook
    Ryu, Minsoo
    Choi, Ho-Jin
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [6] A Tool for Link-Based Web Page Classification
    Hernandez, Inma
    Rivero, Carlos R.
    Ruiz, David
    Corchuelo, Rafael
    ADVANCES IN ARTIFICIAL INTELLIGENCE, 2011, 7023 : 443 - 452
  • [7] Efficient link-based similarity search in web networks
    Zhang, Mingxi
    Hu, Hao
    He, Zhenying
    Gao, Liping
    Sun, Liujie
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) : 8868 - 8880
  • [8] Accuracy estimation of link-based similarity measures and its application
    Yinglong ZHANG
    Cuiping LI
    Chengwang XIE
    Hong CHEN
    Frontiers of Computer Science, 2016, 10 (01) : 113 - 123
  • [9] Accuracy estimation of link-based similarity measures and its application
    Zhang, Yinglong
    Li, Cuiping
    Xie, Chengwang
    Chen, Hong
    FRONTIERS OF COMPUTER SCIENCE, 2016, 10 (01) : 113 - 123
  • [10] Accuracy Estimation of Link-Based Similarity Measures and Its Application
    Zhang, Yinglong
    Li, Cuiping
    Xie, Chengwang
    Chen, Hong
    WEB-AGE INFORMATION MANAGEMENT, WAIM 2014, 2014, 8485 : 100 - 112