Link-based similarity measures for the classification of Web documents

被引:54
|
作者
Calado, P
Cristo, M
Gonçalves, MA
de Moura, ES
Ribeiro-Neto, B
Ziviani, N
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
[2] Virginia Tech, Dept Comp Sci, Blacksburg, VA USA
关键词
D O I
10.1002/asi.20266
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional text-based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text-based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.
引用
收藏
页码:208 / 221
页数:14
相关论文
共 50 条
  • [31] Evaluating and Extending Latent Methods for Link-Based Classification
    McDowell, Luke K.
    Fleming, Aaron
    Markel, Zane
    FORMALISMS FOR REUSE AND SYSTEMS INTEGRATION, 2015, 346 : 227 - 256
  • [32] Link-Based Text Classification Using Bayesian Networks
    de Campos, Luis M.
    Fernandez-Luna, Juan M.
    Huete, Juan F.
    Masegosa, Andres R.
    Romero, Alfonso E.
    FOCUSED RETRIEVAL AND EVALUATION, 2010, 6203 : 397 - 406
  • [33] Link-based multi-verse optimizer for text documents clustering
    Abasi, Ammar Kamal
    Khader, Ahamad Tajudin
    Al-Betar, Mohammed Azmi
    Naim, Syibrah
    Makhadmeh, Sharif Naser
    Alyasseri, Zaid Abdi Alkareem
    APPLIED SOFT COMPUTING, 2020, 87
  • [34] Link-based ranking of the web with source-centric collaboration
    Caverlee, James
    Liu, Ling
    Rouse, William B.
    2006 INTERNATIONAL CONFERENCE ON COLLABORATIVE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, 2006, : 61 - +
  • [35] A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm
    Kamal Berahmand
    Asgarali Bouyer
    Journal of Systems Science and Complexity, 2019, 32 : 737 - 758
  • [36] Link-based web spam detection using weight properties
    Goh, Kwang Leng
    Patchmuthu, Ravi Kumar
    Singh, Ashutosh Kumar
    JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2014, 43 (01) : 129 - 145
  • [37] An exploration of link-based knowledge map in academic web space
    Bo Yang
    Ying Sun
    Scientometrics, 2013, 96 : 239 - 253
  • [38] Efficient Algorithm for Computing Link-based Similarity in Real World Networks
    Cai, Yuanzhe
    Cong, Gao
    Jia, Xu
    Liu, Hongyan
    He, Jun
    Lu, Jiaheng
    Du, Xiaoyong
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 734 - 739
  • [39] A Link-Based Similarity for Improving Community Detection Based on Label Propagation Algorithm
    Berahmand, Kamal
    Bouyer, Asgarali
    JOURNAL OF SYSTEMS SCIENCE & COMPLEXITY, 2019, 32 (03) : 737 - 758
  • [40] An exploration of link-based knowledge map in academic web space
    Yang, Bo
    Sun, Ying
    SCIENTOMETRICS, 2013, 96 (01) : 239 - 253