Link-based similarity measures for the classification of Web documents

被引:54
|
作者
Calado, P
Cristo, M
Gonçalves, MA
de Moura, ES
Ribeiro-Neto, B
Ziviani, N
机构
[1] Univ Fed Minas Gerais, Dept Comp Sci, Belo Horizonte, MG, Brazil
[2] Virginia Tech, Dept Comp Sci, Blacksburg, VA USA
关键词
D O I
10.1002/asi.20266
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Traditional text-based document classifiers tend to perform poorly on the Web. Text in Web documents is usually noisy and often does not contain enough information to determine their topic. However, the Web provides a different source that can be useful to document classification: its hyperlink structure. In this work, the authors evaluate how the link structure of the Web can be used to determine a measure of similarity appropriate for document classification. They experiment with five different similarity measures and determine their adequacy for predicting the topic of a Web page. Tests performed on a Web directory show that link information alone allows classifying documents with an average precision of 86%. Further, when combined with a traditional text-based classifier, precision increases to values of up to 90%, representing gains that range from 63 to 132% over the use of text-based classification alone. Because the measures proposed in this article are straightforward to compute, they provide a practical and effective solution for Web classification and related information retrieval tasks. Further, the authors provide an important set of guidelines on how link structure can be used effectively to classify Web documents.
引用
收藏
页码:208 / 221
页数:14
相关论文
共 50 条
  • [21] Applying Link-Based Classification to Label Blogs
    Bhagat, Smriti
    Cormode, Graham
    Rozenbaum, Irina
    ADVANCES IN WEB MINING AND WEB USAGE ANALYSIS, 2009, 5439 : 97 - 117
  • [22] Using link-based domain models in Web searching
    Qiu, ZZ
    Hemmje, M
    Neuhold, EJ
    2000 KYOTO INTERNATIONAL CONFERENCE ON DIGITAL LIBRARIES: RESEARCH AND PRACTICE, PROCEEDINGS, 2000, : 152 - 159
  • [23] Link information as a similarity measure in web classification
    Cristo, M
    Calado, P
    de Moura, ES
    Ziviani, N
    Ribeiro-Neto, B
    STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2003, 2857 : 43 - 55
  • [24] Exploring link-based algorithm for web spam detection
    Yu, Jian
    Zhou, Jing
    Yu, Mei
    Du, Yu
    Lv, Fang
    Journal of Information and Computational Science, 2015, 12 (13): : 5003 - 5011
  • [25] Density link-based methods for clustering web pages
    Chehreghani, Morteza Haghir
    Abolhassani, Hassan
    Chehreghani, Mostafa Haghir
    DECISION SUPPORT SYSTEMS, 2009, 47 (04) : 374 - 382
  • [26] Improved link-based algorithms for ranking web pages
    Wang, ZY
    ADVANCES IN WEB-AGE INFORMATION MANAGEMENT: PROCEEDINGS, 2004, 3129 : 291 - 302
  • [27] JacSim: An accurate and efficient link-based similarity measure in graphs
    Hamedani, Masoud Reyhani
    Kim, Sang-Wook
    INFORMATION SCIENCES, 2017, 414 : 203 - 224
  • [28] Probabilistic Methods for Link-Based Classification at INEX 2008
    de Campos, Luis M.
    Fernandez-Luna, Juan M.
    Huete, Juan F.
    Romero, Alfonso E.
    ADVANCES IN FOCUSED RETRIEVAL, 2009, 5631 : 453 - 459
  • [29] Combining Link-Based and Content-Based Classification Method
    Tian, Kelun
    WEB INFORMATION SYSTEMS AND MINING, PT II, 2011, 6988 : 160 - 168
  • [30] Generating social network features for link-based classification
    Karamon, Jun
    Matsuo, Yutaka
    Yamamoto, Hikaru
    Ishizuka, Mitsuru
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2007, PROCEEDINGS, 2007, 4702 : 127 - +