Text categorization based on k-nearest neighbor approach for Web site classification

被引:79
|
作者
Kwon, OW [1 ]
Lee, JH [1 ]
机构
[1] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Div Elect & Comp Engn, Pohang 790784, South Korea
关键词
text categorization; Web site classification; Web page classification; k-nearest neighbor approach; machine learning;
D O I
10.1016/S0306-4573(02)00022-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach: It consists of three phases: Web page selection (connectivity analysis), Web page classification, and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document-document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only. (C) 2002 Elsevier Science Ltd. All rights reserved.
引用
收藏
页码:25 / 44
页数:20
相关论文
共 50 条
  • [1] Text Categorization with K-Nearest Neighbor Approach
    Manne, Suneetha
    Kotha, Sita Kumari
    Fatima, S. Sameen
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS DESIGN AND INTELLIGENT APPLICATIONS 2012 (INDIA 2012), 2012, 132 : 413 - +
  • [2] Binary k-nearest neighbor for text categorization
    Tan, SB
    [J]. ONLINE INFORMATION REVIEW, 2005, 29 (04) : 391 - 399
  • [3] Novel text classification based on K-nearest neighbor
    Yu, Xiao-Peng
    Yu, Xiao-Gao
    [J]. PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 3425 - +
  • [4] K-Nearest Neighbor Algorithm Optimization in Text Categorization
    Chen, Shufeng
    [J]. 2017 3RD INTERNATIONAL CONFERENCE ON ENVIRONMENTAL SCIENCE AND MATERIAL APPLICATION (ESMA2017), VOLS 1-4, 2018, 108
  • [5] IMPROVING K-NEAREST NEIGHBOR EFFICIENCY FOR TEXT CATEGORIZATION
    Barigou, F.
    [J]. NEURAL NETWORK WORLD, 2016, 26 (01) : 45 - 65
  • [6] Modular k-nearest neighbor classification method for massively parallel text categorization
    Zhao, H
    Lu, BL
    [J]. COMPUTATIONAL AND INFORMATION SCIENCE, PROCEEDINGS, 2004, 3314 : 867 - 872
  • [7] A Review of a Text Classification Technique: K-Nearest Neighbor
    Zhou, R. S.
    Wang, Z. J.
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTER INFORMATION SYSTEMS AND INDUSTRIAL APPLICATIONS (CISIA 2015), 2015, 18 : 453 - 455
  • [8] Research on the Improvement of K-Nearest Neighbor Classifier for Imbalanced Text Categorization
    Yang Yanmei
    Xu Linying
    [J]. 2018 EIGHTH INTERNATIONAL CONFERENCE ON INSTRUMENTATION AND MEASUREMENT, COMPUTER, COMMUNICATION AND CONTROL (IMCCC 2018), 2018, : 968 - 972
  • [9] Application of k-Nearest Neighbor on feature projections classifier to text categorization
    Yavuz, T
    Guvenir, HA
    [J]. ADVANCES IN COMPUTER AND INFORMATION SCIENCES '98, 1998, 53 : 135 - 142
  • [10] Feature Extraction based Text Classification using K-Nearest Neighbor Algorithm
    Azam, Muhammad
    Ahmed, Tanvir
    Sabah, Fahad
    Hussain, Muhammad Iftikhar
    [J]. INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (12): : 95 - 101