Weakly supervised learning for an effective focused web crawler

被引:2
|
作者
Dhanith, P. R. Joe [1 ]
Saeed, Khalid [2 ,3 ]
Rohith, G. [4 ]
Raja, S. P. [5 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn SCOPE, Chennai, India
[2] Bialystok Tech Univ, Dept Comp Sci, Bialystok, Poland
[3] Univ La Costa, Dept Computat Sci & Elect, Barranquilla, Colombia
[4] Vellore Inst Technol, Sch Elect Engn SENSE, Chennai, India
[5] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India
关键词
Focused web crawler; Global vectors for word representation; Manhattan distance Rule; Semantic vectors; Weakly supervised gated recurrent unit; SEMANTIC SIMILARITY;
D O I
10.1016/j.engappai.2024.107944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant topics and webpagecontent. This weakly supervised Gated Recurrent Unit model accepts the vector form of the topic and the fetched webpage as input to produce meaningful semantic vectors and incorporates the Manhattan distance rule to compute the topical relevance of the webpage. The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the state - of - the - art approaches in terms of harvest rate, precision, recall, harmonic mean, and irrelevance ratio. In summary, the strategy described here works well and is important for focused crawlers.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Visual Recognition by Learning From Web Data via Weakly Supervised Domain Generalization
    Niu, Li
    Li, Wen
    Xu, Dong
    Cai, Jianfei
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (09) : 1985 - 1999
  • [42] A novel focused crawler combining Web space evolution and domain ontology
    Liu, Jingfa
    Li, Xin
    Zhang, Qiansheng
    Zhong, Guo
    KNOWLEDGE-BASED SYSTEMS, 2022, 243
  • [43] An ontology-supported web focused-crawler for Java programs
    Dept. of Computer and Communication Engineering, St. John's University, Taiwan
    不详
    IEEE Int. Conf. Ubi-Media Comput., U-Media, (266-271):
  • [44] Focused crawler for events
    Farag, Mohamed M. G.
    Lee, Sunshin
    Fox, Edward A.
    INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES, 2018, 19 (01) : 3 - 19
  • [45] Weakly supervised label learning flows
    Lu, You
    Song, Wenzhuo
    Arachie, Chidubem
    Huang, Bert
    NEURAL NETWORKS, 2025, 182
  • [46] Weakly Supervised Deep Learning in Radiology
    Misera, Leo
    Mueller-Franzes, Gustav
    Truhn, Daniel
    Kather, Jakob Nikolas
    RADIOLOGY, 2024, 312 (01)
  • [47] Special issue on weakly supervised learning
    Zhang, Luming
    Ji, Rongrong
    Yi, Zhen
    Lin, Weisi
    Snoek, Cees
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2016, 37 : 1 - 2
  • [48] Unlearning From Weakly Supervised Learning
    Tang, Yi
    Gao, Yi
    Luo, Yong-Gang
    Yang, Ju-Cheng
    Xu, Miao
    Zhang, Min-Ling
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 5000 - 5008
  • [49] A brief introduction to weakly supervised learning
    Zhi-Hua Zhou
    NationalScienceReview, 2018, 5 (01) : 44 - 53
  • [50] Towards Safe Weakly Supervised Learning
    Li, Yu-Feng
    Guo, Lan-Zhe
    Zhou, Zhi-Hua
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (01) : 334 - 346