Weakly supervised learning for an effective focused web crawler

被引:2
|
作者
Dhanith, P. R. Joe [1 ]
Saeed, Khalid [2 ,3 ]
Rohith, G. [4 ]
Raja, S. P. [5 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn SCOPE, Chennai, India
[2] Bialystok Tech Univ, Dept Comp Sci, Bialystok, Poland
[3] Univ La Costa, Dept Computat Sci & Elect, Barranquilla, Colombia
[4] Vellore Inst Technol, Sch Elect Engn SENSE, Chennai, India
[5] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India
关键词
Focused web crawler; Global vectors for word representation; Manhattan distance Rule; Semantic vectors; Weakly supervised gated recurrent unit; SEMANTIC SIMILARITY;
D O I
10.1016/j.engappai.2024.107944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant topics and webpagecontent. This weakly supervised Gated Recurrent Unit model accepts the vector form of the topic and the fetched webpage as input to produce meaningful semantic vectors and incorporates the Manhattan distance rule to compute the topical relevance of the webpage. The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the state - of - the - art approaches in terms of harvest rate, precision, recall, harmonic mean, and irrelevance ratio. In summary, the strategy described here works well and is important for focused crawlers.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Weakly Supervised Contrastive Learning
    Zheng, Mingkai
    Wang, Fei
    You, Shan
    Qian, Chen
    Zhang, Changshui
    Wang, Xiaogang
    Xu, Chang
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 10022 - 10031
  • [32] Attend in groups: a weakly-supervised deep learning framework for learning from web data
    Zhuang, Bohan
    Liu, Lingqiao
    Li, Yao
    Shen, Chunhua
    Reid, Ian
    30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 2915 - 2924
  • [33] An efficient adaptive focused crawler based on ontology learning
    Su, C
    Gao, Y
    Yang, JM
    Luo, B
    HIS 2005: 5th International Conference on Hybrid Intelligent Systems, Proceedings, 2005, : 73 - 78
  • [34] An architecture for a focused trend parallel Web crawler with the application of clickstrearn analysis
    Ahmadi-Abkenari, Fatemeh
    Selamat, Ali
    INFORMATION SCIENCES, 2012, 184 (01) : 266 - 281
  • [35] Designing a Modular and Distributed Web Crawler Focused on Unstructured Cybersecurity Intelligence
    Jenkins, Donovan
    Liebrock, Lorie M.
    Urias, Vince
    2021 INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY (ICCST), 2021,
  • [36] LSCrawler: A framework for an enhanced focused web crawler based on link semantics
    Yuvarani, M.
    Iyengar, N. Ch. S. N.
    Kannan, A.
    2006 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE, (WI 2006 MAIN CONFERENCE PROCEEDINGS), 2006, : 794 - 797
  • [37] Learning Facial Action Units from Web Images with Scalable Weakly Supervised Clustering
    Zhao, Kaili
    Chu, Wen-Sheng
    Martinez, Aleix M.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 2090 - 2099
  • [38] Relation Extraction from Chinese News Web Documents Based on Weakly Supervised Learning
    Qiu, Jing
    Liao, Lejian
    Li, Peng
    2009 INTERNATIONAL CONFERENCE ON INTELLIGENT NETWORKING AND COLLABORATIVE SYSTEMS (INCOS 2009), 2009, : 219 - 225
  • [39] Visual Recognition by Learning from Web Data: A Weakly Supervised Domain Generalization Approach
    Niu, Li
    Li, Wen
    Xu, Dong
    2015 IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2015, : 2774 - 2783
  • [40] WEAKLY SUPERVISED MULTISCALE-INCEPTION LEARNING FOR WEB-SCALE FACE RECOGNITION
    Cheng, Cheng
    Xing, Junliang
    Feng, Youji
    Liu, Pengcheng
    Shao, Xiaohu
    Li, Kai
    Zhou, Xiang-Dong
    2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 815 - 819