Weakly supervised learning for an effective focused web crawler

被引:2
|
作者
Dhanith, P. R. Joe [1 ]
Saeed, Khalid [2 ,3 ]
Rohith, G. [4 ]
Raja, S. P. [5 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn SCOPE, Chennai, India
[2] Bialystok Tech Univ, Dept Comp Sci, Bialystok, Poland
[3] Univ La Costa, Dept Computat Sci & Elect, Barranquilla, Colombia
[4] Vellore Inst Technol, Sch Elect Engn SENSE, Chennai, India
[5] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India
关键词
Focused web crawler; Global vectors for word representation; Manhattan distance Rule; Semantic vectors; Weakly supervised gated recurrent unit; SEMANTIC SIMILARITY;
D O I
10.1016/j.engappai.2024.107944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant topics and webpagecontent. This weakly supervised Gated Recurrent Unit model accepts the vector form of the topic and the fetched webpage as input to produce meaningful semantic vectors and incorporates the Manhattan distance rule to compute the topical relevance of the webpage. The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the state - of - the - art approaches in terms of harvest rate, precision, recall, harmonic mean, and irrelevance ratio. In summary, the strategy described here works well and is important for focused crawlers.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] A Semantic Focused Web Crawler Based on a Knowledge Representation Schema
    Hernandez, Julio
    Marin-Castro, Heidy M.
    Morales-Sandoval, Miguel
    APPLIED SCIENCES-BASEL, 2020, 10 (11):
  • [22] Template-Driven Semantic Parsing for Focused Web Crawler
    Blinkiewicz, Michal
    Galler, Mariusz
    Szwabe, Andrzej
    SEMANTIC TECHNOLOGY (JIST 2014), 2015, 8943 : 351 - 358
  • [23] A novel incremental parallel web crawler based on focused crawling
    Huang, Qiuyan
    Li, Qingzhong
    Yan, Zhongmin
    Fu, Hong
    Journal of Computational Information Systems, 2013, 9 (06): : 2461 - 2469
  • [24] Weakly Supervised Learning of Object Segmentations from Web-Scale Video
    Hartmann, Glenn
    Grundmann, Matthias
    Hoffman, Judy
    Tsai, David
    Kwatra, Vivek
    Madani, Omid
    Vijayanarasimhan, Sudheendra
    Essa, Irfan
    Rehg, James
    Sukthankar, Rahul
    COMPUTER VISION - ECCV 2012: WORKSHOPS AND DEMONSTRATIONS, PT I, 2012, 7583 : 198 - 208
  • [25] CurriculumNet: Weakly Supervised Learning from Large-Scale Web Images
    Guo, Sheng
    Huang, Weilin
    Zhang, Haozhi
    Zhuang, Chenfan
    Dong, Dengke
    Scott, Matthew R.
    Huang, Dinglong
    COMPUTER VISION - ECCV 2018, PT X, 2018, 11214 : 139 - 154
  • [26] Weakly Supervised Correspondence Learning
    Wang, Zihan
    Cao, Zhangjie
    Hao, Yilun
    Sadigh, Dorsa
    2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2022), 2022,
  • [27] Weakly supervised machine learning
    Ren, Zeyu
    Wang, Shuihua
    Zhang, Yudong
    CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (03) : 549 - 580
  • [28] Weakly Supervised Dictionary Learning
    You, Zeyu
    Raich, Raviv
    Fern, Xiaoli Z.
    Kim, Jinsub
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, 2018, 66 (10) : 2527 - 2541
  • [29] Safe Weakly Supervised Learning
    Li, Yu-Feng
    PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 4951 - 4955
  • [30] Learning from the Web: Language Drives Weakly-Supervised Incremental Learning for Semantic Segmentation
    Liu, Chang
    Rizzoli, Giulia
    Zanuttigh, Pietro
    Li, Fu
    Niu, Yi
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 352 - 369