Weakly supervised learning for an effective focused web crawler

被引:2
|
作者
Dhanith, P. R. Joe [1 ]
Saeed, Khalid [2 ,3 ]
Rohith, G. [4 ]
Raja, S. P. [5 ]
机构
[1] Vellore Inst Technol, Sch Comp Sci & Engn SCOPE, Chennai, India
[2] Bialystok Tech Univ, Dept Comp Sci, Bialystok, Poland
[3] Univ La Costa, Dept Computat Sci & Elect, Barranquilla, Colombia
[4] Vellore Inst Technol, Sch Elect Engn SENSE, Chennai, India
[5] Vellore Inst Technol, Sch Comp Sci & Engn, Vellore, Tamil Nadu, India
关键词
Focused web crawler; Global vectors for word representation; Manhattan distance Rule; Semantic vectors; Weakly supervised gated recurrent unit; SEMANTIC SIMILARITY;
D O I
10.1016/j.engappai.2024.107944
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Focused crawler traverses the Web to only collect pages that are relevant to a particular topic, and is increasingly considered as a way to get around the scalability issues with current general-purpose search engines. But the data diversity in the Web forces these crawlers to face three significant problems: (i) inconsistency, (ii) ubiquity, and (iii) ambiguity, which causes misguidance in crawling. To handle these issues, this paper proposes a weakly supervised Gated Recurrent Unit (GRU) mechanism for an adaptive focused web crawler framework that matches semantically relevant topics and webpagecontent. This weakly supervised Gated Recurrent Unit model accepts the vector form of the topic and the fetched webpage as input to produce meaningful semantic vectors and incorporates the Manhattan distance rule to compute the topical relevance of the webpage. The proposed mechanism guides the focused crawler in downloading more relevant web pages by finding the relevant hyperlinks and omitting the irrelevant hyperlinks concerning the topic. The proposed method helps the focused crawler to semantically find, arrange, and index the web pages in a relatively narrow segment of the web to solve the inconsistency, ubiquity, and ambiguity problems of the focused crawlers. The experimental results indicate that the proposed technique outperforms the state - of - the - art approaches in terms of harvest rate, precision, recall, harmonic mean, and irrelevance ratio. In summary, the strategy described here works well and is important for focused crawlers.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] LEARNING-based Focused WEB Crawler
    Kumar, Naresh
    Aggarwal, Dhruv
    IETE JOURNAL OF RESEARCH, 2023, 69 (04) : 2037 - 2045
  • [2] Keyword Focused Web Crawler
    Agre, Gunjan H.
    Mahajan, Nikita V.
    2015 2ND INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION SYSTEMS (ICECS), 2015, : 1089 - 1092
  • [3] Smart Focused Web Crawler for Hidden Web
    Kaur, Sawroop
    Geetha, G.
    INFORMATION AND COMMUNICATION TECHNOLOGY FOR COMPETITIVE STRATEGIES, 2019, 40 : 419 - 427
  • [4] A Framework of a Hybrid Focused Web Crawler
    Sun, Yixue
    Jin, Peiquan
    Yue, Lihua
    2008 SECOND INTERNATIONAL CONFERENCE ON FUTURE GENERATION COMMUNICATION AND NETWORKING SYMPOSIA, VOLS 1-5, PROCEEDINGS, 2008, : 146 - 149
  • [5] A Focused Crawler for Dark Web Forums
    Fu, Tianjun
    Abbasi, Ahmed
    Chen, Hsinchun
    JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2010, 61 (06): : 1213 - 1231
  • [6] An algorithm OFC for the focused web crawler
    Zhu, Qiang
    PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 4059 - 4063
  • [7] Focused Web Crawler for Indonesian Recipes
    Alfarisy, Gusti Ahmad Fanshuri
    Bachtiar, Fitra A.
    2017 INTERNATIONAL CONFERENCE ON SUSTAINABLE INFORMATION ENGINEERING AND TECHNOLOGY (SIET), 2017, : 196 - 202
  • [8] SOF: a semi-supervised ontology-learning-based focused crawler
    Dong, Hai
    Hussain, Farookh Khadeer
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2013, 25 (12): : 1755 - 1770
  • [9] iSurfer: a focused Web crawler based on incremental learning from positive samples
    Ye, YM
    Ma, FY
    Lu, YM
    Chiu, M
    Huang, J
    ADVANCED WEB TECHNOLOGIES AND APPLICATIONS, 2004, 3007 : 122 - 134
  • [10] Keyword query based focused Web crawler
    Kumar, Manish
    Bindal, Ankit
    Gautam, Robin
    Bhatia, Rajesh
    6TH INTERNATIONAL CONFERENCE ON SMART COMPUTING AND COMMUNICATIONS, 2018, 125 : 584 - 590