An Improved Focused Crawler Based on Text Keyword Extraction

被引:0
|
作者
Zheng, Zhang [1 ]
Qian, Du [2 ]
机构
[1] Wuhan Univ Technol, Dept Informat Technol, Wuhan, Hubei, Peoples R China
[2] Wuhan Univ Technol, Affiliat Dept Informat Technol, Wuhan, Hubei, Peoples R China
关键词
focused crawler; keyword extract; TF-IDF; syntactic dependency analysis;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
For the shortcoming of the traditional focused crawler, this paper proposed an improved focused crawl method which based on syntactic dependency analysis. This method generates a words collection of the text through TF-IDF algorithm and generates a phrases collection through syntactic dependency analysis firstly. Then evaluate the collection of words and phrases to select set of keywords of the text. Next use the normal search engine to search the keywords set. Part of the search result will be used as seed links in focused crawler. Focused crawler's crawling policy is the best-first search policy, and this policy uses the similarity between keywords and link's anchor text to evaluate the priority. This paper proposed a keyword extraction method based on TF-IDF algorithm and syntactic dependency analysis, the result of this method included phrases and words, because of joining phrases, the relevance of seeds and links will be improved. In this paper, we use the method of combining link's anchor text with context to evaluate the link's priority. The experiment result shows that similarity between crawling pages vs. text of using this method is 14.3 percent higher than using artificial keywords. This method has good performance in the area of the focused crawler which uses text as input and vertical search engines and other application fields.
引用
收藏
页码:386 / 390
页数:5
相关论文
共 50 条
  • [1] Keyword query based focused Web crawler
    Kumar, Manish
    Bindal, Ankit
    Gautam, Robin
    Bhatia, Rajesh
    [J]. 6TH INTERNATIONAL CONFERENCE ON SMART COMPUTING AND COMMUNICATIONS, 2018, 125 : 584 - 590
  • [2] Keyword Focused Web Crawler
    Agre, Gunjan H.
    Mahajan, Nikita V.
    [J]. 2015 2ND INTERNATIONAL CONFERENCE ON ELECTRONICS AND COMMUNICATION SYSTEMS (ICECS), 2015, : 1089 - 1092
  • [3] Research on Text Mining Algorithm Based on Focused Crawler
    Zhang, Qiusheng
    Lin, Mingyu
    Jun, Jianping
    Zhang, Xingyun
    [J]. 2017 12TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND EDUCATION (ICCSE 2017), 2017, : 454 - 457
  • [4] Designing Focused Crawler Based On Improved Genetic Algorithm
    Yan, Wei
    Pan, Li
    [J]. PROCEEDINGS OF 2018 TENTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTATIONAL INTELLIGENCE (ICACI), 2018, : 319 - 323
  • [5] An improved focused web crawler based on hybrid similarity
    Shang, Songtao
    Wu, Huaiguang
    Ma, Jiangtao
    [J]. International Journal of Performability Engineering, 2019, 15 (10) : 2645 - 2656
  • [6] Design of Focused Crawler Based On Feature Extraction, Classification and Term Extraction
    Gupta, Shilpi
    [J]. PROCEEDINGS OF THE 10TH INDIACOM - 2016 3RD INTERNATIONAL CONFERENCE ON COMPUTING FOR SUSTAINABLE GLOBAL DEVELOPMENT, 2016, : 3430 - 3434
  • [7] An application of improved PageRank in focused crawler
    Zhang, Yulian
    Yin, Chunxia
    Yuan, Fuyong
    [J]. FOURTH INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, VOL 2, PROCEEDINGS, 2007, : 331 - 335
  • [8] Design of improved focused web crawler by analyzing semantic nature of URL and anchor text
    Dahiwale, Prashant
    Raghuwanshi, M. M.
    Malik, Latesh
    [J]. 2014 9TH INTERNATIONAL CONFERENCE ON INDUSTRIAL AND INFORMATION SYSTEMS (ICIIS), 2014, : 483 - +
  • [9] Chinese Automatic Text Summarization Based on Keyword Extraction
    Jiang Xiao-yu
    [J]. FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 225 - 228
  • [10] Keyword extraction for text categorization
    An, JY
    Chen, YPP
    [J]. PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON ACTIVE MEDIA TECHNOLOGY (AMT 2005), 2005, : 556 - 561