An Improved Focused Crawler Based on Text Keyword Extraction

被引:0
|
作者
Zheng, Zhang [1 ]
Qian, Du [2 ]
机构
[1] Wuhan Univ Technol, Dept Informat Technol, Wuhan, Hubei, Peoples R China
[2] Wuhan Univ Technol, Affiliat Dept Informat Technol, Wuhan, Hubei, Peoples R China
关键词
focused crawler; keyword extract; TF-IDF; syntactic dependency analysis;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
For the shortcoming of the traditional focused crawler, this paper proposed an improved focused crawl method which based on syntactic dependency analysis. This method generates a words collection of the text through TF-IDF algorithm and generates a phrases collection through syntactic dependency analysis firstly. Then evaluate the collection of words and phrases to select set of keywords of the text. Next use the normal search engine to search the keywords set. Part of the search result will be used as seed links in focused crawler. Focused crawler's crawling policy is the best-first search policy, and this policy uses the similarity between keywords and link's anchor text to evaluate the priority. This paper proposed a keyword extraction method based on TF-IDF algorithm and syntactic dependency analysis, the result of this method included phrases and words, because of joining phrases, the relevance of seeds and links will be improved. In this paper, we use the method of combining link's anchor text with context to evaluate the link's priority. The experiment result shows that similarity between crawling pages vs. text of using this method is 14.3 percent higher than using artificial keywords. This method has good performance in the area of the focused crawler which uses text as input and vertical search engines and other application fields.
引用
收藏
页码:386 / 390
页数:5
相关论文
共 50 条
  • [41] Text Reuse Detection by Keyword Extraction for Telegram Channels
    Saki, Misam
    Faili, Heshaam
    Asadpour, Masoud
    2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1481 - 1484
  • [42] Analysis of Text Collections for the Purposes of Keyword Extraction Task
    Vanyushkin, Alexander
    Graschenko, Leonid
    JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES, 2020, 44 (01) : 171 - 184
  • [43] Comparing keyword extraction techniques for WEBSOM text archives
    Azcarraga, AP
    Yap, TN
    ICTAI 2001: 13TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2001, : 187 - 194
  • [44] Keyword extraction strategy for item banks text categorization
    Nuntiyagul, Atorn
    Naruedomkul, Kanlaya
    Cercone, Nick
    Wongsawang, Damras
    COMPUTATIONAL INTELLIGENCE, 2007, 23 (01) : 28 - 44
  • [45] Ensemble of keyword extraction methods and classifiers in text classification
    Onan, Aytug
    Korukoglu, Serdar
    Bulut, Hasan
    EXPERT SYSTEMS WITH APPLICATIONS, 2016, 57 : 232 - 247
  • [46] Keyword Extraction from Short Texts with a Text-to-Text Transfer Transformer
    Pezik, Piotr
    Mikolajczyk, Agnieszka
    Wawrzynski, Adam
    Niton, Bartlomiej
    Ogrodniczuk, Maciej
    RECENT CHALLENGES IN INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2022, 2022, 1716 : 530 - 542
  • [47] SIFRANK Algorithm for Chinese Text Keyword Extraction Based on Dependent Semantic Feature Constraints
    Zhang, Qian
    Wang, Tiancheng
    Zhu, Mengyuan
    Shen, Tao
    Zhao, Yilin
    Zhang, Yunwei
    2022 IEEE 17TH CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2022, : 1652 - 1657
  • [48] An efficient adaptive focused crawler based on ontology learning
    Su, C
    Gao, Y
    Yang, JM
    Luo, B
    HIS 2005: 5th International Conference on Hybrid Intelligent Systems, Proceedings, 2005, : 73 - 78
  • [49] Support Vector Machine-Based Focused Crawler
    Baweja, Vanshita R.
    Bhatia, Rajesh
    Kumar, Manish
    INVENTIVE COMMUNICATION AND COMPUTATIONAL TECHNOLOGIES, ICICCT 2019, 2020, 89 : 673 - 686
  • [50] Inside Importance Factors of Graph-Based Keyword Extraction on Chinese Short Text
    Chen, Junjie
    Hou, Hongxu
    Gao, Jing
    ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (05)