DTM- Extracting Data Records from Search Engine Results Page using Tree Matching Algorithm

被引:0
|
作者
Hong, Jer Lang [1 ]
Siew, Eugene [1 ]
Egerton, Simon [1 ]
机构
[1] Monash Univ, Selangor Darul Ehsan 46150, Malaysia
关键词
Information Extraction; Wrapper Generation; Search Engine;
D O I
10.1109/SoCPaR.2009.40
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we develop a non-visual automatic wrapper for extracting data records from search engine results page. The novel techniques for our wrapper are (1) filtering rules to detect and filter out irrelevant data records, (2) a tree matching algorithm using frequency measures to increase the speed of data extraction (3) an algorithm to calculate the number and size of the components of data records to detect the correct data region. Results show that our wrapper is as robust and in many cases outperforms the state of the art wrappers such as ViNT and DEPTA. This wrapper could have significant speed advantages when processing large volumes of web sites data, which could be helpful in meta search engine development.
引用
收藏
页码:149 / 154
页数:6
相关论文
共 50 条
  • [1] Extracting Knowledge from Web Search Engine Results
    Kanavos, Andreas
    Theodoridis, Evangelos
    Tsakalidis, Athanasios
    [J]. 2012 IEEE 24TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI 2012), VOL 1, 2012, : 860 - 867
  • [2] Data Extraction for Search Engine Using Safe Matching
    Hong, Jer Lang
    Tan, Ee Xion
    Fauzi, Fariza
    [J]. AI 2011: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2011, 7106 : 759 - +
  • [3] Effects of Using Arabic Web Pages in Building Rank Estimation Algorithm for Google Search Engine Results Page
    Almadhoun, Mohamed
    Malim, Nurul
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2023, 20 (06) : 996 - 1007
  • [4] Extracting Knowledge from Web Search Engine Using Wikipedia
    Kanavos, Andreas
    Makris, Christos
    Plegas, Yannis
    Theodoridis, Evangelos
    [J]. ENGINEERING APPLICATIONS OF NEURAL NETWORKS, PT II, 2013, 384 : 100 - 109
  • [5] Extracting knowledge from web search engine using wikipedia
    Kanavos, Andreas
    Makris, Christos
    Plegas, Yannis
    Theodoridis, Evangelos
    [J]. Communications in Computer and Information Science, 2013, 384 : 100 - 109
  • [6] Layered and Weighted Tree Matching Algorithm for Automatic Web Data Records Recognition
    Shi, Shengsheng
    Quan, Fuliang
    Xie, Tao
    Yuan, Chunfeng
    Huang, Yihua
    [J]. 2013 10TH WEB INFORMATION SYSTEM AND APPLICATION CONFERENCE (WISA 2013), 2013, : 55 - 60
  • [7] Generating page clippings from web search results using a dynamically terminated genetic algorithm
    Chen, LC
    Luh, CJ
    Jou, CC
    [J]. INFORMATION SYSTEMS, 2005, 30 (04) : 299 - 316
  • [8] Using Twitter Data to Improve News Results on Search Engine
    Santoso, Abraham Krisnanda
    Saptawati, Gusti Ayu Putri
    [J]. 2014 INTERNATIONAL CONFERENCE ON DATA AND SOFTWARE ENGINEERING (ICODSE), 2014,
  • [9] Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data
    Kou, Gang
    Lou, Chunwei
    [J]. ANNALS OF OPERATIONS RESEARCH, 2012, 197 (01) : 123 - 134
  • [10] Multiple factor hierarchical clustering algorithm for large scale web page and search engine clickstream data
    Gang Kou
    Chunwei Lou
    [J]. Annals of Operations Research, 2012, 197 : 123 - 134