Using clustering and edit distance techniques for automatic web data extraction

被引:0
|
作者
Alvarez, Manuel [1 ]
Pan, Alberto [1 ]
Raposo, Juan [1 ]
Bellas, Fernando [1 ]
Cacheda, Fidel [1 ]
机构
[1] Univ A Coruna, Dept Informat & Commun Technol, La Coruna 15071, Spain
来源
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2007, PROCEEDINGS | 2007年 / 4831卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
引用
收藏
页码:212 / 224
页数:13
相关论文
共 50 条
  • [21] A Method of Automatic Web Information Extraction Based on Page Clustering
    Yang, Tianqi
    Qiu, Taofen
    2011 9TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA 2011), 2011, : 390 - 393
  • [22] The Research of automatic extraction dynamic web data
    Qu Jubao
    2009 INTERNATIONAL FORUM ON INFORMATION TECHNOLOGY AND APPLICATIONS, VOL 2, PROCEEDINGS, 2009, : 143 - 146
  • [23] On the automatic extraction of data from the hidden web
    Liddle, SW
    Yau, SH
    Embley, DW
    CONCEPTUAL MODELING FOR NEW INFORMATION SYSTEMS TECHNOLOGIES, 2002, 2465 : 212 - 226
  • [24] Discovering Shape Classes using Tree Edit-Distance and Pairwise Clustering
    Andrea Torsello
    Antonio Robles-Kelly
    Edwin R. Hancock
    International Journal of Computer Vision, 2007, 72 : 259 - 285
  • [25] Discovering shape classes using tree edit-distance and pairwise clustering
    Torsello, Andrea
    Robles-Kelly, Antonio
    Hancock, Edwin R.
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2007, 72 (03) : 259 - 285
  • [26] Automatic partitioning of web pages using clustering
    Romero, R
    Berger, A
    MOBILE HUMAN-COMPUTER INTERACTION - MOBILEHCI 2004, PROCEEDINGS, 2004, 3160 : 388 - 393
  • [27] Predicted Edit Distance based Clustering of Gene Sequences
    Pramanik, Sakti
    Islam, A. K. M. Tauhidul
    Sural, Shamik
    2018 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2018, : 1206 - 1211
  • [28] A Survey of Distance Metrics in Clustering Data Mining Techniques
    Mercioni, Marina Adriana
    Holban, Stefan
    ICGSP '19 - PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON GRAPHICS AND SIGNAL PROCESSING, 2019, : 44 - 47
  • [29] Web data extraction, applications and techniques: A survey
    Ferrara, Emilio
    De Meo, Pasquale
    Fiumara, Giacomo
    Baumgartner, Robert
    KNOWLEDGE-BASED SYSTEMS, 2014, 70 : 301 - 323
  • [30] Using keyword extraction for Web site clustering
    Tonella, P
    Ricca, F
    Pianta, E
    Girardi, C
    FIFTH IEEE INTERNATIONAL WORKSHOP ON WEB SITE EVOLUTION THEME: ARCHITECTURE, PROCEEDINGS, 2003, : 41 - 48