Using clustering and edit distance techniques for automatic web data extraction

被引:0
|
作者
Alvarez, Manuel [1 ]
Pan, Alberto [1 ]
Raposo, Juan [1 ]
Bellas, Fernando [1 ]
Cacheda, Fidel [1 ]
机构
[1] Univ A Coruna, Dept Informat & Commun Technol, La Coruna 15071, Spain
来源
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2007, PROCEEDINGS | 2007年 / 4831卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
引用
收藏
页码:212 / 224
页数:13
相关论文
共 50 条
  • [41] Data quality system using reference dictionaries and edit distance algorithms
    Karbarz, Radoslaw
    Mulawka, Jan
    PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2015, 2015, 9662
  • [42] An Automatic Semantic Extraction Method for Web Data Interchange
    Yao, Yuangang
    Liu, Hui
    Yi, Jin
    Chen, Haiqiang
    Zhao, Xianghui
    Ma, Xiaoyu
    2014 6TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND INFORMATION TECHNOLOGY (CSIT), 2014, : 148 - 152
  • [43] Automatic Extraction of Structured Web Data with Domain Knowledge
    Derouiche, Nora
    Cautis, Bogdan
    Abdessalem, Talel
    2012 IEEE 28TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2012, : 726 - 737
  • [44] Understanding Cloud Data Using Approximate String Matching and Edit Distance
    Jupin, Joseph
    Shi, Justin Y.
    Obradovic, Zoran
    2012 SC COMPANION: HIGH PERFORMANCE COMPUTING, NETWORKING, STORAGE AND ANALYSIS (SCC), 2012, : 1234 - 1243
  • [45] Using the levenshtein edit distance for automatic lemmatization: A case study for Modern Greek and English
    Lyras, Dimitrios P.
    Sgarbas, Kyriakos N.
    Fakotakis, Nikolaos D.
    19TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, VOL II, PROCEEDINGS, 2007, : 428 - 435
  • [46] Efficient Approximate Entity Extraction with Edit Distance Constraints
    Wang, Wei
    Xiao, Chuan
    Lin, Xuemin
    Zhang, Chengqi
    ACM SIGMOD/PODS 2009 CONFERENCE, 2009, : 759 - 770
  • [47] Error Classification Using Automatic Measures Based on n-grams and Edit Distance
    Benko, L'ubomir
    Benkova, Lucia
    Munkova, Dasa
    Munk, Michal
    Shulzenko, Danylo
    ADVANCED RESEARCH IN TECHNOLOGIES, INFORMATION, INNOVATION AND SUSTAINABILITY, ARTIIS 2022, PT I, 2022, 1675 : 345 - 356
  • [48] Contour Regularity Extraction Based on String Edit Distance
    Salas, Jose Ignacio Abreu
    Ramon Rico-Juan, Juan
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PROCEEDINGS, 2009, 5524 : 160 - +
  • [49] Near-Duplicate Web Video Retrieval and Localization Using Improved Edit Distance
    Liu, Hao
    Zhao, Qingjie
    Wang, Hao
    Zhang, Cong
    WEB TECHNOLOGIES AND APPLICATIONS, PT I, 2016, 9931 : 141 - 152
  • [50] Automatic data extraction from data-rich web pages
    Hu, DD
    Meng, XF
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839