Using clustering and edit distance techniques for automatic web data extraction

被引:0
|
作者
Alvarez, Manuel [1 ]
Pan, Alberto [1 ]
Raposo, Juan [1 ]
Bellas, Fernando [1 ]
Cacheda, Fidel [1 ]
机构
[1] Univ A Coruna, Dept Informat & Commun Technol, La Coruna 15071, Spain
来源
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2007, PROCEEDINGS | 2007年 / 4831卷
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Many web sources provide access to an underlying database containing structured data. These data can be usually accessed in HTML form only, which makes it difficult for software programs to obtain them in structured form. Nevertheless, web sources usually encode data records using a consistent template or layout, and the implicit regularities in the template can be used to automatically infer the structure and extract the data. In this paper, we propose a set of novel techniques to address this problem. While several previous works have addressed the same problem, most of them require multiple input pages while our method requires only one. In addition, previous methods make some assumptions about how data records are encoded into web pages, which do not always hold in real websites. Finally, we have tested our techniques with a high number of real web sources and we have found them to be very effective.
引用
收藏
页码:212 / 224
页数:13
相关论文
共 50 条
  • [1] An Approach of Automatic Web Data Record Extraction Using Clustering Techniques
    Dong, YongQuan
    Li, QingZhong
    2009 INTERNATIONAL SYMPOSIUM ON INTELLIGENT INFORMATION SYSTEMS AND APPLICATIONS, PROCEEDINGS, 2009, : 441 - 444
  • [2] STAVIES: A system for information extraction from unknown Web data sources through automatic Web wrapper generation using clustering techniques
    Papadakis, NK
    Skoutas, D
    Raftopoulos, K
    Varvarigou, TA
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (12) : 1638 - 1652
  • [3] Web document clustering by using automatic keyphrase extraction
    Flan, Juhyun
    Kim, Taehwan
    Choi, Joongmin
    PROCEEDING OF THE 2007 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS, 2007, : 56 - 59
  • [4] Clustering of Synthetic Routes Using Tree Edit Distance
    Genheden, Samuel
    Engkvist, Ola
    Bjerrum, Esben
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2021, 61 (08) : 3899 - 3907
  • [5] A natural language interface for automatic generation of data flow diagram using web extraction techniques
    Cheema, Sehrish Munawar
    Tariq, Saman
    Pires, Ivan Miguel
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (02) : 626 - 640
  • [6] An automatic web wrapper for extracting information from web sources, using clustering techniques
    Papadakis, N
    Skoutas, D
    Raftopoulos, K
    Varvarigou, T
    2005 SYMPOSIUM ON APPLICATIONS AND THE INTERNET, PROCEEDINGS, 2005, : 24 - 30
  • [7] Paradigm Clustering with Weighted Edit Distance
    Gerlach, Andrew
    Wiemerslage, Adam
    Kann, Katharina
    SIGMORPHON 2021: 18TH SIGMORPHON WORKSHOP ON COMPUTATIONAL RESEARCH IN PHONETICS, PHONOLOGY, AND MORPHOLOGY, 2021, : 107 - 114
  • [8] Determining the Similarity of Two Web Applications Using the Edit Distance
    Popescu, Doru Anastasiu
    Nicolae, Drago.
    SOFT COMPUTING APPLICATIONS, (SOFA 2014), VOL 1, 2016, 356 : 681 - 690
  • [9] Effective techniques for automatic extraction of Web publications
    Fong, ACM
    Hui, SC
    Vu, HL
    ONLINE INFORMATION REVIEW, 2002, 26 (01) : 4 - 18
  • [10] Automatic Extraction of Complex Web Data
    Zhang, Ming
    Zhou, Ying
    Patrick, Jon
    PACIFIC ASIA CONFERENCE ON INFORMATION SYSTEMS 2006, SECTIONS 1-8, 2006, : 1436 - 1449