Automatic data record detection in Web Pages

被引:0
|
作者
Gao, Xiaoying [1 ]
Vuong, Le Phong Bao [1 ]
Zhang, Mengjie [1 ]
机构
[1] Victoria Univ Wellington, Sch Math Stat & Comp Sci, POB 600, Wellington, New Zealand
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Wrapper induction is currently the main technology for data extraction from semi-structured web pages. However, wrapper induction has the limitation of requiring training Web pages, and the information extraction process is quite complex involving pattern induction, data extraction and data transformation. This paper introduces a new approach that achieves automatic data extraction by applying clustering to detecting similar text tokens, developing a new method to label text tokens to capture the hierarchical structure of HTML pages, and developing an algorithm for transforming labelled text tokens to XML. The approach is examined and compared with a number of existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data extraction and that it outperforms existing approaches on these web sites. This approach has the advantages of requiring no training and has no explicit processes for pattern induction or data extraction, therefore the whole process has been simplified.
引用
收藏
页码:349 / +
页数:3
相关论文
共 50 条
  • [1] Automatic template detection for structured web pages
    Lo, Lawrence
    Ng, Vincent To-Yee
    Ng, Patrick
    Chan, Stephen C. F.
    [J]. 2006 10TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, PROCEEDINGS, VOLS 1 AND 2, 2006, : 708 - 713
  • [2] Automatic data extraction from data-rich web pages
    Hu, DD
    Meng, XF
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839
  • [3] Automatic Role Detection of Visual Elements of Web Pages for Automatic Accessibility Evaluation
    Duarte, Carlos
    Salvado, Ana
    Akpinar, M. Elgin
    Yesilada, Yeliz
    Carrico, Luis
    [J]. 15TH INTERNATIONAL WEB FOR ALL CONFERENCE (W4A) 2018, 2018,
  • [4] Towards automatic semantic annotation of data rich Web pages
    Jellouli, Ismail
    El Mohajir, Mohammed
    [J]. RCIS 2009: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2009, : 139 - 142
  • [5] Automatic data extraction from template generated web pages
    Ma, L
    Goharian, N
    Chowdhury, A
    [J]. PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
  • [6] Automatic Pornographic Detection in Web Pages Based on Images and Text Data Using Support Vector Machine
    Sharma, Jayash
    Pathak, Vinay Kumar
    [J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2011), VOL 2, 2012, 131 : 473 - +
  • [7] A robust approach of automatic web data record extraction
    School of Computer Science and Technology, Shandong University, Jinan, China
    不详
    [J]. J. Comput. Inf. Syst., 2009, 6 (1757-1766):
  • [8] Automatic fragment detection in dynamic Web pages and its impact on caching
    Ramaswamy, L
    Iyengar, A
    Liu, L
    Douglis, F
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (06) : 859 - 874
  • [9] Automatic Data Extraction from Lists in Web Pages Based on XML
    Xin, Zhou
    Hao, Wang
    [J]. ADVANCED TECHNOLOGY IN TEACHING - PROCEEDINGS OF THE 2009 3RD INTERNATIONAL CONFERENCE ON TEACHING AND COMPUTATIONAL SCIENCE (WTCS 2009), VOL 2: EDUCATION, PSYCHOLOGY AND COMPUTER SCIENCE, 2012, 117 : 915 - 921
  • [10] Automatic generation of agents for collecting hidden Web pages for data extraction
    Lage, JP
    da Silva, AS
    Golgher, PB
    Laender, AHF
    [J]. DATA & KNOWLEDGE ENGINEERING, 2004, 49 (02) : 177 - 196