Automatic data record detection in Web Pages

被引：0

作者：

Gao, Xiaoying ^{[1
]}

Vuong, Le Phong Bao ^{[1
]}

Zhang, Mengjie ^{[1
]}

机构：

[1] Victoria Univ Wellington, Sch Math Stat & Comp Sci, POB 600, Wellington, New Zealand

来源：

KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT | 2007年 / 4798卷

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Wrapper induction is currently the main technology for data extraction from semi-structured web pages. However, wrapper induction has the limitation of requiring training Web pages, and the information extraction process is quite complex involving pattern induction, data extraction and data transformation. This paper introduces a new approach that achieves automatic data extraction by applying clustering to detecting similar text tokens, developing a new method to label text tokens to capture the hierarchical structure of HTML pages, and developing an algorithm for transforming labelled text tokens to XML. The approach is examined and compared with a number of existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data extraction and that it outperforms existing approaches on these web sites. This approach has the advantages of requiring no training and has no explicit processes for pattern induction or data extraction, therefore the whole process has been simplified.

引用

页码：349 / +

页数：3

共 50 条

[1] Automatic template detection for structured web pages
Lo, Lawrence
Ng, Vincent To-Yee
Ng, Patrick
Chan, Stephen C. F.
[J]. 2006 10TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN, PROCEEDINGS, VOLS 1 AND 2, 2006, : 708 - 713
[2] Automatic data extraction from data-rich web pages
Hu, DD
Meng, XF
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839
[3] Automatic Role Detection of Visual Elements of Web Pages for Automatic Accessibility Evaluation
Duarte, Carlos
Salvado, Ana
Akpinar, M. Elgin
Yesilada, Yeliz
Carrico, Luis
[J]. 15TH INTERNATIONAL WEB FOR ALL CONFERENCE (W4A) 2018, 2018,
[4] Towards automatic semantic annotation of data rich Web pages
Jellouli, Ismail
El Mohajir, Mohammed
[J]. RCIS 2009: PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2009, : 139 - 142
[5] Automatic data extraction from template generated web pages
Ma, L
Goharian, N
Chowdhury, A
[J]. PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
[6] Automatic Pornographic Detection in Web Pages Based on Images and Text Data Using Support Vector Machine
Sharma, Jayash
Pathak, Vinay Kumar
[J]. PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON SOFT COMPUTING FOR PROBLEM SOLVING (SOCPROS 2011), VOL 2, 2012, 131 : 473 - +
[7] A robust approach of automatic web data record extraction
School of Computer Science and Technology, Shandong University, Jinan, China
不详
[J]. J. Comput. Inf. Syst., 2009, 6 (1757-1766):
[8] Automatic fragment detection in dynamic Web pages and its impact on caching
Ramaswamy, L
Iyengar, A
Liu, L
Douglis, F
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (06) : 859 - 874
[9] Automatic Data Extraction from Lists in Web Pages Based on XML
Xin, Zhou
Hao, Wang
[J]. ADVANCED TECHNOLOGY IN TEACHING - PROCEEDINGS OF THE 2009 3RD INTERNATIONAL CONFERENCE ON TEACHING AND COMPUTATIONAL SCIENCE (WTCS 2009), VOL 2: EDUCATION, PSYCHOLOGY AND COMPUTER SCIENCE, 2012, 117 : 915 - 921
[10] Automatic generation of agents for collecting hidden Web pages for data extraction
Lage, JP
da Silva, AS
Golgher, PB
Laender, AHF
[J]. DATA & KNOWLEDGE ENGINEERING, 2004, 49 (02) : 177 - 196

← 1 2 3 4 5 →