Structrued and semantic data extraction from Web pages

被引：0

作者：

Gan, Y ^{[1
]}

Zhang, SZ ^{[1
]}

机构：

[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China

来源：

PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7 | 2004年

关键词：

data integration; data extraction; wrapper; Web source;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.

引用

页码：2930 / 2935

页数：6

共 50 条

[1] Data extraction from Deep Web pages
Yang, Jufeng
Shi, Guangshun
Zheng, Yan
Wang, Qingren
[J]. CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
[2] Keyphrase extraction from Chinese news web pages based on semantic relations
Xie, Fei
Wu, Xindong
Hu, Xue-Gang
Wang, Fei-Yue
[J]. INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2008, 5075 : 490 - +
[3] Automatic data extraction from data-rich web pages
Hu, DD
Meng, XF
[J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839
[4] Schema Inference and Data Extraction from Templatized Web Pages
Krishna, Shinde Santaji
Dattatraya, Joshi Shashank
[J]. 2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
[5] Keyphrase extraction from Chinese news web pages based on semantic relations
Xie, Fei
Wu, Xindong
Hu, Xue-Gang
Wang, Fei-Yue
[J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, 5075 : 490 - 495
[6] Extraction of flat and nested data records from web pages
Algur, Siddu P.
Hiremath, P.S.
[J]. Conferences in Research and Practice in Information Technology Series, 2006, 61 : 163 - 168
[7] Automatic data extraction from template generated web pages
Ma, L
Goharian, N
Chowdhury, A
[J]. PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
[8] Information Extraction from Web pages
Novotny, Robert
Vojtas, Peter
Maruscak, Dusan
[J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +
[9] Data extraction and annotation for dynamic web pages
Song, H
Giri, S
Ma, FY
[J]. 2004 IEEE INTERNATIONAL CONFERNECE ON E-TECHNOLOGY, E-COMMERE AND E-SERVICE, PROCEEDINGS, 2004, : 499 - 502
[10] Automatic Data Extraction from Lists in Web Pages Based on XML
Xin, Zhou
Hao, Wang
[J]. ADVANCED TECHNOLOGY IN TEACHING - PROCEEDINGS OF THE 2009 3RD INTERNATIONAL CONFERENCE ON TEACHING AND COMPUTATIONAL SCIENCE (WTCS 2009), VOL 2: EDUCATION, PSYCHOLOGY AND COMPUTER SCIENCE, 2012, 117 : 915 - 921

← 1 2 3 4 5 →