Research of Extracting Data from HTML Web Pages Automatically

被引：1

作者：

王茹

宋瀚涛

陆玉昌

机构：

[1] Beijing 100081

[2] Beijing 100084

[3] Beijing Institute of Technology

[4] China

[5] Department of Computer Science and Engineering

[6] School of Information Science and Technology

[7] State Key Laboratory of Intelligent Technology and System

[8] Tsinghua University

来源：

Journal of Beijing Institute of Technology | 2003年 / S1期

关键词：

information extraction; data transformation; wrapper; HTML page;

D O I：

10.15918/j.jbit1004-0579.2003.s1.023

中图分类号：

TP393.092 [];

学科分类号：

080402 ;

摘要：

In order to use data information in the Internet,it is necessary to extract data from web pages.An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generationalgorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate thewrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate thewrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.

引用

页码：104 / 108

页数：5

共 50 条

[1] Automatically extracting ontologically specified data from HTML']HTML tables of unknown structure
Embley, DW
Tao, C
Liddle, SW
[J]. CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 322 - 337
[2] HTML']HTML pattern generator - Automatic data extraction from web pages
Cosulschi, Mirel
Giurca, Adrian
Udrescu, Bogdan
Constantinescu, Nicolae
Gabroveanu, Mihai
[J]. SYNASC 2006: EIGHTH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, PROCEEDINGS, 2007, : 75 - +
[3] Adaptively extracting structured data from Web pages
Guo, Yingnan
Zhang, Jiajun
Chen, Xing
[J]. 2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1524 - 1525
[4] Finding and Extracting Data Records from Web Pages
Alvarez, Manuel
Pan, Alberto
Raposo, Juan
Bellas, Fernando
Cacheda, Fidel
[J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 59 (01): : 123 - 137
[5] Finding and Extracting Data Records from Web Pages
Manuel Álvarez
Alberto Pan
Juan Raposo
Fernando Bellas
Fidel Cacheda
[J]. Journal of Signal Processing Systems, 2010, 59 : 123 - 137
[6] Extracting structured data from web pages (poster)
Arasu, A
Garcia-Molina, H
[J]. 19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 698 - 698
[7] Finding and extracting data records from web pages
Alvarez, Manuel
Pan, Alberto
Raposo, Juan
Bellas, Fernando
Cacheda, Fidel
[J]. EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2007, 4808 : 466 - 478
[8] Automatically Extracting Web Data Records
Mundluru, Dheerendranath
Raghavan, Vijay V.
Wu, Zonghuan
[J]. ACTIVE MEDIA TECHNOLOGY, 2010, 6335 : 510 - +
[9] Managing knowledge on the Web - Extracting ontology from HTML']HTML Web
Du, Timon C.
Li, Feng
King, Irwin
[J]. DECISION SUPPORT SYSTEMS, 2009, 47 (04) : 319 - 331
[10] Creating Web pages with HTML']HTML
McClees, M
[J]. NURSING INFORMATICS: THE IMPACT OF NURSING KNOWLEDGE ON HEALTH CARE INFORMATICS, 1997, 46 : 561 - 561

← 1 2 3 4 5 →