Research of Extracting Data from HTML Web Pages Automatically

被引:1
|
作者
王茹
宋瀚涛
陆玉昌
机构
[1] Beijing 100081
[2] Beijing 100084
[3] Beijing Institute of Technology
[4] China
[5] Department of Computer Science and Engineering
[6] School of Information Science and Technology
[7] State Key Laboratory of Intelligent Technology and System
[8] Tsinghua University
关键词
information extraction; data transformation; wrapper; HTML page;
D O I
10.15918/j.jbit1004-0579.2003.s1.023
中图分类号
TP393.092 [];
学科分类号
080402 ;
摘要
In order to use data information in the Internet,it is necessary to extract data from web pages.An HTT tree model representing HTML pages is presented. Based on the HTT model, a wrapper generationalgorithm AGW is proposed. The AGW algorithm utilizes comparing and correcting technique to generate thewrapper with the native characteristic of the HTT tree structure. The AGW algorithm can not only generate thewrapper automatically, but also rebuild the data schema easily and reduce the complexity of the computing.
引用
收藏
页码:104 / 108
页数:5
相关论文
共 50 条
  • [1] Automatically extracting ontologically specified data from HTML']HTML tables of unknown structure
    Embley, DW
    Tao, C
    Liddle, SW
    [J]. CONCEPTUAL MODELING - ER 2002, 2002, 2503 : 322 - 337
  • [2] HTML']HTML pattern generator - Automatic data extraction from web pages
    Cosulschi, Mirel
    Giurca, Adrian
    Udrescu, Bogdan
    Constantinescu, Nicolae
    Gabroveanu, Mihai
    [J]. SYNASC 2006: EIGHTH INTERNATIONAL SYMPOSIUM ON SYMBOLIC AND NUMERIC ALGORITHMS FOR SCIENTIFIC COMPUTING, PROCEEDINGS, 2007, : 75 - +
  • [3] Adaptively extracting structured data from Web pages
    Guo, Yingnan
    Zhang, Jiajun
    Chen, Xing
    [J]. 2019 IEEE INTL CONF ON PARALLEL & DISTRIBUTED PROCESSING WITH APPLICATIONS, BIG DATA & CLOUD COMPUTING, SUSTAINABLE COMPUTING & COMMUNICATIONS, SOCIAL COMPUTING & NETWORKING (ISPA/BDCLOUD/SOCIALCOM/SUSTAINCOM 2019), 2019, : 1524 - 1525
  • [4] Finding and Extracting Data Records from Web Pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 59 (01): : 123 - 137
  • [5] Finding and Extracting Data Records from Web Pages
    Manuel Álvarez
    Alberto Pan
    Juan Raposo
    Fernando Bellas
    Fidel Cacheda
    [J]. Journal of Signal Processing Systems, 2010, 59 : 123 - 137
  • [6] Extracting structured data from web pages (poster)
    Arasu, A
    Garcia-Molina, H
    [J]. 19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 698 - 698
  • [7] Finding and extracting data records from web pages
    Alvarez, Manuel
    Pan, Alberto
    Raposo, Juan
    Bellas, Fernando
    Cacheda, Fidel
    [J]. EMBEDDED AND UBIQUITOUS COMPUTING, PROCEEDINGS, 2007, 4808 : 466 - 478
  • [8] Automatically Extracting Web Data Records
    Mundluru, Dheerendranath
    Raghavan, Vijay V.
    Wu, Zonghuan
    [J]. ACTIVE MEDIA TECHNOLOGY, 2010, 6335 : 510 - +
  • [9] Managing knowledge on the Web - Extracting ontology from HTML']HTML Web
    Du, Timon C.
    Li, Feng
    King, Irwin
    [J]. DECISION SUPPORT SYSTEMS, 2009, 47 (04) : 319 - 331
  • [10] Creating Web pages with HTML']HTML
    McClees, M
    [J]. NURSING INFORMATICS: THE IMPACT OF NURSING KNOWLEDGE ON HEALTH CARE INFORMATICS, 1997, 46 : 561 - 561