Structrued and semantic data extraction from Web pages

被引:0
|
作者
Gan, Y [1 ]
Zhang, SZ [1 ]
机构
[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China
关键词
data integration; data extraction; wrapper; Web source;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.
引用
收藏
页码:2930 / 2935
页数:6
相关论文
共 50 条
  • [1] Data extraction from Deep Web pages
    Yang, Jufeng
    Shi, Guangshun
    Zheng, Yan
    Wang, Qingren
    [J]. CIS: 2007 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PROCEEDINGS, 2007, : 237 - 241
  • [2] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. INTELLIGENCE AND SECURITY INFORMATICS, PROCEEDINGS, 2008, 5075 : 490 - +
  • [3] Automatic data extraction from data-rich web pages
    Hu, DD
    Meng, XF
    [J]. DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PROCEEDINGS, 2005, 3453 : 828 - 839
  • [4] Schema Inference and Data Extraction from Templatized Web Pages
    Krishna, Shinde Santaji
    Dattatraya, Joshi Shashank
    [J]. 2015 INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING (ICPC), 2015,
  • [5] Keyphrase extraction from Chinese news web pages based on semantic relations
    Xie, Fei
    Wu, Xindong
    Hu, Xue-Gang
    Wang, Fei-Yue
    [J]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2008, 5075 : 490 - 495
  • [6] Extraction of flat and nested data records from web pages
    Algur, Siddu P.
    Hiremath, P.S.
    [J]. Conferences in Research and Practice in Information Technology Series, 2006, 61 : 163 - 168
  • [7] Automatic data extraction from template generated web pages
    Ma, L
    Goharian, N
    Chowdhury, A
    [J]. PDPTA'03: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON PARALLEL AND DISTRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS, VOLS 1-4, 2003, : 642 - 648
  • [8] Information Extraction from Web pages
    Novotny, Robert
    Vojtas, Peter
    Maruscak, Dusan
    [J]. 2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 3, 2009, : 121 - +
  • [9] Data extraction and annotation for dynamic web pages
    Song, H
    Giri, S
    Ma, FY
    [J]. 2004 IEEE INTERNATIONAL CONFERNECE ON E-TECHNOLOGY, E-COMMERE AND E-SERVICE, PROCEEDINGS, 2004, : 499 - 502
  • [10] Automatic Data Extraction from Lists in Web Pages Based on XML
    Xin, Zhou
    Hao, Wang
    [J]. ADVANCED TECHNOLOGY IN TEACHING - PROCEEDINGS OF THE 2009 3RD INTERNATIONAL CONFERENCE ON TEACHING AND COMPUTATIONAL SCIENCE (WTCS 2009), VOL 2: EDUCATION, PSYCHOLOGY AND COMPUTER SCIENCE, 2012, 117 : 915 - 921