Structrued and semantic data extraction from Web pages

被引:0
|
作者
Gan, Y [1 ]
Zhang, SZ [1 ]
机构
[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China
关键词
data integration; data extraction; wrapper; Web source;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.
引用
收藏
页码:2930 / 2935
页数:6
相关论文
共 50 条
  • [41] A novel alignment algorithm for effective web data extraction from singleton-item pages
    Oviliani Yenty Yuliana
    Chia-Hui Chang
    [J]. Applied Intelligence, 2018, 48 : 4355 - 4370
  • [42] Extraction of core web content from web pages using noise elimination
    Saravanan A.
    Bama S.S.
    [J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
  • [43] Extraction of web news from web pages using a ternary tree approach
    Laishram, Debina
    Sebastian, Merin
    [J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
  • [44] Applying semantic links for classifying Web pages
    Choi, B
    Guo, Q
    [J]. DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 148 - 153
  • [45] Domain patterns and semantic annotation of web pages
    Kudelka, Milos
    Snasel, Vaclav
    El-Qawasmeh, Eyas
    Lehecka, Ondrej
    Tesarik, Jiri
    [J]. 2006 1ST INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT, 2006, : 504 - +
  • [46] Zero-shot Entity Extraction from Web Pages
    Pasupat, Panupong
    Liang, Percy
    [J]. PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 391 - 401
  • [47] Person Attribute Extraction from the Textual Parts of Web Pages
    Istvan, Nagy T.
    [J]. ACTA CYBERNETICA, 2012, 20 (03): : 419 - 440
  • [48] Automatic Extraction of Textual Elements from News Web Pages
    Ibrahim, Hossam
    Darwish, Kareem
    Abdel-sabor, Abdel-Rahim
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1600 - 1603
  • [49] TEXT: Automatic Template Extraction from Heterogeneous Web Pages
    Kim, Chulyun
    Shim, Kyuseok
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (04) : 612 - 626
  • [50] Unsupervised Keyphrase Extraction for Web Pages
    Haarman, Tim
    Zijlema, Bastiaan
    Wiering, Marco
    [J]. MULTIMODAL TECHNOLOGIES AND INTERACTION, 2019, 3 (03)