Structrued and semantic data extraction from Web pages

被引：0

作者：

Gan, Y ^{[1
]}

Zhang, SZ ^{[1
]}

机构：

[1] Xian Jiaotong Univ, Sch Elect & Informat Engn, Xian 710049, Peoples R China

来源：

PROCEEDINGS OF THE 2004 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7 | 2004年

关键词：

data integration; data extraction; wrapper; Web source;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the development of the Internet, the Web has become an invaluable information source. In order to use this information for more than human browsing, web pages in HTML must be converted into a format meaningful to software programs. Wrappers have been a useful technique to convert HTML documents into semantically meaningful XML files. In this paper, we propose a data extraction approach based on the user pre-defined schema which generates automatically a wrapper to extract data from an HTML document, and produce an XML document conforming to given DTD. After the user define extraction data schema in the form of DTD, the wrapper is generated automatically with the induction and leaning algorithm. The experiment indicates that the approach can extract the required data from the source document with high accuracy.

引用

页码：2930 / 2935

页数：6

共 50 条

[41] A novel alignment algorithm for effective web data extraction from singleton-item pages
Oviliani Yenty Yuliana
Chia-Hui Chang
[J]. Applied Intelligence, 2018, 48 : 4355 - 4370
[42] Extraction of core web content from web pages using noise elimination
Saravanan A.
Bama S.S.
[J]. Journal of Engineering Science and Technology Review, 2020, 13 (04) : 173 - 187
[43] Extraction of web news from web pages using a ternary tree approach
Laishram, Debina
Sebastian, Merin
[J]. 2015 SECOND INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING AND COMMUNICATION ENGINEERING ICACCE 2015, 2015, : 628 - 633
[44] Applying semantic links for classifying Web pages
Choi, B
Guo, Q
[J]. DEVELOPMENTS IN APPLIED ARTIFICIAL INTELLIGENCE, 2003, 2718 : 148 - 153
[45] Domain patterns and semantic annotation of web pages
Kudelka, Milos
Snasel, Vaclav
El-Qawasmeh, Eyas
Lehecka, Ondrej
Tesarik, Jiri
[J]. 2006 1ST INTERNATIONAL CONFERENCE ON DIGITAL INFORMATION MANAGEMENT, 2006, : 504 - +
[46] Zero-shot Entity Extraction from Web Pages
Pasupat, Panupong
Liang, Percy
[J]. PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2014, : 391 - 401
[47] Person Attribute Extraction from the Textual Parts of Web Pages
Istvan, Nagy T.
[J]. ACTA CYBERNETICA, 2012, 20 (03): : 419 - 440
[48] Automatic Extraction of Textual Elements from News Web Pages
Ibrahim, Hossam
Darwish, Kareem
Abdel-sabor, Abdel-Rahim
[J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1600 - 1603
[49] TEXT: Automatic Template Extraction from Heterogeneous Web Pages
Kim, Chulyun
Shim, Kyuseok
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (04) : 612 - 626
[50] Unsupervised Keyphrase Extraction for Web Pages
Haarman, Tim
Zijlema, Bastiaan
Wiering, Marco
[J]. MULTIMODAL TECHNOLOGIES AND INTERACTION, 2019, 3 (03)

← 1 2 3 4 5 →