Flexible Approach for Web Information Extraction Based on HTML']HTMLParser

被引：0

作者：

Shan, Lin ^{[1
]}

Qun, Zhang ^{[1
]}

机构：

[1] Hubei Univ Technol, Sch Comp Sci, Wuhan, Peoples R China

来源：

PROCEEDINGS OF 2012 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE & EDUCATION, VOLS I-VI | 2012年

关键词：

information extraction; Web crawler; !text type='HTML']HTML[!/text]Parser; filter; visitor; custom tags;

D O I：

暂无

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Nowadays Internet presents a huge amount of information for users. How to extract information quickly and effectively from various sources becomes very important. Web information extraction is the key element not only to Web crawler or search engine, but also for many specialized services such as competitive intelligence tools. This paper recommends a flexible and high-performance approach to the web information extraction. HTMLParser is a parsing library mainly used to transform or extract the Web information with HTML. It uses Node, Abstract Node, and Tag to express HTML page. It can extract information mainly with two ways: filter and visitor. With HTMLParser, we can conveniently extract hyperlink, email, title, etc. In this paper, we also extend HTMLParser to extract custom tags in certain web pages to expand its application area. Experimental results confirm the feasibility of the approach.

引用

页码：683 / 686

页数：4

共 50 条

[1] FLEXIBLE WEB INFORMATION EXTRACTION WITH HTML']HTMLPARSER
Shan, Lin
[J]. 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 295 - 298
[2] A Web Information Extraction method Based on HTML']HTML Parser
Zhang, Zhiming
Huang, Shuaishuai
Li, Ping
[J]. ADVANCED TECHNOLOGIES IN MANUFACTURING, ENGINEERING AND MATERIALS, PTS 1-3, 2013, 774-776 : 1802 - 1806
[3] Layout based information extraction from HTML']HTML documents
Buraet, Radek
[J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
[4] Automated information mediator for HTML']HTML and XML based web information delivery service
Park, SS
Kim, YS
Park, GC
Kang, BH
Compton, P
[J]. AI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3809 : 401 - 404
[5] Information extraction from HTML']HTML: Application of a general machine learning approach
Freitag, D
[J]. FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, 1998, : 517 - 523
[6] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
Kawamura, Kazuki
Yamamoto, Akihiro
[J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43
[7] Towards Flexible Mashup of Web Applications Based on Information Extraction and Transfer
Guo, Junxia
Han, Hao
Tokudal, Takehiro
[J]. WEB INFORMATION SYSTEM ENGINEERING-WISE 2010, 2010, 6488 : 602 - +
[8] Extraction and integration information in HTML']HTML tables
Li, SJ
Peng, ZY
Liu, MC
[J]. FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2004, : 315 - 320
[9] A HTML']HTML to WML Translating Model Based on Information Extraction for Mobile Commerce
Song, Mingqiu
Yu, Bo
[J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 9166 - 9169
[10] Knowledge driven processing of HTML']HTML-based information for intellectual spaces on Web
Maikevich, NV
Khoroshevsky, VF
[J]. KNOWLEDGE-BASED SOFTWARE ENGINEERING, 1998, 48 : 241 - 249

← 1 2 3 4 5 →