Flexible Approach for Web Information Extraction Based on HTML']HTMLParser

被引:0
|
作者
Shan, Lin [1 ]
Qun, Zhang [1 ]
机构
[1] Hubei Univ Technol, Sch Comp Sci, Wuhan, Peoples R China
关键词
information extraction; Web crawler; !text type='HTML']HTML[!/text]Parser; filter; visitor; custom tags;
D O I
暂无
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Nowadays Internet presents a huge amount of information for users. How to extract information quickly and effectively from various sources becomes very important. Web information extraction is the key element not only to Web crawler or search engine, but also for many specialized services such as competitive intelligence tools. This paper recommends a flexible and high-performance approach to the web information extraction. HTMLParser is a parsing library mainly used to transform or extract the Web information with HTML. It uses Node, Abstract Node, and Tag to express HTML page. It can extract information mainly with two ways: filter and visitor. With HTMLParser, we can conveniently extract hyperlink, email, title, etc. In this paper, we also extend HTMLParser to extract custom tags in certain web pages to expand its application area. Experimental results confirm the feasibility of the approach.
引用
收藏
页码:683 / 686
页数:4
相关论文
共 50 条
  • [1] FLEXIBLE WEB INFORMATION EXTRACTION WITH HTML']HTMLPARSER
    Shan, Lin
    [J]. 3RD INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND COMPUTER SCIENCE (ITCS 2011), PROCEEDINGS, 2011, : 295 - 298
  • [2] A Web Information Extraction method Based on HTML']HTML Parser
    Zhang, Zhiming
    Huang, Shuaishuai
    Li, Ping
    [J]. ADVANCED TECHNOLOGIES IN MANUFACTURING, ENGINEERING AND MATERIALS, PTS 1-3, 2013, 774-776 : 1802 - 1806
  • [3] Layout based information extraction from HTML']HTML documents
    Buraet, Radek
    [J]. ICDAR 2007: NINTH INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION, VOLS I AND II, PROCEEDINGS, 2007, : 624 - 628
  • [4] Automated information mediator for HTML']HTML and XML based web information delivery service
    Park, SS
    Kim, YS
    Park, GC
    Kang, BH
    Compton, P
    [J]. AI 2005: ADVANCES IN ARTIFICIAL INTELLIGENCE, 2005, 3809 : 401 - 404
  • [5] Information extraction from HTML']HTML: Application of a general machine learning approach
    Freitag, D
    [J]. FIFTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-98) AND TENTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICAL INTELLIGENCE (IAAI-98) - PROCEEDINGS, 1998, : 517 - 523
  • [6] HTML']HTML-LSTM: Information Extraction from HTML']HTML Tables in Web Pages Using Tree-Structured LSTM
    Kawamura, Kazuki
    Yamamoto, Akihiro
    [J]. DISCOVERY SCIENCE (DS 2021), 2021, 12986 : 29 - 43
  • [7] Towards Flexible Mashup of Web Applications Based on Information Extraction and Transfer
    Guo, Junxia
    Han, Hao
    Tokudal, Takehiro
    [J]. WEB INFORMATION SYSTEM ENGINEERING-WISE 2010, 2010, 6488 : 602 - +
  • [8] Extraction and integration information in HTML']HTML tables
    Li, SJ
    Peng, ZY
    Liu, MC
    [J]. FOURTH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY, PROCEEDINGS, 2004, : 315 - 320
  • [9] A HTML']HTML to WML Translating Model Based on Information Extraction for Mobile Commerce
    Song, Mingqiu
    Yu, Bo
    [J]. 2008 4TH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS, NETWORKING AND MOBILE COMPUTING, VOLS 1-31, 2008, : 9166 - 9169
  • [10] Knowledge driven processing of HTML']HTML-based information for intellectual spaces on Web
    Maikevich, NV
    Khoroshevsky, VF
    [J]. KNOWLEDGE-BASED SOFTWARE ENGINEERING, 1998, 48 : 241 - 249