Building web information extraction tasks

被引:1
|
作者
Habegger, B [1 ]
Quafafou, M [1 ]
机构
[1] Lab Informat Nantes Atlantique, F-44322 Nantes 3, France
关键词
D O I
10.1109/WI.2004.10116
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most recent research in the field of information extraction from the Web has concentrated on the task of extracting the underlying content of a set of similarly structured web pages. However in order to build real-world web information extraction applications this is not sufficient. Indeed, building such applications requires fully automating the access to web sources. This does not just involve the extraction of the data from web pages. There is a need to set up the necessary, infrastructure allowing to query a source, retrieve the result pages, extract the results from these pages and filter out the unwanted results. In this paper we show how such an infrastructure can be set up. We propose to build a web information extraction application by decomposing it into sub-tasks and describing it in an XML based language named WetDL. Each of the sub-tasks consists in applying a web information extraction specific operation onto its input, one of these operators being the application of an extractor By connecting such operations together it is possible to simply define complex applications. This is shown in the paper by applying this approach to real-world information extraction tasks such as extracting DVD listings front Ama-Zon.com, extracting addresses from online telephone directories superpages.corn, etc.
引用
收藏
页码:349 / 355
页数:7
相关论文
共 50 条
  • [31] Open Information Extraction from the Web
    Banko, Michele
    Cafarella, Michael J.
    Soderland, Stephen
    Broadhead, Matt
    Etzioni, Oren
    [J]. 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 2670 - 2676
  • [32] Web Information Extraction for content augmentation
    Janevski, A
    Dimitrova, N
    [J]. IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOL I AND II, PROCEEDINGS, 2002, : A389 - A392
  • [33] On validating web information extraction proposals
    Jimenez, Patricia
    Corchuelo, Rafael
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 199
  • [34] Web Information Extraction Based on IEBIDTech
    Ren, Xiaoyan
    Fu, Yunxia
    [J]. 2012 WORLD AUTOMATION CONGRESS (WAC), 2012,
  • [35] Shallow Information Extraction for the Knowledge Web
    Barbosa, Denilson
    Wang, Haixun
    Yu, Cong
    [J]. 2013 IEEE 29TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE), 2013, : 1264 - 1267
  • [36] Open Information Extraction from the Web
    Etzioni, Oren
    Banko, Michele
    Soderland, Stephen
    Weld, Daniel S.
    [J]. COMMUNICATIONS OF THE ACM, 2008, 51 (12) : 68 - 74
  • [37] Metabrain: Web Information Extraction and Visualization
    Teixeira, Joao
    Barata, Gabriel
    Goncalves, Daniel
    [J]. PROCEEDINGS OF THE INTERNATIONAL WORKING CONFERENCE ON ADVANCED VISUAL INTERFACES, 2012, : 534 - 537
  • [38] Extraction of structural information from the web
    Murata, T
    [J]. FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PT 2, PROCEEDINGS, 2005, 3614 : 1204 - 1207
  • [39] Extraction of building product image from the Web
    Nakapan, W
    Halin, G
    Bignon, JC
    Wagner, M
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2004, 19 (1-2) : 65 - 78
  • [40] Users, tasks and the Web: Their impact on the information seeking behavior
    Kim, KS
    [J]. NATIONAL ONLINE MEETING, PROCEEDINGS 2000, 2000, : 189 - 198