Schema-based Web wrapping

被引:8
|
作者
Fazzinga, Bettina [1 ]
Flesca, Sergio [1 ]
Tagarelli, Andrea [1 ]
机构
[1] Univ Calabria, Dept Elect Comp & Syst Sci, I-87036 Arcavacata Di Rende, CS, Italy
关键词
Web wrapping; Extraction schema; Wrapper generalization; XML; XPath extraction rules; DATA EXTRACTION; INFORMATION EXTRACTION; INDUCTION;
D O I
10.1007/s10115-009-0275-2
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
An effective solution to automate information extraction from Web pages is represented by wrappers. A wrapper associates a Web page with an XML document that represents part of the information in that page in a machine-readable format. Most existing wrapping approaches have traditionally focused on how to generate extraction rules, while they have ignored potential benefits deriving from the use of the schema of the information being extracted in the wrapper evaluation. In this paper, we investigate how the schema of extracted information can be effectively used in both the design and evaluation of a Web wrapper. We define a clean declarative semantics for schema-based wrappers by introducing the notion of (preferred) extraction model, which is essential to compute a valid XML document containing the information extracted from a Web page. We developed the SCRAP (SChema-based wRAPper for web data) system for the proposed schema-based wrapping approach, which also provides visual support tools to the wrapper designer. Moreover, we present a wrapper generalization framework to profitably speed up the design of schema-based wrappers. Experimental evaluation has shown that SCRAP wrappers are not only able to successfully extract the required data, but also they are robust to changes that may occur in the source Web pages.
引用
收藏
页码:127 / 173
页数:47
相关论文
共 50 条
  • [21] A survey of schema-based matching approaches
    Shvaiko, P
    Euzenat, J
    [J]. JOURNAL ON DATA SEMANTICS IV, 2005, 3730 : 146 - 171
  • [22] A schema-based XML index structure
    College of Computer Science, Chongqing University, Chongqing 400044, China
    [J]. Jisuanji Gongcheng, 2006, 18 (64-66):
  • [23] THE SCHEMA-BASED APPROACH TO WORKFLOW MANAGEMENT
    BROCKMAN, JB
    DIRECTOR, SW
    [J]. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, 1995, 14 (10) : 1257 - 1267
  • [24] Are Schema-Based and Modified Schema-Based Instruction Evidence-Based Practices for Students with Disabilities? A Meta-Analysis
    Yucesoy-Ozkan, Serife
    Cakmak, Zulal
    Cevher, Zehra
    Gulboy, Emrah
    Oz-Alkoyak, Husne
    [J]. EDUCATION AND TRAINING IN AUTISM AND DEVELOPMENTAL DISABILITIES, 2022, 57 (04) : 446 - 461
  • [25] LODatio: A Schema-Based Retrieval System for Linked Open Data at Web-Scale
    Gottron, Thomas
    Scherp, Ansgar
    Krayer, Bastian
    Peters, Arne
    [J]. SEMANTIC WEB: ESWC 2013 SATELLITE EVENTS, 2013, 7955 : 142 - 146
  • [26] SCHEMA-BASED CACHE VALIDATION OF DYNAMIC CONTENT TO IMPROVE QUERY PERFORMANCE OF WEB SERVICES
    Raghunathan, A.
    Murugesan, K.
    [J]. JOURNAL OF WEB ENGINEERING, 2010, 9 (02): : 116 - 131
  • [27] Schema-based memory processes and eyewitness recollection
    Mallard, D
    Greig, J
    [J]. AUSTRALIAN JOURNAL OF PSYCHOLOGY, 2005, 57 : 227 - 227
  • [28] Pattern set mining with schema-based constraint
    Cagliero, Luca
    Chiusano, Silvia
    Garza, Paolo
    Bruno, Giulia
    [J]. KNOWLEDGE-BASED SYSTEMS, 2015, 84 : 224 - 238
  • [29] A schema-based approach to specifying conversation policies
    Lin, FH
    Norrie, DH
    Shen, WM
    Kremer, R
    [J]. ISSUES IN AGENT COMMUNICATION, 2000, 1916 : 193 - 204
  • [30] A prototype of a schema-based XPath satisfiability tester
    Groppe, Jinghua
    Groppe, Sven
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 93 - 103