Integrating XML data sources using approximate joins

被引:14
|
作者
Guha, Sudipto [1 ]
Jagadish, H. V.
Koudas, Nick
Srivastava, Divesh
Yu, Ting
机构
[1] Univ Penn, Philadelphia, PA 19104 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
[3] Univ Toronto, Toronto, ON, Canada
[4] AT&T Labs Res, Middletown, NJ 07748 USA
[5] N Carolina State Univ, Raleigh, NC 27695 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2006年 / 31卷 / 01期
关键词
algorithms; experimentation; performance; theory; data integration; tree edit distance; XML; joins; approximate joins;
D O I
10.1145/1132863.1132868
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling- based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.
引用
收藏
页码:161 / 207
页数:47
相关论文
共 50 条
  • [21] Using Complex Correspondences for Integrating Relational Data Sources
    Pequeno, Valeria
    Galhardas, Helena
    Ponte Vidal, Vania M.
    ENTERPRISE INFORMATION SYSTEMS, ICEIS 2014, 2015, 227 : 57 - 74
  • [22] Approximate string joins
    Srivastava, D
    SSDBM 2002: 15TH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2003, : 7 - 7
  • [23] Protection and administration of XML data sources
    Bertino, E
    Castano, S
    Ferrari, E
    Mesiti, M
    DATA & KNOWLEDGE ENGINEERING, 2002, 43 (03) : 237 - 260
  • [24] Labeling scheme and structural joins for graph-structured XML data
    Wang, HZ
    Wang, W
    Lin, XM
    Li, JZ
    WEB TECHNOLOGIES RESEARCH AND DEVELOPMENT - APWEB 2005, 2005, 3399 : 277 - 289
  • [25] XR-tree: Indexing XML data for efficient structural joins
    Jiang, HF
    Lu, HJ
    Wang, W
    Ooi, BC
    19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 253 - 264
  • [26] Integrating document and data retrieval based on XML
    Jan-Marco Bremer
    Michael Gertz
    The VLDB Journal, 2006, 15 : 53 - 83
  • [27] Integrating XML data in the TARGIT OLAP system
    Pedersen, D
    Pedersen, J
    Pedersen, TB
    20TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2004, : 778 - 781
  • [28] Integrating XML data in the TARGIT OLAP system
    Pedersen, Torben Bach
    Pedersen, Dennis
    Pedersen, Jesper
    International Journal of Web Engineering and Technology, 2008, 4 (04) : 495 - 533
  • [29] Integrating document and data retrieval based on XML
    Bremer, JM
    Gertz, M
    VLDB JOURNAL, 2006, 15 (01): : 53 - 83
  • [30] Approximate Query Answering and Result Refinement on XML Data
    Seidler, Katja
    Peukert, Eric
    Hackenbroich, Gregor
    Lehner, Wolfgang
    SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, 2010, 6187 : 78 - +