Integrating XML data sources using approximate joins

被引:14
|
作者
Guha, Sudipto [1 ]
Jagadish, H. V.
Koudas, Nick
Srivastava, Divesh
Yu, Ting
机构
[1] Univ Penn, Philadelphia, PA 19104 USA
[2] Univ Michigan, Ann Arbor, MI 48109 USA
[3] Univ Toronto, Toronto, ON, Canada
[4] AT&T Labs Res, Middletown, NJ 07748 USA
[5] N Carolina State Univ, Raleigh, NC 27695 USA
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2006年 / 31卷 / 01期
关键词
algorithms; experimentation; performance; theory; data integration; tree edit distance; XML; joins; approximate joins;
D O I
10.1145/1132863.1132868
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling- based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.
引用
收藏
页码:161 / 207
页数:47
相关论文
共 50 条
  • [1] Approximate joins for data-centric XML
    Augsten, Nikolaus
    Boehlen, Michael
    Dyreson, Curtis
    Gamper, Johann
    2008 IEEE 24TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, VOLS 1-3, 2008, : 814 - +
  • [2] Approximate Joins for XML Using g-String
    Li, Fei
    Wang, Hongzhi
    Zhang, Cheng
    Hao, Liang
    Li, Jianzhong
    Gao, Hong
    DATABASE AND XML TECHNOLOGIES, 2010, 6309 : 3 - 17
  • [3] Integrating XML sources into a data warehouse
    Vrdoljak, Boris
    Banek, Marko
    Skocir, Zoran
    DATA ENGINEERING ISSUES IN E-COMMERCE AND SERVICES, PROCEEDINGS, 2006, 4055 : 133 - 142
  • [4] Approximate joins for XML at label level
    Li, Fei
    Wang, Hongzhi
    Hao, Liang
    Li, Jianzhong
    Gao, Hong
    INFORMATION SCIENCES, 2014, 282 : 237 - 249
  • [5] Integrating heterogeneous data sources with XML and XQuery
    Gardarin, G
    Mensch, A
    Dang-Ngoc, TT
    Smit, L
    13TH INTERNATIONAL WORKSHOP ON DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2002, : 839 - 844
  • [6] Index-based approximate XML joins
    Guha, S
    Koudas, N
    Srivastava, D
    Yu, T
    19TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING, PROCEEDINGS, 2003, : 708 - 710
  • [7] Windowed pq-grams for approximate joins of data-centric XML
    Nikolaus Augsten
    Michael Böhlen
    Curtis Dyreson
    Johann Gamper
    The VLDB Journal, 2012, 21 : 463 - 488
  • [8] Windowed pq-grams for approximate joins of data-centric XML
    Augsten, Nikolaus
    Boehlen, Michael
    Dyreson, Curtis
    Gamper, Johann
    VLDB JOURNAL, 2012, 21 (04): : 463 - 488
  • [9] pq-Hash: An Efficient Method for Approximate XML Joins
    Li, Fei
    Wang, Hongzhi
    Hao, Liang
    Li, Jianzhong
    Gao, Hong
    WEB-AGE INFORMATION MANAGEMENT, 2010, 6185 : 125 - 134
  • [10] Integrating and exchanging XML data using ontologies
    Xiao, Huiyong
    Cruz, Isabel F.
    JOURNAL ON DATA SEMANTICS VI, 2006, 4090 : 67 - 89