TASM: Top-k Approximate Subtree Matching

被引:15
|
作者
Augsten, Nikolaus [1 ]
Barbosa, Denilson [2 ]
Boehlen, Michael [1 ]
Palpanas, Themis [3 ]
机构
[1] Free Univ Bozen Bolzano, Fac Comp Sci, Bolzano, Italy
[2] Univ Alberta, Dept Comp Sci, Edmonton, AB, Canada
[3] Univ Trent, Dept Informat Engn & Comp Sci, Trento, Italy
基金
加拿大自然科学与工程研究理事会;
关键词
ALGORITHM;
D O I
10.1109/ICDE.2010.5447905
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We consider the Top-k Approximate Subtree Matching (TASM) problem: finding the k best matches of a small query tree, e.g., a DBLP article with 15 nodes, in a large document tree, e.g., DBLP with 26M nodes, using the canonical tree edit distance as a similarity measure between subtrees. Evaluating the tree edit distance for large XML trees is difficult: the best known algorithms have cubic runtime and quadratic space complexity, and, thus, do not scale. Our solution is TASM-postorder, a memory-efficient and scalable TASM algorithm. We prove an upper-bound for the maximum subtree size for which the tree edit distance needs to be evaluated. The upper bound depends on the query and is independent of the document size and structure. A core problem is to efficiently prune subtrees that are above this size threshold. We develop an algorithm based on the prefix ring buffer that allows us to prune all subtrees above the threshold in a single postorder scan of the document. The size of the prefix ring buffer is linear in the threshold. As a result, the space complexity of TASM-postorder depends only on k and the query size, and the runtime of TASM-postorder is linear in the size of the document. Our experimental evaluation on large synthetic and real XML documents confirms our analytic results.
引用
收藏
页码:353 / 364
页数:12
相关论文
共 50 条
  • [1] Efficient Top-k Approximate Subtree Matching in Small Memory
    Augsten, Nikolaus
    Barbosa, Denilson
    Boehlen, Michael M.
    Palpanas, Themis
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (08) : 1123 - 1137
  • [2] Fast Algorithms for Top-k Approximate String Matching
    Yang, Zhenglu
    Yu, Jianjun
    Kitsuregawa, Masaru
    [J]. PROCEEDINGS OF THE TWENTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-10), 2010, : 1467 - 1473
  • [3] A Scalable Index for Top-k Subtree Similarity Queries
    Kocher, Daniel
    Augsten, Nikolaus
    [J]. SIGMOD '19: PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2019, : 1624 - 1641
  • [4] Approximate distributed top-k queries
    Boaz Patt-Shamir
    Allon Shafrir
    [J]. Distributed Computing, 2008, 21 : 1 - 22
  • [5] Approximate distributed top-k queries
    Patt-Shamir, Boaz
    Shafrir, Allon
    [J]. DISTRIBUTED COMPUTING, 2008, 21 (01) : 1 - 22
  • [6] Fast, Expressive Top-k Matching
    Culhane, William
    Jayaram, K. R.
    Eugster, Patrick
    [J]. ACM/IFIP/USENIX MIDDLEWARE 2014, 2014, : 73 - 84
  • [7] Approximate top-k queries in sensor networks
    Patt-Shamir, Boaz
    Shafrir, Allon
    [J]. STRUCTURAL INFORMATION AND COMMUNICATION COMPLEXITY, PROCEEDINGS, 2006, 4056 : 319 - +
  • [8] Lightweight Approximate Top-k for Distributed Settings
    Deolalikar, Vinay
    Eshghi, Kave
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 835 - 844
  • [9] Diversified Top-k Graph Pattern Matching
    Fan, Wenfei
    Wang, Xin
    Wu, Yinghui
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (13): : 1510 - 1521
  • [10] Diversified Top-k Spatial Pattern Matching
    Xie, Jiahua
    Chen, Hongmei
    Wang, Lizhen
    [J]. SPATIAL DATA AND INTELLIGENCE, SPATIALDI 2022, 2022, 13614 : 87 - 98