A Learned Approach to Design Compressed Rank/Select Data Structures

被引:9
|
作者
Boffa, Antonio [1 ]
Ferragina, Paolo [1 ]
Vinciguerra, Giorgio [1 ]
机构
[1] Univ Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy
关键词
Compressed data structures; rank/select dictionaries; piecewise linear approximations; high order entropy; algorithm engineering; RANK; REPRESENTATION; RETRIEVAL; STORAGE;
D O I
10.1145/3524060
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We address the problem of designing, implementing, and experimenting with compressed data structures that support rank and select queries over a dictionary of integers. We shine a new light on this classical problem by showing a connection between the input integers and the geometry of a set of points in a Cartesian plane suitably derived from them. We then build upon some results in computational geometry to introduce the first compressed rank/select dictionary based on the idea of "learning" the distribution of such points via proper linear approximations (LA). We therefore call this novel data structure the la_vector. We prove time and space complexities of the la_vector in several scenarios: in the worst case, in the case of input distributions with finite mean and variance, and taking into account the kth order entropy of some of its building blocks. We also discuss improved hybrid data structures, namely, ones that suitably orchestrate known compressed rank/select dictionaries with the la_vector. We corroborate our theoretical results with a large set of experiments over datasets originating from a variety of applications (Web search, DNAsequencing, information retrieval, and natural language processing) and show that our approach provides new interesting space-time tradeoffs with respect to many well-established compressed rank/select dictionary implementations. In particular, we show that our select is the fastest, and our rank is on the space-time Pareto frontier.
引用
收藏
页数:28
相关论文
共 50 条
  • [41] An analytic approach to select data mining for business decision
    Seng, Jia-Lang
    Chen, T. C.
    EXPERT SYSTEMS WITH APPLICATIONS, 2010, 37 (12) : 8042 - 8057
  • [42] The BACON Approach for Rank-Deficient Data
    Kondylis, Athanassios
    Hadi, Ali S.
    Werner, Mark
    PAKISTAN JOURNAL OF STATISTICS AND OPERATION RESEARCH, 2012, 8 (03) : 359 - 379
  • [43] Learned Data Structures for Per-Flow Measurements
    Monterubbiano, Andrea
    Azorin, Raphael
    Castellano, Gabriele
    Gallo, Massimo
    Pontarelli, Salvatore
    PROCEEDINGS OF THE INTERNATIONAL CONEXT STUDENT WORKSHOP 2022, CONEXT-SW 2022, 2022, : 42 - 43
  • [44] A Fast Sequence Assembly Method Based on Compressed Data Structures
    Liang, Peifeng
    Zhang, Yancong
    Lin, Kui
    Hu, Jinglu
    2014 36TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2014, : 326 - 329
  • [45] Special Issue on Algorithms and Data-Structures for Compressed Computation
    Policriti, Alberto
    Prezza, Nicola
    ALGORITHMS, 2022, 15 (12)
  • [46] COMPRESSED RANDOMIZED UTV DECOMPOSITIONS FOR LOW-RANK MATRIX APPROXIMATIONS IN DATA SCIENCE
    Kaloorazi, Maboud F.
    de Lamare, Rodrigo C.
    2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2019, : 7510 - 7514
  • [47] Approach for shot segmentation using MPEG compressed data
    Qi, Wei
    Zhong, Yuzhuo
    Qinghua Daxue Xuebao/Journal of Tsinghua University, 1997, 37 (09): : 50 - 54
  • [48] A mask matching approach for video segmentation on compressed data
    Kuo, TCT
    Chen, ALP
    INFORMATION SCIENCES, 2002, 141 (1-2) : 169 - 191
  • [49] Secure Image Steganography Approach for Hiding Compressed Data
    Abdul-Zaher, Khadija M.
    Sewesy, Adel Abo El-Magd
    Mohamed, Marghany Hassan
    SOFT COMPUTING FOR SECURITY APPLICATIONS, ICSCS 2022, 2023, 1428 : 575 - 595
  • [50] A robust approach to video segmentation using compressed data
    Wei, Q
    Zhang, HJ
    Zhong, YZ
    STORAGE AND RETRIEVAL FOR IMAGE AND VIDEO DATABASES V, 1997, 3022 : 448 - 456