LakeCompass: An End-to-End System for Data Maintenance, Search and Analysis in Data Lakes

被引:0
|
作者
Chai, Chengliang [1 ]
Deng, Yuhao [1 ]
Zhan, Yutong [1 ]
Cao, Ziqi [1 ]
Zhang, Yuanfang [1 ]
Cao, Lei [2 ]
Wang, Yuping [1 ]
Zhang, Zhiwei [1 ]
Yuan, Ye [1 ]
Wang, Guoren [1 ]
Tang, Nan [3 ]
机构
[1] Beijing Inst Technol, Beijing, Peoples R China
[2] Univ Arizona, MIT, Tempe, AZ USA
[3] HKUST, Guangzhou, Peoples R China
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2024年 / 17卷 / 12期
基金
国家重点研发计划;
关键词
D O I
10.14778/3685800.3685880
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Searching tables from poorly maintained data lakes has long been recognized as a formidable challenge in the realm of data management. There are three pivotal tasks: keyword-based, joinable and unionable table search, which form the backbone of tasks that aim to make sense of diverse datasets, such as machine learning. In this demo, we propose LakeCompass, an end-to-end prototype system that maintains abundant tabular data, supports all above search tasks with high efficacy, and well serves downstream ML modeling. To be specific, LakeCompass manages numerous real tables over which diverse types of indexes are built to support efficient search based on different user requirements. Particularly, LakeCompass could automatically integrate these discovered tables to improve the downstream model performance in an iterative approach. Finally, we provide both Python APIs and Web interface to facilitate flexible user interaction.
引用
收藏
页码:4381 / 4384
页数:4
相关论文
共 50 条
  • [21] RNASequest: An End-to-End Reproducible RNAseq Data Analysis and Publishing Framework
    Zhu, Jing
    Sun, Yu H.
    Ouyang, Zhengyu
    Li, Kejie
    Negi, Soumya
    Piya, Sarbottam
    Hu, Wenxing
    Zavodszky, Maria I.
    Yalamanchili, Hima
    Chen, Yirui
    Zhang, Xinmin
    Casey, Fergal
    Zhang, Baohong
    JOURNAL OF MOLECULAR BIOLOGY, 2023, 435 (14)
  • [22] aPEAch: Automated Pipeline for End-to-End Analysis of Epigenomic and Transcriptomic Data
    Xiropotamos, Panagiotis
    Papageorgiou, Foteini
    Manousaki, Haris
    Sinnis, Charalampos
    Antonatos, Charalabos
    Vasilopoulos, Yiannis
    Georgakilas, Georgios K.
    BIOLOGY-BASEL, 2024, 13 (07):
  • [23] End-to-end online performance data capture and analysis for scientific workflows
    Papadimitriou, George
    Wang, Cong
    Vahi, Karan
    da Silva, Rafael Ferreira
    Mandal, Anirban
    Liu, Zhengchun
    Mayani, Rajiv
    Rynge, Mats
    Kiran, Mariam
    Lynch, Vickie E.
    Kettimuthu, Rajkumar
    Deelman, Ewa
    Vetter, Jeffrey S.
    Foster, Ian
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2021, 117 : 387 - 400
  • [24] End-to-end service data analysis: Efficiencies achieved across the enterprise
    Herger, L. M.
    Rippon, W. J.
    Fonseca, C. A.
    Pointer, W.
    Belgodere, B. M.
    Cornejo, W. H.
    Frissora, M. J.
    IBM JOURNAL OF RESEARCH AND DEVELOPMENT, 2017, 61 (01) : 5 - 16
  • [25] End-to-end online performance data capture and analysis for scientific workflows
    Papadimitriou, George
    Wang, Cong
    Vahi, Karan
    da Silva, Rafael Ferreira
    Mandal, Anirban
    Liu, Zhengchun
    Mayani, Rajiv
    Rynge, Mats
    Kiran, Mariam
    Lynch, Vickie E.
    Kettimuthu, Rajkumar
    Deelman, Ewa
    Vetter, Jeffrey S.
    Foster, Ian
    Future Generation Computer Systems, 2021, 117 : 387 - 400
  • [26] ENACT: End-to-End Analysis of Visium High Definition (HD) Data
    Kamel, Mena
    Song, Yiwen
    Solbas, Ana
    Villordo, Sergio
    Sarangi, Amrut
    Senin, Pavel
    Sunaal, Mathew
    Ayestas, Luis Cano
    Levin, Clement
    Wang, Seqian
    Classe, Marion
    Bar-Joseph, Ziv
    Planas, Albert Pla
    BIOINFORMATICS, 2025, 41 (03)
  • [27] Data science education through education data: an end-to-end perspective
    Rao, A. Ravishankar
    Desai, Yashvi
    Mishra, Kavita
    2019 9TH IEEE INTEGRATED STEM EDUCATION CONFERENCE (ISEC), 2019, : 300 - 307
  • [28] The NOAO data products program: Developing an end-to-end data management system in support of the virtual observatory
    Smith, R. Chris
    Boroson, Todd
    Seaman, Robert
    ASTRONOMICAL DATA ANALYSIS SOFTWARE AND SYSTEMS XVI, 2007, 376 : 707 - +
  • [29] Ensuring Data Continuity through NISAR's End-to-End Information System
    Sirohi, Richa
    Bottom, Heather
    Krasner, Sandford M.
    Turnbull, James
    2024 IEEE AEROSPACE CONFERENCE, 2024,
  • [30] CloudDRN: A Lightweight, End-to-End System for Sharing Distributed Research Data in the Cloud
    Humphrey, Marty
    Steele, Jacob
    Kim, In Kee
    Kahn, Michael G.
    Bondy, Jessica
    Ames, Michael
    2013 IEEE 9TH INTERNATIONAL CONFERENCE ON E-SCIENCE (E-SCIENCE), 2013, : 254 - 261