A Method for Automatically Generating Join Queries Based on Relations-Attributes Distance Matrix over Data Lakes

被引:0
|
作者
Zhang, Caicai [1 ]
Lu, Chenglang [1 ]
Mei, Zhuolin [2 ]
Wu, Bin [2 ]
Yu, Jing [2 ]
机构
[1] Zhejiang Inst Mech & Elect Engn, 528 Binwen Rd, Hangzhou 310053, Zhejiang, Peoples R China
[2] Jiujiang Univ, Sch Comp & Big Data Sci, 551 Qianjin East Rd, Jiujiang 332005, Jiangxi, Peoples R China
来源
TEHNICKI VJESNIK-TECHNICAL GAZETTE | 2023年 / 30卷 / 05期
基金
中国国家自然科学基金;
关键词
data integration; data lakes; distance matrix; join queries;
D O I
10.17559/TV-20230402000493
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Techniques for identifying joinable or unionable tables in data lakes can yield valuable information for data scientists. However, more than half of their working time is spent familiarizing themselves with the metadata and correlations of datasets. Simplifying the use of information in data lakes is crucial for enhancing their utilization. The existing solution of integrating correlated relations into a single large data table via full disjunction requires integration updating when either data or metadata changes, complicating data maintenance. This paper proposes a method for automatically generating join queries based on the distance matrix of relations and attributes in data lakes. The distance matrix only requires updating when metadata changes, simplifying data maintenance. Experimental results demonstrate that once the distance matrix is generated, the time required to generate the join queries is negligible. Compared to the existing solution, the time cost for executing join queries over correlated tables is nearly identical to that of selection queries over integrated tables. The results of these two queries are also the same, showcasing the effectiveness and efficiency of our method.
引用
收藏
页码:1539 / 1546
页数:8
相关论文
共 9 条
  • [1] Distributed Similarity Join Over Data Streams Based on Earth Mover's Distance
    Xu J.
    Song C.
    Lv P.
    Li T.-S.
    Jisuanji Xuebao/Chinese Journal of Computers, 2019, 42 (08): : 1779 - 1796
  • [2] Cost-based solution for optimizing multi-join queries over distributed streaming sensor data
    Gomes, Joseph
    Choi, Hyeong-Ah
    2006 INTERNATIONAL CONFERENCE ON COLLABORATIVE COMPUTING: NETWORKING, APPLICATIONS AND WORKSHARING, 2006, : 282 - +
  • [3] Redundancy evaluation method of massive heterogeneous data in Internet of Things based on attributes and relations
    Li, Ying
    WEB INTELLIGENCE, 2020, 18 (02) : 167 - 177
  • [4] EMD-DSJoin: Efficient Similarity Join Over Probabilistic Data Streams Based on Earth Mover's Distance
    Xu, Jia
    Zhang, Jiazhen
    Song, Chao
    Zhang, Qianzhen
    Lv, Pin
    Li, Taoshen
    Chen, Ningjiang
    WEB TECHNOLOGIES AND APPLICATIONS: APWEB 2016 WORKSHOPS, WDMA, GAP, AND SDMA, 2016, 9865 : 42 - 54
  • [5] Matrix-Based Method for Inferring Elements in Data Attributes Using a Vector Space Model
    Hayashi, Teruaki
    Ohsawa, Yukio
    INFORMATION, 2019, 10 (03):
  • [6] Research on Automatically Generating Method for Three-dimension Virtual Model of Buoy Based on S-57 Chart Data
    Yang, Shenhua
    Shen, Haiqing
    Wang, Xinghua
    Yu, Wen
    ADVANCED RESEARCH ON AUTOMATION, COMMUNICATION, ARCHITECTONICS AND MATERIALS, PTS 1 AND 2, 2011, 225-226 (1-2): : 843 - 847
  • [7] Quadrant-Based Minimum Bounding Rectangle-Tree Indexing Method for Similarity Queries over Big Spatial Data in HBase
    Jo, Bumjoon
    Jung, Sungwon
    SENSORS, 2018, 18 (09)
  • [8] A novel multi-source data fusion method based on Bayesian inference for accurate estimation of chlorophyll-a concentration over eutrophic lakes
    Chen, Cheng
    Chen, Qiuwen
    Li, Gang
    He, Mengnan
    Dong, Jianwei
    Yan, Hanlu
    Wang, Zhiyuan
    Duan, Zheng
    ENVIRONMENTAL MODELLING & SOFTWARE, 2021, 141 (141)
  • [9] A New CCN Number Concentration Prediction Method Based on Multiple Linear Regression and Non-Negative Matrix Factorization: 1. Development, Validation, and Comparison Using the Measurement Data Over the Korean Peninsula
    Park, Minsu
    Yum, Seong Soo
    Seo, Pyosuk
    Kim, Najin
    Ahn, Chanwoo
    JOURNAL OF GEOPHYSICAL RESEARCH-ATMOSPHERES, 2023, 128 (22)