A Near Real-Time Big Data Provenance Generation Method Based on the Conjoint Analysis of Heterogeneous Logs

被引:0
|
作者
Gao, Yuanzhao [1 ]
Chen, Xingyuan [1 ,2 ]
Li, Binglong [1 ]
Du, Xuehui [1 ]
机构
[1] Zhengzhou Sci & Technol Inst, Zhengzhou 450000, Peoples R China
[2] State Key Lab Cryptol, Beijing 100878, Peoples R China
来源
IEEE ACCESS | 2023年 / 11卷
关键词
Big data provenance; provenance generation; multi-log conjoint analysis; hadoop;
D O I
10.1109/ACCESS.2023.3300844
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data provenance is an effective approach for data security supervision. In the distributed, multi-user, and multi-layer big data system, only the provenance generation method, which leverages the information logged at both application and operating system level, has the capacity to completely obtain the provenance information required for data usage supervision. However, the current research on the conjoint analysis of multiple logs is inadequate, and it is difficult for them to effectively integrate the provenance information extracted from different logs, especially in the big data scenario. For the near real-time provenance generation based on the analysis of multiple heterogeneous logs, this paper employs a Hadoop-based big data system as the research object, and proposes a parallel log analysis method based on auxiliary data structures and multi-threading. For the efficient conjoint analysis of multiple logs, 5 auxiliary data structures are constructed as the medium for the correlation and fusion of log information, and a multi-threading method is presented to parallelize the lookup of provenance information. In order to cope with the complex log record generation rules, log analysis methods for nondeterministic records, non-instantaneous operations, and instantaneous batch operations are proposed to generate provenance information correctly. In addition, a provenance generation framework is established to implement the proposed log analysis method. The experimental results show that the log collection time overhead caused by processing files above MB level is less than 0.1%. The proposed method can analyze logs in near real time and generate provenance information correctly.
引用
收藏
页码:80806 / 80821
页数:16
相关论文
共 50 条
  • [31] Research on Optimization of Distributed Big Data Real-Time Management Method
    Lin, Ping
    2018 3RD INTERNATIONAL CONFERENCE ON SMART CITY AND SYSTEMS ENGINEERING (ICSCSE), 2018, : 626 - 630
  • [32] The Real-time Big Data Processing Method Based on LSTM for the Intelligent Workshop Production Process
    Du, WenBo
    Zhu, Zhixiang
    Wang, Chuang
    Yue, Zhifeng
    2020 5TH IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS (IEEE ICBDA 2020), 2020, : 63 - 67
  • [33] Real-time/near real-time recce wideband data links
    Robinson, R.S.
    Proceedings of SPIE - The International Society for Optical Engineering, (154-163):
  • [34] A survey on data stream, big data and real-time
    Gomes E.H.A.
    Plentz P.D.M.
    De Rolt C.R.
    Dantas M.A.R.
    International Journal of Networking and Virtual Organisations, 2019, 20 (02) : 143 - 167
  • [35] Real-time assessment of operational risk of coal-fired power generation based on big data
    Li C.
    Dong J.
    Ding J.
    Dianli Xitong Baohu yu Kongzhi/Power System Protection and Control, 2022, 50 (16): : 47 - 57
  • [36] A spark-based big data analysis framework for real-time sentiment prediction on streaming data
    Kilinc, Deniz
    SOFTWARE-PRACTICE & EXPERIENCE, 2019, 49 (09): : 1352 - 1364
  • [37] A Real-Time Big Data Analysis Framework on a CPU/GPU Heterogeneous Cluster A Meteorological Application Case Study
    Hassaan, Mohamed
    Elghandour, Iman
    2016 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING, APPLICATIONS AND TECHNOLOGIES (BDCAT), 2016, : 168 - 177
  • [38] Balsam: Near Real-time Experimental Data Analysis on Supercomputers
    Salim, Michael A.
    Uram, Thomas D.
    Childers, J. Taylor
    Vishwanath, Venkatram
    Papka, Michael E.
    PROCEEDINGS OF XLOOP 2019: IEEE/ACM 1ST ANNUAL WORKSHOP ON LARGE-SCALE EXPERIMENT-IN-THE-LOOP COMPUTING (XLOOP), 2019, : 26 - 31
  • [39] Stream Processing For Near Real-Time Scientific Data Analysis
    Choi, Jong Youl
    Kurc, Tahsin
    Logan, Jeremy
    Wolf, Matthew
    Suchyta, Eric
    Kress, James
    Pugmire, David
    Podhorszki, Norbert
    Byun, Eun-Kyu
    Ainsworth, Mark
    Pwashar, Manish
    Klasky, Scott
    2016 NEW YORK SCIENTIFIC DATA SUMMIT (NYSDS), 2016,
  • [40] Real-time stream processing for Big Data
    Wingerath, Wolfram
    Gessert, Felix
    Friedrich, Steffen
    Ritter, Norbert
    IT-INFORMATION TECHNOLOGY, 2016, 58 (04): : 186 - 194