Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark

被引:105
|
作者
Armbrust, Michael [1 ]
Das, Tathagata [1 ]
Torres, Joseph [1 ]
Yavuz, Burak [1 ]
Zhu, Shixiong [1 ]
Xin, Reynold [1 ]
Ghodsi, Ali [1 ]
Stoica, Ion [1 ]
Zaharia, Matei [1 ,2 ]
机构
[1] Databricks Inc, San Francisco, CA 94105 USA
[2] Stanford Univ, Stanford, CA 94305 USA
关键词
D O I
10.1145/3183713.3190664
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
With the ubiquity of real-time data, organizations need streaming systems that are scalable, easy to use, and easy to integrate into business applications. Structured Streaming is a new high-level streaming API in Apache Spark based on our experience with Spark Streaming. Structured Streaming differs from other recent streaming APIs, such as Google Dataflow, in two main ways. First, it is a purely declarative API based on automatically incrementalizing a static relational query (expressed using SQL or DataFrames), in contrast to APIs that ask the user to build a DAG of physical operators. Second, Structured Streaming aims to support end-to-end real-time applications that integrate streaming with batch and interactive analysis. We found that this integration was often a key challenge in practice. Structured Streaming achieves high performance via Spark SQL's code generation engine and can outperform Apache Flink by up to 2x and Apache Kafka Streams by 90x. It also offers rich operational features such as rollbacks, code updates, and mixed streaming/batch execution. We describe the system's design and use cases from several hundred production deployments on Databricks, the largest of which process over 1 PB of data per month.
引用
收藏
页码:601 / 613
页数:13
相关论文
共 50 条
  • [1] Real-Time Heart Arrhythmia Detection Using Apache Spark Structured Streaming
    Ilbeigipour, Sadegh
    Albadvi, Amir
    Akhondzadeh Noughabi, Elham
    [J]. JOURNAL OF HEALTHCARE ENGINEERING, 2021, 2021
  • [2] KORDI: A Framework for Real-Time Performance and Cost Optimization of Apache Spark Streaming
    Kordelas, Athanasios
    Spyrou, Thanasis
    Voulgaris, Spyros
    Megalooikonomou, Vasileios
    Deligiannis, Nikos
    [J]. 2023 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE, ISPASS, 2023, : 337 - 339
  • [3] Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster
    Prasad, Kashi Sai
    Pasupathy, S.
    [J]. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2018, 13 (05): : 164 - 176
  • [4] Real-Time Regex Matching With Apache Spark
    Deaton, Sean
    Brownfield, David
    Kosta, Leonard
    Zhu, Zhaozhong
    Matthews, Suzanne J.
    [J]. 2017 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2017,
  • [5] Adaptive API for Real-Time Streaming Analytics as a Service
    Inibhunu, Catherine
    Jalali, Roozbeh
    Doyle, Ian
    Gates, Aaron
    Madill, John
    McGregor, Carolyn
    [J]. 2019 41ST ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2019, : 3472 - 3477
  • [6] Real-time incremental recommendation for streaming data based on apache flink
    Tang, Zhuo
    Liu, Zeyu
    Li, Kenli
    Li, Keqin
    [J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (06) : 1421 - 1437
  • [7] Efficient topic partitioning of Apache Kafka for high-reliability real-time data streaming applications
    Raptis, Theofanis P.
    Cicconetti, Claudio
    Passarella, Andrea
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 154 : 173 - 188
  • [8] Real-time user clickstream behavior analysis based on apache storm streaming
    Pal, Gautam
    Atkinson, Katie
    Li, Gangmin
    [J]. ELECTRONIC COMMERCE RESEARCH, 2023, 23 (03) : 1829 - 1859
  • [9] Real-time user clickstream behavior analysis based on apache storm streaming
    Gautam Pal
    Katie Atkinson
    Gangmin Li
    [J]. Electronic Commerce Research, 2023, 23 : 1829 - 1859
  • [10] Real-time Pattern Detection in IP Flow Data using Apache Spark
    Cermak, Milan
    Lastovicka, Martin
    Jirsik, Tomas
    [J]. 2019 IFIP/IEEE SYMPOSIUM ON INTEGRATED NETWORK AND SERVICE MANAGEMENT (IM), 2019, : 521 - 526