Structured streaming in Spark

Structured Streaming
Spark Streaming 2.0
https://hadoopist.wordpress.com
Giri R Varatharajan
https://www.linkedin.com/in/girivaratharajan

What is
Structured
Streaming in
Apache Spark
● Continuous Data Flow Programming Model in
Spark introduced in 2.0
● Low Tolerance & High Throughput System
● Exactly Once Semantic - No Duplicates
● Stateful Aggregation over the Time, Event,
Window, Record.
● A Streaming platform built on top of Spark SQL
● Express your the computational code as your
batch computational code in Spark SQL
Dataframes
● Alpha Release released with Spark 2.0
● Supports HDFS, S3 now and support for Kafka,
Kinesis and Other Sources very soon.

Spark
Streaming
< 2.0
Behavior
● Micro Batching : streams are called as Discretized
Streams (DStreams)
● Running Aggregations needs to be specified with
a updateStateByKey method
● Requires careful construction of fault tolerance.
Micro
Batching

Streaming Model
● Live Data Streams Keep appending
to the Dataframe called Unbounded
table.
● Runs incremental aggregates on the
Unbounded table.

Spark
Streaming
2.0
Behavior
+
Demo
● Continuous Data Flow : Streams are appended in
an Unbounded Table with Dataframes APIs on it.
● No need to specify any method for running
aggregates over the time, window, or record.
● Look at the network socket wordcount program.
● Streaming is performed in Complete, Append,
Update Mode(s)
Continuous
Data Flow
Lines = Input Table
wordCounts = Result Table

Streaming Model
//Socket Stream - Read as and when it arrives in NetCat Channel
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()

Streaming Model
val windowedCounts = words.groupBy(
window($"timestamp", windowDuration,
slideDuration), $"word"
).count().orderBy("window")

Create/
Read Streams
SparkSession.readStream()
● File Source (HDFS, S3, Text, Parquet, Csv,
Json,etc.)
● Socket Stream (NetCat)
● Kafka, Kinesis and Other Input Sources are Under
Research so cross your fingers.
● DataStreamReader API
(http://spark.apache.org/docs/latest/api/scala/index
.html#org.apache.spark.sql.streaming.DataStream
Reader)

Outputting
Streams
SparkSession.writeStream()
Output Sink Types:
● Parquet Sink - HDFS, S3, Parquet
● Console Sink - Terminal
● Memory Sink - In memory table that can be queried over time interactively
● Foreach Sink
● DataStreamWriter
API(http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.st
reaming.DataStreamWriter)
Output Modes:
● Append Mode(Default)
○ New rows only appended
○ Applicable only for Non Aggregated Queries (select,where,filter,join,etc)
● Complete Mode
■ Output the whole result to any Sink
■ Applicable only for aggregated Queries (groupBy, etc)
● Update Mode
○ Updates on any of the row attributes will get appended to the output sink.

CheckPointing ● In case of Failure recover the previous progress
and state of a previous query, and continue where
it left off.
● Configure a CheckPoint location in writeStream
method of DataStreamWriter
● Must be configured for Parquet Sink, File Sink.

Unsupported
Operations yet
● Sort, Limit of First N rows, Distinct on Input
Streams
● Joins bt two streaming datasets
● Outer Joins (FO, LO, RO) bt two streaming
datasets.
● ds.count() ⇒ Use ds.groupBy.count() instead

Key Takeaways
● Structured Streaming is still experimental but please try it out.
● Streaming Events are gathered and appended to a infinite
dataframe series (Unbounded Table) and queries are running on
top of that.
● Development is very similar to the development of Spark for
Static Dataframe/DataSets APIs.
● Execute Ad-hoc Queries, Run aggregates, update DBs, track
session data, prepare dashboards,etc.
● readStream() - Schema of the Streaming Dataframes are
checked only at run time hence it’s untyped.
● writeStream() with various Output Modes, Output Sinks are
available. Always remember when to use what types of Output
Mode.
● Kafka, Kinesis, MLib Integrations, Sessionizations, WaterMarks
are the upcoming features and are being developed at the open
source community.
● Structured Streaming is not recommended for Production
workloads at this point even if it’s a File Streaming, Socket
Streaming.

Thank You Spark Code is available in my github:
https://github.com/vgiri2015/Spark2.0-and-greater/tree/master
/src/main/scala/structStreaming
Other Spark related repositories:
https://github.com/vgiri2015/spark-latest-v1
My blogs and Learning in Spark:
https://hadoopist.wordpress.com/category/apache-spark/

Structured streaming in Spark

More Related Content

What's hot

What's hot (20)

Similar to Structured streaming in Spark

Similar to Structured streaming in Spark (20)

Recently uploaded

Recently uploaded (20)

Structured streaming in Spark