What’s new in Apache Spark 2.3

What's New in Apache Spark
2.3?
Xiao Li & Wenchen Fan
DataWorks Summit | SJ | Jun 2018

About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)

Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streamin
g
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity

Major Features on Spark 2.3
4
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues
resolved!

5
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

Structured Streaming
Introduced in Spark 2.0
Among Databricks customers:
- 10X more usage than DStream
- Processed 100+ trillion records in production
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz,
Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia.
Structured Streaming: A Declarative API for Real-Time Applications in
Apache Spark. SIGMOD '18

Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems

Continuous Processing Execution
ModeMicro-batch Processing (since 2.0 release)
• Lower end-to-end latencies of ~100ms
• Exactly-once fault-tolerance guarantees
Continuous Processing (since 2.3 release) [SPARK-
20928]
• A new streaming execution mode
(experimental)
• Low (~1 ms) end-to-end latency
• At-least-once guarantees
8

10
Continuous
Processing
The only change you need!
Blog:
"Introducing Low-
latency
Continuous
Processing Mode
in Structured
Streaming in
Apache Spark
2.3"
http://ow.ly/e7lS3
0kob7X

Continuous Processing
Supported Operations:
• Map-like Dataset operations
• projections
• selections
• All SQL functions
• Except
current_timestamp(),
current_date() and
aggregation functions
11
Supported Sources:
• Kafka source
• Rate source
Supported Sinks:
• Kafka sink
• Memory sink
• Console sink
Continuous
Processing

Stream-stream Joins
12
Stream-stream
Join
Example: Ad Monetization Join stream of ad
impressions with
another stream of their
corresponding user
clicks
Blog: "Introducing Stream-Stream Joins in Apache Spark 2.3"
ow.ly/oxpv30jbybJ

13
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

ML on Streaming
Model transformation/prediction on batch and streaming
data with unified API.
After fitting a model or Pipeline, you can deploy it in a
streaming job.
val streamOutput = transformer.transform(streamDF)
14
ML on
Streaming

ML on Streaming
Notebook:
https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-
and-stuctured-streaming.html
SPARK+AI Summit 2018: "Deploying MLlib for Scoring in
Structured Streaming"
ML on
Streaming

Image Support in Spark
Spark Image data source SPARK-21866 :
• Defined a standard API in Spark for reading images into
DataFrames
• Deep learning frameworks can rely on this.
val df = ImageSchema.readImages("/data/images")
16
Image
Reader

17
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

PySpark
• Introduced in Spark 0.7 (~2013); became first class citizen
in the DataFrame API in Spark 1.3 (~2015)
PySpark
Performance

PySpark Performance
For single-node analytics,
Spark offers faster runtime
and greater scalability than
PyData tooling, e.g. Pandas,
numpy.
• multi-core parallelism
• less memory consumption
• better-pipelined execution
engine
PySpark
Performance
Blog: "Benchmarking Apache Spark on
a single-node machine"
ow.ly/p1J530jORLw

PySpark Performance
Much slower than Scala/Java with user-defined functions
(UDF), due to serialization & Python interpreter.
Fast data serialization and execution using vectorized
formats [SPARK-22216] [SPARK-21187]
• Conversion from/to Pandas
• df.toPandas() and createDataFrame(pandas_df)
• Pandas UDFs: UDF using Pandas to process data
• Scalar Pandas UDFs, Grouped Map Pandas UDFs
20
PySpark
Performance

PySpark Performance
21
PySpark
Performance
Blog "Introducing Vectorized UDFs for PySpark"

22
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

What’s new in Apache Spark 2.3

Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates pods
that run the executors in response to
requests from the Spark scheduler. [K8S-
34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
24
on

Spark on Kubernetes
Features in Apache Spark 2.3:
• Supports Kubernetes 1.6 and up
• Supports cluster mode only
• Static resource allocation only
• Supports Java and Scala applications
• Can use container-local and remote
dependencies that are downloadable
25

26
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

27
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

History Server Using K-V Store
Stateless and non-scalable History Server V1:
• Requires parsing the event logs (that means, so slow!)
• Requires storing app lists and UI in the memory (and then
OOM!)
[SPARK-18085] K-V store-based History Server V2:
• Store app lists and UI in a persistent K-V store (LevelDB)
• spark.history.store.path – once specified, LevelDB is being used;
otherwise, in-memory KV-store (still stateless like V1)
28
History
Server V2

29
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

What’s Wrong With
V1?
• Leak upper level API in the data source (RDD/SQLContext)
• Hard to extend the Data Source API for more optimizations
• Zero transaction guarantee in the write APIs
• Batch Only
Data
Source
API V2

Features in Data Source V2
• Columnar scan support.
• Flexible operator pushdown framework.
• Can report basic statistics and data partitioning.
• Transactional write API.
• Unified batch and streaming interfaces.
Data
Source
API V2

32
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

UDF Enhancement
• [SPARK-19285] Implement UDF0 (SQL UDF that has 0
arguments)
• [SPARK-22945] Add java UDF APIs in the functions object
• [SPARK-21499] Support creating SQL function for Spark
UDAF(UserDefinedAggregateFunction)
• [SPARK-20586][SPARK-20416][SPARK-20668] Annotate UDF
with Name, Nullability and Determinism
33
UDF
Enhancements

Java UDF and UDAF in PySpark
34
UDF
Enhancements
• Register Java UDF and UDAF as a SQL function and use them in
PySpark.

35
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

Stable Codegen
• [SPARK-22510] [SPARK-22692] Stabilize the codegen
framework to avoid hitting the 64KB JVM bytecode
limit on the Java method and Java compiler constant
pool limit.
• [SPARK-21871] Turn off whole-stage codegen when
the bytecode of the generated Java function is larger
than spark.sql.codegen.hugeMethodLimit. The limit of
method bytecode for JIT optimization on HotSpot is
8K.
36
Stable
Codegen

37
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

Vectorized ORC Reader
• [SPARK-20682] Add new ORCFileFormat based on ORC
1.4.1. spark.sql.orc.impl = native / hive (default)
• [SPARK-16060] Vectorized ORC reader
spark.sql.orc.enableVectorizedReader = true (default) /
false
• Suggestion: enable filter pushdown for ORC files when
using the native reader
38

39
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming

Performance
• [SPARK-21975] Histogram support in cost-based optimizer
• Enhancements in rule-based optimizer and planner
[SPARK-22489] [SPARK-22916] [SPARK-22895] [SPARK-
20758] [SPARK-22266] [SPARK-19122] [SPARK-22662]
[SPARK-21652] (e.g., constant propagation)
• [SPARK-20331] Broaden support for partition pruning
predicate pushdown. (e.g. date = 20161011 or date =
20161014)
• [SPARK-20822] Vectorized reader for table cache
40

API
• Improved ANSI SQL compliance and Dataset/DataFrame
APIs
• More built-in functions [SPARK-20746]
• Better Hive compatibility
• [SPARK-20236] Support Dynamic Partition Overwrite
for Data Source Tables
• [SPARK-17729] Enable Creating Hive Bucketed
Tables 41

42
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues resolved!

Apache Spark 2.4 +
• Built-in high-order function: transform, arrays_zip,
array_remove, array_overlap, …
• Adaptive query planning: dynamically adjust the query
plan according to the real data(shuffled data).
• Deep learning integration: gang scheduling, barrier sync,
fast data exchange, ...
• ML pipeline in SparkR: port the pipeline API to R.
• Build improvements: Scala 2.12, Java 9, Hadoop 3.0, …

Apache Spark 2.4 +
• Pandas UDF improvements: window function, partial
aggregate, …
• K8S integration improvements: PySpark support, dynamic
allocation, external shuffle service, …
• Continuous streaming improvements: shuffle support, ...
• Data Source V2 improvements: Catalog integration, more
operator push down, …

Thank you
46
Xiao Li (lixiao@databricks.com)
Wenchen Fan (wenchen@databricks.com)

What’s new in Apache Spark 2.3

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to What’s new in Apache Spark 2.3

Similar to What’s new in Apache Spark 2.3 (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

What’s new in Apache Spark 2.3