SlideShare a Scribd company logo
What's New in Apache Spark
2.3?
Xiao Li & Wenchen Fan
DataWorks Summit | SJ | Jun 2018
About US
• Software Engineers at
• Apache Spark Committers and PMC Members
Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
Databricks’ Unified Analytics Platform
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streamin
g
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data Engineers
and Data Scientists
Unifies Data and AI
Technologies
Eliminates infrastructure
complexity
Major Features on Spark 2.3
4
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues
resolved!
Major Features on Spark 2.3
5
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Structured Streaming
Introduced in Spark 2.0
Among Databricks customers:
- 10X more usage than DStream
- Processed 100+ trillion records in production
Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz,
Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia.
Structured Streaming: A Declarative API for Real-Time Applications in
Apache Spark. SIGMOD '18
Structured Streaming
stream processing on Spark SQL engine
fast, scalable, fault-tolerant
rich, unified, high level APIs
deal with complex data and complex workloads
rich ecosystem of data sources
integrate with many storage systems
Continuous Processing Execution
ModeMicro-batch Processing (since 2.0 release)
• Lower end-to-end latencies of ~100ms
• Exactly-once fault-tolerance guarantees
Continuous Processing (since 2.3 release) [SPARK-
20928]
• A new streaming execution mode
(experimental)
• Low (~1 ms) end-to-end latency
• At-least-once guarantees
8
Structured Streaming
9
10
Continuous
Processing
The only change you need!
Blog:
"Introducing Low-
latency
Continuous
Processing Mode
in Structured
Streaming in
Apache Spark
2.3"
http://ow.ly/e7lS3
0kob7X
Continuous Processing
Supported Operations:
• Map-like Dataset operations
• projections
• selections
• All SQL functions
• Except
current_timestamp(),
current_date() and
aggregation functions
11
Supported Sources:
• Kafka source
• Rate source
Supported Sinks:
• Kafka sink
• Memory sink
• Console sink
Continuous
Processing
Stream-stream Joins
12
Stream-stream
Join
Example: Ad Monetization Join stream of ad
impressions with
another stream of their
corresponding user
clicks
Blog: "Introducing Stream-Stream Joins in Apache Spark 2.3"
ow.ly/oxpv30jbybJ
Major Features on Spark 2.3
13
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
ML on Streaming
Model transformation/prediction on batch and streaming
data with unified API.
After fitting a model or Pipeline, you can deploy it in a
streaming job.
val streamOutput = transformer.transform(streamDF)
14
ML on
Streaming
ML on Streaming
Notebook:
https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-
and-stuctured-streaming.html
SPARK+AI Summit 2018: "Deploying MLlib for Scoring in
Structured Streaming"
ML on
Streaming
Image Support in Spark
Spark Image data source SPARK-21866 :
• Defined a standard API in Spark for reading images into
DataFrames
• Deep learning frameworks can rely on this.
val df = ImageSchema.readImages("/data/images")
16
Image
Reader
Major Features on Spark 2.3
17
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
PySpark
• Introduced in Spark 0.7 (~2013); became first class citizen
in the DataFrame API in Spark 1.3 (~2015)
PySpark
Performance
PySpark Performance
For single-node analytics,
Spark offers faster runtime
and greater scalability than
PyData tooling, e.g. Pandas,
numpy.
• multi-core parallelism
• less memory consumption
• better-pipelined execution
engine
PySpark
Performance
Blog: "Benchmarking Apache Spark on
a single-node machine"
ow.ly/p1J530jORLw
PySpark Performance
Much slower than Scala/Java with user-defined functions
(UDF), due to serialization & Python interpreter.
Fast data serialization and execution using vectorized
formats [SPARK-22216] [SPARK-21187]
• Conversion from/to Pandas
• df.toPandas() and createDataFrame(pandas_df)
• Pandas UDFs: UDF using Pandas to process data
• Scalar Pandas UDFs, Grouped Map Pandas UDFs
20
PySpark
Performance
PySpark Performance
21
PySpark
Performance
Blog "Introducing Vectorized UDFs for PySpark"
Major Features on Spark 2.3
22
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
What’s new in Apache Spark 2.3
Native Spark App in K8S
• New Spark scheduler backend
• Driver runs in a Kubernetes pod created
by the submission client and creates pods
that run the executors in response to
requests from the Spark scheduler. [K8S-
34377] [SPARK-18278]
• Make direct use of Kubernetes clusters for
multi-tenancy and sharing through
Namespaces and Quotas, as well as
administrative features such as Pluggable
Authorization, and Logging.
24
on
Spark on Kubernetes
Features in Apache Spark 2.3:
• Supports Kubernetes 1.6 and up
• Supports cluster mode only
• Static resource allocation only
• Supports Java and Scala applications
• Can use container-local and remote
dependencies that are downloadable
25
Major Features on Spark 2.3
26
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Major Features on Spark 2.3
27
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
History Server Using K-V Store
Stateless and non-scalable History Server V1:
• Requires parsing the event logs (that means, so slow!)
• Requires storing app lists and UI in the memory (and then
OOM!)
[SPARK-18085] K-V store-based History Server V2:
• Store app lists and UI in a persistent K-V store (LevelDB)
• spark.history.store.path – once specified, LevelDB is being used;
otherwise, in-memory KV-store (still stateless like V1)
28
History
Server V2
Major Features on Spark 2.3
29
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
What’s Wrong With
V1?
• Leak upper level API in the data source (RDD/SQLContext)
• Hard to extend the Data Source API for more optimizations
• Zero transaction guarantee in the write APIs
• Batch Only
Data
Source
API V2
Features in Data Source V2
• Columnar scan support.
• Flexible operator pushdown framework.
• Can report basic statistics and data partitioning.
• Transactional write API.
• Unified batch and streaming interfaces.
Data
Source
API V2
Major Features on Spark 2.3
32
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
UDF Enhancement
• [SPARK-19285] Implement UDF0 (SQL UDF that has 0
arguments)
• [SPARK-22945] Add java UDF APIs in the functions object
• [SPARK-21499] Support creating SQL function for Spark
UDAF(UserDefinedAggregateFunction)
• [SPARK-20586][SPARK-20416][SPARK-20668] Annotate UDF
with Name, Nullability and Determinism
33
UDF
Enhancements
Java UDF and UDAF in PySpark
34
UDF
Enhancements
• Register Java UDF and UDAF as a SQL function and use them in
PySpark.
Major Features on Spark 2.3
35
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Stable Codegen
• [SPARK-22510] [SPARK-22692] Stabilize the codegen
framework to avoid hitting the 64KB JVM bytecode
limit on the Java method and Java compiler constant
pool limit.
• [SPARK-21871] Turn off whole-stage codegen when
the bytecode of the generated Java function is larger
than spark.sql.codegen.hugeMethodLimit. The limit of
method bytecode for JIT optimization on HotSpot is
8K.
36
Stable
Codegen
Major Features on Spark 2.3
37
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Vectorized ORC Reader
• [SPARK-20682] Add new ORCFileFormat based on ORC
1.4.1. spark.sql.orc.impl = native / hive (default)
• [SPARK-16060] Vectorized ORC reader
spark.sql.orc.enableVectorizedReader = true (default) /
false
• Suggestion: enable filter pushdown for ORC files when
using the native reader
38
Major Features on Spark 2.3
39
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Performance
• [SPARK-21975] Histogram support in cost-based optimizer
• Enhancements in rule-based optimizer and planner
[SPARK-22489] [SPARK-22916] [SPARK-22895] [SPARK-
20758] [SPARK-22266] [SPARK-19122] [SPARK-22662]
[SPARK-21652] (e.g., constant propagation)
• [SPARK-20331] Broaden support for partition pruning
predicate pushdown. (e.g. date = 20161011 or date =
20161014)
• [SPARK-20822] Vectorized reader for table cache
40
API
• Improved ANSI SQL compliance and Dataset/DataFrame
APIs
• More built-in functions [SPARK-20746]
• Better Hive compatibility
• [SPARK-20236] Support Dynamic Partition Overwrite
for Data Source Tables
• [SPARK-17729] Enable Creating Hive Bucketed
Tables 41
Major Features on Spark 2.3
42
Continuous
Processing
Data
Source
API V2
Stream-stream
Join
Spark on
Kubernetes
History
Server V2
UDF
Enhancements
Various SQL
Features
PySpark
Performance
Native ORC
Support
Stable
Codegen
Image
Reader
ML on
Streaming
Around 1400 issues resolved!
What’s Next?
4343
Apache Spark 2.4 +
• Built-in high-order function: transform, arrays_zip,
array_remove, array_overlap, …
• Adaptive query planning: dynamically adjust the query
plan according to the real data(shuffled data).
• Deep learning integration: gang scheduling, barrier sync,
fast data exchange, ...
• ML pipeline in SparkR: port the pipeline API to R.
• Build improvements: Scala 2.12, Java 9, Hadoop 3.0, …
Apache Spark 2.4 +
• Pandas UDF improvements: window function, partial
aggregate, …
• K8S integration improvements: PySpark support, dynamic
allocation, external shuffle service, …
• Continuous streaming improvements: shuffle support, ...
• Data Source V2 improvements: Catalog integration, more
operator push down, …
Thank you
46
Xiao Li (lixiao@databricks.com)
Wenchen Fan (wenchen@databricks.com)

More Related Content

What's hot

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
DataWorks Summit
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
DataWorks Summit/Hadoop Summit
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
DataWorks Summit
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
DataWorks Summit
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
DataWorks Summit
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
DataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
DataWorks Summit
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie Mae
DataWorks Summit
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
DataWorks Summit
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
DataWorks Summit
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
avanttic Consultoría Tecnológica
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
DataWorks Summit
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
ModusOptimum
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
DataWorks Summit
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
DataWorks Summit
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
DataWorks Summit
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Sharing metadata across the data lake and streams
Sharing metadata across the data lake and streamsSharing metadata across the data lake and streams
Sharing metadata across the data lake and streams
 
Big Data Ready Enterprise
Big Data Ready Enterprise Big Data Ready Enterprise
Big Data Ready Enterprise
 
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...Quick! Quick! Exploration!: A framework for searching a predictive model on A...
Quick! Quick! Exploration!: A framework for searching a predictive model on A...
 
Innovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data WarehouseInnovation in the Enterprise Rent-A-Car Data Warehouse
Innovation in the Enterprise Rent-A-Car Data Warehouse
 
Apache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the unionApache Hadoop YARN: state of the union
Apache Hadoop YARN: state of the union
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
Avoiding Log Data Overload in a CI/CD System While Streaming 190 Billion Even...
 
Migrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie MaeMigrating Analytics to the Cloud at Fannie Mae
Migrating Analytics to the Cloud at Fannie Mae
 
SAM—streaming analytics made easy
SAM—streaming analytics made easySAM—streaming analytics made easy
SAM—streaming analytics made easy
 
Lessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARNLessons learned running a container cloud on YARN
Lessons learned running a container cloud on YARN
 
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
Meetup Oracle Database MAD: 2.1 Data Management Trends: SQL, NoSQL y Big Data
 
Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez Graphene – Microsoft SCOPE on Tez
Graphene – Microsoft SCOPE on Tez
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
 
Using LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache ArrowUsing LLVM to accelerate processing of data in Apache Arrow
Using LLVM to accelerate processing of data in Apache Arrow
 
Saving the elephant—now, not later
Saving the elephant—now, not laterSaving the elephant—now, not later
Saving the elephant—now, not later
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Insights into Real-world Data Management Challenges
Insights into Real-world Data Management ChallengesInsights into Real-world Data Management Challenges
Insights into Real-world Data Management Challenges
 
Docker data science pipeline
Docker data science pipelineDocker data science pipeline
Docker data science pipeline
 
Log I am your father
Log I am your fatherLog I am your father
Log I am your father
 

Similar to What’s new in Apache Spark 2.3

Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
Xiao Li
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
boxu42
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Databricks
 
Spark7
Spark7Spark7
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
Paulo Gutierrez
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Databricks
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
Impetus Technologies
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Databricks
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
confluent
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
Spark Summit
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
Li Gao
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
Getting value from IoT, Integration and Data Analytics
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
Lucas Jellema
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 

Similar to What’s new in Apache Spark 2.3 (20)

Apache spark 2.4 and beyond
Apache spark 2.4 and beyondApache spark 2.4 and beyond
Apache spark 2.4 and beyond
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 
2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3 2018 02-08-what's-new-in-apache-spark-2.3
2018 02-08-what's-new-in-apache-spark-2.3
 
What's new in Apache Spark 2.4
What's new in Apache Spark 2.4What's new in Apache Spark 2.4
What's new in Apache Spark 2.4
 
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
 
Spark7
Spark7Spark7
Spark7
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick WendellApache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
 
The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...The structured streaming upgrade to Apache Spark and how enterprises can bene...
The structured streaming upgrade to Apache Spark and how enterprises can bene...
 
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native KubernetesSimplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
Simplify and Boost Spark 3 Deployments with Hypervisor-Native Kubernetes
 
An Insider’s Guide to Maximizing Spark SQL Performance
 An Insider’s Guide to Maximizing Spark SQL Performance An Insider’s Guide to Maximizing Spark SQL Performance
An Insider’s Guide to Maximizing Spark SQL Performance
 
Introducing Kafka's Streams API
Introducing Kafka's Streams APIIntroducing Kafka's Streams API
Introducing Kafka's Streams API
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Scaling spark on kubernetes at Lyft
Scaling spark on kubernetes at LyftScaling spark on kubernetes at Lyft
Scaling spark on kubernetes at Lyft
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 

Recently uploaded (20)

FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 

What’s new in Apache Spark 2.3

  • 1. What's New in Apache Spark 2.3? Xiao Li & Wenchen Fan DataWorks Summit | SJ | Jun 2018
  • 2. About US • Software Engineers at • Apache Spark Committers and PMC Members Xiao Li (Github: gatorsmile) Wenchen Fan (Github: cloud-fan)
  • 3. Databricks’ Unified Analytics Platform DATABRICKS RUNTIME COLLABORATIVE NOTEBOOKS Delta SQL Streamin g Powered by Data Engineers Data Scientists CLOUD NATIVE SERVICE Unifies Data Engineers and Data Scientists Unifies Data and AI Technologies Eliminates infrastructure complexity
  • 4. Major Features on Spark 2.3 4 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming Around 1400 issues resolved!
  • 5. Major Features on Spark 2.3 5 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 6. Structured Streaming Introduced in Spark 2.0 Among Databricks customers: - 10X more usage than DStream - Processed 100+ trillion records in production Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, Ali Ghodsi, Ion Stoica, and Matei Zaharia. Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD '18
  • 7. Structured Streaming stream processing on Spark SQL engine fast, scalable, fault-tolerant rich, unified, high level APIs deal with complex data and complex workloads rich ecosystem of data sources integrate with many storage systems
  • 8. Continuous Processing Execution ModeMicro-batch Processing (since 2.0 release) • Lower end-to-end latencies of ~100ms • Exactly-once fault-tolerance guarantees Continuous Processing (since 2.3 release) [SPARK- 20928] • A new streaming execution mode (experimental) • Low (~1 ms) end-to-end latency • At-least-once guarantees 8
  • 10. 10 Continuous Processing The only change you need! Blog: "Introducing Low- latency Continuous Processing Mode in Structured Streaming in Apache Spark 2.3" http://ow.ly/e7lS3 0kob7X
  • 11. Continuous Processing Supported Operations: • Map-like Dataset operations • projections • selections • All SQL functions • Except current_timestamp(), current_date() and aggregation functions 11 Supported Sources: • Kafka source • Rate source Supported Sinks: • Kafka sink • Memory sink • Console sink Continuous Processing
  • 12. Stream-stream Joins 12 Stream-stream Join Example: Ad Monetization Join stream of ad impressions with another stream of their corresponding user clicks Blog: "Introducing Stream-Stream Joins in Apache Spark 2.3" ow.ly/oxpv30jbybJ
  • 13. Major Features on Spark 2.3 13 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 14. ML on Streaming Model transformation/prediction on batch and streaming data with unified API. After fitting a model or Pipeline, you can deploy it in a streaming job. val streamOutput = transformer.transform(streamDF) 14 ML on Streaming
  • 15. ML on Streaming Notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines- and-stuctured-streaming.html SPARK+AI Summit 2018: "Deploying MLlib for Scoring in Structured Streaming" ML on Streaming
  • 16. Image Support in Spark Spark Image data source SPARK-21866 : • Defined a standard API in Spark for reading images into DataFrames • Deep learning frameworks can rely on this. val df = ImageSchema.readImages("/data/images") 16 Image Reader
  • 17. Major Features on Spark 2.3 17 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 18. PySpark • Introduced in Spark 0.7 (~2013); became first class citizen in the DataFrame API in Spark 1.3 (~2015) PySpark Performance
  • 19. PySpark Performance For single-node analytics, Spark offers faster runtime and greater scalability than PyData tooling, e.g. Pandas, numpy. • multi-core parallelism • less memory consumption • better-pipelined execution engine PySpark Performance Blog: "Benchmarking Apache Spark on a single-node machine" ow.ly/p1J530jORLw
  • 20. PySpark Performance Much slower than Scala/Java with user-defined functions (UDF), due to serialization & Python interpreter. Fast data serialization and execution using vectorized formats [SPARK-22216] [SPARK-21187] • Conversion from/to Pandas • df.toPandas() and createDataFrame(pandas_df) • Pandas UDFs: UDF using Pandas to process data • Scalar Pandas UDFs, Grouped Map Pandas UDFs 20 PySpark Performance
  • 22. Major Features on Spark 2.3 22 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 24. Native Spark App in K8S • New Spark scheduler backend • Driver runs in a Kubernetes pod created by the submission client and creates pods that run the executors in response to requests from the Spark scheduler. [K8S- 34377] [SPARK-18278] • Make direct use of Kubernetes clusters for multi-tenancy and sharing through Namespaces and Quotas, as well as administrative features such as Pluggable Authorization, and Logging. 24 on
  • 25. Spark on Kubernetes Features in Apache Spark 2.3: • Supports Kubernetes 1.6 and up • Supports cluster mode only • Static resource allocation only • Supports Java and Scala applications • Can use container-local and remote dependencies that are downloadable 25
  • 26. Major Features on Spark 2.3 26 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 27. Major Features on Spark 2.3 27 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 28. History Server Using K-V Store Stateless and non-scalable History Server V1: • Requires parsing the event logs (that means, so slow!) • Requires storing app lists and UI in the memory (and then OOM!) [SPARK-18085] K-V store-based History Server V2: • Store app lists and UI in a persistent K-V store (LevelDB) • spark.history.store.path – once specified, LevelDB is being used; otherwise, in-memory KV-store (still stateless like V1) 28 History Server V2
  • 29. Major Features on Spark 2.3 29 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 30. What’s Wrong With V1? • Leak upper level API in the data source (RDD/SQLContext) • Hard to extend the Data Source API for more optimizations • Zero transaction guarantee in the write APIs • Batch Only Data Source API V2
  • 31. Features in Data Source V2 • Columnar scan support. • Flexible operator pushdown framework. • Can report basic statistics and data partitioning. • Transactional write API. • Unified batch and streaming interfaces. Data Source API V2
  • 32. Major Features on Spark 2.3 32 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 33. UDF Enhancement • [SPARK-19285] Implement UDF0 (SQL UDF that has 0 arguments) • [SPARK-22945] Add java UDF APIs in the functions object • [SPARK-21499] Support creating SQL function for Spark UDAF(UserDefinedAggregateFunction) • [SPARK-20586][SPARK-20416][SPARK-20668] Annotate UDF with Name, Nullability and Determinism 33 UDF Enhancements
  • 34. Java UDF and UDAF in PySpark 34 UDF Enhancements • Register Java UDF and UDAF as a SQL function and use them in PySpark.
  • 35. Major Features on Spark 2.3 35 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 36. Stable Codegen • [SPARK-22510] [SPARK-22692] Stabilize the codegen framework to avoid hitting the 64KB JVM bytecode limit on the Java method and Java compiler constant pool limit. • [SPARK-21871] Turn off whole-stage codegen when the bytecode of the generated Java function is larger than spark.sql.codegen.hugeMethodLimit. The limit of method bytecode for JIT optimization on HotSpot is 8K. 36 Stable Codegen
  • 37. Major Features on Spark 2.3 37 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 38. Vectorized ORC Reader • [SPARK-20682] Add new ORCFileFormat based on ORC 1.4.1. spark.sql.orc.impl = native / hive (default) • [SPARK-16060] Vectorized ORC reader spark.sql.orc.enableVectorizedReader = true (default) / false • Suggestion: enable filter pushdown for ORC files when using the native reader 38
  • 39. Major Features on Spark 2.3 39 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming
  • 40. Performance • [SPARK-21975] Histogram support in cost-based optimizer • Enhancements in rule-based optimizer and planner [SPARK-22489] [SPARK-22916] [SPARK-22895] [SPARK- 20758] [SPARK-22266] [SPARK-19122] [SPARK-22662] [SPARK-21652] (e.g., constant propagation) • [SPARK-20331] Broaden support for partition pruning predicate pushdown. (e.g. date = 20161011 or date = 20161014) • [SPARK-20822] Vectorized reader for table cache 40
  • 41. API • Improved ANSI SQL compliance and Dataset/DataFrame APIs • More built-in functions [SPARK-20746] • Better Hive compatibility • [SPARK-20236] Support Dynamic Partition Overwrite for Data Source Tables • [SPARK-17729] Enable Creating Hive Bucketed Tables 41
  • 42. Major Features on Spark 2.3 42 Continuous Processing Data Source API V2 Stream-stream Join Spark on Kubernetes History Server V2 UDF Enhancements Various SQL Features PySpark Performance Native ORC Support Stable Codegen Image Reader ML on Streaming Around 1400 issues resolved!
  • 44. Apache Spark 2.4 + • Built-in high-order function: transform, arrays_zip, array_remove, array_overlap, … • Adaptive query planning: dynamically adjust the query plan according to the real data(shuffled data). • Deep learning integration: gang scheduling, barrier sync, fast data exchange, ... • ML pipeline in SparkR: port the pipeline API to R. • Build improvements: Scala 2.12, Java 9, Hadoop 3.0, …
  • 45. Apache Spark 2.4 + • Pandas UDF improvements: window function, partial aggregate, … • K8S integration improvements: PySpark support, dynamic allocation, external shuffle service, … • Continuous streaming improvements: shuffle support, ... • Data Source V2 improvements: Catalog integration, more operator push down, …
  • 46. Thank you 46 Xiao Li (lixiao@databricks.com) Wenchen Fan (wenchen@databricks.com)