SlideShare a Scribd company logo
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators
Accelerating SPARK SQL to 50X
Performance with Apache Arrow based
FPGA accelerators
Calvin Hung
calvin.hung@wasaitech.com
Wasai Technology, Inc.
Weiting Chen
weiting.chen@intel.com
Intel Corp.
AGENDA
Background
Problem Definition
Solutions
Performance
Summary
Q&A
ABOUT US
Calvin Hung @wasaitech
Calvin is the co-founder, CEO and CTO of WASAI
Technology, which is specialized in FPGA-based
datacenter accelerations for Apache Spark,
Apache Hadoop and Genomics Analysis
applications. He has more than 15 years of
experience in software and hardware architecture
co-design and performance optimization.
Weiting Chen(William) @intel
Weiting is a senior software engineer at Intel
Software. He has worked for Big Data and Cloud
Solutions including Spark, Hadoop, OpenStack, and
Kubernetes for more than 6 years.
MOTIVATION
▪ CPU support SIMD instructions such as SSE, AVX2, AVX512, …etc. We
would like to unleash the power in SPARK.
▪ Many accelerators such as FPGA, GPU, ASIC, …etc in the world can
help CPU to offload functions and speed up the performance.
PROBLEM DEFINITION
▪ How to avoid row and columnar convert overhead during data
processing?
▪ How SPARK can leverage AVX support?
▪ How to coordinate the accelerators (e.g. FPGA, GPU, …etc) to work
with CPU in SPARK3.0?
▪ How FPGA can help to speed up SPARK?
▪ How to minimize data copy and serialization overhead when copying
from host to device?
▪ How to enhance the performance during DMA transfer?
SOLUTION: SPARK + ARROW + FPGA
A better way to run SPARK with AVX and accelerators support
- Apache Arrow
- SPARK 3.0 New Features
- OAP Native SQL Engine (Intel)
- FPGA Accelerators (WASAI)
THE GOALS
End-to-End Columnar-to-Columnar Data Processing:
To avoid columnar-to-row or row-to-columnar overhead when processing data.
Support AVX(via OAP Native SQL):
Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Writer Result
With FPGA Integration and acceleration:
Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Data Copy to
Device -> Columnar based Data Processing(on FPGA) -> Columnar
APACHE ARROW
• Each system has its own internal memory format
• 70-80% computation wasted on serialization and
deserialization
• Similar functionality implemented in multiple
projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality
Reference: https://arrow.apache.org
NEW FEATURES in SPARK3.0
SPARK-27396 Public APIs for extended Columnar Processing Support
https://issues.apache.org/jira/browse/SPARK-27396
▪ An interface to extend columnar processing API
▪ Provide an opportunity to create a custom API for columnar data processing with
OAP Native SQL Engine and FPGA support
▪ Advanced user can define a new interface to communication with accelerators
such as GPU or FPGA
SPARK-24615 Accelerator-aware task scheduling for SPARK
https://issues.apache.org/jira/browse/SPARK-24615
▪ An interface for SPARK to allocate accelerators in task level
▪ Make SPARK task to be aware accelerators such as GPU, FPGA, …etc
▪ Currently only support GPU
▪ FPGA can be supported in the same way (vendor specific)
OAP Native SQL Engine Plugin
Intel Optimized Analytics Package(OAP): Native SQL Engine
https://github.com/Intel-bigdata/OAP/
An End-to-End SPARK Columnar based data processing with Intel AVX support
Apache Arrow
Arrow Data Source Arrow Data Processing
Intel CPU Other Accelerators (FPGA, GPU, …)
Columnar Shuffle
SPARK SQL
SPARK Catalyst
FPGA Acceleration
END TO END, COLUMNAR TO COLUMNAR DATA PROCESSING
SPARK3.0
FileScan
FileWriter
Parquet Writer
Parquet Reader
ColumnarVector
ColumnarVector
Columnar-to-Row
Row-to-Columnar
InternalRow
Whole Stage Codegen
Row based Operator
Row based Operator
InternalRow
InternalRow
ColumnarVector
SPARK3.0 + OAP Native SQL Engine
FileScan
FileWriter
Parquet Writer
Parquet Reader
Arrow
Arrow
Columnar based
Operator
Arrow
Columnar based
Operators
Arrow
FPGA Templates
Arrow
FPGA Operators
Aggregation/GroupBy/…
Arrow
OAP NATIVE SQL ENGINE HIGHLIGHT
https://github.com/Intel-bigdata/OAP/
▪ An Open Source Columnar based Data Processing for SPARK
▪ Apache Arrow based Solution
▪ Enable AVX Support with SIMD instruction acceleration
▪ Leverage SPARK3.0 Support
▪ Communicate with 3rd Party Accelerators
▪ Support Data Source Parsing, SQL Operators, and Columnar Shuffle
▪ Common SQL Operators Support such as filter, join, groupby,
aggregate, …etc.
SPARK + FPGA + ARROW
Why SPARK SQL + FPGA
- SPARK SQL essentially processes structured row-based dataset at once with
single query of a bunch of SQL operators. The operators can be simple while the
dataset could be extremely large.
- FPGA with highly specialized IPs can deal with such multiple-instruction, single-
dataset analysis faster, more power and resource-efficiently than CPU and GPU
under the same total-cost-of-ownership.
Why Arrow
- In order to offload SPARK SQL workload from Java runtime to FPGA, leveraging a
new WholeStageCodegen to invoke native function calls to process data with
FPGA can be messy. Apache Arrow can hold Columnar Batch data inside native
memory and manage its memory reference inside Spark.
SPARK SQL FPGA ACCELERATION
▪ SQL Operators (Aggregation, GroupBy,
Filter, Sort, Join, …etc)
▪ Using Apache Arrow for data transfer
between Java runtime and FPGA to
reduce data traffic
▪ Next step will be leveraging
Arrow::RecordBatch
SPARK(CPU ONLY)
SQL
RDD
Dataset
DataFrame
Catalyst
Optimization
Code
Generation
RDD
Dataset
DataFrame
Code
Execution
Spark Executor
Data Processed
By CPU
Input
SPARK(CPU + FPGA)
SQL
RDD
Dataset
DataFrame
Catalyst
Optimization
Code
Generation
RDD
Dataset
DataFrame
Code
Execution
Spark Executor
1. Get Mem Page for Input / Output 5. Free Memory Pages
FPGA Java Wrapper
3. Start FPGA Engine
Data Block
2. Fill the Input
Data Block
Adaptor / JNI /
Driver
Array[Byte]
4. Iterating the
data by Spark
API
(Schema Specified)
DMA Engine Aggregation Engine GroupBy EngineFPGA
Input
Data Processed
By FPGA
Re-generate
Physical Plan
for FPGA
SPARK SQL + ARROW PERFORMANCE
▪ A simple query with 300GB dataset from TPC-DS Q55
▪ With Apache Arrow, performance boost can be up to 33% and CPU is
obviously offloaded.
SELECT ss_sold_date_sk, sum(ss_ext_sales_price) FROM store_sales
WHERE ss_item_sk = 3175 GROUP BY ss_sold_date_sk
0
2
4
6
8
10
12
14
32 90 300
Minutes
Arrow-boosted Original
Intel® Xeon® Gold 6120 CPU x2
DDR4 256GB
Intel PAC Arria10 x1
(GB)
SYSTEM STACK
Storage
Storage/Data Format
JSON Parquet
Distributed
Execution
Big Data Cores
MapReduce
Spark SQL
Engine Spark RDD/DFOS
OS Core System
CentOS RHEL Ubuntu
FPGA
Accelerator
Accelerators
MapReduce
Accelerator
Spark SQL
Accelerator
Spark RDD/DF
Accelerator
Data
Decoder
WASAI
System Lib & Drivers
WASAI
IOBooster
WASAI
EvoCores
Compressor
OTHER ACCELERATORS
Solution Description Workloads Result
Spark RDD groupByKey, foldByKey, etc microbench
80%~3x performance boost. Shuffle size 90%
to 99% reduction.
General Sort
General TimSort for both
Hadoop & Spark
TeraSort
microbench
20% performance boost
Compression
Compression
encoding/decoding
microbench Ongoing
Erasure Coding EC codec microbench Reach maximum throughput of PCIe
Input Format
Parsing
JSON, CSV, Parquet format
parser
microbench 2X~7.8X performance boost
Intel® Xeon® Gold 6120 CPU x2
DDR4 256GB
Intel PAC Arria10 x1
KEY TAKEAWAYS
▪ End-to-End Columnar Data Processing can optimize the performance
in CPU, FPGA and other Accelerators in native layer.
▪ FPGA can help to accelerate SPARK in many cases that involved heavy
CPU-intensive operations.
▪ Last but not least, with SPARK3.0 support, many new opportunities
can be done in the future.
NEXT …
▪ More features in OAP Native SQL Engine
▪ OAP Native SQL Engine + FPGA integration
OAP Native SQL Engine Plugin
Apache Arrow
Arrow Data Source Arrow Data Processing
Intel CPU WASAI FPGA Accelerators
Columnar Shuffle WASAI
CodeGen
CALL TO ACTION
We encourage you to try OAP Native SQL Engine for SPARK in
https://github.com/Intel-bigdata/OAP/
Wasai SPARK SQL + FPGA Solution
https://www.wasaitech.com/
Please contact
Intel: weiting.chen@intel.com
Wasai: calvin.hung@wasaitech.com
Q & A
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
Databricks
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Databricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the HoodRadical Speed for SQL Queries on Databricks: Photon Under the Hood
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
 

Similar to Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
MarketingArrowECS_CZ
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
Lenovo Data Center
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Alluxio, Inc.
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
Sal Marcus
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Community
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
Sal Marcus
 
Presentation oracle super cluster t5-8 technical deep dive
Presentation   oracle super cluster t5-8 technical deep divePresentation   oracle super cluster t5-8 technical deep dive
Presentation oracle super cluster t5-8 technical deep dive
solarisyougood
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
Michael Rys
 
New Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
New Generation of SPARC Processors Boosting Oracle S/W Angelo RajaduraiNew Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
New Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
Orgad Kimchi
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
Claudiu Barbura
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
Simon Lia-Jonassen
 
Security a SPARC M7 CPU
Security a SPARC M7 CPUSecurity a SPARC M7 CPU
Security a SPARC M7 CPU
MarketingArrowECS_CZ
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
Claudiu Barbura
 

Similar to Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators (20)

Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7Představení produktové řady Oracle SPARC S7
Představení produktové řady Oracle SPARC S7
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkThe Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmark
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAdvancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
Advancing GPU Analytics with RAPIDS Accelerator for Spark and Alluxio
 
SFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdfSFBigAnalytics_SparkRapid_20220622.pdf
SFBigAnalytics_SparkRapid_20220622.pdf
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
 
Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK Ceph Day Taipei - Accelerate Ceph via SPDK
Ceph Day Taipei - Accelerate Ceph via SPDK
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
Presentation oracle super cluster t5-8 technical deep dive
Presentation   oracle super cluster t5-8 technical deep divePresentation   oracle super cluster t5-8 technical deep dive
Presentation oracle super cluster t5-8 technical deep dive
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
New Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
New Generation of SPARC Processors Boosting Oracle S/W Angelo RajaduraiNew Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
New Generation of SPARC Processors Boosting Oracle S/W Angelo Rajadurai
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Security a SPARC M7 CPU
Security a SPARC M7 CPUSecurity a SPARC M7 CPU
Security a SPARC M7 CPU
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
xPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, TachyonxPatterns on Spark, Shark, Mesos, Tachyon
xPatterns on Spark, Shark, Mesos, Tachyon
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
AltanAtabarut
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
SamanArshad11
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
rightmanforbloodline
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
deepikakumaridk25
 

Recently uploaded (20)

Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptxParcel Delivery - Intel Segmentation and Last Mile Opt.pptx
Parcel Delivery - Intel Segmentation and Last Mile Opt.pptx
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
Histology of Muscle types histology o.ppt
Histology of Muscle types histology o.pptHistology of Muscle types histology o.ppt
Histology of Muscle types histology o.ppt
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
Cal Girls The Lalit Jaipur 8445551418 Khusi Top Class Girls Call Jaipur Avail...
 

Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators

  • 2. Accelerating SPARK SQL to 50X Performance with Apache Arrow based FPGA accelerators Calvin Hung calvin.hung@wasaitech.com Wasai Technology, Inc. Weiting Chen weiting.chen@intel.com Intel Corp.
  • 4. ABOUT US Calvin Hung @wasaitech Calvin is the co-founder, CEO and CTO of WASAI Technology, which is specialized in FPGA-based datacenter accelerations for Apache Spark, Apache Hadoop and Genomics Analysis applications. He has more than 15 years of experience in software and hardware architecture co-design and performance optimization. Weiting Chen(William) @intel Weiting is a senior software engineer at Intel Software. He has worked for Big Data and Cloud Solutions including Spark, Hadoop, OpenStack, and Kubernetes for more than 6 years.
  • 5. MOTIVATION ▪ CPU support SIMD instructions such as SSE, AVX2, AVX512, …etc. We would like to unleash the power in SPARK. ▪ Many accelerators such as FPGA, GPU, ASIC, …etc in the world can help CPU to offload functions and speed up the performance.
  • 6. PROBLEM DEFINITION ▪ How to avoid row and columnar convert overhead during data processing? ▪ How SPARK can leverage AVX support? ▪ How to coordinate the accelerators (e.g. FPGA, GPU, …etc) to work with CPU in SPARK3.0? ▪ How FPGA can help to speed up SPARK? ▪ How to minimize data copy and serialization overhead when copying from host to device? ▪ How to enhance the performance during DMA transfer?
  • 7. SOLUTION: SPARK + ARROW + FPGA A better way to run SPARK with AVX and accelerators support - Apache Arrow - SPARK 3.0 New Features - OAP Native SQL Engine (Intel) - FPGA Accelerators (WASAI)
  • 8. THE GOALS End-to-End Columnar-to-Columnar Data Processing: To avoid columnar-to-row or row-to-columnar overhead when processing data. Support AVX(via OAP Native SQL): Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Writer Result With FPGA Integration and acceleration: Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Data Copy to Device -> Columnar based Data Processing(on FPGA) -> Columnar
  • 9. APACHE ARROW • Each system has its own internal memory format • 70-80% computation wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality Reference: https://arrow.apache.org
  • 10. NEW FEATURES in SPARK3.0 SPARK-27396 Public APIs for extended Columnar Processing Support https://issues.apache.org/jira/browse/SPARK-27396 ▪ An interface to extend columnar processing API ▪ Provide an opportunity to create a custom API for columnar data processing with OAP Native SQL Engine and FPGA support ▪ Advanced user can define a new interface to communication with accelerators such as GPU or FPGA SPARK-24615 Accelerator-aware task scheduling for SPARK https://issues.apache.org/jira/browse/SPARK-24615 ▪ An interface for SPARK to allocate accelerators in task level ▪ Make SPARK task to be aware accelerators such as GPU, FPGA, …etc ▪ Currently only support GPU ▪ FPGA can be supported in the same way (vendor specific)
  • 11. OAP Native SQL Engine Plugin Intel Optimized Analytics Package(OAP): Native SQL Engine https://github.com/Intel-bigdata/OAP/ An End-to-End SPARK Columnar based data processing with Intel AVX support Apache Arrow Arrow Data Source Arrow Data Processing Intel CPU Other Accelerators (FPGA, GPU, …) Columnar Shuffle SPARK SQL SPARK Catalyst
  • 12. FPGA Acceleration END TO END, COLUMNAR TO COLUMNAR DATA PROCESSING SPARK3.0 FileScan FileWriter Parquet Writer Parquet Reader ColumnarVector ColumnarVector Columnar-to-Row Row-to-Columnar InternalRow Whole Stage Codegen Row based Operator Row based Operator InternalRow InternalRow ColumnarVector SPARK3.0 + OAP Native SQL Engine FileScan FileWriter Parquet Writer Parquet Reader Arrow Arrow Columnar based Operator Arrow Columnar based Operators Arrow FPGA Templates Arrow FPGA Operators Aggregation/GroupBy/… Arrow
  • 13. OAP NATIVE SQL ENGINE HIGHLIGHT https://github.com/Intel-bigdata/OAP/ ▪ An Open Source Columnar based Data Processing for SPARK ▪ Apache Arrow based Solution ▪ Enable AVX Support with SIMD instruction acceleration ▪ Leverage SPARK3.0 Support ▪ Communicate with 3rd Party Accelerators ▪ Support Data Source Parsing, SQL Operators, and Columnar Shuffle ▪ Common SQL Operators Support such as filter, join, groupby, aggregate, …etc.
  • 14. SPARK + FPGA + ARROW Why SPARK SQL + FPGA - SPARK SQL essentially processes structured row-based dataset at once with single query of a bunch of SQL operators. The operators can be simple while the dataset could be extremely large. - FPGA with highly specialized IPs can deal with such multiple-instruction, single- dataset analysis faster, more power and resource-efficiently than CPU and GPU under the same total-cost-of-ownership. Why Arrow - In order to offload SPARK SQL workload from Java runtime to FPGA, leveraging a new WholeStageCodegen to invoke native function calls to process data with FPGA can be messy. Apache Arrow can hold Columnar Batch data inside native memory and manage its memory reference inside Spark.
  • 15. SPARK SQL FPGA ACCELERATION ▪ SQL Operators (Aggregation, GroupBy, Filter, Sort, Join, …etc) ▪ Using Apache Arrow for data transfer between Java runtime and FPGA to reduce data traffic ▪ Next step will be leveraging Arrow::RecordBatch
  • 17. SPARK(CPU + FPGA) SQL RDD Dataset DataFrame Catalyst Optimization Code Generation RDD Dataset DataFrame Code Execution Spark Executor 1. Get Mem Page for Input / Output 5. Free Memory Pages FPGA Java Wrapper 3. Start FPGA Engine Data Block 2. Fill the Input Data Block Adaptor / JNI / Driver Array[Byte] 4. Iterating the data by Spark API (Schema Specified) DMA Engine Aggregation Engine GroupBy EngineFPGA Input Data Processed By FPGA Re-generate Physical Plan for FPGA
  • 18. SPARK SQL + ARROW PERFORMANCE ▪ A simple query with 300GB dataset from TPC-DS Q55 ▪ With Apache Arrow, performance boost can be up to 33% and CPU is obviously offloaded. SELECT ss_sold_date_sk, sum(ss_ext_sales_price) FROM store_sales WHERE ss_item_sk = 3175 GROUP BY ss_sold_date_sk 0 2 4 6 8 10 12 14 32 90 300 Minutes Arrow-boosted Original Intel® Xeon® Gold 6120 CPU x2 DDR4 256GB Intel PAC Arria10 x1 (GB)
  • 19. SYSTEM STACK Storage Storage/Data Format JSON Parquet Distributed Execution Big Data Cores MapReduce Spark SQL Engine Spark RDD/DFOS OS Core System CentOS RHEL Ubuntu FPGA Accelerator Accelerators MapReduce Accelerator Spark SQL Accelerator Spark RDD/DF Accelerator Data Decoder WASAI System Lib & Drivers WASAI IOBooster WASAI EvoCores Compressor
  • 20. OTHER ACCELERATORS Solution Description Workloads Result Spark RDD groupByKey, foldByKey, etc microbench 80%~3x performance boost. Shuffle size 90% to 99% reduction. General Sort General TimSort for both Hadoop & Spark TeraSort microbench 20% performance boost Compression Compression encoding/decoding microbench Ongoing Erasure Coding EC codec microbench Reach maximum throughput of PCIe Input Format Parsing JSON, CSV, Parquet format parser microbench 2X~7.8X performance boost Intel® Xeon® Gold 6120 CPU x2 DDR4 256GB Intel PAC Arria10 x1
  • 21. KEY TAKEAWAYS ▪ End-to-End Columnar Data Processing can optimize the performance in CPU, FPGA and other Accelerators in native layer. ▪ FPGA can help to accelerate SPARK in many cases that involved heavy CPU-intensive operations. ▪ Last but not least, with SPARK3.0 support, many new opportunities can be done in the future.
  • 22. NEXT … ▪ More features in OAP Native SQL Engine ▪ OAP Native SQL Engine + FPGA integration OAP Native SQL Engine Plugin Apache Arrow Arrow Data Source Arrow Data Processing Intel CPU WASAI FPGA Accelerators Columnar Shuffle WASAI CodeGen
  • 23. CALL TO ACTION We encourage you to try OAP Native SQL Engine for SPARK in https://github.com/Intel-bigdata/OAP/ Wasai SPARK SQL + FPGA Solution https://www.wasaitech.com/ Please contact Intel: weiting.chen@intel.com Wasai: calvin.hung@wasaitech.com
  • 24. Q & A
  • 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.