SlideShare a Scribd company logo
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools
Accelerating Spark shuffle for data analytics on
the cloud with remote persistent memory pools
Jian Zhang, Software Engineering Manager, Intel
Chendi Xue, Software Engineer, Intel
June, 2020
Agenda
Motivation
Remote Persistent Memory
Remote Persistent Memory for
Spark Shuffle
Remote persistent memory extension for Spark
shuffle
Remote persistent memory pool based fully
disaggregated shuffle
Summary
Motivation
Motivation - Challenges of Spark shuffle
▪ Data Center Infrastructure evolution
▪ Compute and storage disaggregation become a key trend, diskless environment becoming more and more popular
▪ Modern datacenter is evolving: high speed network between compute and disaggregated storage and tiered storage architecture makes local storage less
attractive
▪ New storage technologies are emerging, e.g., storage class memory (or PMem)
▪ Spark shuffle problems
▪ Uneven resource utilization of CPU and Memory
▪ Out of memory issues and GC
▪ Disk I/O too slow
▪ Data spill degrades performance
▪ Shuffle I/O grows quadratically with data
▪ Local SSDs wear out by frequent intermediate data writes
▪ Unaffordable re-compute cost
▪ Other related works
▪ Intel Disaggregated shuffle w/ DAOS1, Facebook cosco2, Baidu DCE shuffle3, JD.com & MemVerge RSS 4 and etc.
1. https://www.slideshare.net/databricks/improving-apache-spark-by-taking-advantage-of-disaggregated-architecture
2. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service
3. http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-fully-disaggregated-shuffle-on-Spark-td28329.html
4. https://databricks.com/session/optimizing-performance-and-computing-resource-efficiency-of-in-memory-big-data-analytics-with-disaggregated-persistent-memory
Re-cap of Shuffle
load
load
Input
A HDFS file
load
sort
Output
A HDFS file
sort
sort
Intermediate Data
Each Map’s output Shuffle (Random Partition)
2
1
9
1
5
8
2
6
5
2
1
1
2
5
6
5
9
8
1
1
2
2
5
5
6
8
9
9
2
1
8
1
5
6
5
2
Decompression Compression Decompression Compression
Local
Local
Local
Write Local, can use shuffle service to cache the data.
Read Remote via Network
Spark Shuffle Bottlenecks
▪ Spark Shuffle (nWeight – a Graph Computation Workload)
▪ Context: Iterative graph-parallel algorithm, implemented with GraphX, to
compute the association for 2 vertices in 2-3 hops distance in the graph. (e.g.
recommend a video for my friends’ friends)
0
20
40
60
80
100
120
0
116
225
335
445
555
665
775
885
995
1106
1216
1326
1436
1546
1656
1766
1876
1986
2096
2206
2316
CPUUtilization
Spark Worker Node CPU Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
▪ Spark Shuffle (TeraSort)
▪ Context: TeraSort samples the input data and uses
map/reduce to sort the date into a total order.
Remote Persistent Memory
PMem - A New Memory Tier
▪ IDC reports indicated that data is
growing very fast
▪ Global datasphere growth rate (CAGR) 27%**
▪ But DRAM density scaling is becoming slower: from
4X/3yr to 2X/3yr to 2X/4yr*
▪ A new memory system will be needed to met the data
growth needs for new cases
▪ PMem: new category that sits between
memory and storage
▪ Delivers a unique combination of affordable large
capacity and support for data persistence
▪ Two operational Mode
▪ Memory Mode: Enlarge system Memory size
▪ App Direct Mode: exposes two set of independent
memory resources to OS and applications
**Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018
*Source: ”3D NAND Technology – Implications for Enterprise Storage Applications” by J.Yoon (IBM), 2015 Flash Memory Summit
Remote Persistent Memory Usage
High Availability
Data Replication
• Replicate Data in local PM
across Fabric and Store in
remote PM
• For backup
Remote PM
• Extend on-node memory
capacity (w/ or w/o persistency)
in a disaggregated architecture
to enlarge compute node
memory
• IMDB
Shared Remote PM
• PM holds SHARED data
among distributed
applications
• Remote Shuffle service,
IMDB
https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/05_PM_Summit_Grun_PM_%20Final_Post_CORRECTED.pdf
Access Remote Persistent Memory over RDMA
• PM offers remote persistence, without losing any of
characteristic of memory
• PM is really Fast
• Needs ultra low-latency networking
• PM has very high bandwidth
• Needs ultra efficient protocol, transport offload, high
BW
• Remote access must not add significant latency
• Network switches & adaptors deliver predictability,
fairness, zero packet loss
• Moving data between (zero-copy) two system with
Volatile DRAM, offload data movement from CPU to NIC
• Low latency
• Latency < uses
• High BW
• 200Gb/s, 400Gb/s, zero-copy, kernel bypass, HW
offered one side memory to remote memory
operations
• Reliable credit base data and control delivered by HW
• Network resiliency, scale-out
Remote Persistent Memory offers RDMA offers
• RPMem over Fabric additional complexity:
• In order to guarantee written data is durable on the target node, CPU caches need to be bypassed or flushed to get data in to the ADR
power fail safe domain
• When writing to PMEM, need a synchronous acknowledgement when writes have made it to the durability domain but the current RDMA
Write semantics do not provide such write acknowledgement
RPMem Durability
• RDMA
• Guarantee that Data has been successfully received and
accepted for execution by the remote HCA
• Doesn’t guarantee data has reached remote host memory – need
ADR
• Doesn’t guarantee the data can be visible/durable for other
consumers accesses (other connections, host processor)
• Using small RDMA read to forces write data to PMem
• New transport operation – RDMA FLUSH
• New RDMA command opcode
• Flush all previous writes or specific regions
• Provides memory placement guarantee to the upper layer
software
• RDMA Flush forces previous RDMA Write data to durability
domain
• It makes PM operations with RDMA more efficient!
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory Controller
RDMA
Write
RDMA
Write RDMA
Write
RDMA Read RDMA
Read
Read ACK
RDMA Read
ACKRDMA Read
ACK
Flushing Read
Peer B
PMEM
Write
Write
Read
Read Data
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory ControllerRDMA
Write
RDMA
Write
RDMA
Write
RDMA Flush RDMA
Flush
Flush
Flush
Flush
Flush
Peer B
PMEM
Flush
Remote persistent memory pool for spark shuffle
Re-cap: Remote Persistent Memory Extension for Spark shuffle Design
▪ 1. Serialize obj to off-heap memory
▪ 2. Write to local shuffle dir
▪ 3. Read from local shuffle dir
▪ 4. Send to remote reader through TCP-IP
Ø Lots of context switch
Ø POSIX buffered read/write on shuffle disk
Ø TCP/IP based socket send for remote shuffle read
PMEM
Shuffle file
Spark.Local.dir
Shuffle file
Executor JVM #1
User
Kernel
SSD HDD
3
Shuffle write
Shuffle read
2
4
Worker
1. Serialize obj to off-heap memory
2. Persistent to PMEM
3. Read from remote PMEM through RDMA, PMEM
is used as RDMA memory buffer
Ø No context switch
Ø Efficient read/write on PMEM
Ø RDMA read for remote shuffle read
Executor JVM #1
User
Kernel 3
Worker
PMEM
Shuffle ManagerShuffle Manager
NIC
Shuffle
Writer
RDMA NICPMEM
Drivers
Shuffle
Readerbytebuffer
1 obj
Heap
Off-heap
Shuffle
Writer(new)
Shuffle
Reader(new)
obj
bytebuffer
1
Heap
Off-heap2
Spark PMoF: https://github.com/intel-bigdata/spark-pmof
Strata-ca-2019: https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72992
Spark-PMoF End-to-End Time Evaluation – TeraSort Workload
§ Terasort
§ End-to-end time gains .vs Vanilla Spark: 22.7x
§ 1.29x speedup over 4x NVMe
§ PMoF shorten the remote read latency extremely
§ Readblocked time for HDD, NVMe & PMem (from Spark UI): 8.3min vs.
11s vs. 7ms
§ PMem provides higher write/read bandwidth per node than HDD
& NVMe and higher endurance
§ Decision support workload
§ is less I/O intensive compared with Terasort
§ 3.2x speed up for total execution time of 99 queries
§ IO intensive workloads can be benefit more from PMoF performance
improvement.
Performance results are based on testing as of 12/06/2019 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
0
500
1000
1500
2000
2500
3000
3500
4000
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
99 Queries Execution Time - Spark-PMoF vs Vanilla Spark
Spark-PMoF Vanilla Spark
12277.2
695 540.5
1
10
100
1000
10000
100000
Second
Spark 550GB TeraSort End-to-End Time (lower is better)
terasort-hdd terasort-nvme terasort-pmof
Extending to fully Disaggregated Shuffle solution
▪ Remote Persistent Memory demonstrated good results, but what’s more?
▪ In real production environment, there are more challenges
▪ Disaggregated, diskless environment
▪ Scale shuffle/compute independently
▪ CPU/Memory unbalanced issue
▪ Some jobs lasts for long time, stage recompute cost is intolerable in case of shuffle failure
▪ Elastic Deployment with compute and storage disaggregation requires independent shuffle solution
▪ Decouple shuffle I/O from a specific network/storage is capable of delivering dedicated SLA for critical applications
▪ Fault tolerant in case on shuffle failure, no need for recompute
▪ Offload spill as well, reduced compute memory resource requirements
▪ Balanced resource utilization
▪ Leverage state-of-art storage medium as storage media
▪ To provide high performance, high endurance storage backend
▪ Drive the intention to build a RPMem based, fully disaggregated shuffle solution!
RPMP Architecture
▪ Remote Persistent Memory Pool for Spark (Spark RPMP): a new
fully disaggregated shuffle solution that leverage state-of-art
hardware technologies including persistent memory and RDMA.
▪ A new pluggable shuffle manager
▪ A persistent memory based distributed storage system
▪ An RDMA powered network library and an innovative approach
to use persistent memory as both shuffle media as well as
RDMA memory region to reduce additional memory copies and
context switches.
▪ Features
▪ Provides allocate/free/read/write APIs on pooled PMem
resources
▪ Data will be replicated to multiple nodes for High availability
▪ Can be extended to other usage scenario such as PMem based
database, data store, cache store
▪ Benefits
▪ Improved Spark scalability by disaggregating Spark shuffle from
compute node to a high-performance distributed storage
▪ Improved spark shuffle performance with high speed persistent
memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly
available shuffle service supports shuffle data replication and
fault-tolerant.
RPMP
Transactions Streaming Machine LearningSQL
S3
K/V
Remote Shuffle
shuffle
PMem
DRAM
RNIC
DRAM
Storage
Compute
data cache
DRAM
Proxy
PMem
Remote Persistent Memory Pool overview
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
MapperN ReducerN
RDMA
Read and Write
RDMA
Read and Write
§ Care more about write performance
(latency/bandwidth)
§ 100GB NIC needed, theoretically 8x
PMEM on single node provide 10GB+
write bandwidth.
RPMP node 1
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Global Memory Address
RPMP node 2
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Heartbeat
Replication
§ RPMP storage node was
choosen by using
Consistent Hash to avoid
single point failure.
§ An timely ActiveNodeMap
maintained by using
Heartbeat.
§ Data will be replicated to a
worker node from driver node
over RDMA
§ Once driver node goes down,
worker node is still writable
and readable.
Mapper1 Reducer1
RPMP CORE architecture details
§ RPMP Client
§ RPMP client provides transactional
read/write/allocate/free, and obj put/get interfaces
to users
§ Both cpp and java API are provided
§ Data will be transferred by HPNL(RDMA) between
selected server nodes and client later on.
§ RPMP Server
§ RPMP proxy is used to maintain an unified
ActiveNodeMap.
§ Network Layer is based on HPNL to provide
RDMA Data transfer.
§ Controller Layer is responsible for Global Address
Management, TransactionalProcess, etc.
§ Storage Layer is responsible for Pmem
management using high performance PMDK libs
RPMP (Server)
PmemAllocatorStorage Layer
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Controller layer Scheduler TransactionGlobal Address Mgmt
PMem
/dev/dax0.1 /dev/dax1.0 /dev/dax2.0
/dev/dax0.0 /dev/dax1.1 /dev/dax2.1
Checksum
RPMP (Client)
Interface tx_alloc/tx_free/tx_read/tx_write/put/get
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Storage Proxy
RPMP Proxy
Accelerator
Spark RPMP optimization features
▪ Optimized RDMA communication
▪ Leverage HPNL as high performance, protocol agnostic network messenger
▪ Server handle all the write operations and clients implement read-only operations using one-sided RDMA reads.
▪ Controller accelerator layer
▪ Partition Merge
▪ Aggregate small partitions to larger blocks to accelerate reduce, reduced number of reducer connections
▪ Sort
▪ Sort the shuffle data on the fly, no compute on reduce phase, reduce compute node CPU resource utilization
▪ Provided controllable fine granularity control or resource utilization if compute node CPU resources is limited
▪ Storage
▪ Global address space accessible with memory-like APIs
▪ A Key-value store based on libpmemobj, transactional
▪ Allocator to manager on PMem, storage proxy directing request to different allocators
* https://github.com/Cyan4973/xxHash
RPMP Workflow
1. Write data to specific address.
2. Server issue RDMA read (client DRAM ->
server DRAM).
3. Flush (DRAM -> PMEM).
4. Request ACK.
client
server
1 2
3
4
1. Read data from specific
address.
2. RDMA write (server PMEM
-> client DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
DRAM
PMem
DRAM
PMem
DRAM DRAM
1. Write data to specific address.
2. RDMA read (client DRAM -> server DRAM),
Secondary Node DRAM -> Primary Node
DRAM)
3. Flush (DRAM -> PMEM).
4. Request ACK.
DRAM
client
server
1 2
3
4
1. Read data from specific address.
2. RDMA write (server PMEM -> client
DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
server
2
server
3
Proxy
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM
PMEM Based Shuffle Write optimization
§ Map
§ Provision pmem name space in advance
§ Leverage a circular buffer to build unidirectional
channels for RDMA primitives
§ Serialized data write to an offheap buffer, once hit
threshold (4MB by default), create a block via
libpmemobj on pmem device with memcopy
§ Append write, only write once
§ No index file, mapping info stored in pmem object
metadata
§ Libpmem based, kernel bypass
§ Reduce
§ Reduce use memcopy to read the data
§ Reduce read through PMem memory directly from
RDMA
N x
Reduce
PMEM Shuffle Writer
Map task
KV in memory
PMDK(libpmemobj c)
JNI
Partition [0]
Partition [1]
Partition[2]
Partition[n]
Persistent Memory Device
(devdax/fsdax mode)
DRAM
RPMP client circular buffer
JVM
Native
RPMP integration to Spark Shuffle
SortShuffleManager
->registerShuffle()
ShuffleHandler
ShuffleWriter
BaseShuffleHandle
PmemShuffleWriter
getWriter()
PMEM
ShuffleBlockResolver
RDMAShuffleReader
PMDK
RDMA to another Executor
Filesystem
NettyReader
PMoFShuffleManager
getWriter()
PmemShuffleWriter
PmemBlockObjectWriter
PmemOutputStream
PmemBuffer
PmemManagedBuffer
PmemInputStream
PmemExternalSorter
PMDK - Persistent Memory Pool
PmemShuffleReader
RDMAReader
java
cpp
RPMP
Performance Evaluation
§ Configuration
§ 2 nodes, one as RPMEM client and another as RPMEM server.
§ 40 GB RDMA NIC, 4x PMems on RPMEM server.
§ Tested remote_allocate, remote_free, remote_write, remote_read
interfaces with single client.
§ Performance
§ Remote read max out NIC BW
RPMEM Client
Node 1
RPMEM Server
PMem PMem
PMem PMem
Node 2
Interface Performance
allocate 3.7 GB/s Expect higher performance with
more clients.
remote_write 2.9 GB/s Expect higher performance with
more clients.
remote_read 4.9 GB/s Limited by 40GB NIC
Spark
Node 1
Spark
PMem PMem
HDD HDD
Node 2
HDD HDDHDFS
§ Configuration
§ 2 nodes
§ Baseline: Shuffle on HDD
§ RPMem: 2x 128GB PMem on Node 2 ofr Shuffle and external sort
§ Workload: Terasort 100GB
§ Performance
§ 1.98x speedup
Execution Time
Vanilla spark 100G 416 s
RPMem Shuffle 210 s
HDFS
Performance results are based on testing as of 5/30/2020 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
Summary
Summary
• Spark shuffle posed lots of challenges in large scale production environment
• Remote Persistent Memory extending PM new usage mode to new scenarios with RDMA being the most acceptable
technology used for remote persistent memory access
• Remote persistent memory pool for Spark shuffle enables a fully disaggregated, high performance, low latency
shuffle solution to accelerate spark shuffle
▪ Improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed
storage
▪ Improved spark shuffle performance with high speed persistent memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly available shuffle service supports shuffle data replication
and fault-tolerant.
Call to action
28
Accelerate Your Data Analytics & AI Journey with Intel
Optimized
ML/DL
Libraries &
tools
Optimized
Cloud
Platforms
GOOGLE CLOUD PLATFORMAMAZON WEB SERVICES MICROSOFT AZURE
Intel Distribution
for Python
Intel Optimized
BAIDU CLOUD & More
Data Center
DL Inference
(Goya)
FPGA: Real-Time &
Multi-Use DL
Inference
Data Center
DL Training
(Gaudi)
Edge
DL Inference
GPU: AI, HPC, Media
& Graphics
CPU: Multi-Purpose
Analytics/AI Foundation
High performance in-
memory analytics
Analytics &
AI intel
Hardware
Intel.com/AI intel.com/yourdataonintelSoftware.intel.com
* In development
*
*
INTEL-OPTIMIZED END-TO-END DATA ANALYTICS
& AI PIPELINE
DISCOVE
RY Data Develop Deploy
of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate
Intel OAP
https://github.com/Intel-bigdata/OAP/
An End-to-End Columnar based data processing with Intel AVX support
Intel CPU Other Accelerators(FPGA, GPU, …)
OAP Native SQL Engine Plugin
Apache Arrow
Arrow Data
Source
Arrow Data
Processing
Columnar
Shuffle
SPARK SQL
SPARK Catalyst
https://github.com/Intel-bigdata/OAP/
Optimized Analytics Packages for Spark
Shuffle
Remote Shuffle
Remote Persistent
Memory Shuffle
Remote Persistent Memory Pool
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools
Legal Information: Benchmark and Performance
Disclaimers
▪ Performance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
▪ Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
▪ Configurations: see performance benchmark test configurations.
Notices and Disclaimers
▪ No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
▪ Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness
for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or
usage in trade.
▪ This document contains information on products, services and/or processes in development. All information provided here is
subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and
roadmaps.
▪ The products and services described may contain defects or errors known as errata which may cause deviations from published
specifications. Current characterized errata are available on request.
▪ Intel, the Intel logo, Xeon, Optane, Optane DC Persistent Memory are trademarks of Intel Corporation in the U.S. and/or other
countries.
▪ *Other names and brands may be claimed as the property of others
▪ © Intel Corporation.
Benchmark configuration
Workloads
Terasort 600GB
• hibench.spark.master yarn-client
• hibench.yarn.executor.num 12
• yarn.executor.num 12
• hibench.yarn.executor.cores 8
• yarn.executor.cores 8
• spark.shuffle.compress false
• spark.shuffle.spill.compress false
• spark.executor.memory 60g
• spark.executor.memoryoverhead 10G
• spark.driver.memory 80g
• spark.eventLog.compress = false
• spark.executor.extraJavaOptions=-XX:+UseG1GC
• spark.hadoop.yarn.timeline-service.enabled false
• spark.serializer org.apache.spark.serializer.KryoSerializer
• hibench.default.map.parallelism 200
• hibench.default.shuffle.parallelism 1000
3 Node cluster
Hardware:
• Intel® Xeon™ processor Gold 6240 CPU @ 2.60GHz, 384GB Memory (12x
32GB 2666 MT/s)
• 1x Mellanox ConnectX-4 40Gb NIC
• Shuffle Devices:
• 1x HDD for shuffle
• 4x 128GB Persistent Memory for shuffle
• 4x 1T NVMe for HDFS
Software:
• Hadoop 2.7
• Spark 2.3
• Fedora 27 with WW26 BKC
Hadoop NN
Spark driver
Hadoop DN
Spark worker
Hadoop DN
Spark Worker
Hadoop DN
Spark Worker
1x40Gb NIC
4x NVMe
1x HDD 4x PMem
4x NVMe
1x HDD 4x PMem
4x NVMe
1x HDD 4x PMem

More Related Content

What's hot

Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Andrew Lamb
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Databricks
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Jignesh Shah
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
Roopa Tangirala
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Databricks
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
Lucidworks
 

What's hot (20)

Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Introduction to DataFusion An Embeddable Query Engine Written in Rust
Introduction to DataFusion  An Embeddable Query Engine Written in RustIntroduction to DataFusion  An Embeddable Query Engine Written in Rust
Introduction to DataFusion An Embeddable Query Engine Written in Rust
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedInMagnet Shuffle Service: Push-based Shuffle at LinkedIn
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
 
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataProblems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte Data
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup) Polyglot persistence @ netflix (CDE Meetup)
Polyglot persistence @ netflix (CDE Meetup)
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 

Similar to Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
 
How AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changesHow AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changes
Danny Sabour
 
Flash memory summit enterprise udate 2019
Flash memory summit enterprise udate 2019Flash memory summit enterprise udate 2019
Flash memory summit enterprise udate 2019
Howard Marks
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent Memory
Intel® Software
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Rose Toomey
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory FabricsRealizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
inside-BigData.com
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Spark Summit
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
Santanu Dey
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
Databricks
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
G2MCommunications
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
StampedeCon
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Ontico
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
inside-BigData.com
 
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ MemoryRedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
Redis Labs
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community
 

Similar to Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools (20)

Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAccelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
 
How AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changesHow AI and ML are driving Memory Architecture changes
How AI and ML are driving Memory Architecture changes
 
Flash memory summit enterprise udate 2019
Flash memory summit enterprise udate 2019Flash memory summit enterprise udate 2019
Flash memory summit enterprise udate 2019
 
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
Volatile Uses for Persistent Memory
Volatile Uses for Persistent MemoryVolatile Uses for Persistent Memory
Volatile Uses for Persistent Memory
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory FabricsRealizing Exabyte-scale PM Centric Architectures and Memory Fabrics
Realizing Exabyte-scale PM Centric Architectures and Memory Fabrics
 
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...
 
Building a High Performance Analytics Platform
Building a High Performance Analytics PlatformBuilding a High Performance Analytics Platform
Building a High Performance Analytics Platform
 
Elastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent MemoryElastify Cloud-Native Spark Application with Persistent Memory
Elastify Cloud-Native Spark Application with Persistent Memory
 
Are your ready for in memory applications?
Are your ready for in memory applications?Are your ready for in memory applications?
Are your ready for in memory applications?
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ MemoryRedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ Memory
 
Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community Talk on High-Performance Solid Sate Ceph
Ceph Community Talk on High-Performance Solid Sate Ceph
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
JeevanKp7
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
wojakmodern
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
PrabhuB33
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 

Recently uploaded (20)

DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptxSAMPLE PRODUCT RESEARCH PR - strikingly.pptx
SAMPLE PRODUCT RESEARCH PR - strikingly.pptx
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERINGSOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools

  • 2. Accelerating Spark shuffle for data analytics on the cloud with remote persistent memory pools Jian Zhang, Software Engineering Manager, Intel Chendi Xue, Software Engineer, Intel June, 2020
  • 3. Agenda Motivation Remote Persistent Memory Remote Persistent Memory for Spark Shuffle Remote persistent memory extension for Spark shuffle Remote persistent memory pool based fully disaggregated shuffle Summary
  • 5. Motivation - Challenges of Spark shuffle ▪ Data Center Infrastructure evolution ▪ Compute and storage disaggregation become a key trend, diskless environment becoming more and more popular ▪ Modern datacenter is evolving: high speed network between compute and disaggregated storage and tiered storage architecture makes local storage less attractive ▪ New storage technologies are emerging, e.g., storage class memory (or PMem) ▪ Spark shuffle problems ▪ Uneven resource utilization of CPU and Memory ▪ Out of memory issues and GC ▪ Disk I/O too slow ▪ Data spill degrades performance ▪ Shuffle I/O grows quadratically with data ▪ Local SSDs wear out by frequent intermediate data writes ▪ Unaffordable re-compute cost ▪ Other related works ▪ Intel Disaggregated shuffle w/ DAOS1, Facebook cosco2, Baidu DCE shuffle3, JD.com & MemVerge RSS 4 and etc. 1. https://www.slideshare.net/databricks/improving-apache-spark-by-taking-advantage-of-disaggregated-architecture 2. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service 3. http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-fully-disaggregated-shuffle-on-Spark-td28329.html 4. https://databricks.com/session/optimizing-performance-and-computing-resource-efficiency-of-in-memory-big-data-analytics-with-disaggregated-persistent-memory
  • 6. Re-cap of Shuffle load load Input A HDFS file load sort Output A HDFS file sort sort Intermediate Data Each Map’s output Shuffle (Random Partition) 2 1 9 1 5 8 2 6 5 2 1 1 2 5 6 5 9 8 1 1 2 2 5 5 6 8 9 9 2 1 8 1 5 6 5 2 Decompression Compression Decompression Compression Local Local Local Write Local, can use shuffle service to cache the data. Read Remote via Network
  • 7. Spark Shuffle Bottlenecks ▪ Spark Shuffle (nWeight – a Graph Computation Workload) ▪ Context: Iterative graph-parallel algorithm, implemented with GraphX, to compute the association for 2 vertices in 2-3 hops distance in the graph. (e.g. recommend a video for my friends’ friends) 0 20 40 60 80 100 120 0 116 225 335 445 555 665 775 885 995 1106 1216 1326 1436 1546 1656 1766 1876 1986 2096 2206 2316 CPUUtilization Spark Worker Node CPU Utilization Average of %idle Average of %steal Average of %iowait Average of %nice Average of %system Average of %user ▪ Spark Shuffle (TeraSort) ▪ Context: TeraSort samples the input data and uses map/reduce to sort the date into a total order.
  • 9. PMem - A New Memory Tier ▪ IDC reports indicated that data is growing very fast ▪ Global datasphere growth rate (CAGR) 27%** ▪ But DRAM density scaling is becoming slower: from 4X/3yr to 2X/3yr to 2X/4yr* ▪ A new memory system will be needed to met the data growth needs for new cases ▪ PMem: new category that sits between memory and storage ▪ Delivers a unique combination of affordable large capacity and support for data persistence ▪ Two operational Mode ▪ Memory Mode: Enlarge system Memory size ▪ App Direct Mode: exposes two set of independent memory resources to OS and applications **Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018 *Source: ”3D NAND Technology – Implications for Enterprise Storage Applications” by J.Yoon (IBM), 2015 Flash Memory Summit
  • 10. Remote Persistent Memory Usage High Availability Data Replication • Replicate Data in local PM across Fabric and Store in remote PM • For backup Remote PM • Extend on-node memory capacity (w/ or w/o persistency) in a disaggregated architecture to enlarge compute node memory • IMDB Shared Remote PM • PM holds SHARED data among distributed applications • Remote Shuffle service, IMDB https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/05_PM_Summit_Grun_PM_%20Final_Post_CORRECTED.pdf
  • 11. Access Remote Persistent Memory over RDMA • PM offers remote persistence, without losing any of characteristic of memory • PM is really Fast • Needs ultra low-latency networking • PM has very high bandwidth • Needs ultra efficient protocol, transport offload, high BW • Remote access must not add significant latency • Network switches & adaptors deliver predictability, fairness, zero packet loss • Moving data between (zero-copy) two system with Volatile DRAM, offload data movement from CPU to NIC • Low latency • Latency < uses • High BW • 200Gb/s, 400Gb/s, zero-copy, kernel bypass, HW offered one side memory to remote memory operations • Reliable credit base data and control delivered by HW • Network resiliency, scale-out Remote Persistent Memory offers RDMA offers • RPMem over Fabric additional complexity: • In order to guarantee written data is durable on the target node, CPU caches need to be bypassed or flushed to get data in to the ADR power fail safe domain • When writing to PMEM, need a synchronous acknowledgement when writes have made it to the durability domain but the current RDMA Write semantics do not provide such write acknowledgement
  • 12. RPMem Durability • RDMA • Guarantee that Data has been successfully received and accepted for execution by the remote HCA • Doesn’t guarantee data has reached remote host memory – need ADR • Doesn’t guarantee the data can be visible/durable for other consumers accesses (other connections, host processor) • Using small RDMA read to forces write data to PMem • New transport operation – RDMA FLUSH • New RDMA command opcode • Flush all previous writes or specific regions • Provides memory placement guarantee to the upper layer software • RDMA Flush forces previous RDMA Write data to durability domain • It makes PM operations with RDMA more efficient! RDMA Write Posted Write (Non-Allocating) Posted Write (Non-Allocating) APP SW Peer A RNIC Peer B RNIC Peer B Memory Controller RDMA Write RDMA Write RDMA Write RDMA Read RDMA Read Read ACK RDMA Read ACKRDMA Read ACK Flushing Read Peer B PMEM Write Write Read Read Data RDMA Write Posted Write (Non-Allocating) Posted Write (Non-Allocating) APP SW Peer A RNIC Peer B RNIC Peer B Memory ControllerRDMA Write RDMA Write RDMA Write RDMA Flush RDMA Flush Flush Flush Flush Flush Peer B PMEM Flush
  • 13. Remote persistent memory pool for spark shuffle
  • 14. Re-cap: Remote Persistent Memory Extension for Spark shuffle Design ▪ 1. Serialize obj to off-heap memory ▪ 2. Write to local shuffle dir ▪ 3. Read from local shuffle dir ▪ 4. Send to remote reader through TCP-IP Ø Lots of context switch Ø POSIX buffered read/write on shuffle disk Ø TCP/IP based socket send for remote shuffle read PMEM Shuffle file Spark.Local.dir Shuffle file Executor JVM #1 User Kernel SSD HDD 3 Shuffle write Shuffle read 2 4 Worker 1. Serialize obj to off-heap memory 2. Persistent to PMEM 3. Read from remote PMEM through RDMA, PMEM is used as RDMA memory buffer Ø No context switch Ø Efficient read/write on PMEM Ø RDMA read for remote shuffle read Executor JVM #1 User Kernel 3 Worker PMEM Shuffle ManagerShuffle Manager NIC Shuffle Writer RDMA NICPMEM Drivers Shuffle Readerbytebuffer 1 obj Heap Off-heap Shuffle Writer(new) Shuffle Reader(new) obj bytebuffer 1 Heap Off-heap2 Spark PMoF: https://github.com/intel-bigdata/spark-pmof Strata-ca-2019: https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72992
  • 15. Spark-PMoF End-to-End Time Evaluation – TeraSort Workload § Terasort § End-to-end time gains .vs Vanilla Spark: 22.7x § 1.29x speedup over 4x NVMe § PMoF shorten the remote read latency extremely § Readblocked time for HDD, NVMe & PMem (from Spark UI): 8.3min vs. 11s vs. 7ms § PMem provides higher write/read bandwidth per node than HDD & NVMe and higher endurance § Decision support workload § is less I/O intensive compared with Terasort § 3.2x speed up for total execution time of 99 queries § IO intensive workloads can be benefit more from PMoF performance improvement. Performance results are based on testing as of 12/06/2019 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34 0 500 1000 1500 2000 2500 3000 3500 4000 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 99 Queries Execution Time - Spark-PMoF vs Vanilla Spark Spark-PMoF Vanilla Spark 12277.2 695 540.5 1 10 100 1000 10000 100000 Second Spark 550GB TeraSort End-to-End Time (lower is better) terasort-hdd terasort-nvme terasort-pmof
  • 16. Extending to fully Disaggregated Shuffle solution ▪ Remote Persistent Memory demonstrated good results, but what’s more? ▪ In real production environment, there are more challenges ▪ Disaggregated, diskless environment ▪ Scale shuffle/compute independently ▪ CPU/Memory unbalanced issue ▪ Some jobs lasts for long time, stage recompute cost is intolerable in case of shuffle failure ▪ Elastic Deployment with compute and storage disaggregation requires independent shuffle solution ▪ Decouple shuffle I/O from a specific network/storage is capable of delivering dedicated SLA for critical applications ▪ Fault tolerant in case on shuffle failure, no need for recompute ▪ Offload spill as well, reduced compute memory resource requirements ▪ Balanced resource utilization ▪ Leverage state-of-art storage medium as storage media ▪ To provide high performance, high endurance storage backend ▪ Drive the intention to build a RPMem based, fully disaggregated shuffle solution!
  • 17. RPMP Architecture ▪ Remote Persistent Memory Pool for Spark (Spark RPMP): a new fully disaggregated shuffle solution that leverage state-of-art hardware technologies including persistent memory and RDMA. ▪ A new pluggable shuffle manager ▪ A persistent memory based distributed storage system ▪ An RDMA powered network library and an innovative approach to use persistent memory as both shuffle media as well as RDMA memory region to reduce additional memory copies and context switches. ▪ Features ▪ Provides allocate/free/read/write APIs on pooled PMem resources ▪ Data will be replicated to multiple nodes for High availability ▪ Can be extended to other usage scenario such as PMem based database, data store, cache store ▪ Benefits ▪ Improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed storage ▪ Improved spark shuffle performance with high speed persistent memory and low latency RDMA network ▪ Improved reliability by providing a manageable and highly available shuffle service supports shuffle data replication and fault-tolerant. RPMP Transactions Streaming Machine LearningSQL S3 K/V Remote Shuffle shuffle PMem DRAM RNIC DRAM Storage Compute data cache DRAM Proxy PMem
  • 18. Remote Persistent Memory Pool overview Shuffle Writer Shuffle Reader PMoF Shuffle Manager Shuffle Writer Shuffle Reader PMoF Shuffle Manager MapperN ReducerN RDMA Read and Write RDMA Read and Write § Care more about write performance (latency/bandwidth) § 100GB NIC needed, theoretically 8x PMEM on single node provide 10GB+ write bandwidth. RPMP node 1 RPMP Core RPMP Proxy Network layer Controller layer Storage layer Global Memory Address RPMP node 2 RPMP Core RPMP Proxy Network layer Controller layer Storage layer Heartbeat Replication § RPMP storage node was choosen by using Consistent Hash to avoid single point failure. § An timely ActiveNodeMap maintained by using Heartbeat. § Data will be replicated to a worker node from driver node over RDMA § Once driver node goes down, worker node is still writable and readable. Mapper1 Reducer1
  • 19. RPMP CORE architecture details § RPMP Client § RPMP client provides transactional read/write/allocate/free, and obj put/get interfaces to users § Both cpp and java API are provided § Data will be transferred by HPNL(RDMA) between selected server nodes and client later on. § RPMP Server § RPMP proxy is used to maintain an unified ActiveNodeMap. § Network Layer is based on HPNL to provide RDMA Data transfer. § Controller Layer is responsible for Global Address Management, TransactionalProcess, etc. § Storage Layer is responsible for Pmem management using high performance PMDK libs RPMP (Server) PmemAllocatorStorage Layer Encode/DecodeHPNLNetwork Layer Buffer Mgmt Controller layer Scheduler TransactionGlobal Address Mgmt PMem /dev/dax0.1 /dev/dax1.0 /dev/dax2.0 /dev/dax0.0 /dev/dax1.1 /dev/dax2.1 Checksum RPMP (Client) Interface tx_alloc/tx_free/tx_read/tx_write/put/get Encode/DecodeHPNLNetwork Layer Buffer Mgmt Storage Proxy RPMP Proxy Accelerator
  • 20. Spark RPMP optimization features ▪ Optimized RDMA communication ▪ Leverage HPNL as high performance, protocol agnostic network messenger ▪ Server handle all the write operations and clients implement read-only operations using one-sided RDMA reads. ▪ Controller accelerator layer ▪ Partition Merge ▪ Aggregate small partitions to larger blocks to accelerate reduce, reduced number of reducer connections ▪ Sort ▪ Sort the shuffle data on the fly, no compute on reduce phase, reduce compute node CPU resource utilization ▪ Provided controllable fine granularity control or resource utilization if compute node CPU resources is limited ▪ Storage ▪ Global address space accessible with memory-like APIs ▪ A Key-value store based on libpmemobj, transactional ▪ Allocator to manager on PMem, storage proxy directing request to different allocators * https://github.com/Cyan4973/xxHash
  • 21. RPMP Workflow 1. Write data to specific address. 2. Server issue RDMA read (client DRAM -> server DRAM). 3. Flush (DRAM -> PMEM). 4. Request ACK. client server 1 2 3 4 1. Read data from specific address. 2. RDMA write (server PMEM -> client DRAM). 3. Request ACK. client server 1 2 3 Write Read DRAM PMem DRAM PMem DRAM DRAM 1. Write data to specific address. 2. RDMA read (client DRAM -> server DRAM), Secondary Node DRAM -> Primary Node DRAM) 3. Flush (DRAM -> PMEM). 4. Request ACK. DRAM client server 1 2 3 4 1. Read data from specific address. 2. RDMA write (server PMEM -> client DRAM). 3. Request ACK. client server 1 2 3 Write Read server 2 server 3 Proxy DRAM PMem DRAM PMem DRAM PMem DRAM PMem DRAM
  • 22. PMEM Based Shuffle Write optimization § Map § Provision pmem name space in advance § Leverage a circular buffer to build unidirectional channels for RDMA primitives § Serialized data write to an offheap buffer, once hit threshold (4MB by default), create a block via libpmemobj on pmem device with memcopy § Append write, only write once § No index file, mapping info stored in pmem object metadata § Libpmem based, kernel bypass § Reduce § Reduce use memcopy to read the data § Reduce read through PMem memory directly from RDMA N x Reduce PMEM Shuffle Writer Map task KV in memory PMDK(libpmemobj c) JNI Partition [0] Partition [1] Partition[2] Partition[n] Persistent Memory Device (devdax/fsdax mode) DRAM RPMP client circular buffer JVM Native
  • 23. RPMP integration to Spark Shuffle SortShuffleManager ->registerShuffle() ShuffleHandler ShuffleWriter BaseShuffleHandle PmemShuffleWriter getWriter() PMEM ShuffleBlockResolver RDMAShuffleReader PMDK RDMA to another Executor Filesystem NettyReader PMoFShuffleManager getWriter() PmemShuffleWriter PmemBlockObjectWriter PmemOutputStream PmemBuffer PmemManagedBuffer PmemInputStream PmemExternalSorter PMDK - Persistent Memory Pool PmemShuffleReader RDMAReader java cpp RPMP
  • 24. Performance Evaluation § Configuration § 2 nodes, one as RPMEM client and another as RPMEM server. § 40 GB RDMA NIC, 4x PMems on RPMEM server. § Tested remote_allocate, remote_free, remote_write, remote_read interfaces with single client. § Performance § Remote read max out NIC BW RPMEM Client Node 1 RPMEM Server PMem PMem PMem PMem Node 2 Interface Performance allocate 3.7 GB/s Expect higher performance with more clients. remote_write 2.9 GB/s Expect higher performance with more clients. remote_read 4.9 GB/s Limited by 40GB NIC Spark Node 1 Spark PMem PMem HDD HDD Node 2 HDD HDDHDFS § Configuration § 2 nodes § Baseline: Shuffle on HDD § RPMem: 2x 128GB PMem on Node 2 ofr Shuffle and external sort § Workload: Terasort 100GB § Performance § 1.98x speedup Execution Time Vanilla spark 100G 416 s RPMem Shuffle 210 s HDFS Performance results are based on testing as of 5/30/2020 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
  • 26. Summary • Spark shuffle posed lots of challenges in large scale production environment • Remote Persistent Memory extending PM new usage mode to new scenarios with RDMA being the most acceptable technology used for remote persistent memory access • Remote persistent memory pool for Spark shuffle enables a fully disaggregated, high performance, low latency shuffle solution to accelerate spark shuffle ▪ Improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed storage ▪ Improved spark shuffle performance with high speed persistent memory and low latency RDMA network ▪ Improved reliability by providing a manageable and highly available shuffle service supports shuffle data replication and fault-tolerant.
  • 28. 28 Accelerate Your Data Analytics & AI Journey with Intel Optimized ML/DL Libraries & tools Optimized Cloud Platforms GOOGLE CLOUD PLATFORMAMAZON WEB SERVICES MICROSOFT AZURE Intel Distribution for Python Intel Optimized BAIDU CLOUD & More Data Center DL Inference (Goya) FPGA: Real-Time & Multi-Use DL Inference Data Center DL Training (Gaudi) Edge DL Inference GPU: AI, HPC, Media & Graphics CPU: Multi-Purpose Analytics/AI Foundation High performance in- memory analytics Analytics & AI intel Hardware Intel.com/AI intel.com/yourdataonintelSoftware.intel.com * In development * * INTEL-OPTIMIZED END-TO-END DATA ANALYTICS & AI PIPELINE DISCOVE RY Data Develop Deploy of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate
  • 29. Intel OAP https://github.com/Intel-bigdata/OAP/ An End-to-End Columnar based data processing with Intel AVX support Intel CPU Other Accelerators(FPGA, GPU, …) OAP Native SQL Engine Plugin Apache Arrow Arrow Data Source Arrow Data Processing Columnar Shuffle SPARK SQL SPARK Catalyst https://github.com/Intel-bigdata/OAP/ Optimized Analytics Packages for Spark Shuffle Remote Shuffle Remote Persistent Memory Shuffle Remote Persistent Memory Pool
  • 30. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.
  • 32. Legal Information: Benchmark and Performance Disclaimers ▪ Performance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See configuration disclosure for details. No product can be absolutely secure. ▪ Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information, see Performance Benchmark Test Disclosure. ▪ Configurations: see performance benchmark test configurations.
  • 33. Notices and Disclaimers ▪ No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document. ▪ Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade. ▪ This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps. ▪ The products and services described may contain defects or errors known as errata which may cause deviations from published specifications. Current characterized errata are available on request. ▪ Intel, the Intel logo, Xeon, Optane, Optane DC Persistent Memory are trademarks of Intel Corporation in the U.S. and/or other countries. ▪ *Other names and brands may be claimed as the property of others ▪ © Intel Corporation.
  • 34. Benchmark configuration Workloads Terasort 600GB • hibench.spark.master yarn-client • hibench.yarn.executor.num 12 • yarn.executor.num 12 • hibench.yarn.executor.cores 8 • yarn.executor.cores 8 • spark.shuffle.compress false • spark.shuffle.spill.compress false • spark.executor.memory 60g • spark.executor.memoryoverhead 10G • spark.driver.memory 80g • spark.eventLog.compress = false • spark.executor.extraJavaOptions=-XX:+UseG1GC • spark.hadoop.yarn.timeline-service.enabled false • spark.serializer org.apache.spark.serializer.KryoSerializer • hibench.default.map.parallelism 200 • hibench.default.shuffle.parallelism 1000 3 Node cluster Hardware: • Intel® Xeon™ processor Gold 6240 CPU @ 2.60GHz, 384GB Memory (12x 32GB 2666 MT/s) • 1x Mellanox ConnectX-4 40Gb NIC • Shuffle Devices: • 1x HDD for shuffle • 4x 128GB Persistent Memory for shuffle • 4x 1T NVMe for HDFS Software: • Hadoop 2.7 • Spark 2.3 • Fedora 27 with WW26 BKC Hadoop NN Spark driver Hadoop DN Spark worker Hadoop DN Spark Worker Hadoop DN Spark Worker 1x40Gb NIC 4x NVMe 1x HDD 4x PMem 4x NVMe 1x HDD 4x PMem 4x NVMe 1x HDD 4x PMem