Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools

Accelerating Spark shuffle for data analytics on
the cloud with remote persistent memory pools
Jian Zhang, Software Engineering Manager, Intel
Chendi Xue, Software Engineer, Intel
June, 2020

Agenda
Motivation
Remote Persistent Memory
Remote Persistent Memory for
Spark Shuffle
Remote persistent memory extension for Spark
shuffle
Remote persistent memory pool based fully
disaggregated shuffle
Summary

Motivation - Challenges of Spark shuffle
▪ Data Center Infrastructure evolution
▪ Compute and storage disaggregation become a key trend, diskless environment becoming more and more popular
▪ Modern datacenter is evolving: high speed network between compute and disaggregated storage and tiered storage architecture makes local storage less
attractive
▪ New storage technologies are emerging, e.g., storage class memory (or PMem)
▪ Spark shuffle problems
▪ Uneven resource utilization of CPU and Memory
▪ Out of memory issues and GC
▪ Disk I/O too slow
▪ Data spill degrades performance
▪ Shuffle I/O grows quadratically with data
▪ Local SSDs wear out by frequent intermediate data writes
▪ Unaffordable re-compute cost
▪ Other related works
▪ Intel Disaggregated shuffle w/ DAOS1, Facebook cosco2, Baidu DCE shuffle3, JD.com & MemVerge RSS 4 and etc.
1. https://www.slideshare.net/databricks/improving-apache-spark-by-taking-advantage-of-disaggregated-architecture
2. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service
3. http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-fully-disaggregated-shuffle-on-Spark-td28329.html
4. https://databricks.com/session/optimizing-performance-and-computing-resource-efficiency-of-in-memory-big-data-analytics-with-disaggregated-persistent-memory

Re-cap of Shuffle
load
load
Input
A HDFS file
load
sort
Output
A HDFS file
sort
sort
Intermediate Data
Each Map’s output Shuffle (Random Partition)
2
1
9
1
5
8
2
6
5
2
1
1
2
5
6
5
9
8
1
1
2
2
5
5
6
8
9
9
2
1
8
1
5
6
5
2
Decompression Compression Decompression Compression
Local
Local
Local
Write Local, can use shuffle service to cache the data.
Read Remote via Network

Spark Shuffle Bottlenecks
▪ Spark Shuffle (nWeight – a Graph Computation Workload)
▪ Context: Iterative graph-parallel algorithm, implemented with GraphX, to
compute the association for 2 vertices in 2-3 hops distance in the graph. (e.g.
recommend a video for my friends’ friends)
0
20
40
60
80
100
120
0
116
225
335
445
555
665
775
885
995
1106
1216
1326
1436
1546
1656
1766
1876
1986
2096
2206
2316
CPUUtilization
Spark Worker Node CPU Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
▪ Spark Shuffle (TeraSort)
▪ Context: TeraSort samples the input data and uses
map/reduce to sort the date into a total order.

PMem - A New Memory Tier
▪ IDC reports indicated that data is
growing very fast
▪ Global datasphere growth rate (CAGR) 27%**
▪ But DRAM density scaling is becoming slower: from
4X/3yr to 2X/3yr to 2X/4yr*
▪ A new memory system will be needed to met the data
growth needs for new cases
▪ PMem: new category that sits between
memory and storage
▪ Delivers a unique combination of affordable large
capacity and support for data persistence
▪ Two operational Mode
▪ Memory Mode: Enlarge system Memory size
▪ App Direct Mode: exposes two set of independent
memory resources to OS and applications
**Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018
*Source: ”3D NAND Technology – Implications for Enterprise Storage Applications” by J.Yoon (IBM), 2015 Flash Memory Summit

Remote Persistent Memory Usage
High Availability
Data Replication
• Replicate Data in local PM
across Fabric and Store in
remote PM
• For backup
Remote PM
• Extend on-node memory
capacity (w/ or w/o persistency)
in a disaggregated architecture
to enlarge compute node
memory
• IMDB
Shared Remote PM
• PM holds SHARED data
among distributed
applications
• Remote Shuffle service,
IMDB
https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/05_PM_Summit_Grun_PM_%20Final_Post_CORRECTED.pdf

Access Remote Persistent Memory over RDMA
• PM offers remote persistence, without losing any of
characteristic of memory
• PM is really Fast
• Needs ultra low-latency networking
• PM has very high bandwidth
• Needs ultra efficient protocol, transport offload, high
BW
• Remote access must not add significant latency
• Network switches & adaptors deliver predictability,
fairness, zero packet loss
• Moving data between (zero-copy) two system with
Volatile DRAM, offload data movement from CPU to NIC
• Low latency
• Latency < uses
• High BW
• 200Gb/s, 400Gb/s, zero-copy, kernel bypass, HW
offered one side memory to remote memory
operations
• Reliable credit base data and control delivered by HW
• Network resiliency, scale-out
Remote Persistent Memory offers RDMA offers
• RPMem over Fabric additional complexity:
• In order to guarantee written data is durable on the target node, CPU caches need to be bypassed or flushed to get data in to the ADR
power fail safe domain
• When writing to PMEM, need a synchronous acknowledgement when writes have made it to the durability domain but the current RDMA
Write semantics do not provide such write acknowledgement

RPMem Durability
• RDMA
• Guarantee that Data has been successfully received and
accepted for execution by the remote HCA
• Doesn’t guarantee data has reached remote host memory – need
ADR
• Doesn’t guarantee the data can be visible/durable for other
consumers accesses (other connections, host processor)
• Using small RDMA read to forces write data to PMem
• New transport operation – RDMA FLUSH
• New RDMA command opcode
• Flush all previous writes or specific regions
• Provides memory placement guarantee to the upper layer
software
• RDMA Flush forces previous RDMA Write data to durability
domain
• It makes PM operations with RDMA more efficient!
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory Controller
RDMA
Write
RDMA
Write RDMA
Write
RDMA Read RDMA
Read
Read ACK
RDMA Read
ACKRDMA Read
ACK
Flushing Read
Peer B
PMEM
Write
Write
Read
Read Data
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory ControllerRDMA
Write
RDMA
Write
RDMA
Write
RDMA Flush RDMA
Flush
Flush
Flush
Flush
Flush
Peer B
PMEM
Flush

Remote persistent memory pool for spark shuffle

Re-cap: Remote Persistent Memory Extension for Spark shuffle Design
▪ 1. Serialize obj to off-heap memory
▪ 2. Write to local shuffle dir
▪ 3. Read from local shuffle dir
▪ 4. Send to remote reader through TCP-IP
Ø Lots of context switch
Ø POSIX buffered read/write on shuffle disk
Ø TCP/IP based socket send for remote shuffle read
PMEM
Shuffle file
Spark.Local.dir
Shuffle file
Executor JVM #1
User
Kernel
SSD HDD
3
Shuffle write
Shuffle read
2
4
Worker
1. Serialize obj to off-heap memory
2. Persistent to PMEM
3. Read from remote PMEM through RDMA, PMEM
is used as RDMA memory buffer
Ø No context switch
Ø Efficient read/write on PMEM
Ø RDMA read for remote shuffle read
Executor JVM #1
User
Kernel 3
Worker
PMEM
Shuffle ManagerShuffle Manager
NIC
Shuffle
Writer
RDMA NICPMEM
Drivers
Shuffle
Readerbytebuffer
1 obj
Heap
Off-heap
Shuffle
Writer(new)
Shuffle
Reader(new)
obj
bytebuffer
1
Heap
Off-heap2
Spark PMoF: https://github.com/intel-bigdata/spark-pmof
Strata-ca-2019: https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72992

Spark-PMoF End-to-End Time Evaluation – TeraSort Workload
§ Terasort
§ End-to-end time gains .vs Vanilla Spark: 22.7x
§ 1.29x speedup over 4x NVMe
§ PMoF shorten the remote read latency extremely
§ Readblocked time for HDD, NVMe & PMem (from Spark UI): 8.3min vs.
11s vs. 7ms
§ PMem provides higher write/read bandwidth per node than HDD
& NVMe and higher endurance
§ Decision support workload
§ is less I/O intensive compared with Terasort
§ 3.2x speed up for total execution time of 99 queries
§ IO intensive workloads can be benefit more from PMoF performance
improvement.
Performance results are based on testing as of 12/06/2019 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
0
500
1000
1500
2000
2500
3000
3500
4000
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
99 Queries Execution Time - Spark-PMoF vs Vanilla Spark
Spark-PMoF Vanilla Spark
12277.2
695 540.5
1
10
100
1000
10000
100000
Second
Spark 550GB TeraSort End-to-End Time (lower is better)
terasort-hdd terasort-nvme terasort-pmof

Extending to fully Disaggregated Shuffle solution
▪ Remote Persistent Memory demonstrated good results, but what’s more?
▪ In real production environment, there are more challenges
▪ Disaggregated, diskless environment
▪ Scale shuffle/compute independently
▪ CPU/Memory unbalanced issue
▪ Some jobs lasts for long time, stage recompute cost is intolerable in case of shuffle failure
▪ Elastic Deployment with compute and storage disaggregation requires independent shuffle solution
▪ Decouple shuffle I/O from a specific network/storage is capable of delivering dedicated SLA for critical applications
▪ Fault tolerant in case on shuffle failure, no need for recompute
▪ Offload spill as well, reduced compute memory resource requirements
▪ Balanced resource utilization
▪ Leverage state-of-art storage medium as storage media
▪ To provide high performance, high endurance storage backend
▪ Drive the intention to build a RPMem based, fully disaggregated shuffle solution!

RPMP Architecture
▪ Remote Persistent Memory Pool for Spark (Spark RPMP): a new
fully disaggregated shuffle solution that leverage state-of-art
hardware technologies including persistent memory and RDMA.
▪ A new pluggable shuffle manager
▪ A persistent memory based distributed storage system
▪ An RDMA powered network library and an innovative approach
to use persistent memory as both shuffle media as well as
RDMA memory region to reduce additional memory copies and
context switches.
▪ Features
▪ Provides allocate/free/read/write APIs on pooled PMem
resources
▪ Data will be replicated to multiple nodes for High availability
▪ Can be extended to other usage scenario such as PMem based
database, data store, cache store
▪ Benefits
▪ Improved Spark scalability by disaggregating Spark shuffle from
compute node to a high-performance distributed storage
▪ Improved spark shuffle performance with high speed persistent
memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly
available shuffle service supports shuffle data replication and
fault-tolerant.
RPMP
Transactions Streaming Machine LearningSQL
S3
K/V
Remote Shuffle
shuffle
PMem
DRAM
RNIC
DRAM
Storage
Compute
data cache
DRAM
Proxy
PMem

Remote Persistent Memory Pool overview
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
MapperN ReducerN
RDMA
Read and Write
RDMA
Read and Write
§ Care more about write performance
(latency/bandwidth)
§ 100GB NIC needed, theoretically 8x
PMEM on single node provide 10GB+
write bandwidth.
RPMP node 1
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Global Memory Address
RPMP node 2
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Heartbeat
Replication
§ RPMP storage node was
choosen by using
Consistent Hash to avoid
single point failure.
§ An timely ActiveNodeMap
maintained by using
Heartbeat.
§ Data will be replicated to a
worker node from driver node
over RDMA
§ Once driver node goes down,
worker node is still writable
and readable.
Mapper1 Reducer1

RPMP CORE architecture details
§ RPMP Client
§ RPMP client provides transactional
read/write/allocate/free, and obj put/get interfaces
to users
§ Both cpp and java API are provided
§ Data will be transferred by HPNL(RDMA) between
selected server nodes and client later on.
§ RPMP Server
§ RPMP proxy is used to maintain an unified
ActiveNodeMap.
§ Network Layer is based on HPNL to provide
RDMA Data transfer.
§ Controller Layer is responsible for Global Address
Management, TransactionalProcess, etc.
§ Storage Layer is responsible for Pmem
management using high performance PMDK libs
RPMP (Server)
PmemAllocatorStorage Layer
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Controller layer Scheduler TransactionGlobal Address Mgmt
PMem
/dev/dax0.1 /dev/dax1.0 /dev/dax2.0
/dev/dax0.0 /dev/dax1.1 /dev/dax2.1
Checksum
RPMP (Client)
Interface tx_alloc/tx_free/tx_read/tx_write/put/get
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Storage Proxy
RPMP Proxy
Accelerator

Spark RPMP optimization features
▪ Optimized RDMA communication
▪ Leverage HPNL as high performance, protocol agnostic network messenger
▪ Server handle all the write operations and clients implement read-only operations using one-sided RDMA reads.
▪ Controller accelerator layer
▪ Partition Merge
▪ Aggregate small partitions to larger blocks to accelerate reduce, reduced number of reducer connections
▪ Sort
▪ Sort the shuffle data on the fly, no compute on reduce phase, reduce compute node CPU resource utilization
▪ Provided controllable fine granularity control or resource utilization if compute node CPU resources is limited
▪ Storage
▪ Global address space accessible with memory-like APIs
▪ A Key-value store based on libpmemobj, transactional
▪ Allocator to manager on PMem, storage proxy directing request to different allocators
* https://github.com/Cyan4973/xxHash

RPMP Workflow
1. Write data to specific address.
2. Server issue RDMA read (client DRAM ->
server DRAM).
3. Flush (DRAM -> PMEM).
4. Request ACK.
client
server
1 2
3
4
1. Read data from specific
address.
2. RDMA write (server PMEM
-> client DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
DRAM
PMem
DRAM
PMem
DRAM DRAM
1. Write data to specific address.
2. RDMA read (client DRAM -> server DRAM),
Secondary Node DRAM -> Primary Node
DRAM)
3. Flush (DRAM -> PMEM).
4. Request ACK.
DRAM
client
server
1 2
3
4
1. Read data from specific address.
2. RDMA write (server PMEM -> client
DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
server
2
server
3
Proxy
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM

PMEM Based Shuffle Write optimization
§ Map
§ Provision pmem name space in advance
§ Leverage a circular buffer to build unidirectional
channels for RDMA primitives
§ Serialized data write to an offheap buffer, once hit
threshold (4MB by default), create a block via
libpmemobj on pmem device with memcopy
§ Append write, only write once
§ No index file, mapping info stored in pmem object
metadata
§ Libpmem based, kernel bypass
§ Reduce
§ Reduce use memcopy to read the data
§ Reduce read through PMem memory directly from
RDMA
N x
Reduce
PMEM Shuffle Writer
Map task
KV in memory
PMDK(libpmemobj c)
JNI
Partition [0]
Partition [1]
Partition[2]
Partition[n]
Persistent Memory Device
(devdax/fsdax mode)
DRAM
RPMP client circular buffer
JVM
Native

RPMP integration to Spark Shuffle
SortShuffleManager
->registerShuffle()
ShuffleHandler
ShuffleWriter
BaseShuffleHandle
PmemShuffleWriter
getWriter()
PMEM
ShuffleBlockResolver
RDMAShuffleReader
PMDK
RDMA to another Executor
Filesystem
NettyReader
PMoFShuffleManager
getWriter()
PmemShuffleWriter
PmemBlockObjectWriter
PmemOutputStream
PmemBuffer
PmemManagedBuffer
PmemInputStream
PmemExternalSorter
PMDK - Persistent Memory Pool
PmemShuffleReader
RDMAReader
java
cpp
RPMP

Performance Evaluation
§ Configuration
§ 2 nodes, one as RPMEM client and another as RPMEM server.
§ 40 GB RDMA NIC, 4x PMems on RPMEM server.
§ Tested remote_allocate, remote_free, remote_write, remote_read
interfaces with single client.
§ Performance
§ Remote read max out NIC BW
RPMEM Client
Node 1
RPMEM Server
PMem PMem
PMem PMem
Node 2
Interface Performance
allocate 3.7 GB/s Expect higher performance with
more clients.
remote_write 2.9 GB/s Expect higher performance with
more clients.
remote_read 4.9 GB/s Limited by 40GB NIC
Spark
Node 1
Spark
PMem PMem
HDD HDD
Node 2
HDD HDDHDFS
§ Configuration
§ 2 nodes
§ Baseline: Shuffle on HDD
§ RPMem: 2x 128GB PMem on Node 2 ofr Shuffle and external sort
§ Workload: Terasort 100GB
§ Performance
§ 1.98x speedup
Execution Time
Vanilla spark 100G 416 s
RPMem Shuffle 210 s
HDFS
Performance results are based on testing as of 5/30/2020 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34

Summary
• Spark shuffle posed lots of challenges in large scale production environment
• Remote Persistent Memory extending PM new usage mode to new scenarios with RDMA being the most acceptable
technology used for remote persistent memory access
• Remote persistent memory pool for Spark shuffle enables a fully disaggregated, high performance, low latency
shuffle solution to accelerate spark shuffle
▪ Improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed
storage
▪ Improved spark shuffle performance with high speed persistent memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly available shuffle service supports shuffle data replication
and fault-tolerant.

28
Accelerate Your Data Analytics & AI Journey with Intel
Optimized
ML/DL
Libraries &
tools
Optimized
Cloud
Platforms
GOOGLE CLOUD PLATFORMAMAZON WEB SERVICES MICROSOFT AZURE
Intel Distribution
for Python
Intel Optimized
BAIDU CLOUD & More
Data Center
DL Inference
(Goya)
FPGA: Real-Time &
Multi-Use DL
Inference
Data Center
DL Training
(Gaudi)
Edge
DL Inference
GPU: AI, HPC, Media
& Graphics
CPU: Multi-Purpose
Analytics/AI Foundation
High performance in-
memory analytics
Analytics &
AI intel
Hardware
Intel.com/AI intel.com/yourdataonintelSoftware.intel.com
* In development
*
*
INTEL-OPTIMIZED END-TO-END DATA ANALYTICS
& AI PIPELINE
DISCOVE
RY Data Develop Deploy
of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate

Intel OAP
https://github.com/Intel-bigdata/OAP/
An End-to-End Columnar based data processing with Intel AVX support
Intel CPU Other Accelerators(FPGA, GPU, …)
OAP Native SQL Engine Plugin
Apache Arrow
Arrow Data
Source
Arrow Data
Processing
Columnar
Shuffle
SPARK SQL
SPARK Catalyst
https://github.com/Intel-bigdata/OAP/
Optimized Analytics Packages for Spark
Shuffle
Remote Shuffle
Remote Persistent
Memory Shuffle
Remote Persistent Memory Pool

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Legal Information: Benchmark and Performance
Disclaimers
▪ Performance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
▪ Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
▪ Configurations: see performance benchmark test configurations.

Notices and Disclaimers
▪ No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
▪ Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness
for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or
usage in trade.
▪ This document contains information on products, services and/or processes in development. All information provided here is
subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and
roadmaps.
▪ The products and services described may contain defects or errors known as errata which may cause deviations from published
specifications. Current characterized errata are available on request.
▪ Intel, the Intel logo, Xeon, Optane, Optane DC Persistent Memory are trademarks of Intel Corporation in the U.S. and/or other
countries.
▪ *Other names and brands may be claimed as the property of others
▪ © Intel Corporation.

Benchmark configuration
Workloads
Terasort 600GB
• hibench.spark.master yarn-client
• hibench.yarn.executor.num 12
• yarn.executor.num 12
• hibench.yarn.executor.cores 8
• yarn.executor.cores 8
• spark.shuffle.compress false
• spark.shuffle.spill.compress false
• spark.executor.memory 60g
• spark.executor.memoryoverhead 10G
• spark.driver.memory 80g
• spark.eventLog.compress = false
• spark.executor.extraJavaOptions=-XX:+UseG1GC
• spark.hadoop.yarn.timeline-service.enabled false
• spark.serializer org.apache.spark.serializer.KryoSerializer
• hibench.default.map.parallelism 200
• hibench.default.shuffle.parallelism 1000
3 Node cluster
Hardware:
• Intel® Xeon™ processor Gold 6240 CPU @ 2.60GHz, 384GB Memory (12x
32GB 2666 MT/s)
• 1x Mellanox ConnectX-4 40Gb NIC
• Shuffle Devices：
• 1x HDD for shuffle
• 4x 128GB Persistent Memory for shuffle
• 4x 1T NVMe for HDFS
Software:
• Hadoop 2.7
• Spark 2.3
• Fedora 27 with WW26 BKC
Hadoop NN
Spark driver
Hadoop DN
Spark worker
Hadoop DN
Spark Worker
Hadoop DN
Spark Worker
1x40Gb NIC
4x NVMe
1x HDD 4x PMem
4x NVMe
1x HDD 4x PMem
4x NVMe
1x HDD 4x PMem

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools

Similar to Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools