Hybrid Transactional/Analytics Processing with Spark and IMDGs

1
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2016. All rights reserved.
Ali Hodroj
VP, Products and Strategy

2
About me
• Vice President, Products and Strategy @ GigaSpaces
• (ex) Director of Solutions Architecture
• Blogging at http://blog.gigaspaces.com
• @ahodroj
• Email: ali@gigaspaces.com
• Slides at http://slideshare.com/ahodroj

3
About GigaSpaces
Direct customers
500+
Headquarters
New York, NY
Established
2001

4
Do we need to bridge
online transaction
processing with real-time
operational intelligence?

5
Modern applications: the line is blurred between…
Transactional Analytical
Essential to operate the
business
Turning data into value:
insights, diagnosis, decision
making
&

6
Stories from the real #enterprise
world...

Hyper-personalization
and
Omni-channel

9
#Industrial
Internet of Things

10
Minimize Latency +
Strong Consistency
Maximize Data-
Analytics Locality
Goal:

In-Memory Computing 101
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
Increased Capacity
No support for write-heavy scenarios
Limited to ID-based reads
Reads are the only part latency path
In-Memory Database
Scale-up system of record

Heavy Read/Write – sharded/partitioned architecture
Horizontally scalable on commodity HW (or cloud)
Serves as system of record with querying & transaction
semantics
Requires modifying your application’s data access layer
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
In-Memory Database

Read/Write Scalability
Drop-in SQL database replacement
Often lacks horizontal scalability (Joins)
Requires replacing your database
Distribute Cache
Partitioned cache
nodes
In-Memory Data
Grid
Scale-out system
of record
In-Memory Database

IMDG Data Models
@SpaceClass
public class Product
{
private String name;
private String brand;
private Integer quantity;
// …
}

IMDG Data Placement – Fixed Hashing
hash(key) % #nodes

IMDG Fixed Hashing - HA
hash(key) % #nodes

20
In-Memory Data Grids: How it works
http://xap.github.io

21
The database goes
to the background
Partition your data
and store it in
memory

22
Partitioned, co-
located in-memory
messaging

23
Business logic, data &
messaging co-located
& partitioned into
processing units

24
Hot backup for each
partition for high
availability

25
Host your web
application on the
XAP infrastructure

26
Auto-scale out & in
based on real-time
performance & load

27

28
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec

29
WHAT’S
THE RIGHT DATA
STORE TO CHOOSE?

30
● Nope: Your data sources and applications are
often distributed.
● In-Memory or not, these databases aren’t
built for horizontal scale-out
Approach Challenge
Just an IMDB Thing….
Shove it all in one “Big Iron”?

31
● Not when your apps requires polyglot
analytics
● Unless you want to write ML algorithms, MDX
engines…etc from scratch
Approach Challenge
One large In-
Memory
Data Grid to
Rule them
all?

32
What we needed
Low-latency Scale-Out In-
Memory Data Grid
Large-scale distributed
analytics framework
Maximize Data-
Analytics Locality
Minimize
Application Latency

33
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
+

34
SPARK?
So why did we bet on

35
• Unified & Concise API
• Highly Flexible Data Store
Integration
• Massive Community and
Adoption

36
But Spark is
already
in-memory!

37
Spark is caching over <insert your data store>,
not an in-memory system of record

38
APACHE SPARK
FIT INTO THIS?
How does

39
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Spark
Steaming
Machine Learning
Highavailability
Security&Management
InsightEdge Core
Building out the driver
Transactional Tier
ACID-compliant
Strong Consistency
Analytics Tier

40
Data Grid + Spark Deployment Layout
node 1
Spark master
Grid
master
node 2
Spark worker
Grid
worker
node 3
Spark worker
Grid
worker

41
• List of parent RDDs – Empty
• An array of partitions that a dataset is divided to – IMDG Distributed Query
to get partitions and their hosts
• A compute function to do a computation on partitions – Iterator over portion
of data
• Optional preferred locations, i.e. hosts for a partition where the data will be
loaded – hosts from Distributed Query
Data Grid RDD: resilient distributed dataset

42
node 1
Spark executor
Data Grid RDD: one-to-one partition
Spark
Partition
#1
Grid
Partition #1
Direct
connection
Simple, but
not enough
parallelism
for Spark
node 2
Spark executor
Spark
Partition
#2
Grid
Partition #2
node 3
Spark executor
Spark
Partition
#3
Grid
Partition #3

43
node 1
Spark Executor
Grid Primary #1
Data Grid RDD: with bucketing
0
.
.
1
.
.
2
.
.
3
.
.
4
.
.
5
.
.
.
.
.
.
.
.
.
.
.
.
Spark
Partition #1
1023
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Spark
Partition #2
Spark
Partition #1

44
Grid DataFrames: predicates pushdown & columns pruning
Aggregation in
Spark
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages - Python/R
Implementing DataSource API

45
Push-down
Predicates
performance
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
31
seconds
1
second
vs

46
Eventually, we productized this as
an open source Spark distribution

@InsightEdgeIO http://insightedge.io
Apache 2 License
http://insightedge.io/docs
http://insightedge.io/blog
http://github.com/InsightEdge

GigaSpaces InsightEdge
http://insightedge.io
High Performance Spark with OLTP Capabilities

50
Spark GeoSpatial SQL and Data Frames

51
• Multi-Spark Replication / Federated Clusters
In-Memory Replication across local or wide area networks

upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
Application
Processi
ng
Primary
instance
s
Backup
instance
s
Sync
Replicati
on
Storage
Array
Storage
Array
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification

5454
In-Process HTAP
Read any POJO, JSON
Document, or
Transaction as a
DataFrame or RDD
Web services/apps can read
any DataFrame as POJO
True closed-loop analytics data pipeline
@SpaceClass
public class Product
{
private String name;
private String brand;
private Integer
quantity;
// …
}

5555
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
Point of Decision HTAP XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication

5656
Case Study: Fleet Geo-analytics
Challenge
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
Solution
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
together
• Location-based tracking, Geo-fencing
Edge components
Data Sources

Hybrid Transactional/Analytics Processing with Spark and IMDGs

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Hybrid Transactional/Analytics Processing with Spark and IMDGs

Similar to Hybrid Transactional/Analytics Processing with Spark and IMDGs (20)

Recently uploaded

Recently uploaded (20)

Hybrid Transactional/Analytics Processing with Spark and IMDGs