SlideShare a Scribd company logo
Hybrid Transactional/Analytics Processing
with Spark and In-Memory Data Grids
Copyright © GigaSpaces 2016. All rights reserved.
Ali Hodroj
VP, Products and Strategy
About me
• Vice President, Products and Strategy @ GigaSpaces
• (ex) Director of Solutions Architecture
• Blogging at
• @ahodroj
• Email:
• Slides at
About GigaSpaces
Direct customers
New York, NY
Do we need to bridge
online transaction
processing with real-time
operational intelligence?
Modern applications: the line is blurred between…
Transactional Analytical
Essential to operate the
Turning data into value:
insights, diagnosis, decision
Stories from the real #enterprise
Omni-Channel Retail
Internet of Things
Minimize Latency +
Strong Consistency
Maximize Data-
Analytics Locality
There’s a name for it...
Data Store?
In-Memory Computing 101
Distribute Cache
Partitioned cache
In-Memory Data
Scale-out system
of record
Increased Capacity
No support for write-heavy scenarios
Limited to ID-based reads
Reads are the only part latency path
In-Memory Database
Scale-up system of record
Heavy Read/Write – sharded/partitioned architecture
Horizontally scalable on commodity HW (or cloud)
Serves as system of record with querying & transaction
Requires modifying your application’s data access layer
Distribute Cache
Partitioned cache
In-Memory Data
Scale-out system
of record
In-Memory Database
Scale-up system of record
In-Memory Computing 101
Read/Write Scalability
Drop-in SQL database replacement
Often lacks horizontal scalability (Joins)
Requires replacing your database
Distribute Cache
Partitioned cache
In-Memory Data
Scale-out system
of record
In-Memory Database
Scale-up system of record
In-Memory Computing 101
IMDG Data Models
public class Product
private String name;
private String brand;
private Integer quantity;
// …
IMDG Data Placement – Fixed Hashing
hash(key) % #nodes
IMDG Fixed Hashing - HA
hash(key) % #nodes
In-Memory Data Grids: How it works
The database goes
to the background
Partition your data
and store it in
In-Memory Data Grids: How it works
Partitioned, co-
located in-memory
In-Memory Data Grids: How it works
Business logic, data &
messaging co-located
& partitioned into
processing units
In-Memory Data Grids: How it works
Hot backup for each
partition for high
In-Memory Data Grids: How it works
Host your web
application on the
XAP infrastructure
In-Memory Data Grids: How it works
Auto-scale out & in
based on real-time
performance & load
In-Memory Data Grids: How it works
In-Memory Data Grids: How it works
Host Cisco UCS Server
CPU Intel 16core 2.9GHz
Concurrent Threads 2
Throughput 200, 400, 800 ops/sec
● Nope: Your data sources and applications are
often distributed.
● In-Memory or not, these databases aren’t
built for horizontal scale-out
Approach Challenge
Just an IMDB Thing….
Shove it all in one “Big Iron”?
● Not when your apps requires polyglot
● Unless you want to write ML algorithms, MDX
engines…etc from scratch
Approach Challenge
One large In-
Data Grid to
Rule them
What we needed
Low-latency Scale-Out In-
Memory Data Grid
Large-scale distributed
analytics framework
Maximize Data-
Analytics Locality
Application Latency
Our approach to HTAP
Low-latency Scale-Out
In-Memory Data Grid
Large-scale distributed
analytics framework
So why did we bet on
• Unified & Concise API
• Highly Flexible Data Store
• Massive Community and
But Spark is
Spark is caching over <insert your data store>,
not an in-memory system of record
How does
In-Memory Data Grid
In-Memory Store(RAM) Flash, SSD, Off-Heap Store
Spark Spark SQL
Machine Learning
InsightEdge Core
Building out the driver
Transactional Tier
Strong Consistency
Analytics Tier
Data Grid + Spark Deployment Layout
node 1
Spark master
node 2
Spark worker
node 3
Spark worker
• List of parent RDDs – Empty
• An array of partitions that a dataset is divided to – IMDG Distributed Query
to get partitions and their hosts
• A compute function to do a computation on partitions – Iterator over portion
of data
• Optional preferred locations, i.e. hosts for a partition where the data will be
loaded – hosts from Distributed Query
Data Grid RDD: resilient distributed dataset
node 1
Spark executor
Data Grid RDD: one-to-one partition
Partition #1
Simple, but
not enough
for Spark
node 2
Spark executor
Partition #2
node 3
Spark executor
Partition #3
node 1
Spark Executor
Grid Primary #1
Data Grid RDD: with bucketing
Partition #1
1 Spark partition = M grid buckets
1 Grid partition = N Spark partitions
Partition #2
Partition #1
Grid DataFrames: predicates pushdown & columns pruning
Aggregation in
Filtering and
columns pruning
in Data Grid
SELECT SUM(amount)
FROM order
WHERE city = ‘NY’ AND year > 2012
Spark SQL architecture:
• Pushing down predicates to Data Grid
• Leveraging indexes
• Transparent to user
• Enabling support for other languages - Python/R
Implementing DataSource API
Traditional Spark filtering of 7MM records
Grid-side + Spark filtering of 7MM records
Eventually, we productized this as
an open source Spark distribution
Apache 2 License
GigaSpaces InsightEdge
High Performance Spark with OLTP Capabilities
Spark GeoSpatial SQL and Data Frames
• Multi-Spark Replication / Federated Clusters
In-Memory Replication across local or wide area networks
upcoming: Spark RDD/DF native read/save on Off-Heap
(SSD/Flash/Direct Buffers)
In Memory Data Grid
Spark worker Spark worker
• Significant RAM TCO reduction
in Spark clusters
• Direct RDD/DataFrame read
write from SSD/Flash device
• Avoid Filesystem hops and
write amplification
In-Process HTAP
Read any POJO, JSON
Document, or
Transaction as a
DataFrame or RDD
Web services/apps can read
any DataFrame as POJO
True closed-loop analytics data pipeline
public class Product
private String name;
private String brand;
private Integer
// …
In-Memory Data Grid
Realtime Replication
• Scoring models
• Trigger actions
• Events
Transactions Analytics
Point of Decision HTAP XAP + InsightEdge deployed on
different grid clusters with bi-
directional real-time data replication
Case Study: Fleet Geo-analytics
• Stream data from 1,000s of Taxis
• Actively monitor and generate real-time notifications
• Real-time Route Optimization and Geo-Fencing
• Leverage unified in-memory data fabric as middleware for
geo-spatial analytics
• Elastically scale stream processing and transactional apps
• Location-based tracking, Geo-fencing
Edge components
Data Sources

More Related Content

What's hot

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dataconomy Media
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
DevFest DC
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan Chhabra
Big Data Spain
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
Mark Kromer
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
Eyal Ben Ivri
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
DataWorks Summit
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
Mk Kim
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
DataWorks Summit/Hadoop Summit
Big Data Landscape 2016
Big Data Landscape 2016Big Data Landscape 2016
Big Data Landscape 2016
Josef Adersberger
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
Infochimps, a CSC Big Data Business
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
Steve Loughran
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...

What's hot (20)

Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Dr. Christian Kurze from Denodo, "Data Virtualization: Fulfilling the Promise...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Building an Enterprise Data Platform with Azure Databricks to Enable Machine ...
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and DatabricksUnlocking Geospatial Analytics Use Cases with CARTO and Databricks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Graph Hardware Architecture - Enterprise graphs deserve great hardware!
Snowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big DataSnowflakes in the Cloud Real world experience on a new approach for Big Data
Snowflakes in the Cloud Real world experience on a new approach for Big Data
VP of WW Partners by Alan Chhabra
VP of WW Partners by Alan ChhabraVP of WW Partners by Alan Chhabra
VP of WW Partners by Alan Chhabra
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!Counting Unique Users in Real-Time: Here's a Challenge for You!
Counting Unique Users in Real-Time: Here's a Challenge for You!
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Bloor Research & DataStax: How graph databases solve previously unsolvable bu...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace ImagesBig Data in the Cloud with Azure Marketplace Images
Big Data in the Cloud with Azure Marketplace Images
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4jScalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Scalability and Graph Analytics with Neo4j - Stefan Kolmar, Neo4j
Pouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy IndustryPouring the Foundation: Data Management in the Energy Industry
Pouring the Foundation: Data Management in the Energy Industry
Bigdata Machine Learning Platform
Bigdata Machine Learning PlatformBigdata Machine Learning Platform
Bigdata Machine Learning Platform
Intuit Analytics Cloud 101
Intuit Analytics Cloud 101Intuit Analytics Cloud 101
Intuit Analytics Cloud 101
Big Data Landscape 2016
Big Data Landscape 2016Big Data Landscape 2016
Big Data Landscape 2016
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
[Webinar] Measure Twice, Build Once: Real-Time Predictive Analytics
My other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 editionMy other computer is a datacentre - 2012 edition
My other computer is a datacentre - 2012 edition
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...
Journey to Creating a 360 View of the Customer: Implementing Big Data Strateg...

Viewers also liked

6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
Ali Hodroj
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
Ali Hodroj
Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStack
Ali Hodroj
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Alliance
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Rahul Jain

Viewers also liked (13)

6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday6 GigaSpaces Principles to Survive Black Friday
6 GigaSpaces Principles to Survive Black Friday
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability ChasmE-Commerce and In-Memory Computing: Crossing the Scalability Chasm
E-Commerce and In-Memory Computing: Crossing the Scalability Chasm
Application-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStackApplication-level Disaster Recovery on OpenStack
Application-level Disaster Recovery on OpenStack
Exascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing WorldExascale Computing Project - Driving a HUGE Change in a Changing World
Exascale Computing Project - Driving a HUGE Change in a Changing World
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
State of the OpenFabrics Alliance
State of the OpenFabrics AllianceState of the OpenFabrics Alliance
State of the OpenFabrics Alliance
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Microsoft Project Olympus AI Accelerator Chassis (HGX-1)
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark

Similar to Hybrid Transactional/Analytics Processing with Spark and IMDGs

MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
Dylan Tong
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Data Con LA
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Karan Singh
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
shradha ambekar
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
Joshua Patterson
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
Igor José F. Freitas
Peyman Mohajerian
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Etu Solution
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
ScaleOut Software
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
Paula Koziol
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo

Similar to Hybrid Transactional/Analytics Processing with Spark and IMDGs (20)

MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014MongoDB Sharding Webinar 2014
MongoDB Sharding Webinar 2014
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareMaking Hadoop Realtime by Dr. William Bain of Scaleout Software
Making Hadoop Realtime by Dr. William Bain of Scaleout Software
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
Managing data analytics in a hybrid cloud
Managing data analytics in a hybrid cloudManaging data analytics in a hybrid cloud
Managing data analytics in a hybrid cloud
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100xOscon 2019 - Optimizing analytical queries on Cassandra by 100x
Oscon 2019 - Optimizing analytical queries on Cassandra by 100x
GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017GOAI: GPU-Accelerated Data Science DataSciCon 2017
GOAI: GPU-Accelerated Data Science DataSciCon 2017
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You Neo4j: What's Under the Hood & How Knowing This Can Help You
Neo4j: What's Under the Hood & How Knowing This Can Help You
Very large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDLVery large scale distributed deep learning on BigDL
Very large scale distributed deep learning on BigDL
Trends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systemsTrends towards the merge of HPC + Big Data systems
Trends towards the merge of HPC + Big Data systems
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析Track A-2 基於 Spark 的數據分析
Track A-2 基於 Spark 的數據分析
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Graph Data Science at Scale
Graph Data Science at ScaleGraph Data Science at Scale
Graph Data Science at Scale
Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013Real-time analysis using an in-memory data grid - Cloud Expo 2013
Real-time analysis using an in-memory data grid - Cloud Expo 2013
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
Scaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data ScienceScaling up with Cisco Big Data: Data + Science = Data Science
Scaling up with Cisco Big Data: Data + Science = Data Science

Recently uploaded

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Andre Hora
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsBitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
Alina Tait
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GIS
Safe Software
04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
CS Kwak
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
David D. Scott
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Crowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current StatusCrowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current Status
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
John Gallagher
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma

Recently uploaded (20)

Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery SolutionsBitLocker Data Recovery | BLR Tools Data Recovery Solutions
BitLocker Data Recovery | BLR Tools Data Recovery Solutions
Mastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GISMastering MicroStation DGN: How to Integrate CAD and GIS
Mastering MicroStation DGN: How to Integrate CAD and GIS
04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching04. Ruby Operators Slides - Ruby Core Teaching
04. Ruby Operators Slides - Ruby Core Teaching
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching05. Ruby Control Structures - Ruby Core Teaching
05. Ruby Control Structures - Ruby Core Teaching
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
How to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at ScaleHow to Secure Your Kubernetes Software Supply Chain at Scale
How to Secure Your Kubernetes Software Supply Chain at Scale
How Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application DevelopmentHow Generative AI is Shaping the Future of Software Application Development
How Generative AI is Shaping the Future of Software Application Development
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery SolutionBDRSuite - #1 Cost effective Data Backup and Recovery Solution
BDRSuite - #1 Cost effective Data Backup and Recovery Solution
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
Crowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current StatusCrowd Strike\Windows Update Issue: Overview and Current Status
Crowd Strike\Windows Update Issue: Overview and Current Status
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
Fix Production Bugs Quickly - The Power of Structured Logging in Ruby on Rail...
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching06. Ruby Array & Hash - Ruby Core Teaching
06. Ruby Array & Hash - Ruby Core Teaching
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma

Hybrid Transactional/Analytics Processing with Spark and IMDGs

  • 1. 1 Hybrid Transactional/Analytics Processing with Spark and In-Memory Data Grids Copyright © GigaSpaces 2016. All rights reserved. Ali Hodroj VP, Products and Strategy
  • 2. 2 About me • Vice President, Products and Strategy @ GigaSpaces • (ex) Director of Solutions Architecture • Blogging at • @ahodroj • Email: • Slides at
  • 4. 4 Do we need to bridge online transaction processing with real-time operational intelligence?
  • 5. 5 Modern applications: the line is blurred between… Transactional Analytical Essential to operate the business Turning data into value: insights, diagnosis, decision making &
  • 6. 6 Stories from the real #enterprise world...
  • 10. 10 Minimize Latency + Strong Consistency Maximize Data- Analytics Locality Goal:
  • 11. 11 There’s a name for it...
  • 12. 12
  • 14. In-Memory Computing 101 Distribute Cache Partitioned cache nodes In-Memory Data Grid Scale-out system of record Increased Capacity No support for write-heavy scenarios Limited to ID-based reads Reads are the only part latency path In-Memory Database Scale-up system of record
  • 15. Heavy Read/Write – sharded/partitioned architecture Horizontally scalable on commodity HW (or cloud) Serves as system of record with querying & transaction semantics Requires modifying your application’s data access layer Distribute Cache Partitioned cache nodes In-Memory Data Grid Scale-out system of record In-Memory Database Scale-up system of record In-Memory Computing 101
  • 16. Read/Write Scalability Drop-in SQL database replacement Often lacks horizontal scalability (Joins) Requires replacing your database Distribute Cache Partitioned cache nodes In-Memory Data Grid Scale-out system of record In-Memory Database Scale-up system of record In-Memory Computing 101
  • 17. IMDG Data Models @SpaceClass public class Product { private String name; private String brand; private Integer quantity; // … }
  • 18. IMDG Data Placement – Fixed Hashing hash(key) % #nodes
  • 19. IMDG Fixed Hashing - HA hash(key) % #nodes
  • 20. 20 In-Memory Data Grids: How it works
  • 21. 21 The database goes to the background Partition your data and store it in memory In-Memory Data Grids: How it works
  • 22. 22 Partitioned, co- located in-memory messaging In-Memory Data Grids: How it works
  • 23. 23 Business logic, data & messaging co-located & partitioned into processing units In-Memory Data Grids: How it works
  • 24. 24 Hot backup for each partition for high availability In-Memory Data Grids: How it works
  • 25. 25 Host your web application on the XAP infrastructure In-Memory Data Grids: How it works
  • 26. 26 Auto-scale out & in based on real-time performance & load In-Memory Data Grids: How it works
  • 27. 27 In-Memory Data Grids: How it works
  • 28. 28 Host Cisco UCS Server CPU Intel 16core 2.9GHz Concurrent Threads 2 Throughput 200, 400, 800 ops/sec
  • 30. 30 ● Nope: Your data sources and applications are often distributed. ● In-Memory or not, these databases aren’t built for horizontal scale-out Approach Challenge Just an IMDB Thing…. Shove it all in one “Big Iron”?
  • 31. 31 ● Not when your apps requires polyglot analytics ● Unless you want to write ML algorithms, MDX engines…etc from scratch Approach Challenge One large In- Memory Data Grid to Rule them all?
  • 32. 32 What we needed Low-latency Scale-Out In- Memory Data Grid Large-scale distributed analytics framework Maximize Data- Analytics Locality Minimize Application Latency
  • 33. 33 Our approach to HTAP Low-latency Scale-Out In-Memory Data Grid Large-scale distributed analytics framework +
  • 34. 34 SPARK? So why did we bet on
  • 35. 35 • Unified & Concise API • Highly Flexible Data Store Integration • Massive Community and Adoption
  • 37. 37 Spark is caching over <insert your data store>, not an in-memory system of record
  • 38. 38 APACHE SPARK FIT INTO THIS? How does
  • 39. 39 In-Memory Data Grid In-Memory Store(RAM) Flash, SSD, Off-Heap Store Spark Spark SQL Spark Steaming Machine Learning Highavailability Security&Management InsightEdge Core Building out the driver Transactional Tier ACID-compliant Strong Consistency Analytics Tier
  • 40. 40 Data Grid + Spark Deployment Layout node 1 Spark master Grid master node 2 Spark worker Grid worker node 3 Spark worker Grid worker
  • 41. 41 • List of parent RDDs – Empty • An array of partitions that a dataset is divided to – IMDG Distributed Query to get partitions and their hosts • A compute function to do a computation on partitions – Iterator over portion of data • Optional preferred locations, i.e. hosts for a partition where the data will be loaded – hosts from Distributed Query Data Grid RDD: resilient distributed dataset
  • 42. 42 node 1 Spark executor Data Grid RDD: one-to-one partition Spark Partition #1 Grid Partition #1 Direct connection Simple, but not enough parallelism for Spark node 2 Spark executor Spark Partition #2 Grid Partition #2 node 3 Spark executor Spark Partition #3 Grid Partition #3
  • 43. 43 node 1 Spark Executor Grid Primary #1 Data Grid RDD: with bucketing 0 . . 1 . . 2 . . 3 . . 4 . . 5 . . . . . . . . . . . . Spark Partition #1 1023 1 Spark partition = M grid buckets 1 Grid partition = N Spark partitions Spark Partition #2 Spark Partition #1
  • 44. 44 Grid DataFrames: predicates pushdown & columns pruning Aggregation in Spark Filtering and columns pruning in Data Grid SELECT SUM(amount) FROM order WHERE city = ‘NY’ AND year > 2012 Spark SQL architecture: • Pushing down predicates to Data Grid • Leveraging indexes • Transparent to user • Enabling support for other languages - Python/R Implementing DataSource API
  • 45. 45 Push-down Predicates performance Traditional Spark filtering of 7MM records Grid-side + Spark filtering of 7MM records 31 seconds 1 second vs
  • 46. 46 Eventually, we productized this as an open source Spark distribution
  • 47. @InsightEdgeIO Apache 2 License
  • 50. 50 Spark GeoSpatial SQL and Data Frames
  • 51. 51 • Multi-Spark Replication / Federated Clusters In-Memory Replication across local or wide area networks
  • 52. upcoming: Spark RDD/DF native read/save on Off-Heap (SSD/Flash/Direct Buffers) Application Processi ng Primary instance s Backup instance s Sync Replicati on Storage Array Storage Array In Memory Data Grid Spark worker Spark worker • Significant RAM TCO reduction in Spark clusters • Direct RDD/DataFrame read write from SSD/Flash device • Avoid Filesystem hops and write amplification
  • 54. 5454 In-Process HTAP Read any POJO, JSON Document, or Transaction as a DataFrame or RDD Web services/apps can read any DataFrame as POJO True closed-loop analytics data pipeline @SpaceClass public class Product { private String name; private String brand; private Integer quantity; // … }
  • 55. 5555 In-Memory Data Grid Realtime Replication • Scoring models • Trigger actions • Events Transactions Analytics Point of Decision HTAP XAP + InsightEdge deployed on different grid clusters with bi- directional real-time data replication
  • 56. 5656 Case Study: Fleet Geo-analytics Challenge • Stream data from 1,000s of Taxis • Actively monitor and generate real-time notifications • Real-time Route Optimization and Geo-Fencing Solution • Leverage unified in-memory data fabric as middleware for geo-spatial analytics • Elastically scale stream processing and transactional apps together • Location-based tracking, Geo-fencing Edge components Data Sources