Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Corp.) | C* Summit 2016

Jim Peregord, Venu Palvai
Element Fleet Management
Building a Pluggable Analytics Stack with Cassandra as the Foundation

1 Background on Element Fleet Management
2 Key Use Cases Supported
3 Architecture
4 Our Journey
5 Lessons Learned
2© DataStax, All Rights Reserved.

A Little About Us
© DataStax, All Rights Reserved. 3
Jim Peregord Venu Palvai
VP – Analytics, BI, Data Mgt
jperegord@elementcorp.com
Lead Architect
vpalvai@elementcorp.com

Background on Element Fleet Management
Full lifecycle of fleet
management services
Data consolidation and
advanced analytics
services
Maximize customer
ROI on fleet assets via
data and advanced
analytics
2,600 employees
1+ million vehicles
managed
$18 billion in total
finance assets
2 billion rows of data
and growing

Greenfield Opportunity to Build Analytics Platform
• Element acquired GE Fleet Management September 1, 2015
• Now the largest publicly held Fleet Management company in world
• Pre-acquisition Element had limited data warehouse and Big Data tech
• Greenfield Opportunity to build next gen BI and Advanced Analytics platform
High-level Options Considered
#1 – Build a separate data warehouse and Big Data/Advanced Analytics platform
#2 – Build a single, unified architecture that supports both
Our Decision
#2 – Build a single, unified platform using DataStax

Key Use Cases Supported on New Platform
• High availability out of the box
• Linear and elastic scalability
• High concurrency and low latency
• Real-time ingestion of data streams: Vehicle (location, diagnostics), weather, traffic
• Expose data and analytics via RESTful APIs
• Advanced Analytics (Predictive, Prescriptive, Streaming)
• Data warehouse and traditional reporting

Advanced Analytics Hardware Architecture
• Purpose-Built Hardware for Advanced Analytics
• NUMA/NVME Hardware is not commodity – it is highly specialized for very
high performance. Tens of millions of IOPs.
• Architected to scale 10x or even 100x current capacity – A must for
Telematics and IOT data.
• H/W Specs – 256GB, 4 X 2 TB SSD, dedicated C*/Spark instance per SSD
• Active-Active clustering means very high availability
• C* / Spark / SOLR / FiloDB / DSE Graph + NUMA – High performance
analytics platform
Cassandra + Spark
32 nodes
Cassandra + SOLR
8 nodes

Analytics Logical Architecture
Events
Streaming
Sources
Amazon
SQS
Kafka
filoDB
Internal
Batch
Sources
External
Thrift Server
Spark SQL
Job Server
RESTful
Packages
(PySpark)
MLlib
Consumers

Pluggable Architecture - Overview
Element’s pluggable Analytics stack gives us the ability to plug into multiple analytics tools
and choose the right tool depending on the questions we are asking. This gives us the
ability to add new analytics capabilities on top of Cassandra as they become available.
FiloDB
Columnar Data,
Fast Reads
Spark
SQL, Streaming
Analytics,
pySpark
Lucene
Search, Custom
Dictionaries
DSE Graph
Graph-based
Analytics
Future Tools
TBD

Pluggable Architecture - FiloDB
• FiloDB uses Cassandra for storage and Spark for computation
• Optimized for:
• Low latency queries and streaming
• Interactive ad-hoc analysis on Big Data
• Complex analytics and machine learning
• Efficient Columnar Storage (20-40X less storage)
• All queries are distributed and run in parallel in Spark
• Integrates with existing BI tools via JDBC/ODBC
• Horizontally scalable, fault tolerant
• Future enhancements include Geo Spatial Analysis
FiloDB
Columnar
Data,
Fast Reads
Recent blog post by Evan Chan, renowned C* / Spark Expert
www.planetcassandra.org/blog/achieving-sub-second-sql-joins-and-
building-a-data-warehouse-using-spark-cassandra-and-filodb

Pluggable Architecture – Apache Spark
Spark
SQL, Streaming
Analytics
Spark
SQL
• In-memory, fast SQL
processing
• Easily blend data from
multiple sources
• Connect to BI tools
Spark
Streaming
• Ingest streaming data
sources like
telematics, weather,
engine diagnostics,
etc.
Spark
MLlib
• Library of machine
learning algorithms for
advanced analytics

Pluggable Architecture – Lucene / SOLR
• Powerful search algorithms
• Geospatial indexing and geo-queries
• Custom dictionaries
• Efficient metric calculations
Lucene
Search, Custom
Dictionaries

Pluggable Architecture – DSE Enterprise Graph
• Graph databases store data as a network of relationships
• Provides optimized analytics for any data where relationships are most important
• Can improve query/analytics performance 1000X
Example use cases:
• IOT time series on streaming data
• Vehicle routing
• Visualize clusters of well/under performing assets
• Recommend optimal actions
• Fraud detection
DSE Graph
Graph Data
Analytics

Pluggable Architecture – Cassandra
• High performance NoSQL database
• Flexible schema allows new data attributes to be easily added
• Peer-to-Peer, distributed architecture results in no single point of failure – different than traditional
databases
• Elastic scalability to add more servers as workload increases

What our Platform Means to Customers
INFRASTRUCTURE
IMPROVEMENTS
• 20x CPU Speed
• 10x Memory
• 70x Disk Performance
ALL RUNNING ON
Cassandra database
framework has been
adopted by companies
running some of the world’s
largest and most
sophisticated real-time
analytics
Data Insights Action
• Maintenance history
• Fuel purchases
• Miles driven
• GPS location
• Points of Interest
• Weather
• Traffic
• Online repair reviews
• Fuel price geo-indexing
• Predict Operating Costs
• Fraud Detection
• Business Rule Exceptions
• Accident Predictors
• Optimal Replacement
• High risk DTC codes
• Repair sentiment analysis
• Vehicle Replacement
Schedule
• Fraud actions
• Safe driving interventions
• Non-standard
maintenance schedule
• Recommend fueling and
maintenance facilities
Sifting through the data “noise” must be as fast as possible
in order to create actionable recommendations

Journey to Build a Unified BI and Analytics Platform
• Creating flexible data models that work for both BI and analytics
• Achieving high concurrency and low latency required for enterprise reporting platforms
• Optimizing software installation and configuration for performance
• Workload management

Dimensional Modeling for BI and Analytics
• BI Tools are designed to work with dimensional models
• Dimensional models are proven and easy to understand
and model
• Dimensional models are flexible, can answer many
questions
• OLAP use cases require slicing and dicing data across
multiple dimensions
• JOIN capability is critical for achieving data models that
can answer various questions
Fact
Dim Dim
Dim Dim

Limitations of Spark SQL
• Cassandra + Spark cluster provides JOIN functionality
• Spark SQL is not able to pass filters applied on one table
to another table if both tables are joined on filtering
columns.
• Predicate pushdowns are not working for Outer JOIN
relationship
• Pushing predicates to Cassandra/Data source guarantees
better performance
Sample DAG plan for a JOIN SQL
with 5 tables
SQL Example:
Select c.customer_id, c.customer_name , i.invoice_amount
From customer c,invoice i
Where c.customer_id = i.customer_id
And c.customer_id = 123;
Spark splits above SQL into
Select c.customer_id, c.customer_name from customer c
Where c.customer_id = 123;
Select i.customer_id, invoice_amount
From invoice i;

Custom Thrift Server to Optimize SQL Statements
• Adds predicates to joining tables based on matching join columns
• Converts IN conditions to = conditions whenever IN List has only one value
• Adds IN predicate on partition column based on the range predicates supplied on non-partition key
columns
Example
From customer c,
invoice i
And c.customer_id IN (123)
From customer c,
invoice i
And c.customer_id = 123
And i.customer_id = 123
Custom
Thrift Server
Spark thrift
server with
Custom Hive
Context
Inspect
Logical Plan
Modify
Logical Plan
(if needed)
Submit plan
for
Execution

FiloDB
• Cassandra 2.1 has several restrictions on predicate pushdowns
• FiloDB is a true columnar store
• Provides ~20 – 30 times compression over Cassandra
• Very efficient for single and multiple partition scans
• Partial Predicate Pushdown support
• Provides ~20 - 30 times better read performance over straight
Cassandra
Ck1
Ck2
Rows of data
Get converted to compressed
columnar chunks
Cassandra Storage of FiloDB
data

Dimensional Data Modeling for Cassandra + Spark
• Simple STAR schema models as much as possible (eliminate
snow flakes, outer joins etc)
• De-normalized dimensions, facts (avoid duplicating dimensions
into facts)
• Minimize number of tables involved in joins
• Common partitioning strategy across dimensions and facts (easy
predicate handling)
• Limiting max partition sizes to ~1 GB
• Reduce number of partitions for efficient Spark execution, limit
partition sizes for efficient Cassandra read operations

SPIN
ODS ADS FILO
DB
CASSANDRA/SPARK
JDBC
SPARK
C* C*
ETL - TALEND
RELOAD INCREMENTAL INCREMENTAL
THRIFT
ODS is truncate/load daily.
ADS is complete replica of the source system. Incremental ETL strategy.
ODS tables are used to load FiloDB table (incremental) using Spark Jobs.
SSRS
Power BI
Example: ETL Incremental Load Strategy

Results & Opportunities
• Successfully completed 300 concurrent user load test from Business Objects
• <1 second response from thrift servers for 90% of queries
• Average of 50 columns & 50 - 500k rows returned
• Single partition and multi-partition scans, Joins involving 5-10 FiloDB tables per each query
Opportunities
• Limitations on the maximum result size that can be collected using Spark SQL
• Limitations on the total concurrent result size requested from Spark thrift server
• These are tunable limitations

Lessons Learned
• Limitations of Cassandra for Fast Analytics, may require custom development
• Have a strategy to handle growth of Cassandra partitions
• Throttle read & write work loads for the size of the cluster
• Tombstone management
• Pick right ETL tool for the job.
• Turn off NUMAD service
• Lack of monitoring tools on Spark
• Spark’s lazy evaluation, makes debugging very difficult

Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Corp.) | C* Summit 2016

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Corp.) | C* Summit 2016

Similar to Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Corp.) | C* Summit 2016 (20)

More from DataStax

More from DataStax (20)

Recently uploaded

Recently uploaded (20)

Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Corp.) | C* Summit 2016