Mastering Customer Data on Apache Spark

Mastering Customer
Data on Apache Spark
BIG DATA WAREHOUSING
MEETUP
AWS LOFT
APRIL 7, 2016

Agenda
6:30 Networking
Grab some food and drink... Make some friends.
6:50 Joe Caserta
President
Caserta Concepts
Welcome + Intro to BDW Meetup
About the Meetup. Why MDM needs Graph now.
7:15 Kevin Rasmussen
Big Data Engineer
Caserta Concepts
Deeper Dive into Spark/MDM/Graph
Deep dive into DUNBAR technology and how we
came up with it for Customer Data Integration
7:45 Vida Ha,
Lead Solutions Engineer
Databricks
Using Apache Spark
Intro to the different components of Spark:
MLLib, GraphX, SQL, Streaming, Python, ETL
8:30 Q&A Ask Questions, Share your experience

About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
Amazon Best Sellers
Most popular products based on sales.
Updated hourly.

Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance

This Meetup
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,700+ Members
http://www.meetup.com/Big-Data-Warehousing/
http://www.slideshare.net/CasertaConcepts/
Examples of Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data
Monitoring with Storm &
Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/ Revolution
Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN

Informational
Master Data
MDM Information Ecosystem
8
Operational
Master Data
Holistic Master
Data Service
Leads
Policies
Claims
Enrolls
Sales
Finance
DW
Dimensions &
Cross-References
Marketing
Insights

What is wrong with traditional approach to MDM
• Conceptually problems with “enterprise” approach
• Long, complex implementations  low ROI
• Complex data model
• Too much human interaction
• Deliverable???
• Challenges with big data
• Data volumes
• Evolving data sources
• Need to further remove humans out of the process

 Hierarchical relationships are never rigid
 Relational models with tables and
columns not flexible enough
 Neo4j is the leading graph database
 Many MDM systems are going graph:
 Pitney Bowes - Spectrum MDM
 Reltio - Worry-Free Data for Life Sciences.
Graph Databases to the Rescue

How does a Graph DB help MDM
• Data is stored in it’s natural form  no mismatch between
requirements and data model
• Both Nodes and Relationships can have properties  supports sparse
and evolving data
• MDM for analytics  your MDM solution now delivers new
enablement, not just a back office system
• Relationship science

Open source and commercial
Gelphi
Tom Sawyer
linkurio.us

Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop solution concepts and ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit / Data Quality Sub-System
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming
• Recommendation Engine Optimization
• Relationship Intelligence / Spark Graph / CDI (Dunbar)
• CIL is hosted on

Introducing Dunbar (Relationship Intelligence / CDI )
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com

Anthropologists say it’s 150…
MAX

What if we could increase this number?
Opportunities
Closing
Revenue
Yeah,butthat’sCRM101… Howcanweimprovethis?
Expand Data Sources
Explore Relationships
Enhanced Insight

Project Dunbar - Internal
Developed in:
• Python
• Neo4j Database
• Build a social graph based on internal and external data
• Run pathing algorithms to understand strategic opportunity advantages

Whoa… Not so FAST!
Throwing a bunch of unrelated points in a graph will not
give us a useable solution.
We need to
MASTER
our contact data…
• We need to clean and normalize our
incoming interaction and relationship
data (edges)
• Clean normalize and match our
entities (vertexes)

Mastering Customer Data
Customer Data Integration (CDI):
is the process of consolidating and managing customer information from all
available sources.
In other words…
We need to figure out how to
LINK people across systems!

Steps required
Standardization
Matching
Survivorship
Validation

Traditional Standardization and Matching
Matching:
join based on combinations
of cleansed and standardized
data to create match results
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash,
phonetic representation
• Addresses
• Emails
• Phone Numbers

Great – But the NEW data is different
Reveal
• Wait for the customer to “reveal” themselves
• Create a link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph

..and then things got bigger
One of our customers wanted us to build it for them:
• Much larger dataset
• 6 million customers > 30% duplication rate
• 100’s of millions of customer interactions
• Few direct links across channels

Why Spark
“Big Box” MDM tools vs ROI?
• Prohibitively expensive  limited by licensing $$$
• Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is similar
• Beautiful high level API’s
• Databricks cloud is soo Easy
• Full universe of Python modules
• Open source and Free**
• Blazing fast!
Spark has become our
default processing engine
for a myriad of
engineering problems

Spark map operations
Cleansing, transformation, and standardization of both interaction and
customer data
Amazing universe of Python modules:
• Address Parsing: usaddress, postal-address, etc
• Name Hashing: fuzzy, etc
• Genderization: sexmachine, etc
And all the goodies of the standard library!
We can now parallelize our workload against a large number of machines:

Matching process
• We now have clean standardized, linkable data
• We need to resolve our links between our customer
• Large table self joins
• We can even use SQL:

Matching process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234
and 1235 are the same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone

Graphx to the rescue
1234 4849
5499
7788
We just need to import our
edges into a graph and “dump”
out communities
Don’t think table…
think Graph!
These matches
are actually
communities
1235

Connected components
Connected Components algorithm labels each connected
component of the graph with the ID of its lowest-numbered
vertex
This lowest number vertex can serve as our “survivor” (not
field survivorship)
Is it possible to write less code?

Field level survivorship rules
We now need to survive fields to our survivor to make it a “best record”.
Depending on the attribution we choose:
• Latest reported
• Most frequently used
We then do some simple ranking
(thanks windowed functions)

What you really want to know...
With a customer dataset of approximately
6 million customers and 100’s of millions of
data points.
… when directly compared to traditional
“big box” enterprise MDM software
Was Spark faster….

Map operations like cleansing and
standardization saw a
10x improvement
on a 4 node r3.2xlarge cluster
But the real winner was the “Graph Trick”…
Survivorship step reduced from
2 hours 5 minutes
Connected Components
< 1 minute
**loading RDD’s from S3, building the graph, and running connected components

Next Steps…
• Graph Frames
• In development by Berkley Amp Labs
• Combines Full Graph algorithm nature of
GraphX with Relationship mining of Neo4j

Thank You / Q&A
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com

Mastering Customer Data on Apache Spark

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Mastering Customer Data on Apache Spark

Similar to Mastering Customer Data on Apache Spark (20)

More from Caserta

More from Caserta (20)

Recently uploaded

Recently uploaded (20)

Mastering Customer Data on Apache Spark