SlideShare a Scribd company logo
Mastering Customer
Data on Apache Spark
BIG DATA WAREHOUSING
MEETUP
AWS LOFT
APRIL 7, 2016
Agenda
6:30 Networking
Grab some food and drink... Make some friends.
6:50 Joe Caserta
President
Caserta Concepts
Welcome + Intro to BDW Meetup
About the Meetup. Why MDM needs Graph now.
7:15 Kevin Rasmussen
Big Data Engineer
Caserta Concepts
Deeper Dive into Spark/MDM/Graph
Deep dive into DUNBAR technology and how we
came up with it for Customer Data Integration
7:45 Vida Ha,
Lead Solutions Engineer
Databricks
Using Apache Spark
Intro to the different components of Spark:
MLLib, GraphX, SQL, Streaming, Python, ETL
8:30 Q&A Ask Questions, Share your experience
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
Amazon Best Sellers
Most popular products based on sales.
Updated hourly.
Client Portfolio
Retail/eCommerce
& Manufacturing
Digital Media/AdTech
Education & Services
Finance. Healthcare
& Insurance
Partners
Awards & Recognition
This Meetup
CIL - Caserta
Innovations Lab
Experience
Big Data Warehousing Meetup
• Established in 2012 in NYC
• Meet monthly to share data best
practices, experiences
• 3,700+ Members
http://www.meetup.com/Big-Data-Warehousing/
http://www.slideshare.net/CasertaConcepts/
Examples of Topics
• Data Governance, Compliance &
Security in Hadoop w/Cloudera
• Real Time Trade Data
Monitoring with Storm &
Cassandra
• Predictive Analytics
• Exploring Big Data Analytics
Techniques w/Datameer
• Using a Graph DB for MDM &
Relationship Mgmt
• Data Science w/ Revolution
Analytics
• Processing 1.4 Trillion Events
in Hadoop
• Building a Relevance Engine
using Hadoop, Mahout & Pig
• Big Data 2.0 – YARN Distributed
ETL & SQL w/Hadoop
• Intro to NoSQL w/10GEN
Informational
Master Data
MDM Information Ecosystem
8
Operational
Master Data
Holistic Master
Data Service
Leads
Policies
Claims
Enrolls
Sales
Finance
DW
Dimensions &
Cross-References
Marketing
Insights
What is wrong with traditional approach to MDM
• Conceptually problems with “enterprise” approach
• Long, complex implementations  low ROI
• Complex data model
• Too much human interaction
• Deliverable???
• Challenges with big data
• Data volumes
• Evolving data sources
• Need to further remove humans out of the process
 Hierarchical relationships are never rigid
 Relational models with tables and
columns not flexible enough
 Neo4j is the leading graph database
 Many MDM systems are going graph:
 Pitney Bowes - Spectrum MDM
 Reltio - Worry-Free Data for Life Sciences.
Graph Databases to the Rescue
How does a Graph DB help MDM
• Data is stored in it’s natural form  no mismatch between
requirements and data model
• Both Nodes and Relationships can have properties  supports sparse
and evolving data
• MDM for analytics  your MDM solution now delivers new
enablement, not just a back office system
• Relationship science
Open source and commercial
Gelphi
Tom Sawyer
linkurio.us
Caserta Innovation Lab (CIL)
• Internal laboratory established to test & develop solution concepts and ideas
• Used to accelerate client projects
• Examples:
• Search (SOLR) based BI
• Big Data Governance Toolkit / Data Quality Sub-System
• Text Analytics on Social Network Data
• Continuous Integration / End-to-end streaming
• Recommendation Engine Optimization
• Relationship Intelligence / Spark Graph / CDI (Dunbar)
• CIL is hosted on
Introducing Dunbar (Relationship Intelligence / CDI )
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com
How many people do you know??
Anthropologists say it’s 150…
MAX
What if we could increase this number?
Opportunities
Closing
Revenue
Yeah,butthat’sCRM101… Howcanweimprovethis?
Expand Data Sources
Explore Relationships
Enhanced Insight
Project Dunbar - Internal
Developed in:
• Python
• Neo4j Database
• Build a social graph based on internal and external data
• Run pathing algorithms to understand strategic opportunity advantages
Whoa… Not so FAST!
Throwing a bunch of unrelated points in a graph will not
give us a useable solution.
We need to
MASTER
our contact data…
• We need to clean and normalize our
incoming interaction and relationship
data (edges)
• Clean normalize and match our
entities (vertexes)
Mastering Customer Data
Customer Data Integration (CDI):
is the process of consolidating and managing customer information from all
available sources.
In other words…
We need to figure out how to
LINK people across systems!
Steps required
Standardization
Matching
Survivorship
Validation
Traditional Standardization and Matching
Matching:
join based on combinations
of cleansed and standardized
data to create match results
Cleanse and Parse:
• Names
• Resolve nicknames
• Create deterministic hash,
phonetic representation
• Addresses
• Emails
• Phone Numbers
Great – But the NEW data is different
Reveal
• Wait for the customer to “reveal” themselves
• Create a link between anonymous self and known profile
Vector
• May need behavioral statistical profiling
• Compare use vectors
Rebuild
• Recluster all prior activities
• Rebuild the Graph
..and then things got bigger
One of our customers wanted us to build it for them:
• Much larger dataset
• 6 million customers > 30% duplication rate
• 100’s of millions of customer interactions
• Few direct links across channels
Why Spark
“Big Box” MDM tools vs ROI?
• Prohibitively expensive  limited by licensing $$$
• Typically limited to the scalability of a single server
We Spark!
• Development local or distributed is similar
• Beautiful high level API’s
• Databricks cloud is soo Easy
• Full universe of Python modules
• Open source and Free**
• Blazing fast!
Spark has become our
default processing engine
for a myriad of
engineering problems
Spark map operations
Cleansing, transformation, and standardization of both interaction and
customer data
Amazing universe of Python modules:
• Address Parsing: usaddress, postal-address, etc
• Name Hashing: fuzzy, etc
• Genderization: sexmachine, etc
And all the goodies of the standard library!
We can now parallelize our workload against a large number of machines:
Matching process
• We now have clean standardized, linkable data
• We need to resolve our links between our customer
• Large table self joins
• We can even use SQL:
Matching process
The matching process output gives us the relationships between customers:
Great, but it’s not very useable, you need to traverse the dataset to find out 1234
and 1235 are the same person (and this is a trivial case)
And we need to cluster and identify our survivors (vertex)
xid yid match_type
1234 4849 phone
4849 5499 email
5499 1235 address
4849 7788 cookie
5499 7788 cookie
4849 1234 phone
Graphx to the rescue
1234 4849
5499
7788
We just need to import our
edges into a graph and “dump”
out communities
Don’t think table…
think Graph!
These matches
are actually
communities
1235
Connected components
Connected Components algorithm labels each connected
component of the graph with the ID of its lowest-numbered
vertex
This lowest number vertex can serve as our “survivor” (not
field survivorship)
Is it possible to write less code?
Field level survivorship rules
We now need to survive fields to our survivor to make it a “best record”.
Depending on the attribution we choose:
• Latest reported
• Most frequently used
We then do some simple ranking
(thanks windowed functions)
What you really want to know...
With a customer dataset of approximately
6 million customers and 100’s of millions of
data points.
… when directly compared to traditional
“big box” enterprise MDM software
Was Spark faster….
Map operations like cleansing and
standardization saw a
10x improvement
on a 4 node r3.2xlarge cluster
But the real winner was the “Graph Trick”…
Survivorship step reduced from
2 hours 5 minutes
Connected Components
< 1 minute
**loading RDD’s from S3, building the graph, and running connected components
Next Steps…
• Graph Frames
• In development by Berkley Amp Labs
• Combines Full Graph algorithm nature of
GraphX with Relationship mining of Neo4j
Thank You / Q&A
Kevin Rasmussen
Big Data Engineer, Caserta Concepts
kevin@casertaconcepts.com

More Related Content

What's hot

Data Warehouse (ETL) testing process
Data Warehouse (ETL) testing processData Warehouse (ETL) testing process
Data Warehouse (ETL) testing process
Rakesh Hansalia
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large ApplicationsEtl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
Wayne Yaddow
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16
José Lin
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
M. David Allen
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Spark Summit
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
Databricks
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Data Leadership - Stop Talking About Data and Start Making an Impact!
Data Leadership - Stop Talking About Data and Start Making an Impact!Data Leadership - Stop Talking About Data and Start Making an Impact!
Data Leadership - Stop Talking About Data and Start Making an Impact!
DATAVERSITY
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Achieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data ManagementAchieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data Management
DATAVERSITY
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
 
Introduction to DAX Language
Introduction to DAX LanguageIntroduction to DAX Language
Introduction to DAX Language
Antonios Chatzipavlis
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
Karwin Software Solutions LLC
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
Mark Kromer
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 

What's hot (20)

Data Warehouse (ETL) testing process
Data Warehouse (ETL) testing processData Warehouse (ETL) testing process
Data Warehouse (ETL) testing process
 
Etl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large ApplicationsEtl And Data Test Guidelines For Large Applications
Etl And Data Test Guidelines For Large Applications
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
2023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 162023 COSCUP - Whats new in PostgreSQL 16
2023 COSCUP - Whats new in PostgreSQL 16
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott CordoMastering Your Customer Data on Apache Spark by Elliott Cordo
Mastering Your Customer Data on Apache Spark by Elliott Cordo
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)Introduction to Spark (Intern Event Presentation)
Introduction to Spark (Intern Event Presentation)
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Data Leadership - Stop Talking About Data and Start Making an Impact!
Data Leadership - Stop Talking About Data and Start Making an Impact!Data Leadership - Stop Talking About Data and Start Making an Impact!
Data Leadership - Stop Talking About Data and Start Making an Impact!
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Achieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data ManagementAchieving a Single View of Business – Critical Data with Master Data Management
Achieving a Single View of Business – Critical Data with Master Data Management
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Introduction to DAX Language
Introduction to DAX LanguageIntroduction to DAX Language
Introduction to DAX Language
 
How to Design Indexes, Really
How to Design Indexes, ReallyHow to Design Indexes, Really
How to Design Indexes, Really
 
Build data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelinesBuild data quality rules and data cleansing into your data pipelines
Build data quality rules and data cleansing into your data pipelines
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 

Viewers also liked

Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship ManagementBig MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Caserta
 
Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
Neo4j
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
DataWorks Summit
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDM
Neo4j
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
Caserta
 
Using neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirementsUsing neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirements
Neo4j
 
Neo4j Solutions - Master Data Management
Neo4j Solutions - Master Data ManagementNeo4j Solutions - Master Data Management
Neo4j Solutions - Master Data Management
Caserta
 
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Using MDM to Lay the Foundation for Big Data and Analytics in HealthcareUsing MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Perficient, Inc.
 
Graph db
Graph dbGraph db
Graph db
Gagan Agrawal
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
Hannes Ebner
 
Real-World Data Governance: Master Data Management & Data Governance
Real-World Data Governance: Master Data Management & Data GovernanceReal-World Data Governance: Master Data Management & Data Governance
Real-World Data Governance: Master Data Management & Data Governance
DATAVERSITY
 
Metadata an overview
Metadata an overviewMetadata an overview
Metadata an overview
robin fay
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDM
Kousik Mukherjee
 
Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017
Neo4j
 
raph Databases with Neo4j – Emil Eifrem
raph Databases with Neo4j – Emil Eifremraph Databases with Neo4j – Emil Eifrem
raph Databases with Neo4j – Emil Eifrem
buildacloud
 
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Codemotion
 
Presentatie Marktonderzoek
Presentatie MarktonderzoekPresentatie Marktonderzoek
Presentatie Marktonderzoek
Responsum
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
New trends in data analysis and visualization on the web
New trends in data analysis and visualization on the webNew trends in data analysis and visualization on the web
New trends in data analysis and visualization on the web
Javier de la Torre
 
Presentatie data visualisatie, interactive storytelling
Presentatie data visualisatie, interactive storytellingPresentatie data visualisatie, interactive storytelling
Presentatie data visualisatie, interactive storytelling
Wilbert Baan
 

Viewers also liked (20)

Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship ManagementBig MDM Part 2: Using a Graph Database for MDM and Relationship Management
Big MDM Part 2: Using a Graph Database for MDM and Relationship Management
 
Graph Databases for Master Data Management
Graph Databases for Master Data ManagementGraph Databases for Master Data Management
Graph Databases for Master Data Management
 
Using Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data ManagementUsing Hadoop as a platform for Master Data Management
Using Hadoop as a platform for Master Data Management
 
Using a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDMUsing a Graph Database for Next-Gen MDM
Using a Graph Database for Next-Gen MDM
 
Data Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobalData Quality in the Data Hub with RedPointGlobal
Data Quality in the Data Hub with RedPointGlobal
 
Using neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirementsUsing neo4j for enterprise metadata requirements
Using neo4j for enterprise metadata requirements
 
Neo4j Solutions - Master Data Management
Neo4j Solutions - Master Data ManagementNeo4j Solutions - Master Data Management
Neo4j Solutions - Master Data Management
 
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Using MDM to Lay the Foundation for Big Data and Analytics in HealthcareUsing MDM to Lay the Foundation for Big Data and Analytics in Healthcare
Using MDM to Lay the Foundation for Big Data and Analytics in Healthcare
 
Graph db
Graph dbGraph db
Graph db
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
 
Real-World Data Governance: Master Data Management & Data Governance
Real-World Data Governance: Master Data Management & Data GovernanceReal-World Data Governance: Master Data Management & Data Governance
Real-World Data Governance: Master Data Management & Data Governance
 
Metadata an overview
Metadata an overviewMetadata an overview
Metadata an overview
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDM
 
Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017Neo4j PartnerDay Amsterdam 2017
Neo4j PartnerDay Amsterdam 2017
 
raph Databases with Neo4j – Emil Eifrem
raph Databases with Neo4j – Emil Eifremraph Databases with Neo4j – Emil Eifrem
raph Databases with Neo4j – Emil Eifrem
 
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
Museo Torino - un esempio reale d'uso di NOSQL-GraphDB, Linked Data e Web Sem...
 
Presentatie Marktonderzoek
Presentatie MarktonderzoekPresentatie Marktonderzoek
Presentatie Marktonderzoek
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
New trends in data analysis and visualization on the web
New trends in data analysis and visualization on the webNew trends in data analysis and visualization on the web
New trends in data analysis and visualization on the web
 
Presentatie data visualisatie, interactive storytelling
Presentatie data visualisatie, interactive storytellingPresentatie data visualisatie, interactive storytelling
Presentatie data visualisatie, interactive storytelling
 

Similar to Mastering Customer Data on Apache Spark

Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
Neo4j
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
Neo4j
 
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data EstateEnable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
CCG
 
Graph all the things - PRathle
Graph all the things - PRathleGraph all the things - PRathle
Graph all the things - PRathle
Neo4j
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Looker
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
Matt Turner
 
GraphTalks Rome - Introducing Neo4j
GraphTalks Rome - Introducing Neo4jGraphTalks Rome - Introducing Neo4j
GraphTalks Rome - Introducing Neo4j
Neo4j
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
Neo4j
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
GraphTalks Rome - Selecting the right Technology
GraphTalks Rome - Selecting the right TechnologyGraphTalks Rome - Selecting the right Technology
GraphTalks Rome - Selecting the right Technology
Neo4j
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Caserta
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
Nicolas Georgeault
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
Max De Marzi
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
Neo4j
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Neo4j
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
Caserta
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
BigDataExpo
 

Similar to Mastering Customer Data on Apache Spark (20)

Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
Introduction: Relational to Graphs
Introduction: Relational to GraphsIntroduction: Relational to Graphs
Introduction: Relational to Graphs
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data EstateEnable Better Decision Making with Power BI Visualizations & Modern Data Estate
Enable Better Decision Making with Power BI Visualizations & Modern Data Estate
 
Graph all the things - PRathle
Graph all the things - PRathleGraph all the things - PRathle
Graph all the things - PRathle
 
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven DecisionsPower to the People: A Stack to Empower Every User to Make Data-Driven Decisions
Power to the People: A Stack to Empower Every User to Make Data-Driven Decisions
 
Data In Action: Business Value of Data
Data In Action: Business Value of DataData In Action: Business Value of Data
Data In Action: Business Value of Data
 
GraphTalks Rome - Introducing Neo4j
GraphTalks Rome - Introducing Neo4jGraphTalks Rome - Introducing Neo4j
GraphTalks Rome - Introducing Neo4j
 
Roadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph StrategyRoadmap for Enterprise Graph Strategy
Roadmap for Enterprise Graph Strategy
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
GraphTalks Rome - Selecting the right Technology
GraphTalks Rome - Selecting the right TechnologyGraphTalks Rome - Selecting the right Technology
GraphTalks Rome - Selecting the right Technology
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
 
SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?SPSChicagoBurbs 2019 - What is CDM and CDS?
SPSChicagoBurbs 2019 - What is CDM and CDS?
 
Neo4j in Depth
Neo4j in DepthNeo4j in Depth
Neo4j in Depth
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
Your Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph StrategyYour Roadmap for An Enterprise Graph Strategy
Your Roadmap for An Enterprise Graph Strategy
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
Big Data Analytics with Microsoft
Big Data Analytics with MicrosoftBig Data Analytics with Microsoft
Big Data Analytics with Microsoft
 
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is EssentialBig Data Expo 2015 - Barnsten Why Data Modelling is Essential
Big Data Expo 2015 - Barnsten Why Data Modelling is Essential
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 

Recently uploaded

Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
Yury Chemerkin
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Zilliz
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
DianaGray10
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
Zilliz
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Zilliz
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
ZachWylie3
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Alliance
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 

Recently uploaded (20)

Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
 
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdfDefCamp_2016_Chemerkin_Yury_--_publish.pdf
DefCamp_2016_Chemerkin_Yury_--_publish.pdf
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
 
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
 
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
 
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
 
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
 
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptxFIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
FIDO Munich Seminar Blueprint for In-Vehicle Payment Standard.pptx
 
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 

Mastering Customer Data on Apache Spark

  • 1. Mastering Customer Data on Apache Spark BIG DATA WAREHOUSING MEETUP AWS LOFT APRIL 7, 2016
  • 2. Agenda 6:30 Networking Grab some food and drink... Make some friends. 6:50 Joe Caserta President Caserta Concepts Welcome + Intro to BDW Meetup About the Meetup. Why MDM needs Graph now. 7:15 Kevin Rasmussen Big Data Engineer Caserta Concepts Deeper Dive into Spark/MDM/Graph Deep dive into DUNBAR technology and how we came up with it for Customer Data Integration 7:45 Vida Ha, Lead Solutions Engineer Databricks Using Apache Spark Intro to the different components of Spark: MLLib, GraphX, SQL, Streaming, Python, ETL 8:30 Q&A Ask Questions, Share your experience
  • 3. About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance Amazon Best Sellers Most popular products based on sales. Updated hourly.
  • 4. Client Portfolio Retail/eCommerce & Manufacturing Digital Media/AdTech Education & Services Finance. Healthcare & Insurance
  • 7. This Meetup CIL - Caserta Innovations Lab Experience Big Data Warehousing Meetup • Established in 2012 in NYC • Meet monthly to share data best practices, experiences • 3,700+ Members http://www.meetup.com/Big-Data-Warehousing/ http://www.slideshare.net/CasertaConcepts/ Examples of Topics • Data Governance, Compliance & Security in Hadoop w/Cloudera • Real Time Trade Data Monitoring with Storm & Cassandra • Predictive Analytics • Exploring Big Data Analytics Techniques w/Datameer • Using a Graph DB for MDM & Relationship Mgmt • Data Science w/ Revolution Analytics • Processing 1.4 Trillion Events in Hadoop • Building a Relevance Engine using Hadoop, Mahout & Pig • Big Data 2.0 – YARN Distributed ETL & SQL w/Hadoop • Intro to NoSQL w/10GEN
  • 8. Informational Master Data MDM Information Ecosystem 8 Operational Master Data Holistic Master Data Service Leads Policies Claims Enrolls Sales Finance DW Dimensions & Cross-References Marketing Insights
  • 9. What is wrong with traditional approach to MDM • Conceptually problems with “enterprise” approach • Long, complex implementations  low ROI • Complex data model • Too much human interaction • Deliverable??? • Challenges with big data • Data volumes • Evolving data sources • Need to further remove humans out of the process
  • 10.  Hierarchical relationships are never rigid  Relational models with tables and columns not flexible enough  Neo4j is the leading graph database  Many MDM systems are going graph:  Pitney Bowes - Spectrum MDM  Reltio - Worry-Free Data for Life Sciences. Graph Databases to the Rescue
  • 11. How does a Graph DB help MDM • Data is stored in it’s natural form  no mismatch between requirements and data model • Both Nodes and Relationships can have properties  supports sparse and evolving data • MDM for analytics  your MDM solution now delivers new enablement, not just a back office system • Relationship science
  • 12. Open source and commercial Gelphi Tom Sawyer linkurio.us
  • 13. Caserta Innovation Lab (CIL) • Internal laboratory established to test & develop solution concepts and ideas • Used to accelerate client projects • Examples: • Search (SOLR) based BI • Big Data Governance Toolkit / Data Quality Sub-System • Text Analytics on Social Network Data • Continuous Integration / End-to-end streaming • Recommendation Engine Optimization • Relationship Intelligence / Spark Graph / CDI (Dunbar) • CIL is hosted on
  • 14. Introducing Dunbar (Relationship Intelligence / CDI ) Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com
  • 15. How many people do you know??
  • 17. What if we could increase this number? Opportunities Closing Revenue Yeah,butthat’sCRM101… Howcanweimprovethis? Expand Data Sources Explore Relationships Enhanced Insight
  • 18. Project Dunbar - Internal Developed in: • Python • Neo4j Database • Build a social graph based on internal and external data • Run pathing algorithms to understand strategic opportunity advantages
  • 19. Whoa… Not so FAST! Throwing a bunch of unrelated points in a graph will not give us a useable solution. We need to MASTER our contact data… • We need to clean and normalize our incoming interaction and relationship data (edges) • Clean normalize and match our entities (vertexes)
  • 20. Mastering Customer Data Customer Data Integration (CDI): is the process of consolidating and managing customer information from all available sources. In other words… We need to figure out how to LINK people across systems!
  • 22. Traditional Standardization and Matching Matching: join based on combinations of cleansed and standardized data to create match results Cleanse and Parse: • Names • Resolve nicknames • Create deterministic hash, phonetic representation • Addresses • Emails • Phone Numbers
  • 23. Great – But the NEW data is different Reveal • Wait for the customer to “reveal” themselves • Create a link between anonymous self and known profile Vector • May need behavioral statistical profiling • Compare use vectors Rebuild • Recluster all prior activities • Rebuild the Graph
  • 24. ..and then things got bigger One of our customers wanted us to build it for them: • Much larger dataset • 6 million customers > 30% duplication rate • 100’s of millions of customer interactions • Few direct links across channels
  • 25. Why Spark “Big Box” MDM tools vs ROI? • Prohibitively expensive  limited by licensing $$$ • Typically limited to the scalability of a single server We Spark! • Development local or distributed is similar • Beautiful high level API’s • Databricks cloud is soo Easy • Full universe of Python modules • Open source and Free** • Blazing fast! Spark has become our default processing engine for a myriad of engineering problems
  • 26. Spark map operations Cleansing, transformation, and standardization of both interaction and customer data Amazing universe of Python modules: • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc And all the goodies of the standard library! We can now parallelize our workload against a large number of machines:
  • 27. Matching process • We now have clean standardized, linkable data • We need to resolve our links between our customer • Large table self joins • We can even use SQL:
  • 28. Matching process The matching process output gives us the relationships between customers: Great, but it’s not very useable, you need to traverse the dataset to find out 1234 and 1235 are the same person (and this is a trivial case) And we need to cluster and identify our survivors (vertex) xid yid match_type 1234 4849 phone 4849 5499 email 5499 1235 address 4849 7788 cookie 5499 7788 cookie 4849 1234 phone
  • 29. Graphx to the rescue 1234 4849 5499 7788 We just need to import our edges into a graph and “dump” out communities Don’t think table… think Graph! These matches are actually communities 1235
  • 30. Connected components Connected Components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex This lowest number vertex can serve as our “survivor” (not field survivorship) Is it possible to write less code?
  • 31. Field level survivorship rules We now need to survive fields to our survivor to make it a “best record”. Depending on the attribution we choose: • Latest reported • Most frequently used We then do some simple ranking (thanks windowed functions)
  • 32. What you really want to know... With a customer dataset of approximately 6 million customers and 100’s of millions of data points. … when directly compared to traditional “big box” enterprise MDM software Was Spark faster….
  • 33. Map operations like cleansing and standardization saw a 10x improvement on a 4 node r3.2xlarge cluster But the real winner was the “Graph Trick”… Survivorship step reduced from 2 hours 5 minutes Connected Components < 1 minute **loading RDD’s from S3, building the graph, and running connected components
  • 34. Next Steps… • Graph Frames • In development by Berkley Amp Labs • Combines Full Graph algorithm nature of GraphX with Relationship mining of Neo4j
  • 35. Thank You / Q&A Kevin Rasmussen Big Data Engineer, Caserta Concepts kevin@casertaconcepts.com