SlideShare a Scribd company logo
Introduction to Data Science
On Hadoop
Joe Caserta
Caserta Concepts
Caserta Timeline
Launched Big Data practice Co-author, with Ralph Kimball, The Data
Warehouse ETL Toolkit (Wiley)
Data Analysis, Data Warehousing and Business
Intelligence since 1996
Began consulting database programing and data
modeling 25+ years hands-on experience building database
Founded Caserta Concepts in NYC
Web log analytics solution published in Intelligent
Launched Data Science, Data Interaction and
Cloud practices Laser focus on extending Data Analytics with Big
Data solutions
Dedicated to Data Governance Techniques on Big
Data (Innovation)
Awarded Top 20 Big Data
Companies 2016
Top 20 Most Powerful
Big Data consulting firms
Launched Big Data Warehousing (BDW) Meetup
NYC: 2,000+ Members
2016 Awarded Fastest Growing Big
Data Companies 2016
Established best practices for big data ecosystem
About Caserta Concepts
• Technology services company with expertise in data analysis:
• Big Data Solutions
• Data Warehousing
• Business Intelligence
• Core focus in the following industries:
• eCommerce / Retail / Marketing
• Financial Services / Insurance
• Healthcare / Ad Tech / Higher Ed
• Established in 2001:
• Increased growth year-over-year
• Industry recognized work force
• Strategy and Implementation
• Data Science & Analytics
• Data on the Cloud
• Data Interaction & Visualization
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
Ad-Hoc Query
Horizontally Scalable Environment - Optimized for Analytics
Big Data Lake
Canned Reporting
Big Data Analytics
Traditional BI
Spark MapReduce Pig/Hive
N1 N2 N4N3 N5
Hadoop Distributed File System (HDFS)
Today’s business environment requires Big Data
Data Science
•Data is coming in so
fast, how do we
monitor it?
•Real real-time
•What does
“complete” mean
•Dealing with sparse,
incomplete, volatile,
and highly
manufactured data.
How do you certify
sentiment analysis?
•Wider breadth of
datasets and sources
in scope requires
larger data
governance support
•Data governance
cannot start at the
data warehouse
•Data volume is higher
so the process must
be more reliant on
•Less people/process
Volume Variety
The Challenges Building a Data Lake
What’s Old is New Again
 Before Data Warehousing Governance
 Users trying to produce reports from raw source data
 No Data Conformance
 No Master Data Management
 No Data Quality processes
 No Trust: Two analysts were almost guaranteed to come up with two
different sets of numbers!
 Before Data Lake Governance
 We can put “anything” in Hadoop
 We can analyze anything
 We’re scientists, we don’t need IT, we make the rules
 Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance
will create a mess
 Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No
Trust = Not Actionable
Making it Right
 The promise is an “agile” data culture where communities of users are encouraged to explore
new datasets in new ways
 New tools
 External data
 Data blending
 Decentralization
 With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS
 We need more systemic administration
 We need systems, tools to help with big data governance
 This space is EXTREMELY immature!
 Steps towards Data Governance for the Data Lake
1. Establish difference between traditional data and big data governance
2. Establish basic rules for where new data governance can be applied
3. Establish processes for graduating the products of data science to governance
4. Establish a set of tools to make governing Big Data feasible
Process Architecture
Value Proposition
Enterprise Data
Data Integrity
Control Mechanisms
Principles and
Information Usability
BDG provides vision, oversight and accountability for leveraging corporate
information assets to create competitive advantage, and accelerate the vision
of integrated delivery.
Value Creation
• Acts on Requirements
Build Capabilities
• Does the Work
• Responsible for adherence
Data Stewards
Project Teams
Enterprise Data
• Executive Oversight
• Prioritizes work
Drives change
Accountable for results
Data Governance for the Data Lake
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization
•Definitions, lineage (where does this data come from), business definitions, technical
•Identify and control sensitive data, regulatory compliancePrivacy/Security
•Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring
•Policies around data frequency, source availability, etc.Business Process Integration
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management
•Data retention, purge schedule, storage/archiving
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Big Data
Data Lake Governance Realities
 Full data governance can only be applied to “Structured” data
 The data must have a known and well documented schema
 This can include materialized endpoints such as files or tables OR projections
such as a Hive table
 Governed structured data must have:
 A known schema with Metadata
 A known and certified lineage
 A monitored, quality test, managed process for ingestion and transformation
 A governed usage  Data isn’t just for enterprise BI tools anymore
 We talk about unstructured data in Hadoop but more-so it’s semi-
structured/structured with a definable schema.
 Even in the case of unstructured data, structure must be extracted/applied in
just about every case imaginable before analysis can be performed.
The Data Scientists Can Help!
 Data Science to Big Data Warehouse mapping
 Full Data Governance Requirements
 Provide full process lineage
 Data certification process by data stewards and business owners
 Ongoing Data Quality monitoring that includes Quality Checks
 Provide requirements for Data Lake
 Proper metadata established:
 Catalog
 Data Definitions
 Lineage
 Quality monitoring
 Know and validate data completeness
Data Science Workspace
Data Lake
Landing Area
The Big Data Analytics Pyramid
Metadata  Catalog
ILM  who has access, how long do
we “manage it”
Raw machine data
collection, collect
Data is ready to be turned into
information: organized, well defined,
Agile business insight through data-munging,
machine learning, blending with external data,
development of to-be BDW facts
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring 
Monitoring of completeness of data
Metadata  Catalog
ILM  who has access, how long do we “manage it”
Data Quality and Monitoring  Monitoring of
completeness of data
 Hadoop has different governance demands at each tier.
 Only top tier of the pyramid is fully governed.
 We refer to this as the Trusted tier of the Big Data Warehouse.
Fully Data Governed ( trusted)User community arbitrary queries and reporting
Usage Pattern Data Governance
What does a Data Scientist Do, Anyway?
 Searching for the data they need
 Making sense of the data
 Figuring why the data looks the way is does and assessing its validity
 Cleaning up all the garbage within the data so it represents true business
 Combining events with Reference data to give it context
 Correlating event data with other events
 Finally, they write algorithms to perform mining, clustering and predictive analytics
 Writes really cool and sophisticated
algorithms that impacts the way the business
 Much of the time of a Data Scientist is spent:
Why Data Science?
Prescriptive Analytics
Why did it
What will
How can we make
It happen?
Data Analytics Sophistication
Source: Gartner
The Data Scientist Winning Trifecta
Modern Data
Easier to Find Than an Awesome Data Scientist
Modern Data Engineering
Which Visualization, When?
Advanced Mathematics / Statistics
Domain and Outcome Sensibility
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
1. Business Understanding
In this initial phase of the project we will need to speak to humans.
• It would be premature to jump in to the data, or begin selection of
the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business requirements
into a preliminary technical design (decision model) and plan.
Since this is an iterative process, this phase will be revisited throughout
the entire process.
Business Stakeholders
Business Stakeholders
Business Stakeholders
Interview notes
Requirement Document
Models / Insights
Gathering Requirements
Data Science Scrum Team
Efficient Inclusive
2. Data Understanding
• Data Discovery  understand where the data you need comes
• Data Profiling  interrogate the data at the entity level,
understand key entities and fields that are relevant to the
• Cleansing Requirements  understand data quality, data
density, skew, etc
• Data Munging  collocate, blend and analyze data for early
insights! Valuable information can be achieved from simple
group-by, aggregate queries, and even more with SQL Jujitsu!
Significant iteration between Business Understanding and Data
Understanding phases.
Exploration tools
for Hadoop:
Trifacta, Paxata,
Spark, Python,
Pig, Hive,
Data Exploration in Hadoop - Avoid low level coding
Start by evaluating DSL’s
Core or
Will a
Custom UDF
Use Streaming or
Native MR
Practical to
express in
Data Science Data Quality Priorities
Be Fast
Be Thorough
Data Science Data Quality Priorities
Data Quality
Raw Refined
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data Preparation!
• Select required entities/fields
• Address Data Quality issues: missing or incomplete values,
whitespace, bad data-points
• Join/Enrich disparate datasets
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
Data Preparation
• We love Spark!
• ETL can be done in Scala,
Python or SQL
• Cleansing, transformation,
and standardization
• Address Parsing:
usaddress, postal-address,
• Name Hashing: fuzzy, etc
• Genderization:
sexmachine, etc
• And all the goodies of the
standard Python library!
• Parallelize workload
against a large number of
machines in Hadoop
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
• Each fact stores unique identifier of the
defective source row
HAMBot: ‘open
source’ project
created in Caserta
Innovation Lab
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data preparation
techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional data points,
or uncover additional data quality issues!
Modeling in Hadoop
• Spark works well
• SAS, SPSS, Etc. not
native on Hadoop
• R and Python
becoming new
• PMML can be used,
but approach with
Machine Learning
The goal of machine learning is to get software to make decisions and learn
from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning  inferring functions based on labeled training data
• Unsupervised learning  finding hidden structure/patterns within data, no
training data is supplied
We will review some popular, easy to understand machine learning
What to use When?
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function we can predict if we have a cat or dog!
Category or Values?
There are several classes of algorithms depending on whether the prediction is a
category (like cat or dog) or a value, like the value of a home.
Classification algorithms are generally well fit for categorization, while algorithms
like Regression and Decision Trees are well suited for predicting values.
• Understanding the relationship between a given set of dependent variables
and independent variables
• Typically regression is used to predict the output of a dependent variable
based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else statements
Weight > 10lbs
color = orange
name = fido
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random “centroids”
and assigns the nearest items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in data
• Cluster Analysis
• Classification
• Content Filtering
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory
• Leveraging collaboration between multiple agents to filter, project, or detect
• Popular in recommender systems for projecting the “taste” for of specific
individuals for items they have not yet expressed one.
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user rating
• Then recommendations are created by producing a weighted sum of top items,
based on the users previously rated items
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
6. Deployment
Engineering Time!
• It’s time for the work products of data science to “graduate” from “new
insights” to real applications.
• Processes must be hardened, repeatable, and generally perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based interchange format
My Favorite Data Science Project
• Recommendation Engines
Project Objective
• Create a functional recommendation engine to surface to provide relevant
product recommendations to customers.
• Improve Customer Experience
• Increase Customer Retention
• Increase Customer Purchase Activity
• Accurately suggest relevant products to customers based on their peer
• Your customers expect them
• Good recommendations make life easier
• Help them find information, products, and services they might not have
thought of
• What makes a good recommendation?
• Relevant but not obvious
• Sense of “surprise”
23” LED TV 24” LED TV 25” LED TV
23” LED TV``
Blu-Ray Home Theater HDMI Cables
Where do we use recommendations?
• Applications can be found in a wide variety of industries and applications:
• Travel
• Financial Service
• Music/Online radio
• TV and Video
• Online Publications
• Retail
..and countless others
Our Example: Movies
The Goal of the Recommender
• Create a powerful, scalable recommendation engine with minimal development
• Make recommendations to users as they are browsing movie titles -
• Recommendation must have context to the movie they are currently viewing.
OOPS! – too much surprise!
Recommender Tools & Techniques
Hadoop – distributed file system and processing platform
Spark – low-latency computing
MLlib – Library of Machine Learning Algorithms
We leverage two algorithms:
• Content-Based Filtering – how similar is this particular movie to other movies based on
• Collaborative Filtering – predict an individuals preference based on their peers ratings.
Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares
• Both algorithms only require a simple dataset of 3 fields:
“User ID” , “Item ID”, “Rating”
Content-Based Filtering
“People who liked this movie liked these as well”
• Content Based Filter builds a matrix of items to other items and calculates
similarity (based on user rating)
• The most similar item are then output as a list:
• Item ID, Similar Item ID, Similarity Score
• Items with the highest score are most similar
• In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100)
7 100 0.690951001800917
7 50 0.653299445638532
7 117 0.643701303640083
Collaborative Filtering
“People with similar taste to you liked these movies”
• Collaborative filtering applies weights based on “peer” user preference.
• Essentially it determines the best movie critics for you to follow
• The items with the highest recommendation score are then output as tuples
• User ID [Item ID1:Score,…., Item IDn:Score]
• Items with the highest recommendation score are the most relevant to this user
• For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and
“Donnie Brasco”
572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515]
573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019]
574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
Recommendation Store
• Serving recommendations needs to be instantaneous
• The core to this solution is two reference tables:
• When called to make recommendations we query our store
• Rec_Item_Similarity based on the Item_ID they are viewing
• Rec_User_Item_Base based on their User_ID
Delivering Recommendations
Peers like these
Item Similarity Raw Score Score
Fargo 0.691 1.000
Star Wars 0.653 0.946
Rock, The 0.644 0.932
Pulp Fiction 0.628 0.909
Return of the Jedi 0.627 0.908
Independence Day 0.618 0.894
Willy Wonka 0.603 0.872
Mission: Impossible 0.597 0.864
Silence of the Lambs, The 0.596 0.863
Star Trek: First Contact 0.594 0.859
Raiders of the Lost Ark 0.584 0.845
Terminator, The 0.574 0.831
Blade Runner 0.571 0.826
Usual Suspects, The 0.569 0.823
Seven (Se7en) 0.569 0.823
Item-Base (Peer) Raw Score Score
Seven 5.000 1.000
Donnie Brasco 4.707 0.941
Babe 4.688 0.938
Heat 4.688 0.938
To Kill a Mockingbird 4.686 0.937
Jaws 4.683 0.937
Monty Python, Holy Grail 4.670 0.934
Blade Runner 4.670 0.934
Get Shorty 4.655 0.931
Top 10 Recommendations
So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results
Seven (Se7en) 1.823
Blade Runner 1.760
Fargo 1.000
Star Wars 0.946
Donnie Brasco 0.941
Babe 0.938
Heat 0.938
To Kill a Mockingbird 0.937
Jaws 0.937
Monty Python, Holy Grail 0.934
From Good to Great Recommendations
• Note that the first 5 recommendations look pretty good
…but the 6th result would have been “Babe” the children's movie
• Tuning the algorithms might help: parameter changes, similarity measures.
• How else can we make it better?
1. Delivery filters
2. Introduce additional algorithms such as K-Means
Additional Algorithm – K-Means
We would use the major attributes of the Movie to create coordinate points.
• Categories
• Actors
• Director
• Synopsis Text
“These movies are similar based on their attributes”
Delivery Scoring and Filters
• One or more categories must match
• Only children movies will be recommended for children's movies.
Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller
Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0
Babe 0 0 1 1 0 1 0 0 0 0 0
Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1
Star Wars 1 1 0 0 0 0 0 0 1 1 0
Blade Runner 0 0 0 0 0 0 1 0 0 1 0
Fargo 0 0 0 0 1 1 0 0 0 0 1
Willy Wonka 0 1 1 1 0 0 0 0 0 0 0
Monty Python 0 0 0 1 0 0 0 0 0 0 0
Jaws 1 0 0 0 0 0 0 1 0 0 0
Heat 1 0 0 0 1 0 0 0 0 0 1
Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0
To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0
Apply assumptions to control the results of collaborative filtering
Similarly logic could be applied to promote more favorable options
• New Releases
• Retail Case: Items that are on-sale, overstock
Integrating K-Means into the process
Collaborative Filter K-Means:
Content Filter
Movies recommended by more than 1 algorithm are the most highly rated
Sophisticated Recommendation Model
What items are we
promoting at time
of sale?
What items are
being promoted by
the Store or
What are people
with similar
Peer Based
What items have
you bought in the
What did people
who ordered
these items also
The solution
allows balancing
of algorithms to
attain the most
• Hadoop and Spark can provide a relatively low cost and extremely scalable platform
for Data Science
• Hadoop offers great scalability and speed to value without the overhead of
structuring data
• Spark, with MLlib offers a great library of established Machine Learning algorithms,
reducing development efforts
• Python and SQL tools of choice for Data Science on Hadoop
• Go Agile and follow Best Practices (CRISP-DM)
• Employ Data Pyramid concepts to ensure data has just enough governance
Some Thoughts – Enable the Future
 Data Science requires the convergence of data
quality, advanced math, data engineering and
visualization and business smarts
 Make sure your data can be trusted and people can
be held accountable for impact caused by low data
 Good data scientists are rare: It will take a village
to achieve all the tasks required for effective data
 Get good!
 Be great!
 Blaze new trails!
Data Science Training:
Thank You / Q&A
Joe Caserta
President, Caserta Concepts

More Related Content

What's hot

Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
DataWorks Summit
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
Thomas Kelly, PMP
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Technologies
Big Data Boom
Big Data BoomBig Data Boom

What's hot (20)

Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...A modern, flexible approach to Hadoop implementation incorporating innovation...
A modern, flexible approach to Hadoop implementation incorporating innovation...
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Defining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business EnvironmentDefining and Applying Data Governance in Today’s Business Environment
Defining and Applying Data Governance in Today’s Business Environment
The Emerging Data Lake IT Strategy
The Emerging Data Lake IT StrategyThe Emerging Data Lake IT Strategy
The Emerging Data Lake IT Strategy
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Journey to Cloud Analytics
Journey to Cloud Analytics Journey to Cloud Analytics
Journey to Cloud Analytics
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike FergusonMapR Enterprise Data Hub Webinar w/ Mike Ferguson
MapR Enterprise Data Hub Webinar w/ Mike Ferguson
Big Data Boom
Big Data BoomBig Data Boom
Big Data Boom

Viewers also liked

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big data ppt
Big  data pptBig  data ppt
Big data ppt
Nasrin Hussain
What's the profile of a data scientist?
What's the profile of a data scientist? What's the profile of a data scientist?
What's the profile of a data scientist?
BICC Thomas More
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
Marcel Franke
Smart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using HadoopSmart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using Hadoop
DataWorks Summit
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Cloudera, Inc.
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best Practices
Emrah METE
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Emrah METE
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
Cloudera, Inc.
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
데이터 사이언스 소개 - 정준호
데이터 사이언스 소개 -  정준호데이터 사이언스 소개 -  정준호
데이터 사이언스 소개 - 정준호
준호 정
Smart Analytics For The Utility Sector
Smart Analytics For The Utility SectorSmart Analytics For The Utility Sector
Smart Analytics For The Utility Sector
Herman Bosker
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
Dao Vo
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
EY Drug R&D: Big DATA for big returns
EY Drug R&D: Big DATA for big returnsEY Drug R&D: Big DATA for big returns
EY Drug R&D: Big DATA for big returns
Thomas Wilckens

Viewers also liked (18)

Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big data ppt
Big  data pptBig  data ppt
Big data ppt
What's the profile of a data scientist?
What's the profile of a data scientist? What's the profile of a data scientist?
What's the profile of a data scientist?
Designing High Performance ETL for Data Warehouse
Designing High Performance ETL for Data WarehouseDesigning High Performance ETL for Data Warehouse
Designing High Performance ETL for Data Warehouse
Smart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using HadoopSmart Meter Data Analytic using Hadoop
Smart Meter Data Analytic using Hadoop
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5Hadoop for the Data Scientist: Spark in Cloudera 5.5
Hadoop for the Data Scientist: Spark in Cloudera 5.5
Oracle PL/SQL Best Practices
Oracle PL/SQL Best PracticesOracle PL/SQL Best Practices
Oracle PL/SQL Best Practices
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL DesteğiVeri Ambarları için Oracle'ın Analitik SQL Desteği
Veri Ambarları için Oracle'ın Analitik SQL Desteği
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
The Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data HubThe Future of Data Management: The Enterprise Data Hub
The Future of Data Management: The Enterprise Data Hub
DGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data QualityDGIQ 2015 The Fundamentals of Data Quality
DGIQ 2015 The Fundamentals of Data Quality
데이터 사이언스 소개 - 정준호
데이터 사이언스 소개 -  정준호데이터 사이언스 소개 -  정준호
데이터 사이언스 소개 - 정준호
Smart Analytics For The Utility Sector
Smart Analytics For The Utility SectorSmart Analytics For The Utility Sector
Smart Analytics For The Utility Sector
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
Agile data warehouse
Agile data warehouseAgile data warehouse
Agile data warehouse
Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari Hortonworks Data In Motion Series Part 3 - HDF Ambari
Hortonworks Data In Motion Series Part 3 - HDF Ambari
EY Drug R&D: Big DATA for big returns
EY Drug R&D: Big DATA for big returnsEY Drug R&D: Big DATA for big returns
EY Drug R&D: Big DATA for big returns

Similar to Intro to Data Science on Hadoop

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
DataWorks Summit/Hadoop Summit
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
Ricky Barron
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
Aachen Data & AI Meetup
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Dunn Solutions Group
Big data
Big dataBig data
Big data
Sakshi Chawla
Group 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptxGroup 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptx
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Increasing Agility Through Data Virtualization
Increasing Agility Through Data VirtualizationIncreasing Agility Through Data Virtualization
Increasing Agility Through Data Virtualization
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
Satish Bhatia
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
Mithlesh Sadh
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
Syaifuddin Ismail
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data
Umair Shafique

Similar to Intro to Data Science on Hadoop (20)

Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Data Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with ClouderaData Governance, Compliance and Security in Hadoop with Cloudera
Data Governance, Compliance and Security in Hadoop with Cloudera
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Big Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data LakeBig Data: Setting Up the Big Data Lake
Big Data: Setting Up the Big Data Lake
How to build a successful Data Lake
How to build a successful Data LakeHow to build a successful Data Lake
How to build a successful Data Lake
Architecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment OptionsArchitecting for Big Data: Trends, Tips, and Deployment Options
Architecting for Big Data: Trends, Tips, and Deployment Options
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315Ashley Ohmann--Data Governance Final 011315
Ashley Ohmann--Data Governance Final 011315
Data lake benefits
Data lake benefitsData lake benefits
Data lake benefits
Why data governance is the new buzz?
Why data governance is the new buzz?Why data governance is the new buzz?
Why data governance is the new buzz?
The Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They NeedThe Data Lake and Getting Buisnesses the Big Data Insights They Need
The Data Lake and Getting Buisnesses the Big Data Insights They Need
Big data
Big dataBig data
Big data
Group 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptxGroup 2 Handling and Processing of big data.pptx
Group 2 Handling and Processing of big data.pptx
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Increasing Agility Through Data Virtualization
Increasing Agility Through Data VirtualizationIncreasing Agility Through Data Virtualization
Increasing Agility Through Data Virtualization
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
Big data by Mithlesh sadh
Big data by Mithlesh sadhBig data by Mithlesh sadh
Big data by Mithlesh sadh
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
Handling and Processing Big Data
Handling and Processing Big DataHandling and Processing Big Data
Handling and Processing Big Data

Recently uploaded

"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Alliance
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance

Recently uploaded (20)

"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
FIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptxFIDO Munich Seminar FIDO Automotive Apps.pptx
FIDO Munich Seminar FIDO Automotive Apps.pptx
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx

Intro to Data Science on Hadoop

  • 1. @joe_Caserta#DataSummit Introduction to Data Science On Hadoop Joe Caserta President Caserta Concepts
  • 2. @joe_Caserta#DataSummit Caserta Timeline Launched Big Data practice Co-author, with Ralph Kimball, The Data Warehouse ETL Toolkit (Wiley) Data Analysis, Data Warehousing and Business Intelligence since 1996 Began consulting database programing and data modeling 25+ years hands-on experience building database solutions Founded Caserta Concepts in NYC Web log analytics solution published in Intelligent Enterprise Launched Data Science, Data Interaction and Cloud practices Laser focus on extending Data Analytics with Big Data solutions 1986 2004 1996 2009 2001 2013 2012 2014 Dedicated to Data Governance Techniques on Big Data (Innovation) Awarded Top 20 Big Data Companies 2016 Top 20 Most Powerful Big Data consulting firms Launched Big Data Warehousing (BDW) Meetup NYC: 2,000+ Members 2016 Awarded Fastest Growing Big Data Companies 2016 Established best practices for big data ecosystem implementations
  • 3. @joe_Caserta#DataSummit About Caserta Concepts • Technology services company with expertise in data analysis: • Big Data Solutions • Data Warehousing • Business Intelligence • Core focus in the following industries: • eCommerce / Retail / Marketing • Financial Services / Insurance • Healthcare / Ad Tech / Higher Ed • Established in 2001: • Increased growth year-over-year • Industry recognized work force • Strategy and Implementation • Data Science & Analytics • Data on the Cloud • Data Interaction & Visualization
  • 4. @joe_Caserta#DataSummit Agenda • Why we care about Big Data • Challenges of working with Big Data • Governing Big Data for Data Science • Introducing the Data Pyramid • Why Data Science is Cool? • What does a Data Scientist do? • Standards for Data Science • Business Objective • Data Discovery • Preparation • Models • Evaluation • Deployment • Q & A
  • 5. @joe_Caserta#DataSummit Enrollments Claims Finance ETL Ad-Hoc Query Horizontally Scalable Environment - Optimized for Analytics Big Data Lake Canned Reporting Big Data Analytics NoSQL DatabasesETL Ad-Hoc/Canned Reporting Traditional BI Spark MapReduce Pig/Hive N1 N2 N4N3 N5 Hadoop Distributed File System (HDFS) Traditional EDW Others… Today’s business environment requires Big Data Data Science
  • 6. @joe_Caserta#DataSummit •Data is coming in so fast, how do we monitor it? •Real real-time analytics •What does “complete” mean •Dealing with sparse, incomplete, volatile, and highly manufactured data. How do you certify sentiment analysis? •Wider breadth of datasets and sources in scope requires larger data governance support •Data governance cannot start at the data warehouse •Data volume is higher so the process must be more reliant on programmatic administration •Less people/process dependence Volume Variety VelocityVeracity The Challenges Building a Data Lake
  • 7. @joe_Caserta#DataSummit What’s Old is New Again  Before Data Warehousing Governance  Users trying to produce reports from raw source data  No Data Conformance  No Master Data Management  No Data Quality processes  No Trust: Two analysts were almost guaranteed to come up with two different sets of numbers!  Before Data Lake Governance  We can put “anything” in Hadoop  We can analyze anything  We’re scientists, we don’t need IT, we make the rules  Rule #1: Dumping data into Hadoop with no repeatable process, procedure, or data governance will create a mess  Rule #2: Information harvested from an ungoverned systems will take us back to the old days: No Trust = Not Actionable
  • 8. @joe_Caserta#DataSummit Making it Right  The promise is an “agile” data culture where communities of users are encouraged to explore new datasets in new ways  New tools  External data  Data blending  Decentralization  With all the V’s, data scientists, new tools, new data we must rely LESS on HUMANS  We need more systemic administration  We need systems, tools to help with big data governance  This space is EXTREMELY immature!  Steps towards Data Governance for the Data Lake 1. Establish difference between traditional data and big data governance 2. Establish basic rules for where new data governance can be applied 3. Establish processes for graduating the products of data science to governance 4. Establish a set of tools to make governing Big Data feasible
  • 9. @joe_Caserta#DataSummit Process Architecture Communication Organization IFP Governance Administration Compliance Reporting Standards Value Proposition Risk/Reward Information Accountabilities Stewardship Architecture Enterprise Data Council Data Integrity Metrics Control Mechanisms Principles and Standards Information Usability Communication BDG provides vision, oversight and accountability for leveraging corporate information assets to create competitive advantage, and accelerate the vision of integrated delivery. Value Creation • Acts on Requirements Build Capabilities • Does the Work • Responsible for adherence Governance Committees Data Stewards Project Teams Enterprise Data Council • Executive Oversight • Prioritizes work Drives change Accountable for results Definitions Data Governance for the Data Lake
  • 10. @joe_Caserta#DataSummit •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance
  • 11. @joe_Caserta#DataSummit •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.Organization •Definitions, lineage (where does this data come from), business definitions, technical metadataMetadata •Identify and control sensitive data, regulatory compliancePrivacy/Security •Data must be complete and correct. Measure, improve, certifyData Quality and Monitoring •Policies around data frequency, source availability, etc.Business Process Integration •Ensure consistent business critical data i.e. Members, Providers, Agents, etc.Master Data Management •Data retention, purge schedule, storage/archiving Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Big Data
  • 12. @joe_Caserta#DataSummit Data Lake Governance Realities  Full data governance can only be applied to “Structured” data  The data must have a known and well documented schema  This can include materialized endpoints such as files or tables OR projections such as a Hive table  Governed structured data must have:  A known schema with Metadata  A known and certified lineage  A monitored, quality test, managed process for ingestion and transformation  A governed usage  Data isn’t just for enterprise BI tools anymore  We talk about unstructured data in Hadoop but more-so it’s semi- structured/structured with a definable schema.  Even in the case of unstructured data, structure must be extracted/applied in just about every case imaginable before analysis can be performed.
  • 13. @joe_Caserta#DataSummit The Data Scientists Can Help!  Data Science to Big Data Warehouse mapping  Full Data Governance Requirements  Provide full process lineage  Data certification process by data stewards and business owners  Ongoing Data Quality monitoring that includes Quality Checks  Provide requirements for Data Lake  Proper metadata established:  Catalog  Data Definitions  Lineage  Quality monitoring  Know and validate data completeness
  • 14. @joe_Caserta#DataSummit Big Data Warehouse Data Science Workspace Data Lake Landing Area The Big Data Analytics Pyramid Metadata  Catalog ILM  who has access, how long do we “manage it” Raw machine data collection, collect everything Data is ready to be turned into information: organized, well defined, complete. Agile business insight through data-munging, machine learning, blending with external data, development of to-be BDW facts Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data Metadata  Catalog ILM  who has access, how long do we “manage it” Data Quality and Monitoring  Monitoring of completeness of data  Hadoop has different governance demands at each tier.  Only top tier of the pyramid is fully governed.  We refer to this as the Trusted tier of the Big Data Warehouse. Fully Data Governed ( trusted)User community arbitrary queries and reporting Usage Pattern Data Governance
  • 15. @joe_Caserta#DataSummit What does a Data Scientist Do, Anyway?  Searching for the data they need  Making sense of the data  Figuring why the data looks the way is does and assessing its validity  Cleaning up all the garbage within the data so it represents true business  Combining events with Reference data to give it context  Correlating event data with other events  Finally, they write algorithms to perform mining, clustering and predictive analytics  Writes really cool and sophisticated algorithms that impacts the way the business runs.  Much of the time of a Data Scientist is spent:  NOT
  • 16. @joe_Caserta#DataSummit Why Data Science? Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Source: Gartner
  • 17. @joe_Caserta#DataSummit The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Bu siness Expertise Advanced Mathematics/ Statistics
  • 18. @joe_Caserta#DataSummit Easier to Find Than an Awesome Data Scientist
  • 23. @joe_Caserta#DataSummit Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 24. @joe_Caserta#DataSummit 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 25. @joe_Caserta#DataSummit Data ScientistBusiness Analyst Business Stakeholders Business Stakeholders Business Stakeholders Interview notes Requirement Document Models / Insights Gathering Requirements
  • 26. @joe_Caserta#DataSummit Data Science Scrum Team Data Scientist Business Stakeholders Data Engineer Efficient Inclusive EffectiveInteractive Data Analyst
  • 27. @joe_Caserta#DataSummit 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases. Sample Exploration tools for Hadoop: Trifacta, Paxata, Spark, Python, Pig, Hive, Waterline, Elasticsearch
  • 28. @joe_Caserta#DataSummit Data Exploration in Hadoop - Avoid low level coding Start by evaluating DSL’s Structured/tab ular Hive Pig Core or Extended Libraries Will a Custom UDF help? Use Streaming or Native MR Yes Yes No No Yes Practical to express in SQL Yes No No Spark
  • 29. @joe_Caserta#DataSummit Data Science Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 30. @joe_Caserta#DataSummit Data Science Data Quality Priorities Data Quality SpeedtoValue Fast Slow Raw Refined Does Data munging in a data science lab need the same restrictive governance and enterprise reporting?
  • 31. @joe_Caserta#DataSummit 3. Data Preparation ETL (Extract Transform Load) 90+% of a Data Scientists time goes into Data Preparation! • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 32. @joe_Caserta#DataSummit Data Preparation • We love Spark! • ETL can be done in Scala, Python or SQL • Cleansing, transformation, and standardization • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc • And all the goodies of the standard Python library! • Parallelize workload against a large number of machines in Hadoop cluster
  • 33. @joe_Caserta#DataSummit Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row HAMBot: ‘open source’ project created in Caserta Innovation Lab (CIL)
  • 34. @joe_Caserta#DataSummit 4. Modeling Do you love algebra & stats? • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 35. @joe_Caserta#DataSummit Modeling in Hadoop • Spark works well • SAS, SPSS, Etc. not native on Hadoop • R and Python becoming new standard • PMML can be used, but approach with caution
  • 36. @joe_Caserta#DataSummit Machine Learning The goal of machine learning is to get software to make decisions and learn from data without being programed explicitly to do so Machine Learning algorithms are broadly broken out into two groups: • Supervised learning  inferring functions based on labeled training data • Unsupervised learning  finding hidden structure/patterns within data, no training data is supplied We will review some popular, easy to understand machine learning algorithms
  • 38. @joe_Caserta#DataSummit Supervised Learning Name Weight Color Cat_or_Dog Susie 9lbs Orange Cat Fido 25lbs Brown Dog Sparkles 6lbs Black Cat Fido 9lbs Black Dog Name Weight Color Cat_or_Dog Misty 5lbs Orange ? The training set is used to generate a function we can predict if we have a cat or dog!
  • 39. @joe_Caserta#DataSummit Category or Values? There are several classes of algorithms depending on whether the prediction is a category (like cat or dog) or a value, like the value of a home. Classification algorithms are generally well fit for categorization, while algorithms like Regression and Decision Trees are well suited for predicting values.
  • 40. @joe_Caserta#DataSummit Regression • Understanding the relationship between a given set of dependent variables and independent variables • Typically regression is used to predict the output of a dependent variable based on variations in independent variables • Very popular for prediction and forecasting Linear Regression
  • 41. @joe_Caserta#DataSummit Decision Trees • A method for predicting outcomes based on the features of data • Model is represented a easy to understand tree structure of if-else statements Weight > 10lbs color = orange cat yes no name = fido no no dogyes dog cat yes
  • 42. @joe_Caserta#DataSummit Unsupervised K-Means • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing Clustering of items into logical groups based on natural patterns in data Uses: • Cluster Analysis • Classification • Content Filtering
  • 43. @joe_Caserta#DataSummit Collaborative Filtering • A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory Based) • Leveraging collaboration between multiple agents to filter, project, or detect patterns • Popular in recommender systems for projecting the “taste” for of specific individuals for items they have not yet expressed one.
  • 44. @joe_Caserta#DataSummit Item-based • A popular and simple memory-based collaborative filtering algorithm • Projects preference based on item similarity (based on ratings): for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return the top items, ranked by weighted average • First a matrix of Item to Item similarity is calculated based on user rating • Then recommendations are created by producing a weighted sum of top items, based on the users previously rated items
  • 45. @joe_Caserta#DataSummit 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 46. @joe_Caserta#DataSummit 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format Big$ Data$ Warehouse$ Data$Science$Workspace$ Data$Lake$–$Integrated$Sandbox$$ Landing$Area$–$Source$Data$in$“Full$Fidelity”$ New Data New Insights Governance Refinery
  • 47. @joe_Caserta#DataSummit My Favorite Data Science Project • Recommendation Engines
  • 48. @joe_Caserta#DataSummit Project Objective • Create a functional recommendation engine to surface to provide relevant product recommendations to customers. • Improve Customer Experience • Increase Customer Retention • Increase Customer Purchase Activity • Accurately suggest relevant products to customers based on their peer behavior.
  • 49. @joe_Caserta#DataSummit Recommendations • Your customers expect them • Good recommendations make life easier • Help them find information, products, and services they might not have thought of • What makes a good recommendation? • Relevant but not obvious • Sense of “surprise” 23” LED TV 24” LED TV 25” LED TV 23” LED TV`` SOLD!! Blu-Ray Home Theater HDMI Cables
  • 50. @joe_Caserta#DataSummit Where do we use recommendations? • Applications can be found in a wide variety of industries and applications: • Travel • Financial Service • Music/Online radio • TV and Video • Online Publications • Retail ..and countless others Our Example: Movies
  • 51. @joe_Caserta#DataSummit The Goal of the Recommender • Create a powerful, scalable recommendation engine with minimal development • Make recommendations to users as they are browsing movie titles - instantaneously • Recommendation must have context to the movie they are currently viewing. OOPS! – too much surprise!
  • 52. @joe_Caserta#DataSummit Recommender Tools & Techniques Hadoop – distributed file system and processing platform Spark – low-latency computing MLlib – Library of Machine Learning Algorithms We leverage two algorithms: • Content-Based Filtering – how similar is this particular movie to other movies based on usage. • Collaborative Filtering – predict an individuals preference based on their peers ratings. Spark MLlib implements a collaborative filtering algorithm called Alternating Least Squares (ALS) • Both algorithms only require a simple dataset of 3 fields: “User ID” , “Item ID”, “Rating”
  • 53. @joe_Caserta#DataSummit Content-Based Filtering “People who liked this movie liked these as well” • Content Based Filter builds a matrix of items to other items and calculates similarity (based on user rating) • The most similar item are then output as a list: • Item ID, Similar Item ID, Similarity Score • Items with the highest score are most similar • In this example users who liked “Twelve Monkeys” (7) also like “Fargo” (100) 7 100 0.690951001800917 7 50 0.653299445638532 7 117 0.643701303640083
  • 54. @joe_Caserta#DataSummit Collaborative Filtering “People with similar taste to you liked these movies” • Collaborative filtering applies weights based on “peer” user preference. • Essentially it determines the best movie critics for you to follow • The items with the highest recommendation score are then output as tuples • User ID [Item ID1:Score,…., Item IDn:Score] • Items with the highest recommendation score are the most relevant to this user • For user “Johny Sisklebert” (572), the two most highly recommended movies are “Seven” and “Donnie Brasco” 572 [11:5.0,293:4.70718,8:4.688335,273:4.687676,427:4.685926,234:4.683155,168:4.669672,89:4.66959,4:4.65515] 573 [487:4.54397,1203:4.5291,616:4.51644,605:4.49344,709:4.3406,502:4.33706,152:4.32263,503:4.20515,432:4.26455,611:4.22019] 574 [1:5.0,902:5.0,546:5.0,13:5.0,534:5.0,533:5.0,531:5.0,1082:5.0,1631:5.0,515:5.0]
  • 55. @joe_Caserta#DataSummit Recommendation Store • Serving recommendations needs to be instantaneous • The core to this solution is two reference tables: • When called to make recommendations we query our store • Rec_Item_Similarity based on the Item_ID they are viewing • Rec_User_Item_Base based on their User_ID Rec_Item_Similarity Item_ID Similar_Item Similarity_Score Rec_User_Item_Base User_ID Item_ID Recommendation_Score
  • 56. @joe_Caserta#DataSummit Delivering Recommendations Item-Based: Peers like these Movies Best Recommendations Item Similarity Raw Score Score Fargo 0.691 1.000 Star Wars 0.653 0.946 Rock, The 0.644 0.932 Pulp Fiction 0.628 0.909 Return of the Jedi 0.627 0.908 Independence Day 0.618 0.894 Willy Wonka 0.603 0.872 Mission: Impossible 0.597 0.864 Silence of the Lambs, The 0.596 0.863 Star Trek: First Contact 0.594 0.859 Raiders of the Lost Ark 0.584 0.845 Terminator, The 0.574 0.831 Blade Runner 0.571 0.826 Usual Suspects, The 0.569 0.823 Seven (Se7en) 0.569 0.823 Item-Base (Peer) Raw Score Score Seven 5.000 1.000 Donnie Brasco 4.707 0.941 Babe 4.688 0.938 Heat 4.688 0.938 To Kill a Mockingbird 4.686 0.937 Jaws 4.683 0.937 Monty Python, Holy Grail 4.670 0.934 Blade Runner 4.670 0.934 Get Shorty 4.655 0.931 Top 10 Recommendations So if Johny is viewing “12 Monkeys” we query our recommendation store and present the results Seven (Se7en) 1.823 Blade Runner 1.760 Fargo 1.000 Star Wars 0.946 Donnie Brasco 0.941 Babe 0.938 Heat 0.938 To Kill a Mockingbird 0.937 Jaws 0.937 Monty Python, Holy Grail 0.934
  • 57. @joe_Caserta#DataSummit From Good to Great Recommendations • Note that the first 5 recommendations look pretty good …but the 6th result would have been “Babe” the children's movie • Tuning the algorithms might help: parameter changes, similarity measures. • How else can we make it better? 1. Delivery filters 2. Introduce additional algorithms such as K-Means OOPS!
  • 58. @joe_Caserta#DataSummit Additional Algorithm – K-Means We would use the major attributes of the Movie to create coordinate points. • Categories • Actors • Director • Synopsis Text “These movies are similar based on their attributes”
  • 59. @joe_Caserta#DataSummit Delivery Scoring and Filters • One or more categories must match • Only children movies will be recommended for children's movies. Action Adventure Children's Comedy Crime Drama Film-Noir Horror Romance Sci-Fi Thriller Twelve Monkeys 0 0 0 0 0 1 0 0 0 1 0 Babe 0 0 1 1 0 1 0 0 0 0 0 Seven (Se7en) 0 0 0 0 1 1 0 0 0 0 1 Star Wars 1 1 0 0 0 0 0 0 1 1 0 Blade Runner 0 0 0 0 0 0 1 0 0 1 0 Fargo 0 0 0 0 1 1 0 0 0 0 1 Willy Wonka 0 1 1 1 0 0 0 0 0 0 0 Monty Python 0 0 0 1 0 0 0 0 0 0 0 Jaws 1 0 0 0 0 0 0 1 0 0 0 Heat 1 0 0 0 1 0 0 0 0 0 1 Donnie Brasco 0 0 0 0 1 1 0 0 0 0 0 To Kill a Mockingbird 0 0 0 0 0 1 0 0 0 0 0 Apply assumptions to control the results of collaborative filtering Similarly logic could be applied to promote more favorable options • New Releases • Retail Case: Items that are on-sale, overstock
  • 60. @joe_Caserta#DataSummit Integrating K-Means into the process Collaborative Filter K-Means: Similar Content Filter Best Recommendations Movies recommended by more than 1 algorithm are the most highly rated
  • 61. @joe_Caserta#DataSummit 61 Sophisticated Recommendation Model What items are we promoting at time of sale? What items are being promoted by the Store or Market? What are people with similar characteristics buying? Peer Based Item Clustering Corporate Deals/ Offers Customer Behavior Market/ Store Recommendation What items have you bought in the past? What did people who ordered these items also order? The solution allows balancing of algorithms to attain the most effective recommendation
  • 62. @joe_Caserta#DataSummit Summary • Hadoop and Spark can provide a relatively low cost and extremely scalable platform for Data Science • Hadoop offers great scalability and speed to value without the overhead of structuring data • Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts • Python and SQL tools of choice for Data Science on Hadoop • Go Agile and follow Best Practices (CRISP-DM) • Employ Data Pyramid concepts to ensure data has just enough governance
  • 63. @joe_Caserta#DataSummit Some Thoughts – Enable the Future  Data Science requires the convergence of data quality, advanced math, data engineering and visualization and business smarts  Make sure your data can be trusted and people can be held accountable for impact caused by low data quality.  Good data scientists are rare: It will take a village to achieve all the tasks required for effective data science  Get good!  Be great!  Blaze new trails! Data Science Training:
  • 64. @joe_Caserta#DataSummit Thank You / Q&A Joe Caserta President, Caserta Concepts @joe_Caserta