Introduction to Data Science
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
Big Data Analysis: Timeline of Society Media
Printing Press
Penny Post
Rural Free Post
Social Media, Mobile, Big Data, Cloud
98,000+ Tweets
695,000 Status Updates
11 Million instant messages
698,445 Google Searches
168 million+ emails sent
1,829 TB of data created
217 new mobile web
Every 60 Seconds
Data is your Differentiator
63% of organizations realize a positive return on
analytic investments within a year
69% of speed-driven analytics organizations
created a positive impact on business outcomes
74% of respondents anticipate a speed at which
executives expect new data-driven insights will
continue to accelerate
Understanding the Customer Journey
Awareness Consideration Purchase Service
Word of Mouth
Direct Mail
Customer Service
Physical Touchpoints
Digital Touchpoints
Paid Content
Landing Pages
Social Media
Social Media
Call Center
Loyalty Programs
3rd Party Sites
Web self-service
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
Assign the credit to
each interaction
based on business
Assign the credit to
interactions based
on data-driven
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
Understanding Touchpoint Methods
What is Data Science?
Business Value
Cloud-based Data Lake
Big Data Analysis: The Ecosystem of the future
Data Integration
Identity Resolution
Data Quality
Discovery Exploration
Machine Learning
Models Development
Reports / Dashboards
Structured Data
Unstructured Data
SQL, NoSQL, Object Store
Find Share Collaborate
Data Engineer Data Scientist Business Analyst App Developer
Provides innovative and industry
leading technologies to rapidly be
applied to the business without
having to manage compatibility
and data complexity.
Technical Value
Provides an open
framework to reduce the
number of integration
points and testing
environments to deliver
business solutions.
Progression of Business Analytics to Data Science
Prescriptive Analytics
Why did it
What will
How can we make
It happen?
Data Analytics Sophistication
Influence what happens
Reports  Correlations  Predictions  Recommendations 
Progression of Data Science Maturity
• Timeline
• Tools
• Available libraries
• Best practices
What are the Realities of the Data Scientist
— Searching for the data they need
— Making sense of the data
— Figuring why the data looks the way is does and assessing its validity
— Cleaning up all the garbage within the data so it represents true business
— Combining events with Reference data to give it context
— Correlating event data with other events
— Finally, they implement algorithms to perform mining, clustering and predictive analytics
— Writes really cool and sophisticated
algorithms that impacts the way the business
— Much of the time of a Data Scientist is spent:
Why Data Science now?
• Costs of compute and storage dramatically lower than just a few years ago
• Data generated by all aspects of society has dramatically increased
• Need to efficiently learn what there is to learn from our data
The Data Scientist Winning Trifecta
Modern Data
- Computer Science
- Programming/Storage
- Data Quality
- Visualization
Algorithms -
A/B Testing -
- Data and Outcome
- Sensibility
Modern Data Engineering
Which Visualization, When?
Advanced Mathematics / Statistics
Domain and Outcome Sensibility
Is Data Trying To Trick You?
Correlation: 99.26%
Are we Considering the Right Factors?
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
1. Business Understanding
In this initial phase of the project we will need to speak to humans.
• It would be premature to jump in to the data, or begin selection of
the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business requirements
into a preliminary technical design (decision model) and plan.
Since this is an iterative process, this phase will be revisited throughout
the entire process.
Business Stakeholders
Business Stakeholders
Business Stakeholders
Interview notes
Requirement Document
Models / Insights
Gathering Requirements
Data Science Scrum Team
Efficient Inclusive
2. Data Understanding
• Data Discovery  understand where the data you need comes
• Data Profiling  interrogate the data at the entity level,
understand key entities and fields that are relevant to the
• Cleansing Requirements  understand data quality, data
density, skew, etc
• Data Munging  collocate, blend and analyze data for early
insights! Valuable information can be achieved from simple
group-by, aggregate queries, and even more with SQL Jujitsu!
Significant iteration between Business Understanding and Data
Understanding phases.
Exploration tools
for Hadoop:
Trifacta, Paxata,
Spark, Python,
Data Science Data Quality Priorities
Be Fast
Be Thorough
Data Science Data Quality Priorities
Data Quality
Raw Refined
Data Scientist’s Tightrope
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data Preparation!
• Locating and acquiring valuable data sources
• Select required entities/fields
• Address Data Quality issues: missing or incomplete values,
whitespace, bad data-points
• Join/Enrich disparate datasets
• Derive behavioral features
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
We Spark
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
Spark has become our default processing engine for a data engineering & science
Data Preparation
• We love Spark!
• ETL can be done in Scala,
Python or SQL
• Cleansing, transformation,
and standardization
• Address Parsing:
usaddress, postal-address,
• Name Hashing: fuzzy, etc
• Genderization:
sexmachine, etc
• And all the goodies of the
standard Python library!
• Parallelize workload
against a large number of
machines in Hadoop
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
• Each fact stores unique identifier of the
defective source row
HAMBot: ‘open
source’ project
created in Caserta
Innovation Lab
Data Preparation Demonstration! Wifi:
Hilton Meeting Room Wifi
Follow along:
File: SanFranCrime.ipynb
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data preparation
techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional data points,
or uncover additional data quality issues!
Modeling in Hadoop
• Spark works well
• SAS, SPSS, Etc. not
native on Hadoop
• R and Python
becoming new
• PMML can be used,
but approach with
Machine Learning
The goal of machine learning is to get software to make decisions and learn
from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning  inferring functions based on labeled training data
• Unsupervised learning  finding hidden structure/patterns within data, no
training data is supplied
We will review some popular, easy to understand machine learning
What to use When?
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function we can predict if we have a cat or dog!
Category or Values?
There are several classes of algorithms depending on whether the prediction is a
category (like cat or dog) or a value, like the value of a home.
Classification algorithms are generally well fit for categorization, while algorithms
like Regression and Decision Trees are well suited for predicting “continuous”
• Understanding the relationship between a given set of dependent variables
and independent variables
• Typically regression is used to predict the output of a dependent variable
based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else statements
Weight > 10lbs
color = orange
name = fido
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random “centroids”
and assigns the nearest items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in data
• Cluster Analysis
• Classification
• Content Filtering
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory
• Leveraging collaboration between multiple agents to filter, project, or detect
• Popular in recommender systems for projecting the “taste” for of specific
individuals for items they have not yet expressed one.
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user rating
• Then recommendations are created by producing a weighted sum of top items,
based on the users previously rated items
Data Science Demonstration!
Follow along:
File: SanFranCrime_model_DataSummit.ipynb
Hilton Meeting Room Wifi
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
6. Deployment
Engineering Time!
• It’s time for the work products of data science to “graduate” from “new
insights” to real applications.
• Processes must be hardened, repeatable, and generally perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based interchange format
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.
•Definitions, lineage (where does this data come from), business definitions, technical
•Identify and control sensitive data, regulatory compliance
•Data must be complete and correct. Measure, improve, certify
•Policies around data frequency, source availability, etc.
Data Quality and Monitoring
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.
Business Process Integration
•Data retention, purge schedule, storage/archiving
Master Data Management
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Data Science
Ingest Raw
Organize, Define,
Munging, Blending
Machine Learning
Data Quality and
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Corporate Data Pyramid (CDP)
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
IT Organization
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
Analytics-Driven Organization
Technologies & Techniques
• The Cloud and Spark can provide a relatively low cost and extremely scalable
platform for Data Science
• AWS S3 and Google GCS offers great scalability and speed to value without the
overhead of structuring data
• Spark, with MLlib offers a great library of established Machine Learning algorithms,
reducing development efforts
• Python and SQL are choices for Data Science
• Go Agile and follow Best Practices (CRISP-DM)
• Employ Data Pyramid concepts to ensure data has just enough governance
Some Thoughts – Enable the Future
— Data Science requires the convergence of data
quality, advanced math, data engineering and
visualization and business smarts
— Make sure your data can be trusted and people can
be held accountable for impact caused by low data
— Good data scientists are rare: It will take a village
to achieve all the tasks required for effective data
— Get good!
— Be great!
— Blaze new trails!
Data Science Training:
• Big Data Warehousing Meetup
• New York City
• 4,300+ members
• Knowledge sharing
Thank You / Q&A
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond

Introduction to Data Science (Data Summit, 2017)

  • 1. @CasertaConcepts#DataSummit Introduction to Data Science Joe Caserta Bill Walrond @joe_Caserta @bill_walrond @CasertaConcepts
  • 2. @CasertaConcepts#DataSummit About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance
  • 3. @CasertaConcepts#DataSummit Caserta Client Portfolio Retail/eCommerce & Manufacturing Finance, Healthcare & Insurance Digital Media/AdTech Education & Services
  • 4. @CasertaConcepts#DataSummit Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  • 6. @CasertaConcepts#DataSummit Agenda • Why we care about Big Data • Challenges of working with Big Data • Governing Big Data for Data Science • Introducing the Data Pyramid • Why Data Science is Cool? • What does a Data Scientist do? • Standards for Data Science • Business Objective • Data Discovery • Preparation • Models • Evaluation • Deployment • Q & A
  • 7. @CasertaConcepts#DataSummit Big Data Analysis: Timeline of Society Media 1500s Printing Press 1840s Penny Post 1850s Telegraph 1850s Rural Free Post 1890s Telephone 1900s Radio 1950s TV 1970s PCs 1980s Internet 1990s Web 2000s Social Media, Mobile, Big Data, Cloud 98,000+ Tweets 695,000 Status Updates 11 Million instant messages 698,445 Google Searches 168 million+ emails sent 1,829 TB of data created 217 new mobile web users Every 60 Seconds
  • 8. @CasertaConcepts#DataSummit Data is your Differentiator 63% of organizations realize a positive return on analytic investments within a year 69% of speed-driven analytics organizations created a positive impact on business outcomes 74% of respondents anticipate a speed at which executives expect new data-driven insights will continue to accelerate
  • 9. @CasertaConcepts#DataSummit Understanding the Customer Journey Awareness Consideration Purchase Service Loyalty Expansion PR Radio TV Print Outdoor Word of Mouth Direct Mail Customer Service Physical Touchpoints Digital Touchpoints Search Paid Content email Website/ Landing Pages Social Media Community Chat Social Media Call Center Offers Mailings Survey Loyalty Programs email Agents Partners Ads Website Mobile 3rd Party Sites Offers Web self-service
  • 10. @CasertaConcepts#DataSummit Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Understanding Touchpoint Methods
  • 12. @CasertaConcepts#DataSummit Business Value Cloud-based Data Lake Big Data Analysis: The Ecosystem of the future Analyze Persist DeployIngest Data Integration Identity Resolution Data Quality Discovery Exploration Machine Learning Models Development Reports / Dashboards Applications APIs Structured Data Unstructured Data SQL, NoSQL, Object Store Find Share Collaborate Data Engineer Data Scientist Business Analyst App Developer Provides innovative and industry leading technologies to rapidly be applied to the business without having to manage compatibility and data complexity. Technical Value Provides an open framework to reduce the number of integration points and testing environments to deliver business solutions.
  • 13. @CasertaConcepts#DataSummit Progression of Business Analytics to Data Science Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Hindsight Insight Foresight Information Optimization Cognitive Analytics Influence what happens Reports  Correlations  Predictions  Recommendations  Monetization Interactions Action
  • 14. @CasertaConcepts#DataSummit Progression of Data Science Maturity • Timeline • Tools • Available libraries • Best practices
  • 15. @CasertaConcepts#DataSummit What are the Realities of the Data Scientist — Searching for the data they need — Making sense of the data — Figuring why the data looks the way is does and assessing its validity — Cleaning up all the garbage within the data so it represents true business — Combining events with Reference data to give it context — Correlating event data with other events — Finally, they implement algorithms to perform mining, clustering and predictive analytics — Writes really cool and sophisticated algorithms that impacts the way the business runs. — Much of the time of a Data Scientist is spent: — NOT
  • 16. @CasertaConcepts#DataSummit Why Data Science now? • Costs of compute and storage dramatically lower than just a few years ago • Data generated by all aspects of society has dramatically increased • Need to efficiently learn what there is to learn from our data
  • 17. @CasertaConcepts#DataSummit The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Bu siness Expertise Advanced Mathematics/ Statistics - Computer Science - Programming/Storage - Data Quality - Visualization Algorithms - A/B Testing - - Data and Outcome - Sensibility
  • 22. @CasertaConcepts#DataSummit Is Data Trying To Trick You? Correlation: 99.26%
  • 24. @CasertaConcepts#DataSummit Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 25. @CasertaConcepts#DataSummit 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 26. @CasertaConcepts#DataSummit Data ScientistBusiness Analyst Business Stakeholders Business Stakeholders Business Stakeholders Interview notes Requirement Document Models / Insights Gathering Requirements
  • 27. @CasertaConcepts#DataSummit Data Science Scrum Team Data Scientist Business Stakeholders Data Engineer Efficient Inclusive EffectiveInteractive Data Analyst
  • 28. @CasertaConcepts#DataSummit 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases. Sample Exploration tools for Hadoop: Trifacta, Paxata, Spark, Python, Waterline, Elasticsearch
  • 29. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 30. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Data Quality SpeedtoValue Fast Slow Raw Refined Data Scientist’s Tightrope Does Data munging in a data science lab need the same restrictive governance and enterprise reporting?
  • 31. @CasertaConcepts#DataSummit 3. Data Preparation ETL (Extract Transform Load) 90+% of a Data Scientists time goes into Data Preparation! • Locating and acquiring valuable data sources • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Derive behavioral features • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 32. @CasertaConcepts#DataSummit We Spark • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! Spark has become our default processing engine for a data engineering & science
  • 33. @CasertaConcepts#DataSummit Data Preparation • We love Spark! • ETL can be done in Scala, Python or SQL • Cleansing, transformation, and standardization • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc • And all the goodies of the standard Python library! • Parallelize workload against a large number of machines in Hadoop cluster
  • 34. @CasertaConcepts#DataSummit Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row HAMBot: ‘open source’ project created in Caserta Innovation Lab (CIL)
  • 35. @CasertaConcepts#DataSummit Data Preparation Demonstration! Wifi: Hilton Meeting Room Wifi infotoday2017 Follow along: File: SanFranCrime.ipynb
  • 36. @CasertaConcepts#DataSummit 4. Modeling Do you love algebra & stats? • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 37. @CasertaConcepts#DataSummit Modeling in Hadoop • Spark works well • SAS, SPSS, Etc. not native on Hadoop • R and Python becoming new standard • PMML can be used, but approach with caution
  • 38. @CasertaConcepts#DataSummit Machine Learning The goal of machine learning is to get software to make decisions and learn from data without being programed explicitly to do so Machine Learning algorithms are broadly broken out into two groups: • Supervised learning  inferring functions based on labeled training data • Unsupervised learning  finding hidden structure/patterns within data, no training data is supplied We will review some popular, easy to understand machine learning algorithms
  • 40. @CasertaConcepts#DataSummit Supervised Learning Name Weight Color Cat_or_Dog Susie 9lbs Orange Cat Fido 25lbs Brown Dog Sparkles 6lbs Black Cat Fido 9lbs Black Dog Name Weight Color Cat_or_Dog Misty 5lbs Orange ? The training set is used to generate a function we can predict if we have a cat or dog!
  • 41. @CasertaConcepts#DataSummit Category or Values? There are several classes of algorithms depending on whether the prediction is a category (like cat or dog) or a value, like the value of a home. Classification algorithms are generally well fit for categorization, while algorithms like Regression and Decision Trees are well suited for predicting “continuous” values.
  • 42. @CasertaConcepts#DataSummit Regression • Understanding the relationship between a given set of dependent variables and independent variables • Typically regression is used to predict the output of a dependent variable based on variations in independent variables • Very popular for prediction and forecasting Linear Regression
  • 43. @CasertaConcepts#DataSummit Decision Trees • A method for predicting outcomes based on the features of data • Model is represented a easy to understand tree structure of if-else statements Weight > 10lbs color = orange cat yes no name = fido no no dogyes dog cat yes
  • 44. @CasertaConcepts#DataSummit Unsupervised K-Means • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing Clustering of items into logical groups based on natural patterns in data Uses: • Cluster Analysis • Classification • Content Filtering
  • 45. @CasertaConcepts#DataSummit Collaborative Filtering • A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory Based) • Leveraging collaboration between multiple agents to filter, project, or detect patterns • Popular in recommender systems for projecting the “taste” for of specific individuals for items they have not yet expressed one.
  • 46. @CasertaConcepts#DataSummit Item-based • A popular and simple memory-based collaborative filtering algorithm • Projects preference based on item similarity (based on ratings): for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return the top items, ranked by weighted average • First a matrix of Item to Item similarity is calculated based on user rating • Then recommendations are created by producing a weighted sum of top items, based on the users previously rated items
  • 47. @CasertaConcepts#DataSummit Data Science Demonstration! Follow along: File: SanFranCrime_model_DataSummit.ipynb Wifi: Hilton Meeting Room Wifi infotoday2017
  • 48. @CasertaConcepts#DataSummit 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 49. @CasertaConcepts#DataSummit 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format
  • 50. @CasertaConcepts#DataSummit •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. •Definitions, lineage (where does this data come from), business definitions, technical metadata Organization •Identify and control sensitive data, regulatory compliance Metadata •Data must be complete and correct. Measure, improve, certify Privacy/Security •Policies around data frequency, source availability, etc. Data Quality and Monitoring •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Business Process Integration •Data retention, purge schedule, storage/archiving Master Data Management Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Data Science
  • 51. @CasertaConcepts#DataSummit Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Corporate Data Pyramid (CDP)
  • 52. @CasertaConcepts#DataSummit Chief Data Organization (Oversight) Vertical Business Area [Sales/Finance/Marketing/Operations/Customer Svc] Product Owner SCRUM Master Development Team Business Subject Matter Expertise Data Librarian/Data Stewardship Data Science/ Statistical Skills Data Engineering / Architecture Presentation/ BI Report Development Skills Data Quality Assurance DevOps IT Organization (Oversight) Enterprise Data Architect Solution Engineers Data Integration Practice User Experience Practice QA Practice Operations Practice Advanced Analytics Business Analysts Data Analysts Data Scientists Statisticians Data Engineers Planning Organization Project Managers Data Organization Data Gov Coordinator Data Librarians Data Stewards Analytics-Driven Organization
  • 53. @CasertaConcepts#DataSummit Technologies & Techniques • The Cloud and Spark can provide a relatively low cost and extremely scalable platform for Data Science • AWS S3 and Google GCS offers great scalability and speed to value without the overhead of structuring data • Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts • Python and SQL are choices for Data Science • Go Agile and follow Best Practices (CRISP-DM) • Employ Data Pyramid concepts to ensure data has just enough governance
  • 54. @CasertaConcepts#DataSummit Some Thoughts – Enable the Future — Data Science requires the convergence of data quality, advanced math, data engineering and visualization and business smarts — Make sure your data can be trusted and people can be held accountable for impact caused by low data quality. — Good data scientists are rare: It will take a village to achieve all the tasks required for effective data science — Get good! — Be great! — Blaze new trails! Data Science Training: • Big Data Warehousing Meetup • New York City • 4,300+ members • Knowledge sharing
  • 55. @CasertaConcepts#DataSummit Thank You / Q&A Joe Caserta Bill Walrond @joe_Caserta @bill_walrond