SlideShare a Scribd company logo
@CasertaConcepts#DataSummit
Introduction to Data Science
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond
@CasertaConcepts
@CasertaConcepts#DataSummit
About Caserta Concepts
• Consulting Data Innovation and Modern Data Engineering
• Award-winning company
• Internationally recognized work force
• Strategy, Architecture, Implementation, Governance
• Innovation Partner
• Strategic Consulting
• Advanced Architecture
• Build & Deploy
• Leader in Enterprise Data Solutions
• Big Data Analytics
• Data Warehousing
• Business Intelligence
• Data Science
• Cloud Computing
• Data Governance
@CasertaConcepts#DataSummit
Caserta Client Portfolio
Retail/eCommerce
& Manufacturing
Finance, Healthcare
& Insurance
Digital Media/AdTech
Education & Services
@CasertaConcepts#DataSummit
Awards & Recognition
Top 10
Fastest Growing
Big Data Companies
2016
@CasertaConcepts#DataSummit
Our Partners
@CasertaConcepts#DataSummit
Agenda
• Why we care about Big Data
• Challenges of working with Big Data
• Governing Big Data for Data Science
• Introducing the Data Pyramid
• Why Data Science is Cool?
• What does a Data Scientist do?
• Standards for Data Science
• Business Objective
• Data Discovery
• Preparation
• Models
• Evaluation
• Deployment
• Q & A
@CasertaConcepts#DataSummit
Big Data Analysis: Timeline of Society Media
1500s
Printing Press
1840s
Penny Post
1850s
Telegraph
1850s
Rural Free Post
1890s
Telephone
1900s
Radio
1950s
TV
1970s
PCs
1980s
Internet
1990s
Web
2000s
Social Media, Mobile, Big Data, Cloud
98,000+ Tweets
695,000 Status Updates
11 Million instant messages
698,445 Google Searches
168 million+ emails sent
1,829 TB of data created
217 new mobile web
users
Every 60 Seconds
@CasertaConcepts#DataSummit
Data is your Differentiator
63% of organizations realize a positive return on
analytic investments within a year
69% of speed-driven analytics organizations
created a positive impact on business outcomes
74% of respondents anticipate a speed at which
executives expect new data-driven insights will
continue to accelerate
@CasertaConcepts#DataSummit
Understanding the Customer Journey
Awareness Consideration Purchase Service
Loyalty
Expansion
PR
Radio
TV
Print
Outdoor
Word of Mouth
Direct Mail
Customer Service
Physical Touchpoints
Digital Touchpoints
Search
Paid Content
email
Website/
Landing Pages
Social Media
Community
Chat
Social Media
Call Center
Offers
Mailings
Survey
Loyalty Programs
email
Agents
Partners
Ads
Website
Mobile
3rd Party Sites
Offers
Web self-service
@CasertaConcepts#DataSummit
Type
Comments
Single Touch Rules-Based Statistically Driven
Assign the credit
to the first or last
exposure
Assign the credit to
each interaction
based on business
rules
Assign the credit to
interactions based
on data-driven
model
Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click
100% 33% 33% 33% 27% 49% 24%
- Last touch only
- Ignores bulk of
customer journey
- Undervalues
other interactions
and influencers
- Subjective
- Assigns arbitrary
values to each
interaction
- Lacks analytics rigor
to determine weights
ü Looks at full behavior
patterns
ü Consider all touch points
ü Can apply different
models for best results
ü Use data to find
correlations between
touch points (winning
combinations)
Understanding Touchpoint Methods
@CasertaConcepts#DataSummit
What is Data Science?
@CasertaConcepts#DataSummit
Business Value
Cloud-based Data Lake
Big Data Analysis: The Ecosystem of the future
Analyze
Persist
DeployIngest
Data Integration
Identity Resolution
Data Quality
Discovery Exploration
Machine Learning
Models Development
Reports / Dashboards
Applications
APIs
Structured Data
Unstructured Data
SQL, NoSQL, Object Store
Find Share Collaborate
Data Engineer Data Scientist Business Analyst App Developer
Provides innovative and industry
leading technologies to rapidly be
applied to the business without
having to manage compatibility
and data complexity.
Technical Value
Provides an open
framework to reduce the
number of integration
points and testing
environments to deliver
business solutions.
@CasertaConcepts#DataSummit
Progression of Business Analytics to Data Science
Descriptive
Analytics
Diagnostic
Analytics
Predictive
Analytics
Prescriptive Analytics
What
happened?
Why did it
happen?
What will
happen?
How can we make
It happen?
Data Analytics Sophistication
BusinessValue
Hindsight
Insight
Foresight
Information
Optimization
Cognitive
Analytics
Influence what happens
Reports  Correlations  Predictions  Recommendations 
Monetization
Interactions
Action
@CasertaConcepts#DataSummit
Progression of Data Science Maturity
• Timeline
• Tools
• Available libraries
• Best practices
@CasertaConcepts#DataSummit
What are the Realities of the Data Scientist
— Searching for the data they need
— Making sense of the data
— Figuring why the data looks the way is does and assessing its validity
— Cleaning up all the garbage within the data so it represents true business
— Combining events with Reference data to give it context
— Correlating event data with other events
— Finally, they implement algorithms to perform mining, clustering and predictive analytics
— Writes really cool and sophisticated
algorithms that impacts the way the business
runs.
— Much of the time of a Data Scientist is spent:
— NOT
@CasertaConcepts#DataSummit
Why Data Science now?
• Costs of compute and storage dramatically lower than just a few years ago
• Data generated by all aspects of society has dramatically increased
• Need to efficiently learn what there is to learn from our data
@CasertaConcepts#DataSummit
The Data Scientist Winning Trifecta
Modern Data
Engineering/Data
Preparation
Domain
Knowledge/Bu
siness
Expertise
Advanced
Mathematics/
Statistics
- Computer Science
- Programming/Storage
- Data Quality
- Visualization
Algorithms -
A/B Testing -
- Data and Outcome
- Sensibility
@CasertaConcepts#DataSummit
Modern Data Engineering
@CasertaConcepts#DataSummit
Which Visualization, When?
@CasertaConcepts#DataSummit
Advanced Mathematics / Statistics
@CasertaConcepts#DataSummit
Domain and Outcome Sensibility
@CasertaConcepts#DataSummit
Is Data Trying To Trick You?
Correlation: 99.26%
@CasertaConcepts#DataSummit
Are we Considering the Right Factors?
@CasertaConcepts#DataSummit
Are there Standards?
CRISP-DM: Cross Industry Standard Process for Data Mining
1. Business Understanding
• Solve a single business problem
2. Data Understanding
• Discovery
• Data Munging
• Cleansing Requirements
3. Data Preparation
• ETL
4. Modeling
• Evaluate various models
• Iterative experimentation
5. Evaluation
• Does the model achieve business objectives?
6. Deployment
• PMML; application integration; data platform; Excel
@CasertaConcepts#DataSummit
1. Business Understanding
In this initial phase of the project we will need to speak to humans.
• It would be premature to jump in to the data, or begin selection of
the appropriate model(s) or algorithm
• Understand the project objective
• Review the business requirements
• The output of this phase will be conversion of business requirements
into a preliminary technical design (decision model) and plan.
Since this is an iterative process, this phase will be revisited throughout
the entire process.
@CasertaConcepts#DataSummit
Data
ScientistBusiness
Analyst
Business Stakeholders
Business Stakeholders
Business Stakeholders
Interview notes
Requirement Document
Models / Insights
Gathering Requirements
@CasertaConcepts#DataSummit
Data Science Scrum Team
Data
Scientist
Business
Stakeholders
Data
Engineer
Efficient Inclusive
EffectiveInteractive
Data
Analyst
@CasertaConcepts#DataSummit
2. Data Understanding
• Data Discovery  understand where the data you need comes
from
• Data Profiling  interrogate the data at the entity level,
understand key entities and fields that are relevant to the
analysis.
• Cleansing Requirements  understand data quality, data
density, skew, etc
• Data Munging  collocate, blend and analyze data for early
insights! Valuable information can be achieved from simple
group-by, aggregate queries, and even more with SQL Jujitsu!
Significant iteration between Business Understanding and Data
Understanding phases.
Sample
Exploration tools
for Hadoop:
Trifacta, Paxata,
Spark, Python,
Waterline,
Elasticsearch
@CasertaConcepts#DataSummit
Data Science Data Quality Priorities
Be
Corrective
Be Fast
Be
Transparent
Be Thorough
@CasertaConcepts#DataSummit
Data Science Data Quality Priorities
Data Quality
SpeedtoValue
Fast
Slow
Raw Refined
Data Scientist’s Tightrope
Does Data munging in a data science
lab need the same restrictive
governance and enterprise reporting?
@CasertaConcepts#DataSummit
3. Data Preparation
ETL (Extract Transform Load)
90+% of a Data Scientists time goes into Data Preparation!
• Locating and acquiring valuable data sources
• Select required entities/fields
• Address Data Quality issues: missing or incomplete values,
whitespace, bad data-points
• Join/Enrich disparate datasets
• Derive behavioral features
• Transform/Aggregate data for intended use:
• Sample
• Aggregate
• Pivot
@CasertaConcepts#DataSummit
We Spark
• Development local or distributed is identical
• Beautiful high level API’s
• Full universe of Python modules
• Open source and Free
• Blazing fast!
Spark has become our default processing engine for a data engineering & science
@CasertaConcepts#DataSummit
Data Preparation
• We love Spark!
• ETL can be done in Scala,
Python or SQL
• Cleansing, transformation,
and standardization
• Address Parsing:
usaddress, postal-address,
etc
• Name Hashing: fuzzy, etc
• Genderization:
sexmachine, etc
• And all the goodies of the
standard Python library!
• Parallelize workload
against a large number of
machines in Hadoop
cluster
@CasertaConcepts#DataSummit
Data Quality and Monitoring
• BUILD a robust data quality subsystem:
• Metadata and error event facts
• Orchestration
• Based on Data Warehouse ETL Toolkit
• Each error instance of each data quality
check is captured
• Implemented as sub-system after
ingestion
• Each fact stores unique identifier of the
defective source row
HAMBot: ‘open
source’ project
created in Caserta
Innovation Lab
(CIL)
@CasertaConcepts#DataSummit
Data Preparation Demonstration! Wifi:
Hilton Meeting Room Wifi
infotoday2017
Follow along: http://bit.ly/2r9ABcK
File: SanFranCrime.ipynb
@CasertaConcepts#DataSummit
4. Modeling
Do you love algebra & stats?
• Evaluate various models/algorithms
• Classification
• Clustering
• Regression
• Many others…..
• Tune parameters
• Iterative experimentation
• Different models may require different data preparation
techniques (ie. Sparse Vector Format)
• Additionally we may discover the need for additional data points,
or uncover additional data quality issues!
@CasertaConcepts#DataSummit
Modeling in Hadoop
• Spark works well
• SAS, SPSS, Etc. not
native on Hadoop
• R and Python
becoming new
standard
• PMML can be used,
but approach with
caution
@CasertaConcepts#DataSummit
Machine Learning
The goal of machine learning is to get software to make decisions and learn
from data without being programed explicitly to do so
Machine Learning algorithms are broadly broken out into two groups:
• Supervised learning  inferring functions based on labeled training data
• Unsupervised learning  finding hidden structure/patterns within data, no
training data is supplied
We will review some popular, easy to understand machine learning
algorithms
@CasertaConcepts#DataSummit
What to use When?
@CasertaConcepts#DataSummit
Supervised Learning
Name Weight Color Cat_or_Dog
Susie 9lbs Orange Cat
Fido 25lbs Brown Dog
Sparkles 6lbs Black Cat
Fido 9lbs Black Dog
Name Weight Color Cat_or_Dog
Misty 5lbs Orange ?
The training set is used to generate a function
..so we can predict if we have a cat or dog!
@CasertaConcepts#DataSummit
Category or Values?
There are several classes of algorithms depending on whether the prediction is a
category (like cat or dog) or a value, like the value of a home.
Classification algorithms are generally well fit for categorization, while algorithms
like Regression and Decision Trees are well suited for predicting “continuous”
values.
@CasertaConcepts#DataSummit
Regression
• Understanding the relationship between a given set of dependent variables
and independent variables
• Typically regression is used to predict the output of a dependent variable
based on variations in independent variables
• Very popular for prediction and forecasting
Linear Regression
@CasertaConcepts#DataSummit
Decision Trees
• A method for predicting outcomes based on the features of data
• Model is represented a easy to understand tree structure of if-else statements
Weight > 10lbs
color = orange
cat
yes
no
name = fido
no
no
dogyes
dog
cat
yes
@CasertaConcepts#DataSummit
Unsupervised K-Means
• Treats items as coordinates
• Places a number of random “centroids”
and assigns the nearest items
• Moves the centroids around based on
average location
• Process repeats until the assignments
stop changing
Clustering of items into logical groups based on natural patterns in data
Uses:
• Cluster Analysis
• Classification
• Content Filtering
@CasertaConcepts#DataSummit
Collaborative Filtering
• A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory
Based)
• Leveraging collaboration between multiple agents to filter, project, or detect
patterns
• Popular in recommender systems for projecting the “taste” for of specific
individuals for items they have not yet expressed one.
@CasertaConcepts#DataSummit
Item-based
• A popular and simple memory-based collaborative filtering algorithm
• Projects preference based on item similarity (based on ratings):
for every item i that u has no preference for yet
for every item j that u has a preference for
compute a similarity s between i and j
add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average
• First a matrix of Item to Item similarity is calculated based on user rating
• Then recommendations are created by producing a weighted sum of top items,
based on the users previously rated items
@CasertaConcepts#DataSummit
Data Science Demonstration!
Follow along: http://bit.ly/2r9ABcK
File: SanFranCrime_model_DataSummit.ipynb
Wifi:
Hilton Meeting Room Wifi
infotoday2017
@CasertaConcepts#DataSummit
5. Evaluation
What problem are we trying to solve again?
• Our final solution needs to be evaluated against original
Business Understanding
• Did we meet our objectives?
• Did we address all issues?
@CasertaConcepts#DataSummit
6. Deployment
Engineering Time!
• It’s time for the work products of data science to “graduate” from “new
insights” to real applications.
• Processes must be hardened, repeatable, and generally perform well too!
• Data Governance applied
• PMML (Predictive Model Markup Langauge): XML based interchange format
@CasertaConcepts#DataSummit
•This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc.
•Definitions, lineage (where does this data come from), business definitions, technical
metadata
Organization
•Identify and control sensitive data, regulatory compliance
Metadata
•Data must be complete and correct. Measure, improve, certify
Privacy/Security
•Policies around data frequency, source availability, etc.
Data Quality and Monitoring
•Ensure consistent business critical data i.e. Members, Providers, Agents, etc.
Business Process Integration
•Data retention, purge schedule, storage/archiving
Master Data Management
Information Lifecycle
Management (ILM)
Components of Data Governance
• Add Big Data to overall framework and assign responsibility
• Add data scientists to the Stewardship program
• Assign stewards to new data sets (twitter, call center logs, etc.)
• Graph databases are more flexible than relational
• Lower latency service required
• Distributed data quality and matching algorithms
• Data Quality and Monitoring (probably home grown, drools?)
• Quality checks not only SQL: machine learning, Pig and Map Reduce
• Acting on large dataset quality checks may require distribution
• Larger scale
• New datatypes
• Integrate with Hive Metastore, HCatalog, home grown tables
• Secure and mask multiple data types (not just tabular)
• Deletes are more uncommon (unless there is regulatory requirement)
• Take advantage of compression and archiving (like AWS Glacier)
• Data detection and masking on unstructured data upon ingest
• Near-zero latency, DevOps, Core component of business operations
For Data Science
@CasertaConcepts#DataSummit
Ingest Raw
Data
Organize, Define,
Complete
Munging, Blending
Machine Learning
Data Quality and
Monitoring
Metadata, ILM , Security
Data Catalog
Data Integration
Fully Governed ( trusted)
Arbitrary/Ad-hoc Queries
and Reporting
Big
Data
Ware
house
Data Science Workspace
Data Lake – Integrated Sandbox
Landing Area – Source Data in “Full Fidelity”
Usage Pattern Data Governance
Metadata, ILM,
Security
Corporate Data Pyramid (CDP)
@CasertaConcepts#DataSummit
Chief Data Organization (Oversight)
Vertical Business Area
[Sales/Finance/Marketing/Operations/Customer Svc]
Product Owner
SCRUM Master
Development Team
Business Subject Matter Expertise
Data Librarian/Data Stewardship
Data Science/ Statistical Skills
Data Engineering / Architecture
Presentation/ BI Report Development Skills
Data Quality Assurance
DevOps
IT Organization
(Oversight)
Enterprise Data Architect
Solution Engineers
Data Integration Practice
User Experience Practice
QA Practice
Operations Practice
Advanced Analytics
Business Analysts
Data Analysts
Data Scientists
Statisticians
Data Engineers
Planning Organization
Project Managers
Data Organization
Data Gov Coordinator
Data Librarians
Data Stewards
Analytics-Driven Organization
@CasertaConcepts#DataSummit
Technologies & Techniques
• The Cloud and Spark can provide a relatively low cost and extremely scalable
platform for Data Science
• AWS S3 and Google GCS offers great scalability and speed to value without the
overhead of structuring data
• Spark, with MLlib offers a great library of established Machine Learning algorithms,
reducing development efforts
• Python and SQL are choices for Data Science
• Go Agile and follow Best Practices (CRISP-DM)
• Employ Data Pyramid concepts to ensure data has just enough governance
@CasertaConcepts#DataSummit
Some Thoughts – Enable the Future
— Data Science requires the convergence of data
quality, advanced math, data engineering and
visualization and business smarts
— Make sure your data can be trusted and people can
be held accountable for impact caused by low data
quality.
— Good data scientists are rare: It will take a village
to achieve all the tasks required for effective data
science
— Get good!
— Be great!
— Blaze new trails!
https://explore-data-science.thisismetis.com
Data Science Training:
• Big Data Warehousing Meetup
• New York City
• 4,300+ members
• Knowledge sharing
@CasertaConcepts#DataSummit
Thank You / Q&A
Joe Caserta Bill Walrond
@joe_Caserta @bill_walrond

More Related Content

What's hot

Data science
Data scienceData science
Data science
SwapnilDahake2
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Tharushi Ruwandika
 
Data science presentation
Data science presentationData science presentation
Data science presentation
MSDEVMTL
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
Spotle.ai
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
ANOOP V S
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Edureka!
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
Sreenatha Reddy K R
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
Gang Tao
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
VijayMohan Vasu
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
Simplilearn
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
Bernard Marr
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
jayroy
 
Big Data
Big DataBig Data
Big Data
Seminar Links
 
Data Science
Data ScienceData Science
Data Science
Emma Thompson
 
Data science
Data scienceData science
Data science
Mohamed Loey
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Laguna State Polytechnic University
 
Big data
Big dataBig data
Big data
Pooja Shah
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
Edureka!
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 

What's hot (20)

Data science
Data scienceData science
Data science
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Introduction To Data Science
Introduction To Data ScienceIntroduction To Data Science
Introduction To Data Science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...Data Science Training | Data Science Tutorial | Data Science Certification | ...
Data Science Training | Data Science Tutorial | Data Science Certification | ...
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Data Science Introduction
Data Science IntroductionData Science Introduction
Data Science Introduction
 
Data science & data scientist
Data science & data scientistData science & data scientist
Data science & data scientist
 
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...What Is Data Science? | Introduction to Data Science | Data Science For Begin...
What Is Data Science? | Introduction to Data Science | Data Science For Begin...
 
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
What’s The Difference Between Structured, Semi-Structured And Unstructured Data?
 
Introduction To Predictive Analytics Part I
Introduction To Predictive Analytics   Part IIntroduction To Predictive Analytics   Part I
Introduction To Predictive Analytics Part I
 
Big Data
Big DataBig Data
Big Data
 
Data Science
Data ScienceData Science
Data Science
 
Data science
Data scienceData science
Data science
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Big data
Big dataBig data
Big data
 
Introduction on Data Science
Introduction on Data ScienceIntroduction on Data Science
Introduction on Data Science
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 

Viewers also liked

Modern Data Science
Modern Data ScienceModern Data Science
Modern Data Science
Alejandro Correa Bahnsen, PhD
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
Varad Meru
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
NhatHai Phan
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
Imry Kissos
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
Devashish Shanker
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
DataRobot
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
Daniel Tunkelang
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
joshwills
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
Owen Zhang
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
Xavier Amatriain
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Sebastian Raschka
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
Pier Luca Lanzi
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
Daniel Tunkelang
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
Si Haem
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
ryanorban
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
DEEPASHRI HK
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
Prof. Dr. Diego Kuonen
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
David Pittman
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
Philip Zheng
 

Viewers also liked (20)

Modern Data Science
Modern Data ScienceModern Data Science
Modern Data Science
 
Introduction to Mahout and Machine Learning
Introduction to Mahout and Machine LearningIntroduction to Mahout and Machine Learning
Introduction to Mahout and Machine Learning
 
Tutorial on Deep learning and Applications
Tutorial on Deep learning and ApplicationsTutorial on Deep learning and Applications
Tutorial on Deep learning and Applications
 
Hands-on Deep Learning in Python
Hands-on Deep Learning in PythonHands-on Deep Learning in Python
Hands-on Deep Learning in Python
 
Deep Learning for Natural Language Processing
Deep Learning for Natural Language ProcessingDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing
 
10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions10 R Packages to Win Kaggle Competitions
10 R Packages to Win Kaggle Competitions
 
Data By The People, For The People
Data By The People, For The PeopleData By The People, For The People
Data By The People, For The People
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems10 Lessons Learned from Building Machine Learning Systems
10 Lessons Learned from Building Machine Learning Systems
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 
Machine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification RulesMachine Learning and Data Mining: 12 Classification Rules
Machine Learning and Data Mining: 12 Classification Rules
 
How to Interview a Data Scientist
How to Interview a Data ScientistHow to Interview a Data Scientist
How to Interview a Data Scientist
 
Deep neural networks
Deep neural networksDeep neural networks
Deep neural networks
 
How to Become a Data Scientist
How to Become a Data ScientistHow to Become a Data Scientist
How to Become a Data Scientist
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)A Statistician's View on Big Data and Data Science (Version 1)
A Statistician's View on Big Data and Data Science (Version 1)
 
Myths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data ScientistsMyths and Mathemagical Superpowers of Data Scientists
Myths and Mathemagical Superpowers of Data Scientists
 
A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013A tutorial on deep learning at icml 2013
A tutorial on deep learning at icml 2013
 

Similar to Introduction to Data Science (Data Summit, 2017)

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
Caserta
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
Gary Allemann
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Caserta
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
DATAVERSITY
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
Caserta
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
Caserta
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Caserta
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
cedrinemadera
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Denodo
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Delivering Value Through Business Analytics
Delivering Value Through Business AnalyticsDelivering Value Through Business Analytics
Delivering Value Through Business Analytics
Social Media Today
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
Caserta
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Cloudera, Inc.
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
DATAVERSITY
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
LoQutus
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
Digital Transformation EXPO Event Series
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
Caserta
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
Denodo
 

Similar to Introduction to Data Science (Data Summit, 2017) (20)

Predictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing MeetupPredictive Analytics - Big Data Warehousing Meetup
Predictive Analytics - Big Data Warehousing Meetup
 
Deliveinrg explainable AI
Deliveinrg explainable AIDeliveinrg explainable AI
Deliveinrg explainable AI
 
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing KeynoteArchitecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
Architecting Data For The Modern Enterprise - Data Summit 2017, Closing Keynote
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 
Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics Building a New Platform for Customer Analytics
Building a New Platform for Customer Analytics
 
What Data Do You Have and Where is It?
What Data Do You Have and Where is It? What Data Do You Have and Where is It?
What Data Do You Have and Where is It?
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
The Emerging Role of the Data Lake
The Emerging Role of the Data LakeThe Emerging Role of the Data Lake
The Emerging Role of the Data Lake
 
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
Building New Data Ecosystem for Customer Analytics, Strata + Hadoop World, 2016
 
Gse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-sharedGse uk-cedrinemadera-2018-shared
Gse uk-cedrinemadera-2018-shared
 
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
Why Your Data Science Architecture Should Include a Data Virtualization Tool ...
 
When and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data ArchitectureWhen and How Data Lakes Fit into a Modern Data Architecture
When and How Data Lakes Fit into a Modern Data Architecture
 
Delivering Value Through Business Analytics
Delivering Value Through Business AnalyticsDelivering Value Through Business Analytics
Delivering Value Through Business Analytics
 
Big Data's Impact on the Enterprise
Big Data's Impact on the EnterpriseBig Data's Impact on the Enterprise
Big Data's Impact on the Enterprise
 
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and ClouderaIs your big data journey stalling? Take the Leap with Capgemini and Cloudera
Is your big data journey stalling? Take the Leap with Capgemini and Cloudera
 
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture MaturityADV Slides: How to Improve Your Analytic Data Architecture Maturity
ADV Slides: How to Improve Your Analytic Data Architecture Maturity
 
Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018Self-Service Analytics Framework - Connected Brains 2018
Self-Service Analytics Framework - Connected Brains 2018
 
Lean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science teamLean Analytics: How to get more out of your data science team
Lean Analytics: How to get more out of your data science team
 
The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation The Data Lake - Balancing Data Governance and Innovation
The Data Lake - Balancing Data Governance and Innovation
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 

More from Caserta

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
Caserta
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Caserta
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Caserta
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
Caserta
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Caserta
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Caserta
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
Caserta
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
Caserta
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
Caserta
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
Caserta
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
Caserta
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
Caserta
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
Caserta
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
Caserta
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
Caserta
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
Caserta
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
Caserta
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
Caserta
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
Caserta
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
Caserta
 

More from Caserta (20)

Using Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven MarketingUsing Machine Learning & Spark to Power Data-Driven Marketing
Using Machine Learning & Spark to Power Data-Driven Marketing
 
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
Data Intelligence: How the Amalgamation of Data, Science, and Technology is C...
 
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
Creating a DevOps Practice for Analytics -- Strata Data, September 28, 2017
 
General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017General Data Protection Regulation - BDW Meetup, October 11th, 2017
General Data Protection Regulation - BDW Meetup, October 11th, 2017
 
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
Integrating the CDO Role Into Your Organization; Managing the Disruption (MIT...
 
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
Looker Data Modeling in the Age of Cloud - BDW Meetup May 2, 2017
 
The Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's EnterpriseThe Rise of the CDO in Today's Enterprise
The Rise of the CDO in Today's Enterprise
 
You're the New CDO, Now What?
You're the New CDO, Now What?You're the New CDO, Now What?
You're the New CDO, Now What?
 
Making Big Data Easy for Everyone
Making Big Data Easy for EveryoneMaking Big Data Easy for Everyone
Making Big Data Easy for Everyone
 
Benefits of the Azure Cloud
Benefits of the Azure CloudBenefits of the Azure Cloud
Benefits of the Azure Cloud
 
Big Data Analytics on the Cloud
Big Data Analytics on the CloudBig Data Analytics on the Cloud
Big Data Analytics on the Cloud
 
Intro to Data Science on Hadoop
Intro to Data Science on HadoopIntro to Data Science on Hadoop
Intro to Data Science on Hadoop
 
Not Your Father's Database by Databricks
Not Your Father's Database by DatabricksNot Your Father's Database by Databricks
Not Your Father's Database by Databricks
 
Mastering Customer Data on Apache Spark
Mastering Customer Data on Apache SparkMastering Customer Data on Apache Spark
Mastering Customer Data on Apache Spark
 
Moving Past Infrastructure Limitations
Moving Past Infrastructure LimitationsMoving Past Infrastructure Limitations
Moving Past Infrastructure Limitations
 
Introducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing MeetupIntroducing Kudu, Big Data Warehousing Meetup
Introducing Kudu, Big Data Warehousing Meetup
 
Balancing Data Governance and Innovation
Balancing Data Governance and InnovationBalancing Data Governance and Innovation
Balancing Data Governance and Innovation
 
Setting Up the Data Lake
Setting Up the Data LakeSetting Up the Data Lake
Setting Up the Data Lake
 
Incorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic ArchitectureIncorporating the Data Lake into Your Analytic Architecture
Incorporating the Data Lake into Your Analytic Architecture
 
Real Time Big Data Processing on AWS
Real Time Big Data Processing on AWSReal Time Big Data Processing on AWS
Real Time Big Data Processing on AWS
 

Recently uploaded

The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Riya Sen
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
hritikbui
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
AltanAtabarut
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
JeevanKp7
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
Riya Sen
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 

Recently uploaded (20)

The Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdfThe Rise of Python in Finance,Automating Trading Strategies: _.pdf
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Field Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdfField Diary and lab record, Importance.pdf
Field Diary and lab record, Importance.pdf
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdfParcel Delivery - Intel Segmentation and Last Mile Opt.pdf
Parcel Delivery - Intel Segmentation and Last Mile Opt.pdf
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Getting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdfGetting Started with Interactive Brokers API and Python.pdf
Getting Started with Interactive Brokers API and Python.pdf
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 

Introduction to Data Science (Data Summit, 2017)

  • 1. @CasertaConcepts#DataSummit Introduction to Data Science Joe Caserta Bill Walrond @joe_Caserta @bill_walrond @CasertaConcepts
  • 2. @CasertaConcepts#DataSummit About Caserta Concepts • Consulting Data Innovation and Modern Data Engineering • Award-winning company • Internationally recognized work force • Strategy, Architecture, Implementation, Governance • Innovation Partner • Strategic Consulting • Advanced Architecture • Build & Deploy • Leader in Enterprise Data Solutions • Big Data Analytics • Data Warehousing • Business Intelligence • Data Science • Cloud Computing • Data Governance
  • 3. @CasertaConcepts#DataSummit Caserta Client Portfolio Retail/eCommerce & Manufacturing Finance, Healthcare & Insurance Digital Media/AdTech Education & Services
  • 4. @CasertaConcepts#DataSummit Awards & Recognition Top 10 Fastest Growing Big Data Companies 2016
  • 6. @CasertaConcepts#DataSummit Agenda • Why we care about Big Data • Challenges of working with Big Data • Governing Big Data for Data Science • Introducing the Data Pyramid • Why Data Science is Cool? • What does a Data Scientist do? • Standards for Data Science • Business Objective • Data Discovery • Preparation • Models • Evaluation • Deployment • Q & A
  • 7. @CasertaConcepts#DataSummit Big Data Analysis: Timeline of Society Media 1500s Printing Press 1840s Penny Post 1850s Telegraph 1850s Rural Free Post 1890s Telephone 1900s Radio 1950s TV 1970s PCs 1980s Internet 1990s Web 2000s Social Media, Mobile, Big Data, Cloud 98,000+ Tweets 695,000 Status Updates 11 Million instant messages 698,445 Google Searches 168 million+ emails sent 1,829 TB of data created 217 new mobile web users Every 60 Seconds
  • 8. @CasertaConcepts#DataSummit Data is your Differentiator 63% of organizations realize a positive return on analytic investments within a year 69% of speed-driven analytics organizations created a positive impact on business outcomes 74% of respondents anticipate a speed at which executives expect new data-driven insights will continue to accelerate
  • 9. @CasertaConcepts#DataSummit Understanding the Customer Journey Awareness Consideration Purchase Service Loyalty Expansion PR Radio TV Print Outdoor Word of Mouth Direct Mail Customer Service Physical Touchpoints Digital Touchpoints Search Paid Content email Website/ Landing Pages Social Media Community Chat Social Media Call Center Offers Mailings Survey Loyalty Programs email Agents Partners Ads Website Mobile 3rd Party Sites Offers Web self-service
  • 10. @CasertaConcepts#DataSummit Type Comments Single Touch Rules-Based Statistically Driven Assign the credit to the first or last exposure Assign the credit to each interaction based on business rules Assign the credit to interactions based on data-driven model Ad-Click Mailing MailingE-mail E-mailAd-Click Ad-Click 100% 33% 33% 33% 27% 49% 24% - Last touch only - Ignores bulk of customer journey - Undervalues other interactions and influencers - Subjective - Assigns arbitrary values to each interaction - Lacks analytics rigor to determine weights ü Looks at full behavior patterns ü Consider all touch points ü Can apply different models for best results ü Use data to find correlations between touch points (winning combinations) Understanding Touchpoint Methods
  • 12. @CasertaConcepts#DataSummit Business Value Cloud-based Data Lake Big Data Analysis: The Ecosystem of the future Analyze Persist DeployIngest Data Integration Identity Resolution Data Quality Discovery Exploration Machine Learning Models Development Reports / Dashboards Applications APIs Structured Data Unstructured Data SQL, NoSQL, Object Store Find Share Collaborate Data Engineer Data Scientist Business Analyst App Developer Provides innovative and industry leading technologies to rapidly be applied to the business without having to manage compatibility and data complexity. Technical Value Provides an open framework to reduce the number of integration points and testing environments to deliver business solutions.
  • 13. @CasertaConcepts#DataSummit Progression of Business Analytics to Data Science Descriptive Analytics Diagnostic Analytics Predictive Analytics Prescriptive Analytics What happened? Why did it happen? What will happen? How can we make It happen? Data Analytics Sophistication BusinessValue Hindsight Insight Foresight Information Optimization Cognitive Analytics Influence what happens Reports  Correlations  Predictions  Recommendations  Monetization Interactions Action
  • 14. @CasertaConcepts#DataSummit Progression of Data Science Maturity • Timeline • Tools • Available libraries • Best practices
  • 15. @CasertaConcepts#DataSummit What are the Realities of the Data Scientist — Searching for the data they need — Making sense of the data — Figuring why the data looks the way is does and assessing its validity — Cleaning up all the garbage within the data so it represents true business — Combining events with Reference data to give it context — Correlating event data with other events — Finally, they implement algorithms to perform mining, clustering and predictive analytics — Writes really cool and sophisticated algorithms that impacts the way the business runs. — Much of the time of a Data Scientist is spent: — NOT
  • 16. @CasertaConcepts#DataSummit Why Data Science now? • Costs of compute and storage dramatically lower than just a few years ago • Data generated by all aspects of society has dramatically increased • Need to efficiently learn what there is to learn from our data
  • 17. @CasertaConcepts#DataSummit The Data Scientist Winning Trifecta Modern Data Engineering/Data Preparation Domain Knowledge/Bu siness Expertise Advanced Mathematics/ Statistics - Computer Science - Programming/Storage - Data Quality - Visualization Algorithms - A/B Testing - - Data and Outcome - Sensibility
  • 22. @CasertaConcepts#DataSummit Is Data Trying To Trick You? Correlation: 99.26%
  • 24. @CasertaConcepts#DataSummit Are there Standards? CRISP-DM: Cross Industry Standard Process for Data Mining 1. Business Understanding • Solve a single business problem 2. Data Understanding • Discovery • Data Munging • Cleansing Requirements 3. Data Preparation • ETL 4. Modeling • Evaluate various models • Iterative experimentation 5. Evaluation • Does the model achieve business objectives? 6. Deployment • PMML; application integration; data platform; Excel
  • 25. @CasertaConcepts#DataSummit 1. Business Understanding In this initial phase of the project we will need to speak to humans. • It would be premature to jump in to the data, or begin selection of the appropriate model(s) or algorithm • Understand the project objective • Review the business requirements • The output of this phase will be conversion of business requirements into a preliminary technical design (decision model) and plan. Since this is an iterative process, this phase will be revisited throughout the entire process.
  • 26. @CasertaConcepts#DataSummit Data ScientistBusiness Analyst Business Stakeholders Business Stakeholders Business Stakeholders Interview notes Requirement Document Models / Insights Gathering Requirements
  • 27. @CasertaConcepts#DataSummit Data Science Scrum Team Data Scientist Business Stakeholders Data Engineer Efficient Inclusive EffectiveInteractive Data Analyst
  • 28. @CasertaConcepts#DataSummit 2. Data Understanding • Data Discovery  understand where the data you need comes from • Data Profiling  interrogate the data at the entity level, understand key entities and fields that are relevant to the analysis. • Cleansing Requirements  understand data quality, data density, skew, etc • Data Munging  collocate, blend and analyze data for early insights! Valuable information can be achieved from simple group-by, aggregate queries, and even more with SQL Jujitsu! Significant iteration between Business Understanding and Data Understanding phases. Sample Exploration tools for Hadoop: Trifacta, Paxata, Spark, Python, Waterline, Elasticsearch
  • 29. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Be Corrective Be Fast Be Transparent Be Thorough
  • 30. @CasertaConcepts#DataSummit Data Science Data Quality Priorities Data Quality SpeedtoValue Fast Slow Raw Refined Data Scientist’s Tightrope Does Data munging in a data science lab need the same restrictive governance and enterprise reporting?
  • 31. @CasertaConcepts#DataSummit 3. Data Preparation ETL (Extract Transform Load) 90+% of a Data Scientists time goes into Data Preparation! • Locating and acquiring valuable data sources • Select required entities/fields • Address Data Quality issues: missing or incomplete values, whitespace, bad data-points • Join/Enrich disparate datasets • Derive behavioral features • Transform/Aggregate data for intended use: • Sample • Aggregate • Pivot
  • 32. @CasertaConcepts#DataSummit We Spark • Development local or distributed is identical • Beautiful high level API’s • Full universe of Python modules • Open source and Free • Blazing fast! Spark has become our default processing engine for a data engineering & science
  • 33. @CasertaConcepts#DataSummit Data Preparation • We love Spark! • ETL can be done in Scala, Python or SQL • Cleansing, transformation, and standardization • Address Parsing: usaddress, postal-address, etc • Name Hashing: fuzzy, etc • Genderization: sexmachine, etc • And all the goodies of the standard Python library! • Parallelize workload against a large number of machines in Hadoop cluster
  • 34. @CasertaConcepts#DataSummit Data Quality and Monitoring • BUILD a robust data quality subsystem: • Metadata and error event facts • Orchestration • Based on Data Warehouse ETL Toolkit • Each error instance of each data quality check is captured • Implemented as sub-system after ingestion • Each fact stores unique identifier of the defective source row HAMBot: ‘open source’ project created in Caserta Innovation Lab (CIL)
  • 35. @CasertaConcepts#DataSummit Data Preparation Demonstration! Wifi: Hilton Meeting Room Wifi infotoday2017 Follow along: http://bit.ly/2r9ABcK File: SanFranCrime.ipynb
  • 36. @CasertaConcepts#DataSummit 4. Modeling Do you love algebra & stats? • Evaluate various models/algorithms • Classification • Clustering • Regression • Many others….. • Tune parameters • Iterative experimentation • Different models may require different data preparation techniques (ie. Sparse Vector Format) • Additionally we may discover the need for additional data points, or uncover additional data quality issues!
  • 37. @CasertaConcepts#DataSummit Modeling in Hadoop • Spark works well • SAS, SPSS, Etc. not native on Hadoop • R and Python becoming new standard • PMML can be used, but approach with caution
  • 38. @CasertaConcepts#DataSummit Machine Learning The goal of machine learning is to get software to make decisions and learn from data without being programed explicitly to do so Machine Learning algorithms are broadly broken out into two groups: • Supervised learning  inferring functions based on labeled training data • Unsupervised learning  finding hidden structure/patterns within data, no training data is supplied We will review some popular, easy to understand machine learning algorithms
  • 40. @CasertaConcepts#DataSummit Supervised Learning Name Weight Color Cat_or_Dog Susie 9lbs Orange Cat Fido 25lbs Brown Dog Sparkles 6lbs Black Cat Fido 9lbs Black Dog Name Weight Color Cat_or_Dog Misty 5lbs Orange ? The training set is used to generate a function ..so we can predict if we have a cat or dog!
  • 41. @CasertaConcepts#DataSummit Category or Values? There are several classes of algorithms depending on whether the prediction is a category (like cat or dog) or a value, like the value of a home. Classification algorithms are generally well fit for categorization, while algorithms like Regression and Decision Trees are well suited for predicting “continuous” values.
  • 42. @CasertaConcepts#DataSummit Regression • Understanding the relationship between a given set of dependent variables and independent variables • Typically regression is used to predict the output of a dependent variable based on variations in independent variables • Very popular for prediction and forecasting Linear Regression
  • 43. @CasertaConcepts#DataSummit Decision Trees • A method for predicting outcomes based on the features of data • Model is represented a easy to understand tree structure of if-else statements Weight > 10lbs color = orange cat yes no name = fido no no dogyes dog cat yes
  • 44. @CasertaConcepts#DataSummit Unsupervised K-Means • Treats items as coordinates • Places a number of random “centroids” and assigns the nearest items • Moves the centroids around based on average location • Process repeats until the assignments stop changing Clustering of items into logical groups based on natural patterns in data Uses: • Cluster Analysis • Classification • Content Filtering
  • 45. @CasertaConcepts#DataSummit Collaborative Filtering • A hybrid of Supervised and Unsupervised Learning (Model Based vs. Memory Based) • Leveraging collaboration between multiple agents to filter, project, or detect patterns • Popular in recommender systems for projecting the “taste” for of specific individuals for items they have not yet expressed one.
  • 46. @CasertaConcepts#DataSummit Item-based • A popular and simple memory-based collaborative filtering algorithm • Projects preference based on item similarity (based on ratings): for every item i that u has no preference for yet for every item j that u has a preference for compute a similarity s between i and j add u's preference for j, weighted by s, to a running average return the top items, ranked by weighted average • First a matrix of Item to Item similarity is calculated based on user rating • Then recommendations are created by producing a weighted sum of top items, based on the users previously rated items
  • 47. @CasertaConcepts#DataSummit Data Science Demonstration! Follow along: http://bit.ly/2r9ABcK File: SanFranCrime_model_DataSummit.ipynb Wifi: Hilton Meeting Room Wifi infotoday2017
  • 48. @CasertaConcepts#DataSummit 5. Evaluation What problem are we trying to solve again? • Our final solution needs to be evaluated against original Business Understanding • Did we meet our objectives? • Did we address all issues?
  • 49. @CasertaConcepts#DataSummit 6. Deployment Engineering Time! • It’s time for the work products of data science to “graduate” from “new insights” to real applications. • Processes must be hardened, repeatable, and generally perform well too! • Data Governance applied • PMML (Predictive Model Markup Langauge): XML based interchange format
  • 50. @CasertaConcepts#DataSummit •This is the ‘people’ part. Establishing Enterprise Data Council, Data Stewards, etc. •Definitions, lineage (where does this data come from), business definitions, technical metadata Organization •Identify and control sensitive data, regulatory compliance Metadata •Data must be complete and correct. Measure, improve, certify Privacy/Security •Policies around data frequency, source availability, etc. Data Quality and Monitoring •Ensure consistent business critical data i.e. Members, Providers, Agents, etc. Business Process Integration •Data retention, purge schedule, storage/archiving Master Data Management Information Lifecycle Management (ILM) Components of Data Governance • Add Big Data to overall framework and assign responsibility • Add data scientists to the Stewardship program • Assign stewards to new data sets (twitter, call center logs, etc.) • Graph databases are more flexible than relational • Lower latency service required • Distributed data quality and matching algorithms • Data Quality and Monitoring (probably home grown, drools?) • Quality checks not only SQL: machine learning, Pig and Map Reduce • Acting on large dataset quality checks may require distribution • Larger scale • New datatypes • Integrate with Hive Metastore, HCatalog, home grown tables • Secure and mask multiple data types (not just tabular) • Deletes are more uncommon (unless there is regulatory requirement) • Take advantage of compression and archiving (like AWS Glacier) • Data detection and masking on unstructured data upon ingest • Near-zero latency, DevOps, Core component of business operations For Data Science
  • 51. @CasertaConcepts#DataSummit Ingest Raw Data Organize, Define, Complete Munging, Blending Machine Learning Data Quality and Monitoring Metadata, ILM , Security Data Catalog Data Integration Fully Governed ( trusted) Arbitrary/Ad-hoc Queries and Reporting Big Data Ware house Data Science Workspace Data Lake – Integrated Sandbox Landing Area – Source Data in “Full Fidelity” Usage Pattern Data Governance Metadata, ILM, Security Corporate Data Pyramid (CDP)
  • 52. @CasertaConcepts#DataSummit Chief Data Organization (Oversight) Vertical Business Area [Sales/Finance/Marketing/Operations/Customer Svc] Product Owner SCRUM Master Development Team Business Subject Matter Expertise Data Librarian/Data Stewardship Data Science/ Statistical Skills Data Engineering / Architecture Presentation/ BI Report Development Skills Data Quality Assurance DevOps IT Organization (Oversight) Enterprise Data Architect Solution Engineers Data Integration Practice User Experience Practice QA Practice Operations Practice Advanced Analytics Business Analysts Data Analysts Data Scientists Statisticians Data Engineers Planning Organization Project Managers Data Organization Data Gov Coordinator Data Librarians Data Stewards Analytics-Driven Organization
  • 53. @CasertaConcepts#DataSummit Technologies & Techniques • The Cloud and Spark can provide a relatively low cost and extremely scalable platform for Data Science • AWS S3 and Google GCS offers great scalability and speed to value without the overhead of structuring data • Spark, with MLlib offers a great library of established Machine Learning algorithms, reducing development efforts • Python and SQL are choices for Data Science • Go Agile and follow Best Practices (CRISP-DM) • Employ Data Pyramid concepts to ensure data has just enough governance
  • 54. @CasertaConcepts#DataSummit Some Thoughts – Enable the Future — Data Science requires the convergence of data quality, advanced math, data engineering and visualization and business smarts — Make sure your data can be trusted and people can be held accountable for impact caused by low data quality. — Good data scientists are rare: It will take a village to achieve all the tasks required for effective data science — Get good! — Be great! — Blaze new trails! https://explore-data-science.thisismetis.com Data Science Training: • Big Data Warehousing Meetup • New York City • 4,300+ members • Knowledge sharing
  • 55. @CasertaConcepts#DataSummit Thank You / Q&A Joe Caserta Bill Walrond @joe_Caserta @bill_walrond