Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

Building Data Pipelines with Cask Hydrator
Jon Gray
CEO, Cask
July 9th, 2016

PROPRIETARY & CONFIDENTIAL
Web Analytics and Reporting Use Case
✦Hadoop ETL pipeline stitched together using hard-to-maintain, brittle scripts 
✦Not enough personnel with expertise in all the Hadoop components (HDFS,
MapReduce, Spark, YARN, HBase, Kafka) or lack of expertise 
✦Hard to debug and validate, resulting in frequent failures in production environment 
 
Transform web log data from S3 every hour to Hadoop cluster for backup, as well as,
perform analytics and enable realtime reporting of metrics such as number of
successful/failure responses, most popular webpage etc.
The Challenge —

Demo Example
Load Log Files from S3 to
HDFS and perform
aggregations/analysis
•Start with web access logs stored in
Amazon S3
•Store the raw logs into HDFS Avro Files
•Parse the access log lines into individual
ﬁelds
•Calculate the total number of requests by
IP and status code
•Find out IPs which received maximum
successful status code and error codes
69.181.160.120 - - [08/Feb/2015:04:36:40 +0000] "GET /ajax/planStatusHistory HTTP/1.1" 200 508 "http://builds.cask.co/log" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_10_1) AppleWebKit Chrome/38.0.2125.122 Safari/537.36"
Fields: IP Address, Timestamp, Http Method, URI, Http Status, Response Size, URI, Client Info
Sample Web access log (Combined Log Format):

INGEST
any data from any source
in real-time and batch
BUILD
drag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESS
any data to any destination
in real-time and batch
Data Pipeline
provides the ability to automate complex workﬂows that involves fetching
data, possibly from multiple data sources, combining, performing non-trivial
transformations on the data, writing it to one more data sinks and deriving/

Stack of Data Enablers

Hydrator Studio
✦Drag-and-drop GUI for visual Data
Pipeline creation 
✦Rich library of pre-built sources,
transforms, sinks for data ingestion
and ETL use cases 
✦Separation of pipeline creation from
execution framework - MapReduce,
Spark, Spark Streaming etc. 
✦Hadoop-native and Hadoop Distro
agnostic

Hydrator Data Pipeline
✦ Captures Metadata, Audit,
Lineage info and visualized using
Cask Tracker 
✦ Notiﬁcation, centralized metrics
and log collection for ease of
operability 
✦ Simple Java API to build your
own source, transforms, sinks
with complete class loading
isolation 
✦ SparkML based plugins, Python
transforms for data scientists

✦ ElasticSearch, SFTP, Cassandra, Kafka, JMS and many more sources and
sinks 
Out of the box Integrations

✦ Implement your own batch (or realtime) source, transform, sink plugins using simple
Java API
Custom Plugins

Pipeline Implementation
Logical
Physical
MR/Spark Executions
Planner
CDAP
✦ Planner converts logical pipeline to a physical
execution plan 
✦ Optimizes and bundles functions into one or
more MR/Spark jobs 
✦ CDAP is the runtime environment where all the
components of the data pipeline are executed 
✦ CDAP provides centralized log and metrics
collection, transaction, lineage and audit
information

Pipeline Implementation

CASK DATA APPLICATION PLATFORM
Integrated Framework for Building and
Running Data Applications on Hadoop
Integrates the Latest
Big Data Technologies
Supports All Major
Hadoop Distributions
Fully Open Source
and Highly Extensible

Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions

Abstraction and Integration Layer
Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker

Data Lake
Fraud
Detection
Recommendation
Engine
Sensor Data
Analytics
Customer
360
Hydrator Tracker
CASK DATA APP PLATFORM

Self-Service Data Ingestion
and ETL for Data Lakes
Built for Production
on CDAP
Rich Drag-and-Drop
User Interface
Open Source &
Highly Extensible

✦ Join across multiple data sources (CDAP-5588) 
✦ Macro substitutions 
✦ Pre-Actions in pipelines similar to post run
notiﬁcations 
✦ Spark streaming support for Realtime pipelines
Hydrator Roadmap

Thank You!
cdap-user@googlegroups.com
Twitter @CaskData
Questions?

Data Lake
Enterprise-wide data management platforms
for analyzing disparate sources of data in its
native format - Gartner
Data
Lake
1
0
1
0
0
01
1
0
1
Hydrating your Data Lake
Hydrator
Self-service, hadoop-native, drag-and-
drop open source framework to
develop, run and operate data

Manual processes requiring
hand-coding and reliance on 
command-line tools
Hard to find data and 
it’s lineage for data 
discovery and exploration
Coupling of ingestion and
processing drives 
architecture decisions
Operationalizing processes 
for production and to 
maintain SLAs
Ensuring data is in canonical
forms with a shared schema
usable by others
Coding or filing tickets often
required to perform new 
ingestion and processing tasks
Multiple architectures and
technologies used by different
teams on different clusters
Guaranteeing compliance in a
system that is designed for
schema-on-read and raw data
Sharing infrastructure in a 
multi-tenant environment 
without low-level QoS support
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lake Challenges

Hydrator framework with
templates and plugins enables
production workflows in minutes
Never lose data by ensuring all
ingested data is tracked with 
metadata and lineage
Separation of ingestion 
and processing to support 
any type, format and rate
Operationalize workflows using 
scheduling and SLA monitoring 
with time / partition awareness
Using common transformations
and a shared system for 
defining and exposing schema
Reference architecture ensures
a common platform across
teams, orgs, ops and security
Multi-tenant namespacing
provides data and app isolation,
tying together infrastructure
Ensure compliance by 
requiring the use of specific
transformations and validation
Self-service access through
Cask Hydrator for the discovery,
ingest and exploration of data
Data
Reservoir
1
0
1
0
0
0
1
Data
Pond
1
0
1
0
1 0
Data
Lake
1
0
1
0
1
0
Data Lakes on CDAP

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data

Similar to Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data (20)

More from Data Con LA

More from Data Con LA (20)

Recently uploaded

Recently uploaded (20)

Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Free Data Pipelines, Jon Gray, CEO, Cask Data