Tame that Beast

1© Copyright 2016 EMC Corporation. All rights reserved.
TAME THAT BEAST
Stefan Radtke
CTO, EMEA
EMC Emerging Technology Division

2EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY
Welcome !
Dr. Stefan Radtke
CTO Isilon, EMEA
EMC Emerging Technology Division
- 1995-2011: 17 Years for IBM in various technical roles
- 2011: Joined EMC
- 2012-today: CTO, EMEA for EMC Insilon
Phone: +49-176-34434460
E-Mail: Stefan.Radtke@emc.com
Linkedin: http://de.linkedin.com/in/drstefanradtke
Blog: http://stefanradtke.blogspot.com

System Availability
Uptime Downtime (per year)
99.999% (AKA 5 nines) 5.26 minutes
99.99% (AKA 4 nines) 52.6 minutes
99.5% 1.83 days
99% (AKA 2 nines) 7.30 days
95% 18.25 days
What is your Data Warehouses’ uptime SLA?
What is your Hadoop uptime SLA?
Why are they different?

We have good Hadoop Outcomes
 Smart Grid
Fraud / Broken Devices & Grid Traffic Projections
 Fraud
 Healthcare research
Genomes and Healthcare – BRCA
 Connected Car - Tesla

Hadoop takes on DB like Features
• Newly Added Features in Hadoop 3.0
– Erasure Coding (HDFS-EC / HDFS-7485) is being introduced
to Hadoop
– Additional Stand By Name Nodes for increase resiliency
(HDFS-6440)
• Future Features
– Random read support from Indexed Name Node – (HDFS-
8555)
– Disaster Recovery (HDFS-5442)

So...
• IF Hadoop is the Modern Database
AND
• IF Hadoop is taking on more Modern Database Features
AND
• Successful Outcomes are becoming more prolific...
Why are Operations of Hadoop and Uptime / SLAs seem
like such an afterthought on most clusters?

KPIs
• Why do companies who have VERY successful Data
Warehouses, ETL processes, and KPI Dashboards
have so little of THOSE for their Hadoop instance
which is now generating all their Machine Learning
and Data & Analytics?

What can go wrong?
• Forbes: “..haven’t taken into account
some long-term or ongoing cost associated
with the project…”
• Information Week: “…Unanticipated
problems beyond the big data
technology…”
• Computerworld: “…there are enterprises
that underestimated the paradigm shift…”

An Intervention
• Why is the concept of 99.99% seem bad for a
production Hadoop system?
• Why is solid KPIs around data collection and capture
sound absurd?
• Since when did a backup copy or backup of your
primary analytics data become not needed?
• Is this just because Hadoop is about standing up cheap
hardware?
• Why do companies need a catalyst before these things
seem common again?

Why wouldn’t you want:
• Two clusters fully addressable with data
replication located in separate geographies
• Data Re-silvering when additional capacity is
added
• Complete fault tolerance in the environment
and not just Data / Node redundancy to allow 4
Nines availability
• Operational scale that allows 24 x 7 support
EMPTYEMPTYEMPTYEMPTYEMPTYFULLFULLFULLFULLBALANCEDBALANCEDBALANCEDBALANCEDBALANCED

What is my Idea - 1
• Separation of compute and storage.
– Why do you think the cloud Hadoop is able to offer better SLAs
then on premise Hadoop? It isn’t because of a ton of single point
of failure compute boxes. They separate compute and storage.
• Look at Infrastructure / Big Data as a service centralization
– Instead of trying to staff 25 hadoop clusters for 24 x 7, centralize
the team and provide QoS back to the applications

Data Gravity
• Data sets get bigger over time, and moving them becomes
increasingly difficult
– This leads to switching costs & lock in
• Data is a strategic asset to enterprises with digital strategies
• Data becomes central – build around it
– Applications tend to migrate toward the data
– Apply advanced analytics to the data “in-place”

Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
Multiple Hadoop Silos
Storage Silos
vServer
Applications
Finance Marketing Operations Sales
Servers
Storage
Servers
Storage
Servers
Storage
Servers
Storage
CRMERP SCM CRM
Servers
Storage
Servers
Storage
Servers
Storage
Analytics
Copy
Copy
Traditional IT

THE PROBLEM OF DATA MOVEMENT
• To get statistically relevant results, a typical minimal required
data set is about 100 TB.
• That’s also the recommendet minimal Hadoop cluster size
• To copy 100TB over a dedicated 10 GBE link takes about 24
hours.
You need a Data Lake that unserstands Posix/Windows
and HDFS to avoid data movement (=In-place Analytics)

EMC DATA LAKE
Isilon
Servers
Applications
Finance Marketing Operations Sales
Servers Servers Servers Servers
CRMERP SCM CRM
Servers Servers Servers
Analytics + Mobile Applications
• Data Lake
Servers Servers Servers Servers

WHAT IS A DATA LAKE?

Isilon Data Lake Architecture
ClientsC
LAN
C
Clients
Clients
Isilon Node
GB/10GB
Ethernet
Isilon
SAS
Isilon Node
SAS
Isilon Node
SAS
Infiniband
Scale out Data Lake
 OneFS integrates RAID, Volume Manager and
Filesystem.
 Uses internal disk and spawns a single
filesystem accross disks
 Development start in the 2000‘s
 Extremly mature, based on FreeBSD
 Supports many access protocols
…
Scale Out
Clients
Clients
LAN

• Multi-threaded daemon runs on all nodes
– Services both NN and DN protocols
– Translates HDFS RPCs to POSIX system calls
– Stateless, underlying FS handles coherency
HDFS Implementation as a Protocol
OneFS Node
isi_hdfs_d
Thread
Request VFS
OneFS
Syscall
Response

HDFS IMPLEMENTED LIKE A NAS PROTOCOL
OneFS runs a daemon that
speaks NameNode and
DataNode natively
OneFS Clustered FileSystem
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
OneFS Node
NameNode
DataNode
Hadoop
Node
DFSClient
1) Request(“/file”)
2) Response
(block locations) 3) GetBlock(block)

ISILON - FOR ALL TYPES OF UNSTRUCTURED DATA
Archive &
Backup Target
File shares
Home
Directories
BLOBS
Design, Test
& Manufacture
Retail &
Monetization
Transaction
Hadoop &
Analytics
Sync ‘n Share
Application Test
Content
Social &
Next-Gen
Surveillance
Isilon
Data Lake
© Copyright 2016 EMC Corporation. All rights reserved.

HDFS
SMB, NFS,
HTTP, FTP,
HDFS 1.x
...
HDFS 2.x
...
name
node
name
node
name
node
name
node
datanode
NFS
SMB
SMB
NFS MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
MAP Reduce
SUPPORT FOR MULTIPLE ANALYTICS APPLICATIONS

22EMC CONFIDENTIAL—INTERNAL USE ONLYEMC CONFIDENTIAL—INTERNAL USE ONLY© Copyright 2015 EMC Corporation. All rights reserved.
DATA CENTER
CLOUDPOOLS
SmartPools Policy
Example
<30 days
>30 days
S210
NL410
>2 years Cloud
22
EXPAND DATA LAKE TO THE CLOUD
30 days-
1 year
> 1 year HD400
CLOUD PROVIDER
1 year –
2 years

CLOUDPOOLS
DATA CENTER
23
CLOUD PROVIDER
APPS &
USERS
Access time
CLOUD ENABLED DATA LAKE

Parallel Replication
 Designed ground-up for scale-out storage
 Aggregate throughput scales with capacity
 Maintain consistent RPO over growing data sets
 Underlying FS knowledge
– Snapshot integration
– Block-level deltas
– Rich meta-data transfer
 Automated Data Failover/Failback

Storage Considerations
STANDARD HADOOP CLUSTER
HADOOP USING EMC ISILON DATA LAKE
100 Nodes
Compute + DAS
24 TB per Node
/3 for
Hadoop
Copies
800TB Usable,
but rarely
achieved
5+
Cabinets
Spill space for
ingestion and
extraction
20 Nodes
Compute +
800TB Isilon
Single Copy with
Erasure Coding
800TB
Usable
1 Cabinet It is NAS

What is my Idea - 2
• Build a fully functioning cost model that includes all
items you think are “free”, but costs stop when you
change the Architecture.
– Project based funding is great until you want to centralize.
Centralization models (BDaaS) work when you consider all
the sundry costs typically excluded by project based
funding (i.e., 24 x 7 support for each cluster, all in costs
that appear free but are sunk)

What is my Idea - 3
• Think about “build all yourself” vs. “buy”
• Focus on Analytics rather than infrastructure implementation,
software dependency, testing,.... etc.
• That has all been done already with EMC Big Data Systems and
Big Data Solutions
• Using pre-validated, installed and tested solutions reduces
complexity and increases reliability.

EMC BIG DATA PORTFOLIO
• Data Lake
• Data Lake Extensions
• Cloud Enabled
• Vblock
• VxRack
• VxRail
• Federation Business
Data Lake

HIGH PERFORMANCE
PREDICTABLE, LOW LATENCY
HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000µsHDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
HDFS
PCIe
<100µs
DSSD
✓HDFS
Filesystem
Buffer Cache
Device Driver
SATA Controller
Disk
HDFS
Filesystem
Buffer Cache
Device Driver
PCIe SSD
PCIeSATA
PCIe
10msHDD
1000-2000µsSDD
Traditional PCIe SSD
Hadoop
Kernel
Motherboard
DSSD Hadoop
Plugin accesses
flash directly
• 10X Throughput
• 1/13th Latency
• No Application
Changes Required

P I V O T A L B I G D A T A S U I T E
V M W A R E V C L O U D S U I T E
EMC DATA LAKE FOUNDATION: ISILON + ECS
VCE VBLOCK | XTREMIO | DATA DOMAIN
O P E N
A N A L Y T I C S
T O O L B O X
D A T A A N D A N A L Y T I C S C A T A L O G
A D V A N C E D A N A L Y T I C SA P P L I C A T I O N S
A T S C A L E
D A T A
P R O C E S S I N G
GREENPLUM
DATABASE
HAWQ
SPRING XD PIVOTAL HDSPARK
REDIS
RABBITMQ
GEMFIRE
BDS ON PIVOTAL
CLOUD FOUNDRY
H A D O O P
PLATFORMMANAGER
DATAGOVERNOR
DATA
MANAGER
INGEST
MANAGER
ANALYTICS
MANAGER
EMC Business Data Lake
Look Demos at http://www.fbdldemo.com/

Thursday, April 14th, 15:00 UTC
Watch out for :
• Hadoop Everywhere: Geo-Distributed Storage
for Big Data
Pesenters:
• Nikhil Joshi, EMC
• Vishrut Shah,EMC

A Remark on data locality
• U. C. Berkeley’s AMP Labs declared Data locality dead in
2011
• Cloudera has declared data locality dead in Hadoop 3.0
with HDFS-EC.
• Gartner has declared hadoop dead due to its limits
• Hadoop will only grow and have more dependency on it
going forward.
• A catalyst may be the next time I see you and uptime for
hadoop is your main concern.

Simple to manage
Single file system, single volume, global namespace
Massively scalable
Scales from 16 TB to over 50 PB in a single cluster
200GB/s throughput, 3.75M IOPS
Unmatched efficiency
Over 80% storage utilization, automated tiering and SmartDedupe
Enterprise data protection
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
Robust security and compliance options
RBAC, Access Zones, WORM data security, File System Auditing
Data At Rest Encryption with SEDs, STIG hardening
CAC/PIV Smartcard authentication, FIPS OpenSSL support
Operational flexibility
Multi-protocol support including NFS, SMB, HTTP, FTP and HDFS
Object and Cloud computing including OpenStack Swift
Isilon Scale-Out NAS

Geo-Scale
Geo-Replicated and Distributed to multiple locations
Massively scalable
Scales to billions of objects in a single namespace
Support for all file sizes
Support for individual files of any size.
Multi-Tenant
Efficient backup and disaster recovery, and N+1 thru N+4 redundancy
HDFS Compatible
Hortonworks Certified HDFS Compatible File System
Swift Compatible
Natively support Open Stack storage
Native Cloud Interface
Natively works with existing cloud protocols like S3 and Azure.
Elastic Cloud Storage (ECS)

Tame that Beast

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Tame that Beast

Similar to Tame that Beast (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Tame that Beast