SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0 in
Munich, Apr. 2017
Sanjay Radia, Junping Du
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Speakers
Sanjay Radia
⬢ Chief Architect, Founder, Hortonworks
⬢ Part of the original Hadoop team at Yahoo! since 2007
– Chief Architect of Hadoop Core at Yahoo!
– Apache Hadoop PMC and Committer
⬢ Prior
– Data center automation, virtualization, Java, HA, OSs, File Systems
– Startup, Sun Microsystems, Inria …
– Ph.D., University of Waterloo
Junping Du
– Apache Hadoop Committer & PMC member
– Lead Software Engineer @ Hortonworks YARN Core Team
– 10+ years for developing enterprise software (5+ years for being “Hadooper”)
Page 2
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Why Hadoop 3.0
⬢ Lot of content in Trunk that did not
make it to 2.x branch
⬢ JDK Upgrade – does not truly require
bumping major number
⬢ Hadoop command scripts rewrite
⬢ Big features that need stabilizing major
release – Erasure codes
⬢ YARN: long running services
⬢ Ephemeral Ports (incompatible)
The Driving Reasons Some features taking advantage of 3.0
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hadoop 3.0
⬢HDFS: Erasure codes
–Long running services,
– scheduler enhancements,
– Isolation & Docker
– UI
⬢Lots of Trunk content
⬢ JDK8 and newer dependent
⬢ 3.0.0-alpha1 - Sep/3/2016
⬢ Alpha2 - Jan/25/2017
⬢ Alpha3 - Q2 2017 (Estimated)
⬢ Beta/GA - Q3/Q4 2017 (Estimated)
Key Takeaways Release Timeline
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Hadoop 3.0 Basis - Major changes you should know before upgrade
– JDK upgrade
– Dependency upgrade
– Change on default port for daemon/services
– Shell script rewrite
⬢ Features
– Hadoop Common
•Client-Side Classpath Isolation
•Erasure Coding
•Support for more than 2 NameNodes
•Support for long running services
•Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies
•New UI
•ATS v2
•Task-level native optimizationHADOOP-11264
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Minimum JDK for Hadoop 3.0.x is JDK8OOP-11858
– Oracle JDK 7 is EoL at April 2015!!
⬢ Moving forward to use new features of JDK8
– Lambda Expressions – starting to use this
– Stream API
– security enhancements
– performance enhancement for HashMaps, IO/NIO, etc.
⬢ Hadoop’s evolution with JDK upgrades
– Hadoop 2.6.x - JDK 6, 7, 8 or later
– Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later
– Hadoop 3.0.x - JDK 8 or later
Hadoop Operation - JDK Upgrade
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Jersey: 1.9 to 1.19
–the root element whose content is empty collection is changed from null to
empty object({}).
⬢ Grizzly-http-servlet: 2.1.2 to 2.2.21
⬢ Guice: 3.0 to 4.0
⬢ cglib: 2.2 to 3.2.0
⬢ asm: 3.2 to 5.0.4
⬢ netty-all: 4.0.23 to 4.1x (in discussion)
⬢ Protocol Buffer: 2.5 to 3.x (in discussion)
Dependency Upgrade
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Previously, the default ports of multiple Hadoop services were in the Linux
ephemeral port range (32768-61000)
– Can conflict with other apps running on the same node
⬢ New ports:
– Namenode ports: 50470  9871, 50070  9870, 8020  9820
– Secondary NN ports: 50091  9869, 50090  9868
– Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864
⬢ KMS service port 16000  9600
Change of Default Ports for Hadoop Services
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Common
 Client-Side Classpath Isolation
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Problem
– Application code’s dependency (including Apache Hive or dependency projects) can conflict with
Hadoop’s dependencies
⬢ Solution
– Separating Server-side jar and Client-side jar
•Like hbase-client, dependencies are shaded
Client-side classpath isolation
User code
Single Jar File
User code
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
 Support for Three NameNodes for HA
 Erasure coding
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Current (2.x) HDFS Replication Strategy
⬢ Three replicas by default
– 1st replica on local node, local rack or random node
– 2nd and 3rd replicas on the same remote rack
– Reliability: tolerate 2 failures
⬢ Good data locality, local shortcut
⬢ Multiple copies => Parallel IO for parallel compute
⬢ Very Fast block recovery and node recovery
– Parallel recover - the bigger the cluster the faster
– 10TB Node recovery 30sec to a few hours
⬢ 3/x storage overhead vs 1.4-1.6 of Erasure Code
– Remember that Hadoop’s JBod is much much cheaper
– 1/10 - 1/20 of SANs
– 1/10 – 1/5 of NFS
Rack I
Rack II
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding
⬢k data blocks + m parity blocks (k + m)
– Example: Reed-Solomon 6+3
⬢Reliability: tolerate m failures
⬢Save disk space
⬢Save I/O bandwidth on the write path
b3b1 b2 P1b6b4 b5 P2 P3
6 data blocks 3 parity blocks
• 1.5x storage overhead
• Tolerate any 3 failures
3-replication (6, 3) Reed-Solomon
Maximum fault Tolerance 2 3
Disk usage
(N byte of data)
3N 1.5N
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Block Reconstruction
⬢ Block reconstruction overhead
– Higher network bandwidth cost
– Extra CPU overhead
• Local Reconstruction Codes (LRC), Hitchhiker
Rack RackRack
P1 P2
Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12.
Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013.
Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding on Contiguous/Striped Blocks
⬢ EC on striped blocks
– Pros: Leverage multiple disks in parallel
– Pros: Works for small small files
– Cons: No data locality for readers
C1 C2 C3 C4 C5 C6 PC1 PC2 PC3
C7 C8 C9 C10 C11 C12 PC4 PC5 PC6
stripe 1
stripe 2
stripe n
b1 b2 b3 b4 b5 b6 P1 P2 P3
6 Data Blocks 3 Parity Blocks
b3b1 b2 b6b4 b5
File f1
P1 P2 P3
parity blocks
File f2 f3
data blocks
Two Approaches
EC on contiguous blocks
– Pros: Better for locality
– Cons: small files cannot be handled
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Starting from Striping to deal with smaller files
⬢ Hadoop 3.0.0 implementes Phase 1.1 and Phase 1.2
Apache Hadoop’s decision
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Erasure Coding Zone
⬢ Create a zone on an empty directory
– Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path>
⬢ All the files under a zone directory are automatically erasure
– Rename across zones with different EC schemas are disallowed
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Write Pipeline for Replicated Files
⬢ Write pipeline to datanodes
⬢ Durability
– Use 3 replicas to tolerate maximum 2 failures
⬢ Visibility
– Read is supported for being written files
– Data can be made visible by hflush/hsync
⬢ Consistency
– Client can start reading from any replica and failover to any other replica to read the same data
⬢ Appendable
– Files can be reopened for append
* DN = DataNode
data data
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Parallel Write for EC Files
⬢ Parallel write
– Client writes to a group of 9 datanodes at the same time
– Calculate Parity bits at client side, at Write Time
⬢ Durability
– (6, 3)-Reed-Solomon can tolerate maximum 3 failures
⬢ Visibility (Same as replicated files)
– Read is supported for being written files
– Data can be made visible by hflush/hsync
⬢ Consistency
– Client can start reading from any 6 of the 9 replicas
– When reading from a datanode fails, client can failover to
any other remaining replica to read the same data.
⬢ Appendable (Same as replicated files)
– Files can be reopened for append
Stipe size 1MB
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
EC: Write Failure Handling
⬢ Datanode failure
– Client ignores the failed datanode and continue writing.
– Able to tolerate 3 failures.
– Require at least 6 datanodes.
– Missing blocks will be reconstructed later.
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Slow Writers & Replace Datanode on Failure
⬢ Write pipeline for replicated files
– Datanode can be replaced in case of failure.
⬢ Slow writers
– A write pipeline may last for a long time
– The probability of datanode failures increases over time.
– Need to replace datanode on failure.
⬢ EC files
– Do not support replace-datanode-on-failure.
– Slow writer improved
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reading with Parity Blocks
⬢ Parallel read
– Read from 6 Datanodes with data blocks
– Support both stateful read and pread
⬢ Block reconstruction
– Read parity blocks to reconstruct missing blocks
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
⬢ Pros
–Low latency because of parallel write/read
–Good for small-size files
⬢ Cons
–Require high network bandwidth between client-server
–Higher reconstruction cost
–Dead DataNode implies high network traffic and reconstruction time
Network traffic – Need good network bandwidth
Workload 3-replication (6, 3) Reed-Solomon
Read 1 block 1 LN 1/6 LN + 5/6 RR
Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR +
7/6 RR
LN: Local Node
LR: Local Rack
RR: Remote Rack
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
 YARN Scheduling Enhancements
 Support for Long Running Services
 Re-architecture for YARN Timeline Service - ATS v2
 Better elasticity and resource utilization
 Better resource isolation and Docker!!
 Better User Experiences
 Other Enhancements
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scheduling Enhancements
 Application priorities within a queue: YARN-1963
– In Queue A, App1 > App 2
 Inter-Queue priorities
– Q1 > Q2 irrespective of demand / capacity
– Previously based on unconsumed capacity
 Affinity / anti-affinity: YARN-1042
– More restraints on locations
 Global Scheduling: YARN-5139
– Get rid of scheduling triggered on node heartbeat
– Replaced with global scheduler that has parallel threads
• Globally optimal placement
• Critical for long running services – they stick to the allocation – better be a good one
• Enhanced container scheduling throughput (6x)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Drivers for Long Running Services
 Consolidation of Infrastructure
 Hadoop clusters have a lot of compute and storage resources (some unused)
 Can’t I use Hadoop’s resources for non-Hadoop load?
 Openstack is hard to run, can I use YARN?
 But does it support Docker? – yes, we heard you
 Hadoop related Data Services that run outside a Hadoop cluster
 Why can’t I run them in the Hadoop cluster
 Run Hadoop services (Hive, HBase) on YARN
 Run Multiple instances
 Benefit from YARN’s Elasticity and resource management
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Built-in support for long running Service in YARN
 A native YARN framework. YARN-4692
 Abstract common Framework (Similar to Slider) to support long running service
 More simplified API (to manage service lifecycle)
 Better support for long running service
 Recognition of long running service
 Affect the policy of preemption, container reservation, etc.
 Auto-restart of containers
 Containers for long running service are retried to same node in case of local state
 Service/application upgrade support – YARN-4726
 In general, services are expected to run long enough to cross versions
 Dynamic container configuration
 Only ask for resources just enough, but adjust them at runtime (memory harder)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Discovery services in YARN
 Services can run on any YARN node; how do get its IP?
– It can also move due to node failure
 YARN Service Discovery via DNS: YARN-4757
– Expose existing service information in YARN registry via DNS
• ​Current YARN service registry’s records will be converted into DNS entries
– Discovery of container IP and service port via standard DNS lookups.
• Application
– ->
• Container
– Container ->
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
A More Powerful YARN
⬢ Elastic Resource Model
–Dynamic Resource Configuration
•Allow tune down/up on NM’s resource in runtime
–Graceful decommissioning of NodeManagers
•Drains a node that’s being decommissioned to allow running containers to
⬢ Efficient Resource Utilization
–Support for container resizing
•Allows applications to change the size of an existing container
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
More Powerful YARN (Contd.)
⬢ Resource Isolation
–Resource isolation support for disk and network
•YARN-2619 (disk), YARN-2140 (network)
•Containers get a fair share of disk and network resources using Cgroups
–Docker support in LinuxContainerExecutor
•Support to launch Docker containers alongside process
•Packaging and resource isolation
• Complements YARN’s support for long running services
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Apps
Docker on Yarn & YARN on YARN  - YCloud
MR Tez Spark
TensorFlow YARN
MR Tez Spar
Can use Yarn to test Hadoop!!
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
YARN New UI (YARN-3368)
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Timeline Service Revolution – Why ATS v2
⬢ Scalability & Performance
v1 limitation:
–Single global instance of writer/reader
–Local disk based LevelDB storage
⬢ Usability
–Handle flows as first-class concepts and
model aggregation
–Add configuration and metrics as first-class
–Better support for queries
⬢ Reliability
v1 limitation:
–Data is stored in a local disk
–Single point of failure (SPOF) for timeline
⬢ Flexibility
–Data model is more describable
–Extended to more specific info to app
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Core Design for ATS v2
⬢ Distributed write path
– Logical per app collector + physical per
node writer
– Collector/Writer launched as an auxiliary
service in NM.
– Standalone writers will be added later.
⬢ Pluggable backend storage
– Built in with a scalable and reliable
implementation (HBase)
⬢ Enhanced data model
– Entity (bi-directional relation) with flow,
queue, etc.
– Configuration, Metric, Event, etc.
⬢ Separate reader instances
⬢ Aggregation & Accumulation
– Aggregation: rolling up the metric values to the
•Online aggregation for apps and flow runs
•Offline aggregation for users, flows and
– Accumulation: rolling up the metric values
across time interval
•Accumulated resource consumption for app,
flow, etc.
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Other YARN work planned in Hadoop 3.X
⬢ Resource profiles
–Users can specify resource profile name instead of individual resources
–Resource types read via a config file
⬢ YARN federation
–Allows YARN to scale out to tens of thousands of nodes
–Cluster of clusters which appear as a single cluster to an end user
⬢ Gang Scheduling
© Hortonworks Inc. 2011 – 2016. All Rights Reserved3
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!
Reminder: BoFs on Thursday at 5:50pm

More Related Content

What's hot

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
DataWorks Summit/Hadoop Summit
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
DataWorks Summit
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
DataWorks Summit/Hadoop Summit
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
DataWorks Summit/Hadoop Summit
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
DataWorks Summit/Hadoop Summit
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
DataWorks Summit
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
DataWorks Summit
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataWorks Summit/Hadoop Summit
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
DataWorks Summit/Hadoop Summit
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
DataWorks Summit
YARN Federation
YARN Federation YARN Federation
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit

What's hot (20)

File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Near Real-Time Network Anomaly Detection and Traffic Analysis using Spark bas...
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
Schema Registry - Set Your Data Free
Schema Registry - Set Your Data FreeSchema Registry - Set Your Data Free
Schema Registry - Set Your Data Free
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Data Highway Rainbow - Petabyte Scale Event Collection, Transport & Delivery ...
Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop Real-time Hadoop: The Ideal Messaging System for Hadoop
Real-time Hadoop: The Ideal Messaging System for Hadoop
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
YARN Federation
YARN Federation YARN Federation
YARN Federation
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari

Viewers also liked

Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
WANdisco Plc
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Spark Summit
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Russell Spitzer
Udo Seidel
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
Marian Marinov
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
Big Data in Azure
Big Data in AzureBig Data in Azure
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
DataWorks Summit/Hadoop Summit
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
DataWorks Summit
Solving Cyber at Scale
Solving Cyber at ScaleSolving Cyber at Scale
Solving Cyber at Scale
DataWorks Summit/Hadoop Summit
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
DataWorks Summit/Hadoop Summit
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
DataWorks Summit/Hadoop Summit
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
WANdisco Plc
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
DataWorks Summit/Hadoop Summit
Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security
DataWorks Summit/Hadoop Summit
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
DataWorks Summit
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit

Viewers also liked (19)

Supporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big DataSupporting Financial Services with a More Flexible Approach to Big Data
Supporting Financial Services with a More Flexible Approach to Big Data
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality-(Russell Spitzer, DataStax)
Cassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data LocalityCassandra and Spark: Optimizing for Data Locality
Cassandra and Spark: Optimizing for Data Locality
Performance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networksPerformance comparison of Distributed File Systems on 1Gbit networks
Performance comparison of Distributed File Systems on 1Gbit networks
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
Apache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and FutureApache Hadoop YARN: Past, Present and Future
Apache Hadoop YARN: Past, Present and Future
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache MetronMaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron
Solving Cyber at Scale
Solving Cyber at ScaleSolving Cyber at Scale
Solving Cyber at Scale
Best Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop EnvironmentBest Practices for Enterprise User Management in Hadoop Environment
Best Practices for Enterprise User Management in Hadoop Environment
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
Running Services on YARN
Running Services on YARNRunning Services on YARN
Running Services on YARN
Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security Apache Metron: Community Driven Cyber Security
Apache Metron: Community Driven Cyber Security
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Bringing it All Together: Apache Metron (Incubating) as a Case Study of a Mod...
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices

Similar to Hadoop 3 in a Nutshell

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
DataWorks Summit
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Edelweiss Kammermann
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Frank Munz
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Mingliang Liu
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
Chris Nauroth
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
DataWorks Summit/Hadoop Summit
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Chris Nauroth
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
DataWorks Summit
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
DataWorks Summit
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit

Similar to Hadoop 3 in a Nutshell (20)

HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage EfficiencyHDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
HDFS Erasure Code Storage - Same Reliability at Better Storage Efficiency
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San JoseCloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - DataWorks Summit 2017 San Jose
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5Hadoop operations-2014-strata-new-york-v5
Hadoop operations-2014-strata-new-york-v5
What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS What's new in Hadoop Common and HDFS
What's new in Hadoop Common and HDFS
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
Ozone and HDFS’s evolution
Ozone and HDFS’s evolutionOzone and HDFS’s evolution
Ozone and HDFS’s evolution
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Ozone and HDFS's Evolution
Ozone and HDFS's EvolutionOzone and HDFS's Evolution
Ozone and HDFS's Evolution
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
DataWorks Summit/Hadoop Summit
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake

Recently uploaded

Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
Alison B. Lowndes
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.

Recently uploaded (20)

Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
NVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space ExplorationNVIDIA at Breakthrough Discuss for Space Exploration
NVIDIA at Breakthrough Discuss for Space Exploration
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium

Hadoop 3 in a Nutshell

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0 in Nutshell Munich, Apr. 2017 Sanjay Radia, Junping Du
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Speakers Sanjay Radia ⬢ Chief Architect, Founder, Hortonworks ⬢ Part of the original Hadoop team at Yahoo! since 2007 – Chief Architect of Hadoop Core at Yahoo! – Apache Hadoop PMC and Committer ⬢ Prior – Data center automation, virtualization, Java, HA, OSs, File Systems – Startup, Sun Microsystems, Inria … – Ph.D., University of Waterloo Junping Du – Apache Hadoop Committer & PMC member – Lead Software Engineer @ Hortonworks YARN Core Team – 10+ years for developing enterprise software (5+ years for being “Hadooper”) Page 2
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Why Hadoop 3.0 ⬢ Lot of content in Trunk that did not make it to 2.x branch ⬢ JDK Upgrade – does not truly require bumping major number ⬢ Hadoop command scripts rewrite (incompatible) ⬢ Big features that need stabilizing major release – Erasure codes ⬢ YARN: long running services ⬢ Ephemeral Ports (incompatible) The Driving Reasons Some features taking advantage of 3.0
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hadoop 3.0 ⬢HDFS: Erasure codes ⬢YARN: –Long running services, – scheduler enhancements, – Isolation & Docker – UI ⬢Lots of Trunk content ⬢ JDK8 and newer dependent libraries ⬢ 3.0.0-alpha1 - Sep/3/2016 ⬢ Alpha2 - Jan/25/2017 ⬢ Alpha3 - Q2 2017 (Estimated) ⬢ Beta/GA - Q3/Q4 2017 (Estimated) Key Takeaways Release Timeline
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Hadoop 3.0 Basis - Major changes you should know before upgrade – JDK upgrade – Dependency upgrade – Change on default port for daemon/services – Shell script rewrite ⬢ Features – Hadoop Common •Client-Side Classpath Isolation – HDFS •Erasure Coding •Support for more than 2 NameNodes – YARN •Support for long running services •Scheduling enhancements: : App / Queue Priorities, global scheduling, placement strategies •New UI •ATS v2 – MAPREDUCE •Task-level native optimizationHADOOP-11264 Agenda
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Minimum JDK for Hadoop 3.0.x is JDK8OOP-11858 – Oracle JDK 7 is EoL at April 2015!! ⬢ Moving forward to use new features of JDK8 – Lambda Expressions – starting to use this – Stream API – security enhancements – performance enhancement for HashMaps, IO/NIO, etc. ⬢ Hadoop’s evolution with JDK upgrades – Hadoop 2.6.x - JDK 6, 7, 8 or later – Hadoop 2.7.x/2.8.x/2.9.x - JDK 7, 8 or later – Hadoop 3.0.x - JDK 8 or later Hadoop Operation - JDK Upgrade
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Jersey: 1.9 to 1.19 –the root element whose content is empty collection is changed from null to empty object({}). ⬢ Grizzly-http-servlet: 2.1.2 to 2.2.21 ⬢ Guice: 3.0 to 4.0 ⬢ cglib: 2.2 to 3.2.0 ⬢ asm: 3.2 to 5.0.4 ⬢ netty-all: 4.0.23 to 4.1x (in discussion) ⬢ Protocol Buffer: 2.5 to 3.x (in discussion) Dependency Upgrade
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Previously, the default ports of multiple Hadoop services were in the Linux ephemeral port range (32768-61000) – Can conflict with other apps running on the same node ⬢ New ports: – Namenode ports: 50470  9871, 50070  9870, 8020  9820 – Secondary NN ports: 50091  9869, 50090  9868 – Datanode ports: 50020  9867, 50010  9866, 50475  9865, 50075  9864 ⬢ KMS service port 16000  9600 Change of Default Ports for Hadoop Services
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Common  Client-Side Classpath Isolation
  • 10. 1 0 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Problem – Application code’s dependency (including Apache Hive or dependency projects) can conflict with Hadoop’s dependencies ⬢ Solution – Separating Server-side jar and Client-side jar •Like hbase-client, dependencies are shaded Client-side classpath isolation HADOOP-11656/HADOOP-13070 Hadoop Client Server Older commons Hadoop -client shaded Server Older commons User code newer commons Single Jar File Conflicts!!! User code newer commons Co-existable!
  • 11. 1 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS  Support for Three NameNodes for HA  Erasure coding
  • 12. 1 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Current (2.x) HDFS Replication Strategy ⬢ Three replicas by default – 1st replica on local node, local rack or random node – 2nd and 3rd replicas on the same remote rack – Reliability: tolerate 2 failures ⬢ Good data locality, local shortcut ⬢ Multiple copies => Parallel IO for parallel compute ⬢ Very Fast block recovery and node recovery – Parallel recover - the bigger the cluster the faster – 10TB Node recovery 30sec to a few hours ⬢ 3/x storage overhead vs 1.4-1.6 of Erasure Code – Remember that Hadoop’s JBod is much much cheaper – 1/10 - 1/20 of SANs – 1/10 – 1/5 of NFS r1 Rack I DataNode r2 Rack II DataNode r3
  • 13. 1 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding ⬢k data blocks + m parity blocks (k + m) – Example: Reed-Solomon 6+3 ⬢Reliability: tolerate m failures ⬢Save disk space ⬢Save I/O bandwidth on the write path b3b1 b2 P1b6b4 b5 P2 P3 6 data blocks 3 parity blocks • 1.5x storage overhead • Tolerate any 3 failures 3-replication (6, 3) Reed-Solomon Maximum fault Tolerance 2 3 Disk usage (N byte of data) 3N 1.5N
  • 14. 1 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Block Reconstruction ⬢ Block reconstruction overhead – Higher network bandwidth cost – Extra CPU overhead • Local Reconstruction Codes (LRC), Hitchhiker b4 Rack b2 Rack b3 Rack b1 Rack b6 Rack b5 Rack RackRack P1 P2 Rack P3 Huang et al. Erasure Coding in Windows Azure Storage. USENIX ATC'12. Sathiamoorthy et al. XORing elephants: novel erasure codes for big data. VLDB 2013. Rashmi et al. A "Hitchhiker's" Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers. SIGCOMM'14.
  • 15. 1 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding on Contiguous/Striped Blocks ⬢ EC on striped blocks – Pros: Leverage multiple disks in parallel – Pros: Works for small small files – Cons: No data locality for readers C1 C2 C3 C4 C5 C6 PC1 PC2 PC3 C7 C8 C9 C10 C11 C12 PC4 PC5 PC6 stripe 1 stripe 2 stripe n b1 b2 b3 b4 b5 b6 P1 P2 P3 6 Data Blocks 3 Parity Blocks b3b1 b2 b6b4 b5 File f1 P1 P2 P3 parity blocks File f2 f3 data blocks Two Approaches EC on contiguous blocks – Pros: Better for locality – Cons: small files cannot be handled
  • 16. 1 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Starting from Striping to deal with smaller files ⬢ Hadoop 3.0.0 implementes Phase 1.1 and Phase 1.2 Apache Hadoop’s decision
  • 17. 1 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Erasure Coding Zone ⬢ Create a zone on an empty directory – Shell command: hdfs erasurecode –createZone [-s <schemaName>] <path> ⬢ All the files under a zone directory are automatically erasure coded – Rename across zones with different EC schemas are disallowed
  • 18. 1 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Write Pipeline for Replicated Files ⬢ Write pipeline to datanodes ⬢ Durability – Use 3 replicas to tolerate maximum 2 failures ⬢ Visibility – Read is supported for being written files – Data can be made visible by hflush/hsync ⬢ Consistency – Client can start reading from any replica and failover to any other replica to read the same data ⬢ Appendable – Files can be reopened for append * DN = DataNode DN1 DN2 DN3 data data ackack Writer data ack
  • 19. 1 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Parallel Write for EC Files ⬢ Parallel write – Client writes to a group of 9 datanodes at the same time – Calculate Parity bits at client side, at Write Time ⬢ Durability – (6, 3)-Reed-Solomon can tolerate maximum 3 failures ⬢ Visibility (Same as replicated files) – Read is supported for being written files – Data can be made visible by hflush/hsync ⬢ Consistency – Client can start reading from any 6 of the 9 replicas – When reading from a datanode fails, client can failover to any other remaining replica to read the same data. ⬢ Appendable (Same as replicated files) – Files can be reopened for append DN1 DN6 DN7 data parity ack ack Writer data ack DN9 parity ack …… Stipe size 1MB
  • 20. 2 0 © Hortonworks Inc. 2011 – 2016. All Rights Reserved EC: Write Failure Handling ⬢ Datanode failure – Client ignores the failed datanode and continue writing. – Able to tolerate 3 failures. – Require at least 6 datanodes. – Missing blocks will be reconstructed later. DN1 DN6 DN7 data parity ack ack Writer data ack DN9 parity ack ……
  • 21. 2 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Replication: Slow Writers & Replace Datanode on Failure ⬢ Write pipeline for replicated files – Datanode can be replaced in case of failure. ⬢ Slow writers – A write pipeline may last for a long time – The probability of datanode failures increases over time. – Need to replace datanode on failure. ⬢ EC files – Do not support replace-datanode-on-failure. – Slow writer improved DN1 DN4 data ack DN3DN2 data ack Writer data ack
  • 22. 2 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reading with Parity Blocks ⬢ Parallel read – Read from 6 Datanodes with data blocks – Support both stateful read and pread ⬢ Block reconstruction – Read parity blocks to reconstruct missing blocks DN3 DN7 DN1 DN2 Reader DN4 DN5 DN6 Block3 reconstruct Block2 Block1 Block4 Block5 Block6Parity1
  • 23. 2 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved ⬢ Pros –Low latency because of parallel write/read –Good for small-size files ⬢ Cons –Require high network bandwidth between client-server –Higher reconstruction cost –Dead DataNode implies high network traffic and reconstruction time Network traffic – Need good network bandwidth Workload 3-replication (6, 3) Reed-Solomon Read 1 block 1 LN 1/6 LN + 5/6 RR Write 1LN + 1LR + 1RR 1/6 LN + 1/6 LR + 7/6 RR LN: Local Node LR: Local Rack RR: Remote Rack
  • 24. 2 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN  YARN Scheduling Enhancements  Support for Long Running Services  Re-architecture for YARN Timeline Service - ATS v2  Better elasticity and resource utilization  Better resource isolation and Docker!!  Better User Experiences  Other Enhancements
  • 25. 2 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scheduling Enhancements  Application priorities within a queue: YARN-1963 – In Queue A, App1 > App 2  Inter-Queue priorities – Q1 > Q2 irrespective of demand / capacity – Previously based on unconsumed capacity  Affinity / anti-affinity: YARN-1042 – More restraints on locations  Global Scheduling: YARN-5139 – Get rid of scheduling triggered on node heartbeat – Replaced with global scheduler that has parallel threads • Globally optimal placement • Critical for long running services – they stick to the allocation – better be a good one • Enhanced container scheduling throughput (6x)
  • 26. 2 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Key Drivers for Long Running Services  Consolidation of Infrastructure  Hadoop clusters have a lot of compute and storage resources (some unused)  Can’t I use Hadoop’s resources for non-Hadoop load?  Openstack is hard to run, can I use YARN?  But does it support Docker? – yes, we heard you  Hadoop related Data Services that run outside a Hadoop cluster  Why can’t I run them in the Hadoop cluster  Run Hadoop services (Hive, HBase) on YARN  Run Multiple instances  Benefit from YARN’s Elasticity and resource management
  • 27. 2 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Built-in support for long running Service in YARN  A native YARN framework. YARN-4692  Abstract common Framework (Similar to Slider) to support long running service  More simplified API (to manage service lifecycle)  Better support for long running service  Recognition of long running service  Affect the policy of preemption, container reservation, etc.  Auto-restart of containers  Containers for long running service are retried to same node in case of local state  Service/application upgrade support – YARN-4726  In general, services are expected to run long enough to cross versions  Dynamic container configuration  Only ask for resources just enough, but adjust them at runtime (memory harder)
  • 28. 2 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Discovery services in YARN  Services can run on any YARN node; how do get its IP? – It can also move due to node failure  YARN Service Discovery via DNS: YARN-4757 – Expose existing service information in YARN registry via DNS • ​Current YARN service registry’s records will be converted into DNS entries – Discovery of container IP and service port via standard DNS lookups. • Application – -> • Container – Container ->
  • 29. 2 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved A More Powerful YARN ⬢ Elastic Resource Model –Dynamic Resource Configuration •YARN-291 •Allow tune down/up on NM’s resource in runtime –Graceful decommissioning of NodeManagers •YARN-914 •Drains a node that’s being decommissioned to allow running containers to finish ⬢ Efficient Resource Utilization –Support for container resizing •YARN-1197 •Allows applications to change the size of an existing container
  • 30. 3 0 © Hortonworks Inc. 2011 – 2016. All Rights Reserved More Powerful YARN (Contd.) ⬢ Resource Isolation –Resource isolation support for disk and network •YARN-2619 (disk), YARN-2140 (network) •Containers get a fair share of disk and network resources using Cgroups –Docker support in LinuxContainerExecutor •YARN-3611 •Support to launch Docker containers alongside process •Packaging and resource isolation • Complements YARN’s support for long running services
  • 31. 3 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop Apps Docker on Yarn & YARN on YARN  - YCloud YARN MR Tez Spark TensorFlow YARN MR Tez Spar k Can use Yarn to test Hadoop!!
  • 32. 3 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved YARN New UI (YARN-3368)
  • 33. 3 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Timeline Service Revolution – Why ATS v2 ⬢ Scalability & Performance v1 limitation: –Single global instance of writer/reader –Local disk based LevelDB storage ⬢ Usability –Handle flows as first-class concepts and model aggregation –Add configuration and metrics as first-class members –Better support for queries ⬢ Reliability v1 limitation: –Data is stored in a local disk –Single point of failure (SPOF) for timeline server ⬢ Flexibility –Data model is more describable –Extended to more specific info to app
  • 34. 3 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Core Design for ATS v2 ⬢ Distributed write path – Logical per app collector + physical per node writer – Collector/Writer launched as an auxiliary service in NM. – Standalone writers will be added later. ⬢ Pluggable backend storage – Built in with a scalable and reliable implementation (HBase) ⬢ Enhanced data model – Entity (bi-directional relation) with flow, queue, etc. – Configuration, Metric, Event, etc. ⬢ Separate reader instances ⬢ Aggregation & Accumulation – Aggregation: rolling up the metric values to the parent •Online aggregation for apps and flow runs •Offline aggregation for users, flows and queues – Accumulation: rolling up the metric values across time interval •Accumulated resource consumption for app, flow, etc.
  • 35. 3 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Other YARN work planned in Hadoop 3.X ⬢ Resource profiles –YARN-3926 –Users can specify resource profile name instead of individual resources –Resource types read via a config file ⬢ YARN federation –YARN-2915 –Allows YARN to scale out to tens of thousands of nodes –Cluster of clusters which appear as a single cluster to an end user ⬢ Gang Scheduling –YARN-624
  • 36. 3 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved3 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you! Reminder: BoFs on Thursday at 5:50pm

Editor's Notes

  1. it enables online EC which bypasses the conversion phase and immediately saves storage space; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple D​ataNodes​and eliminates the need to bundle multiple files into a single coding group.