Apache Tez: Accelerating Hadoop Query Processing

Apache Tez : Accelerating
Hadoop Query Processing
Page 1
Arun C. Murthy Bikas Saha
Founder & Architect Hortonworks
@acmurthy @bikassaha
(@hortonworks)

© Hortonworks Inc. 2013
Hello!
• Founder/Architect at Hortonworks Inc.
–Lead - Map-Reduce/YARN/Tez
–Formerly, Architect Hadoop MapReduce, Yahoo
–Responsible for running Hadoop MapReduce as a service for all
of Yahoo (~50k nodes footprint)
• Apache Hadoop, ASF
–Frmr. VP, Apache Hadoop, ASF (Chair of Apache Hadoop PMC)
–Long-term Committer/PMC member (full time for 7 years)
–Release Manager for hadoop-2.x
Page 2

Once upon a time …
Page 3
… long, long ago, there was a kingdom we shall call
Apache Hadoop
http://2.bp.blogspot.com/-hIp99urgxCk/UAsSFo4i8YI/AAAAAAAAAFg/IzjNDwrBBVg/s1600/magickingdo

Hadoop begat …
Page 4
… a two-headed monster on every node in the kingdom;
each belonged to a different clan and answered to a
different master
http://4.bp.blogspot.com/_C7CsfdqySYc/TNSKvIwiFcI/AAAAAAAAAbs/2FSU2TV_rRA/s1600/Two-Headed+Monster+-+With+Identifiers+-+Jan+19,+2009_0.jpg

Knights of Bytes - HDFS
Page 5
… stored data uncompromisingly in directories/files, nary a
care about contents
http://whoiscraigmoser.com/Images/identity/knight.png

Prince of Processing - MapReduce
Page 6
He ruled with an iron fist by mapping,
and then by mercilessly reducing datahttp://media.comicvine.com/uploads/14/144886/2868181-sauron.jpg

Peace Reigned
Page 7
… for a while with the odd change in the direction of the wind
http://www.get-covers.com/wp-content/uploads/2012/07/Peace.jpg

Slowly, but surely …
Page 8
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp

Page 9
Human beings define reality through misery and suffering.
- Agent Smith
http://api.ning.com/files/*oWmhl7LBlXuodD2itWUUtOautEVfD*pbBn57L8ThCyYIykiTuzkO4lJY1bwaNbJF7GecTDwsVj3EFHpDM-F1y-UW4b3Xsvh/matrix_revolutions_agent_smith_04.bmp

Page 10
… people of the kingdom clamored for more.
A palpable sense of greed & expectation.
http://sidoxia.files.wordpress.com/2011/11/wall-st-greed-st1.jpg

Signs of Distress
Page 11
SQL said some, others said Machine Learning,
still others said Real-Time Event Processing
http://www.truth-seeker.info/wp-content/uploads/2012/11/distress.jpg

A Meeting at the Summit
Page 12
MapReduce is dead!
Err… not quite.
We need more options! We need more!
True…
http://4.bp.blogspot.com/-
oqr1t6avx6g/TW55kUnmQvI/AAAAAAAAMMk/q9Jc87MSG4g/s400/arab%2Bleague%2Bround%2Btable%2B%2Bbig%2Bgood%2B2011.bmp

A Meeting at the Summit
Page 13
A common thread YARN running through all applications…
Long live the King!
http://whipup.net/wp-content/images/2008/08/yarn.gif

The Edict
Page 14
Henceforth, in the Kingdom of King YARN…
MapReduce has been relegated to the status
of, merely, one of the applications!
http://www.napavintners.org/images/winery_Labels/EdictWines-800HW.jpg

Reign of King YARN
Page 15
King YARN came to throne
with promises to return power
to all applications
equally, lower performance
taxes and resource
management…
http://images.fineartamerica.com/images-medium-large/the-coronation-the-crown-that-queen-everett.jpg

Oh the Shame!
Page 16
Well, at least, Prince
MapReduce still had
powerful allies like
Highness
Hive, Powerful
Pig, Cheery
Cascading…
http://www.gibbsmagazine.com/MPj03414090000%5B1%5D.jpg

Things get worse before better
Page 17
Unfortunately, things got a lot worse for the Prince MapReduce…
http://www.deviantart.com/download/144412184/Smile__Tomorrow_will_be_worse__by_daGrevis.jpg

Knight Tez
Page 18
He did MapReduce, and so much more…
Smartly aligned himself to Kingdom YARN.
http://twomorrows.com/alterego/media/08shiningknight.gif

Knight Tez
Page 19
… they decided to throw their
lot with Knight Tez!
http://informatica.upg-ploiesti.ro/62689/img/partners.jpg
Long term alliances of MapReduce with
Hive, Pig, Cascading etc. broke up…
http://www.officialpsds.com/images/thumbs/broken-glass-psd44132.png

Happily ever after…
Page 20
(nothing cute to say)

On a more serious note…
Page 21

Every season has a flavor…
Page 22
SQL-on-Hadoop is the new black!
SQL-on-Hadoop will be solved within
the existing ecosystem

Looking ahead
Page 23
What will it be next year?
Real-time event processing?
Machine Learning?

Play to our strengths
Page 24
Invest in the Apache Hadoop platform
and the ecosystem (Hive et al).

Seriously…
Technical Details
Page 25

Tez – Introduction
Page 26
• Distributed execution
framework targeted towards
data-processing applications.
• Based on expressing a
computation as a dataflow
graph.
• Built on top of YARN – the
resource management
framework for Hadoop.
• Open source Apache incubator
project and Apache licensed.

Tez – Design Themes
Page 27
• Empowering End Users
• Execution Performance

Tez – Empowering End Users
• Expressive dataflow definition API’s
• Flexible Input-Processor-Output runtime model
• Data type agnostic
• Simplifying deployment
Page 28

–Enable definition of complex data flow pipelines using simple
graph connection API’s. Tez expands the logical plan at runtime.
–Targeted towards data processing applications like Hive/Pig but
not limited to it. Hive/Pig query plans naturally map to Tez dataflow
graphs with no translation impedance.
Page 29
TaskA-1 TaskA-2 TaskB-1 TaskB-2 TaskC-1 TaskC-2
TaskD-1 TaskD-2 TaskE-1 TaskE-2

Aggregate Stage
Partition Stage
Preprocessor Stage
Page 30
Sampler
Task-1 Task-2
Task-1 Task-2
Task-1 Task-2
Samples
Ranges
Distributed Sort

–Construct physical runtime executors dynamically by connecting
different inputs, processors and outputs.
–End goal is to have a library of inputs, outputs and processors that
can be programmatically composed to generate useful operators.
Page 31
IntermediateReduce
ShuffleInput
ReduceProcessor
FileSortedOutput
FinalReduce
ShuffleInput
ReduceProcessor
HDFSOutput
PairwiseJoin
Input1
JoinProcessor
FileSortedOutput
Input2

–Tez is only concerned with the movement of data. Files and
streams of bytes.
–Does not impose any data format on the user application. MR
application can use Key-Value pairs on top of Tez. Hive and Pig
can use tuple oriented formats that are natural and native to them.
Page 32
File
Stream
Key Value
Tez Task
Tuples
User Code
Bytes Bytes

• Simplifying deployment
–Tez is a completely client side application.
–No deployments to do. Simply upload to any accessible
FileSystem and change local Tez configuration to point to that.
–Enables running different versions concurrently. Easy to test new
functionality while keeping stable versions for production.
–Leverages YARN local resources and distributed cache.
Page 33
Client
Machine
Node
Manager
TezTask
Node
Manager
TezTaskTezClient
HDFS
Tez Lib 1 Tez Lib 2
Client
Machine
TezClient

• Simplifying usage
With great power API’s come great responsibilities 
Page 34

Tez – Execution Performance
• Performance gains over Map Reduce
• Plan reconfiguration at runtime
• Optimal resource management
• Dynamic physical data flow decisions
Page 35

• Performance gains over Map Reduce
–Eliminate replicated write barrier between successive
computations.
–Eliminate job launch overhead of workflow jobs.
–Eliminate extra stage of map reads in every workflow job.
–Eliminate queue and resource contention suffered by workflow
jobs that are started after a predecessor job completes.
Page 36
Pig/Hive - MR
Pig/Hive - Tez

• Plan reconfiguration at runtime
–Dynamic runtime concurrent control based on data size, user
operator resources, available cluster resources and locality.
–Advanced changes in dataflow graph structure.
–Progressive graph construction in concert with user optimizer.
Page 37
HDFS
Blocks
YARN
Resources
Stage 1
50 maps
100
partitions
Stage 2
100
reducers
Stage 1
50 maps
100
partitions
Stage 2
100 10
reducers
Only 10GB’s
of data

• Optimal resource management
–Reuse YARN containers to launch new tasks.
–Reuse YARN containers to enable shared objects across tasks.
Page 38
YARN Container
TezTask Host
TezTask1
TezTask2
SharedObjects
YARN Container
Tez
Application Master
Start Task
Task Done
Start Task

• Dynamic physical data flow decisions
–Decide the type of physical byte movement and storage on the fly.
–Store intermediate data on distributed store, local store or in-
memory.
–Transfer bytes via blocking files or streaming and the spectrum in
between.
Page 39
Producer
(small size)
In-Memory
Consumer
Producer
Local File
Consumer
At Runtime

Tez – Current status
• Apache Incubator Project
–Rapid development. Over 270 jiras opened. Over 170 resolved.
–Growing community.
• Focus on stability
–Testing and quality are highest priority.
–Code ready and deployed on multi-node clusters.
• DAG of MR processing is working
– Already functionally equivalent to Map Reduce. Existing Map
Reduce jobs can be executed on Tez with few or no changes.
– Working Hive prototype that can target Tez for execution of
queries.
–Work started on prototype of Pig that can target Tez.
Page 40

Tez – Current status
Page 41
Fact Table
Dimension
Table 1
Result
Table 1
Dimension
Table 2
Result
Table 2
Dimension
Table 3
Result
Table 3
Join
Join
Join
Typical pattern in a
TPC-DS query
Fact Table
Dimension
Table 1
Dimension
Table 1
Dimension
Table 1
Optimization for
small data sets
Both can now run
as a single Tez job

Tez – Looking ahead
• Early adopters and contributors welcome
–Adopters to drive more scenarios. Contributors to make them
happen.
• Stay tuned for Tez meetups with deep dives on Tez
architecture and using Tez
• Useful links
–Work tracking: https://issues.apache.org/jira/browse/TEZ
–Code: https://github.com/apache/incubator-tez
–High level design document and API specification:
https://issues.apache.org/jira/browse/TEZ-65
– Developer list: dev@tez.incubator.apache.org
User list: user@tez.incubator.apache.org
Issues list: issues@tez.incubator.apache.org
Page 42

Tez – Takeaways
• Distributed execution framework that works on
computations represented as dataflow graphs
• Naturally maps to execution plans produced by query
optimizers
• Execution architecture designed to enable dynamic
performance optimizations at runtime
• Open source Apache project – your use-cases and
code are welcome
• It works and is already being used by Hive
Page 43

Tez
Thanks for your time and attention!
Questions?
Page 44

Apache Tez: Accelerating Hadoop Query Processing

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Apache Tez: Accelerating Hadoop Query Processing

Similar to Apache Tez: Accelerating Hadoop Query Processing (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Apache Tez: Accelerating Hadoop Query Processing