Flink internals web

Flink
internals
Kostas Tzoumas
Flink committer &
Co-founder, data Artisans
ktzoumas@apache.org
@kostas_tzoumas

Welcome
 Last talk: how to program PageRank in Flink,
and Flink programming model
 This talk: how Flink works internally
 Again, a big bravo to the Flink community
2

DataSet and transformations
Input X First Y Second
Operator X Operator Y
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.filter (str -> str.length() > 40);
second.print()
env.execute();
4

Available transformations
 map
 flatMap
 filter
 reduce
 reduceGroup
 join
 coGroup
 aggregate
 cross
 project
 distinct
 union
 iterate
 iterateDelta
 repartition
 …
5

Other API elements & tools
 Accumulators and counters
• Int, Long, Double counters
• Histogram accumulator
• Define your own
 Broadcast variables
 Plan visualization
 Local debugging/testing mode
6

Data types and grouping
public static class Access {
public int userId;
public String url;
...
}
public static class User {
public int userId;
public int region;
public Date customerSince;
...
}
DataSet<Tuple2<Access,User>> campaign = access.join(users)
.where(“userId“).equalTo(“userId“)
DataSet<Tuple3<Integer,String,String> someLog;
someLog.groupBy(0,1).reduceGroup(...);
 Bean-style Java classes & field names
 Tuples and position addressing
 Any data type with key selector function
7

Other API elements
 Hadoop compatibility
• Supports all Hadoop data types, input/output
formats, Hadoop mappers and reducers
 Data streaming API
• DataStream instead of DataSet
• Similar set of operators
• Currently in alpha but moving very fast
 Scala and Java APIs (mirrored)
 Graph API (Spargel)
8

Task
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
Manager
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
Job
Manager
Task
Manager
.flatMap((str, out) -> {
})
.groupBy(0)
.aggregate(SUM, 1);
Flink Client &
Optimizer
O Romeo,
Romeo,
wherefore
art thou
Romeo?
O, 1
Romeo, 3
wherefore, 1
art, 1
thou, 1
Apache Flink
10
Nor arm,
nor face,
nor any
other part
nor, 3
arm, 1
face, 1,
any, 1,
other, 1
part, 1

If you want to know one
thing about Flink is that
you don’t need to know
the internals of Flink.
11

Philosophy
 Flink “hides” its internal workings from the
user
 This is good
• User does not worry about how jobs are executed
• Internals can be changed without breaking
changes
 … and bad
• Execution model more complicated to explain
compared to MapReduce or Spark RDD
12

Recap: DataSet
Operator X Operator Y
13
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.filter (str -> str.length() > 40);
second.print()
env.execute();

Common misconception
 Programs are not executed eagerly
 Instead, system compiles program to an
execution plan and executes that plan
14

DataSet<String>
 Think of it as a PCollection<String>, or a
Spark RDD[String]
 With a major difference: it can be
produced/recovered in several ways
• … like a Java collection
• … like an RDD
• … perhaps it is never fully materialized (because
the program does not need it to)
• … implicitly updated in an iteration
 And this is transparent to the user
15

Example: grep
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
16

Staged (batch) execution
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
17

Pipelined execution
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
00110011
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if
needed
18
Note: Log
DataSet is
never
“created”!

Benefits of pipelining
 25 node cluster
 Grep log for 3
terms
 Scale data size
from 100GB to
1TB
2500
2250
2000
1750
1500
1250
1000
750
500
250
0
0 100 200 300 400 500 600 700 800 900 1000
Time to complete grep (sec)
Cluster memory Data size (GB)
exceeded 19

Drawbacks of pipelining
 Long pipelines may be active at the same time leading to
memory fragmentation
• FLINK-1101: Changes memory allocation from static to adaptive
 Fault-tolerance harder to get right
• FLINK-986: Adds intermediate data sets (similar to RDDS) as
first-class citizen to Flink Runtime. Will lead to fine-grained fault-tolerance
among other features.
21

Example: Iterative processing
DataSet<Page> pages = ...
DataSet<Neighborhood> edges = ...
DataSet<Page> oldRanks = pages; DataSet<Page> newRanks;
for (i = 0; i < maxIterations; i++) {
newRanks = update(oldRanks, edges)
oldRanks = newRanks
}
DataSet<Page> result = newRanks;
DataSet<Page> update (DataSet<Page> ranks, DataSet<Neighborhood> adjacency) {
return oldRanks
.join(adjacency)
.where(“id“).equalTo(“id“)
.with ( (page, adj, out) -> {
for (long n : adj.neighbors)
out.collect(new Page(n, df * page.rank / adj.neighbors.length))
})
.groupBy(“id“)
.reduce ( (a, b) -> new Page(a.id, a.rank + b.rank) );
22

Iterate by unrolling
Client
Step Step Step Step Step
 for/while loop in client submits one job per iteration
step
 Data reuse by caching in memory and/or disk
23

Iterate natively
DataSet<Page> pages = ...
DataSet<Neighborhood> edges = ...
IterativeDataSet<Page> pagesIter = pages.iterate(maxIterations);
DataSet<Page> newRanks = update (pagesIter, edges);
DataSet<Page> result = pagesIter.closeWith(newRanks)
24
partial
solution
partial
solution X
other
datasets
Y
initial
solution
iteration
result
Replace
Step function

Iterate natively with deltas
Replace
workset A B workset
partial
solution
delta
set X
other
datasets
Y
initial
workset
initial
solution
iteration
result
Merge deltas
DeltaIteration<...> pagesIter = pages.iterateDelta(initialDeltas, maxIterations, 0);
DataSet<...> newRanks = update (pagesIter, edges);
DataSet<...> newRanks = ...
DataSet<...> result = pagesIter.closeWith(newRanks, deltas)
See http://data-artisans.com/data-analysis-with-flink.html 25

Native, unrolling, and delta
26

Flink stack
Apache Tez
Data
storage
Flink Optimizer Flink Stream Builder
Rabbit
MQ
Embedded execution
(Java collections)
Files HDFS S3 JDBC Kafka
Redis
Azure
tables
…
Local execution
Flink Runtime
YARN EC2
Common API
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Execution Engine
29

Flink stack
30
Common API
Flink Optimizer Flink Stream Builder
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Flink cluster YARN
Data
storage
Rabbit
MQ
Files HDFS S3 JDBC Kafka
Redis
Azure
tables
…
Single node execution

Program lifecycle
31
val source1 = …
val source2 = …
val maxed = source1
.map(v => (v._1,v._2,
math.max(v._1,v._2))
val filtered = source2
.filter(v => (v._1 > 4))
val result = maxed
.join(filtered).where(0).equalTo(0)
.filter(_1 > 3)
.groupBy(0)
.reduceGroup {……}
1

 The optimizer is the
component that selects
an execution plan for a
Common API program
 Think of an AI system
manipulating your
program for you 
 But don’t be scared – it
works
• Relational databases have
been doing this for
decades – Flink ports the
technology to API-based
systems
Flink Optimizer
32

A simple program
33
DataSet<Tuple5<Integer, String, String, String, Integer>> orders = …
DataSet<Tuple2<Integer, Double>> lineitems = …
DataSet<Tuple2<Integer, Integer>> filteredOrders = orders
.filter(. . .)
.project(0,4).types(Integer.class, Integer.class);
DataSet<Tuple3<Integer, Integer, Double>> lineitemsOfOrders = filteredOrders
.join(lineitems)
.where(0).equalTo(0)
.projectFirst(0,1).projectSecond(1)
.types(Integer.class, Integer.class, Double.class);
DataSet<Tuple3<Integer, Integer, Double>> priceSums = lineitemsOfOrders
.groupBy(0,1).aggregate(Aggregations.SUM, 2);
priceSums.writeAsCsv(outputPath);

Two execution plans
34
GroupRed
sort
Combine
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
Best plan forward
depends on
relative sizes
of input files

Flink Local Runtime
 Local runtime, not the
35
distributed execution
engine
 Aka: what happens
inside every parallel
task

Flink runtime operators
 Sorting and hashing data
• Necessary for grouping, aggregation, reduce,
join, cogroup, delta iterations
 Flink contains tailored implementations of
hybrid hashing and external sorting in
Java
• Scale well with both abundant and restricted
memory sizes
36

Internal data representation
37
JVM Heap
map
JVM Heap
reduce
O Romeo,
Romeo,
wherefore
art thou
Romeo?
00110011
art, 1
O, 1
Romeo, 1
Romeo, 1
00110011
00010111
01110001
01111010
00010111
00110011
Network transfer
Local sort
How is intermediate data internally represented?

Internal data representation
 Two options: Java objects or raw bytes
 Java objects
• Easier to program
• Can suffer from GC overhead
• Hard to de-stage data to disk, may suffer from “out of
memory exceptions”
 Raw bytes
• Harder to program (customer serialization stack, more
involved runtime operators)
• Solves most of memory and GC problems
• Overhead from object (de)serialization
 Flink follows the raw byte approach
38

Memory in Flink
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
JVM Heap
User code
objects
Sorting,
hashing,
caching
Shuffling,
broadcasts
heap
Unmanaged
Managed
heap
Network
buffers
39

Memory in Flink (2)
 Internal memory management
• Flink initially allocates 70% of the free heap as byte[]
segments
• Internal operators allocate() and release() these
segments
 Flink has its own serialization stack
• All accepted data types serialized to data segments
 Easy to reason about memory, (almost) no
OutOfMemory errors, reduces the pressure to the
GC (smooth performance)
40

Operating on serialized data
Microbenchmark
 Sorting 1GB worth of (long, double) tuples
 67,108,864 elements
 Simple quicksort
41

Flink Execution Engine
42
 The distributed
execution engine
 Pipelined
• Same engine for Flink
and Flink streaming
 Pluggable
• Local runtime can be
executed on other
engines
• E.g., Java collections
and Apache Tez

Summary
 Flink decouples API from execution
• Same program can be executed in many different
ways
• Hopefully users do not need to care about this and
still get very good performance
 Unique Flink internal features
• Pipelined execution, native iterations, optimizer,
serialized data manipulation, good disk destaging
 Very good performance
• Known issues currently worked on actively
44

Stay informed
 flink.incubator.apache.org
• Subscribe to the mailing lists!
• http://flink.incubator.apache.org/community.html#mailing-lists
 Blogs
• flink.incubator.apache.org/blog
• data-artisans.com/blog
 Twitter
• follow @ApacheFlink
45

That’s it, time for beer
47

Flink in context
49
Hive
Mahout
MapReduce
Flink
Spark Storm
Yarn Mesos
HDFS
Cascading
…
Tez
Pig
Applications
Data processing
engines
App and resource
management
Storage, streams HBase Kafka
…

Common API
 Notion of “DataSet” is no
longer present
 Program is a DAG of
operators
DataSource
DataSource
MapOperator
FilterOperator
JoinOperator
DataSink
Operator
50

Example: Joins in Flink
DataSet<Order> large = ...
DataSet<Lineitem> medium = ...
DataSet<Customer> small = ...
⋈
⋈
DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1)
.with(new JoinFunction() { ... });
DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2)
DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);
small
51
Built-in strategies include partitioned join and replicated join with
local sort-merge or hybrid-hash algorithms.
γ
large medium

Optimizer Example
DataSet<Tuple...> large = env.readCsv(...);
DataSet<Tuple...> medium = env.readCsv(...);
DataSet<Tuple...> small = env.readCsv(...);
DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1)
DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2)
DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);
52
1) Partitioned hash-join
2) Broadcast hash-join
3) Grouping /Aggregation reuses the partitioning
from step (1)  No shuffle!!!
Partitioned ≈ Reduce-side
Broadcast ≈ Map-side

Operating on serialized data
53
serializes data every time
 Highly robust, never gives up on you
works on objects, RDDs may be stored serialized
Serialization considered slow, only when needed
makes serialization really cheap:
 partial deserialization, operates on serialized form
 Efficient and robust!

Flink internals web

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Flink internals web

Similar to Flink internals web (20)

Flink internals web

Editor's Notes