This document provides an overview of the internals of Apache Flink. It discusses how Flink programs are compiled into execution plans by the Flink optimizer and executed in a pipelined fashion by the Flink runtime. The runtime uses optimized implementations of sorting and hashing to represent data internally as serialized bytes, avoiding object overhead. It also describes how Flink handles iterative programs and memory management. Overall, it explains how Flink hides complexity from users while providing high performance distributed processing.
Along with the arrival of BigData, a parallel yet less well known but significant change to the way we process data has occurred. Data is getting faster! Business models are changing radically based on the ability to be first to know insights and act appropriately to keep the customer, prevent the breakdown or save the patient. In essence, knowing something now is overriding knowing everything later. Stream processing engines allow us to blend event streams from different internal and external sources to gain insights in real time. This talk will discuss the need for streaming, business models it can change, new applications it allows and why Apache Flink enables these applications. Apache Flink is a top Level Apache Project for real time stream processing at scale. It is a high throughput, low latency, fault tolerant, distributed, state based stream processing engine. Flink has associated Polyglot APIs (Scala, Python, Java) for manipulating streams, a Complex Event Processor for monitoring and alerting on the streams and integration points with other big data ecosystem tooling.
This document provides an overview of Apache Flink, an open-source platform for distributed stream and batch data processing. Flink allows for unified batch and stream processing with a simple yet powerful programming model. It features native stream processing, exactly-once fault tolerance based on consistent snapshots, and high performance optimized for streaming workloads. The document outlines Flink's APIs, state management, fault tolerance approach, and roadmap for continued improvements in 2015.
Continuous Processing with Apache Flink - Strata London 2016Stephan Ewen
Task from the Strata & Hadoop World conference in London, 2016: Apache Flink and Continuous Processing.
The talk discusses some of the shortcomings of building continuous applications via batch processing, and how a stream processing architecture naturally solves many of these issues.
This document provides an overview of Apache Flink and stream processing. It discusses how stream processing has changed data infrastructure by enabling real-time analysis with low latency. Traditional batch processing had limitations like high latency of hours. Flink allows analyzing streaming data with sub-second latency using mechanisms like windows, state handling, and fault tolerance through distributed snapshots. The document benchmarks Flink performance against other frameworks on a Yahoo! production use case, finding Flink can achieve over 15 million messages/second throughput.
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.
Tech Talk @ Google on Flink Fault Tolerance and HAParis Carbone
The document summarizes Apache Flink's approach to exactly-once stream processing through distributed snapshots. It discusses how Flink takes asynchronous snapshots of streaming jobs using barriers to define consistent cuts. Snapshots include operator states and records in transit, allowing the job to be reset from the snapshot state. The approach works for both DAG and cyclic dataflow topologies. Flink implements distributed snapshots using a coordinator that triggers snapshots and handles recovery. Snapshots are stored asynchronously to avoid blocking the streaming job execution.
Data Stream Processing with Apache FlinkFabian Hueske
This talk is an introduction into Stream Processing with Apache Flink. I gave this talk at the Madrid Apache Flink Meetup at February 25th, 2016.
The talk discusses Flink's features, shows it's DataStream API and explains the benefits of Event-time stream processing. It gives an outlook on some features that will be added after the 1.0 release.
This document discusses stateful stream processing. It provides examples of stateful streaming applications and describes several open source stream processors, including their programming models and approaches to fault tolerance. It also examines how different systems handle state in streaming programs and discusses the tradeoffs of various approaches.
Apache Flink is an open source platform for distributed stream and batch data processing. It provides two APIs - a DataStream API for real-time streaming and a DataSet API for batch processing. The document introduces Flink's core concepts like sources, sinks, transformations, and windows. It also provides instructions on setting up a Flink project and describes some use cases like processing Twitter feeds. Additional resources like tutorials, documentation and mailing lists are referenced to help users get started with Flink.
This document provides a summary of upcoming features in Apache Flink, including stream SQL, queryable state, dynamic scaling of streaming programs, and consistent hashing. Stream SQL will allow running continuous SQL queries over infinite data streams and ingesting streams into a data warehouse. Queryable state will improve performance by allowing queries to Flink's internal state without external systems. Dynamic scaling will adjust a streaming program's parallelism without interrupting the application. Consistent hashing will improve state redistribution when changing a program's parallelism.
This document discusses continuous counting on data streams using Apache Flink. It begins by introducing streaming data and how counting is an important but challenging problem. It then discusses issues with batch-oriented and lambda architectures for counting. The document presents Flink's streaming architecture and DataStream API as solutions. It discusses requirements for low-latency, high-efficiency counting on streams, as well as fault tolerance, accuracy, and queryability. Benchmark results show Flink achieving sub-second latencies and high throughput. The document closes by overviewing upcoming features in Flink like SQL and dynamic scaling.
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
Slides for my talk at Hadoop Summit Dublin, April 2016.
The talk motivates how streaming can subsume batch use cases at the example of continuous counting.
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
This document discusses two topics: 1) Stale Synchronous Parallel (SSP) iterations on Apache Flink to address stragglers, and 2) a distributed Frank-Wolfe algorithm using SSP and a parameter server. For SSP on Flink, it describes integrating an iteration control model and API to allow iterations when worker data is within a staleness threshold. For the distributed Frank-Wolfe algorithm, it applies SSP to coordinate local atom selection and global coefficient updates via a parameter server in solving LASSO regression problems.
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
This document discusses Google Cloud Dataflow and how it can be executed using Apache Flink. It provides an overview of Dataflow and its API, which is similar to batch and streaming concepts in Flink. It then describes how a Dataflow program is translated to an Abstract Syntax Tree (AST) and how the AST is converted to a Flink execution graph by implementing translators for specific Dataflow transforms like ParDo and Combine. Finally, it mentions the FlinkPipelineRunner that is available on GitHub to execute Dataflow pipelines using Flink.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
Apache Flink enables stream processing on continuously produced data through its DataStream and DataSet APIs. It allows for streaming and batch processing as first class citizens. Flink programs are composed of sources that ingest data, transformations on those data streams, and sinks that output the results. Queryable state in Flink allows for querying the system state without writing to an external database, improving performance over traditional architectures that rely on writing intermediate results to external key-value stores. Flink's use of lightweight snapshots for fault tolerance and its log-based approach to persistence allows queryable state to have high throughput and low latency.
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
This face to face talk about Apache Flink in Sao Paulo, Brazil is the first event of its kind in Latin America! It explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation (link), marks a new era of Big Data analytics and in particular Real-Time streaming analytics. The talk maps Flink's capabilities to real-world use cases that span multiples verticals such as: Financial Services, Healthcare, Advertisement, Oil and Gas, Retail and Telecommunications.
In this talk, you learn more about:
1. What is Apache Flink Stack?
2. Batch vs. Streaming Analytics
3. Key Differentiators of Apache Flink for Streaming Analytics
4. Real-World Use Cases with Flink for Streaming Analytics
5. Who is using Flink?
6. Where do you go from here?
This talk is an application-driven walkthrough to modern stream processing, exemplified by Apache Flink, and how this enables new applications and makes old applications easier and more efficient. In this talk, we will walk through several real-world stream processing application scenarios of Apache Flink, highlighting unique features in Flink that make these applications possible. In particular, we will see (1) how support for handling out of order streams enables real-time monitoring of cloud infrastructure, (2) how the ability handle high-volume data streams with low latency SLAs enables real-time alerts in network equipment, (3) how the combination of high throughput and the ability to handle batch as a special case of streaming enables an architecture where the same exact program is used for real-time and historical data processing, and (4) how stateful stream processing can enable an architecture that eliminates the need for an external database store, leading to more than 100x performance speedup, among many other benefits.
Spark SQL is a module for structured data processing on Spark. It integrates relational processing with Spark's functional programming API and allows SQL queries to be executed over data sources via the Spark execution engine. Spark SQL includes components like a SQL parser, a Catalyst optimizer, and Spark execution engines for queries. It supports HiveQL queries, SQL queries, and APIs in Scala, Java, and Python.
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
These are the slides of my talk at the Chicago Apache Flink Meetup on April 19, 2016. This talk explains how Apache Flink 1.0 announced on March 8th, 2016 by the Apache Software Foundation, marks a new era of Real-Time and Real-World streaming analytics. The talk will map Flink's capabilities to streaming analytics use cases.
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
These are the slides of my talk on June 30, 2015 at the first event of the Chicago Apache Flink meetup. Although most of the current buzz is about Apache Spark, the talk shows how Apache Flink offers the only hybrid open source (Real-Time Streaming + Batch) distributed data processing engine supporting many use cases: Real-Time stream processing, machine learning at scale, graph analytics and batch processing.
In these slides, you will find answers to the following questions: What is Apache Flink stack and how it fits into the Big Data ecosystem? How Apache Flink integrates with Apache Hadoop and other open source tools for data input and output as well as deployment? What is the architecture of Apache Flink? What are the different execution modes of Apache Flink? Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? Who is using Apache Flink? Where to learn more about Apache Flink?
Step-by-Step Introduction to Apache Flink Slim Baltagi
This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way:
How to setup and configure your Apache Flink environment?
How to use Apache Flink tools?
3. How to run the examples in the Apache Flink bundle?
4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink?
5. How to write your Apache Flink program in an IDE?
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
This document provides an introduction to Apache Flink. It begins with an overview of the presenters and structure of the presentation. It then discusses Flink's APIs, architecture, and execution model. Key concepts are explained like streaming vs batch processing, scaling, the job manager and task managers. It provides a demo of Flink's DataSet API for batch processing and explains a WordCount example program. The goal is to get attendees started with Apache Flink.
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
This document summarizes Spark's structured APIs including SQL, DataFrames, and Datasets. It discusses how structuring computation in Spark enables optimizations by limiting what can be expressed. The structured APIs provide type safety, avoid errors, and share an optimization and execution pipeline. Functions allow expressing complex logic on columns. Encoders map between objects and Spark's internal data format. Structured streaming provides a high-level API to continuously query streaming data similar to batch queries.
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
This document summarizes key aspects of structuring computation and data in Apache Spark using SQL, DataFrames, and Datasets. It discusses how structuring computation and data through these APIs enables optimizations like predicate pushdown and efficient joins. It also describes how data is encoded efficiently in Spark's internal format and how encoders translate between domain objects and Spark's internal representations. Finally, it introduces structured streaming as a high-level streaming API built on top of Spark SQL that allows running the same queries continuously on streaming data.
This presentation held in at Inovex GmbH in Munich in November 2015 was about a general introduction of the streaming space, an overview of Flink and use cases of production users as presented at Flink Forward.
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Data Analysis with Apache Flink (Hadoop Summit, 2015)Aljoscha Krettek
Apache Flink is an open source project that offers both batch and stream processing on top of a common runtime and exposing a common API.
This talk shows how you can easily analyse data with Apache Flink. We present a new relational API and also the new Apache Flink Machine Learning library.
Apache Flink is a distributed stream and batch data processing framework. It provides a functional API, relational API, and machine learning capabilities. Flink executes jobs as dataflow graphs on a distributed runtime. An example use case demonstrates log analysis to extract click data and combine it with user information to find interesting URLs. Advanced analysis like collaborative filtering for website recommendations is also supported.
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
This document provides an overview of MapReduce and Hadoop. It describes the Map and Reduce functions, explaining that Map applies a function to each element of a list and Reduce reduces a list to a single value. It gives examples of Map and Reduce using employee salary data. It then discusses Hadoop and its core components HDFS for distributed storage and MapReduce for distributed processing. Key aspects covered include the NameNode, DataNodes, input/output formats, and the job launch process. It also addresses some common questions around small files, large files, and accessing SQL data from Hadoop.
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
This document discusses how Spark provides structured APIs like SQL, DataFrames, and Datasets to organize data and computation. It describes how these APIs allow Spark to optimize queries by understanding their structure. The document outlines how Spark represents data internally and how encoders translate between this format and user objects. It also introduces Spark's new structured streaming functionality, which allows batch queries to run continuously on streaming data using the same API.
Flink provides unified batch and stream processing. It natively supports streaming dataflows, long batch pipelines, machine learning algorithms, and graph analysis through its layered architecture and treatment of all computations as data streams. Flink's optimizer selects efficient execution plans such as shipping strategies and join algorithms. It also caches loop-invariant data to speed up iterative algorithms and graph processing.
This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
The document discusses creating an optimized algorithm in R. It covers writing functions and algorithms in R, creating R packages, and optimizing code performance using parallel computing and high performance computing. Key steps include reviewing existing algorithms, identifying gaps, testing and iterating a new algorithm, publishing the work, and making the algorithm available to others through an R package.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa
1) The document provides a quick guide to using data.table in R and Pentaho Data Integration (PDI) for fast data loading and manipulation. It discusses benchmarks showing data.table is 2-20x faster than traditional methods for reading, ordering, and transforming large data.
2) The outline discusses how to use basic data.table functions for speed gains and to overcome R's scaling limitations. It also provides a very brief overview of PDI's capabilities for Extract/Transform/Load (ETL) workflows without writing code.
3) The benchmarks section shows data.table is up to 500% faster than traditional R methods for reading large CSV files and orders of magnitude faster for sorting and aggregating
2. Welcome
Last talk: how to program PageRank in Flink,
and Flink programming model
This talk: how Flink works internally
Again, a big bravo to the Flink community
2
4. DataSet and transformations
Input X First Y Second
Operator X Operator Y
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.filter (str -> str.length() > 40);
second.print()
env.execute();
4
6. Other API elements & tools
Accumulators and counters
• Int, Long, Double counters
• Histogram accumulator
• Define your own
Broadcast variables
Plan visualization
Local debugging/testing mode
6
7. Data types and grouping
public static class Access {
public int userId;
public String url;
...
}
public static class User {
public int userId;
public int region;
public Date customerSince;
...
}
DataSet<Tuple2<Access,User>> campaign = access.join(users)
.where(“userId“).equalTo(“userId“)
DataSet<Tuple3<Integer,String,String> someLog;
someLog.groupBy(0,1).reduceGroup(...);
Bean-style Java classes & field names
Tuples and position addressing
Any data type with key selector function
7
8. Other API elements
Hadoop compatibility
• Supports all Hadoop data types, input/output
formats, Hadoop mappers and reducers
Data streaming API
• DataStream instead of DataSet
• Similar set of operators
• Currently in alpha but moving very fast
Scala and Java APIs (mirrored)
Graph API (Spargel)
8
10. Task
for (String token : value.split("W")) {
out.collect(new Tuple2<>(token, 1));
Manager
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
Job
Manager
Task
Manager
.flatMap((str, out) -> {
})
.groupBy(0)
.aggregate(SUM, 1);
Flink Client &
Optimizer
O Romeo,
Romeo,
wherefore
art thou
Romeo?
O, 1
Romeo, 3
wherefore, 1
art, 1
thou, 1
Apache Flink
10
Nor arm,
nor face,
nor any
other part
nor, 3
arm, 1
face, 1,
any, 1,
other, 1
part, 1
11. If you want to know one
thing about Flink is that
you don’t need to know
the internals of Flink.
11
12. Philosophy
Flink “hides” its internal workings from the
user
This is good
• User does not worry about how jobs are executed
• Internals can be changed without breaking
changes
… and bad
• Execution model more complicated to explain
compared to MapReduce or Spark RDD
12
13. Recap: DataSet
Input X First Y Second
Operator X Operator Y
13
ExecutionEnvironment env =
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input
.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first
.filter (str -> str.length() > 40);
second.print()
env.execute();
14. Common misconception
Input X First Y Second
Programs are not executed eagerly
Instead, system compiles program to an
execution plan and executes that plan
14
15. DataSet<String>
Think of it as a PCollection<String>, or a
Spark RDD[String]
With a major difference: it can be
produced/recovered in several ways
• … like a Java collection
• … like an RDD
• … perhaps it is never fully materialized (because
the program does not need it to)
• … implicitly updated in an iteration
And this is transparent to the user
15
16. Example: grep
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
16
17. Staged (batch) execution
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
17
18. Pipelined execution
Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
00110011
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if
needed
18
Note: Log
DataSet is
never
“created”!
19. Benefits of pipelining
25 node cluster
Grep log for 3
terms
Scale data size
from 100GB to
1TB
2500
2250
2000
1750
1500
1250
1000
750
500
250
0
0 100 200 300 400 500 600 700 800 900 1000
Time to complete grep (sec)
Cluster memory Data size (GB)
exceeded 19
21. Drawbacks of pipelining
Long pipelines may be active at the same time leading to
memory fragmentation
• FLINK-1101: Changes memory allocation from static to adaptive
Fault-tolerance harder to get right
• FLINK-986: Adds intermediate data sets (similar to RDDS) as
first-class citizen to Flink Runtime. Will lead to fine-grained fault-tolerance
among other features.
21
23. Iterate by unrolling
Client
Step Step Step Step Step
for/while loop in client submits one job per iteration
step
Data reuse by caching in memory and/or disk
23
24. Iterate natively
DataSet<Page> pages = ...
DataSet<Neighborhood> edges = ...
IterativeDataSet<Page> pagesIter = pages.iterate(maxIterations);
DataSet<Page> newRanks = update (pagesIter, edges);
DataSet<Page> result = pagesIter.closeWith(newRanks)
24
partial
solution
partial
solution X
other
datasets
Y
initial
solution
iteration
result
Replace
Step function
25. Iterate natively with deltas
Replace
workset A B workset
partial
solution
delta
set X
other
datasets
Y
initial
workset
initial
solution
iteration
result
Merge deltas
DeltaIteration<...> pagesIter = pages.iterateDelta(initialDeltas, maxIterations, 0);
DataSet<...> newRanks = update (pagesIter, edges);
DataSet<...> newRanks = ...
DataSet<...> result = pagesIter.closeWith(newRanks, deltas)
See http://data-artisans.com/data-analysis-with-flink.html 25
29. Flink stack
Apache Tez
Data
storage
Flink Optimizer Flink Stream Builder
Rabbit
MQ
Embedded execution
(Java collections)
Files HDFS S3 JDBC Kafka
Redis
Azure
tables
…
Local execution
Flink Runtime
YARN EC2
Common API
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Execution Engine
29
30. Flink stack
30
Common API
Flink Optimizer Flink Stream Builder
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
Python API
(upcoming)
Graph API
Apache
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution)
Apache Tez
Flink cluster YARN
Data
storage
Rabbit
MQ
Files HDFS S3 JDBC Kafka
Redis
Azure
tables
…
Single node execution
31. Program lifecycle
31
val source1 = …
val source2 = …
val maxed = source1
.map(v => (v._1,v._2,
math.max(v._1,v._2))
val filtered = source2
.filter(v => (v._1 > 4))
val result = maxed
.join(filtered).where(0).equalTo(0)
.filter(_1 > 3)
.groupBy(0)
.reduceGroup {……}
1
32. The optimizer is the
component that selects
an execution plan for a
Common API program
Think of an AI system
manipulating your
program for you
But don’t be scared – it
works
• Relational databases have
been doing this for
decades – Flink ports the
technology to API-based
systems
Flink Optimizer
32
34. Two execution plans
34
GroupRed
sort
Combine
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
Best plan forward
depends on
relative sizes
of input files
35. Flink Local Runtime
Local runtime, not the
35
distributed execution
engine
Aka: what happens
inside every parallel
task
36. Flink runtime operators
Sorting and hashing data
• Necessary for grouping, aggregation, reduce,
join, cogroup, delta iterations
Flink contains tailored implementations of
hybrid hashing and external sorting in
Java
• Scale well with both abundant and restricted
memory sizes
36
37. Internal data representation
37
JVM Heap
map
JVM Heap
reduce
O Romeo,
Romeo,
wherefore
art thou
Romeo?
00110011
art, 1
O, 1
Romeo, 1
Romeo, 1
00110011
00010111
01110001
01111010
00010111
00110011
Network transfer
Local sort
How is intermediate data internally represented?
38. Internal data representation
Two options: Java objects or raw bytes
Java objects
• Easier to program
• Can suffer from GC overhead
• Hard to de-stage data to disk, may suffer from “out of
memory exceptions”
Raw bytes
• Harder to program (customer serialization stack, more
involved runtime operators)
• Solves most of memory and GC problems
• Overhead from object (de)serialization
Flink follows the raw byte approach
38
39. Memory in Flink
public class WC {
public String word;
public int count;
}
empty
page
Pool of Memory Pages
JVM Heap
User code
objects
Sorting,
hashing,
caching
Shuffling,
broadcasts
heap
Unmanaged
Managed
heap
Network
buffers
39
40. Memory in Flink (2)
Internal memory management
• Flink initially allocates 70% of the free heap as byte[]
segments
• Internal operators allocate() and release() these
segments
Flink has its own serialization stack
• All accepted data types serialized to data segments
Easy to reason about memory, (almost) no
OutOfMemory errors, reduces the pressure to the
GC (smooth performance)
40
41. Operating on serialized data
Microbenchmark
Sorting 1GB worth of (long, double) tuples
67,108,864 elements
Simple quicksort
41
42. Flink Execution Engine
42
The distributed
execution engine
Pipelined
• Same engine for Flink
and Flink streaming
Pluggable
• Local runtime can be
executed on other
engines
• E.g., Java collections
and Apache Tez
44. Summary
Flink decouples API from execution
• Same program can be executed in many different
ways
• Hopefully users do not need to care about this and
still get very good performance
Unique Flink internal features
• Pipelined execution, native iterations, optimizer,
serialized data manipulation, good disk destaging
Very good performance
• Known issues currently worked on actively
44
49. Flink in context
49
Hive
Mahout
MapReduce
Flink
Spark Storm
Yarn Mesos
HDFS
Cascading
…
Tez
Pig
Applications
Data processing
engines
App and resource
management
Storage, streams HBase Kafka
…
50. Common API
Notion of “DataSet” is no
longer present
Program is a DAG of
operators
DataSource
DataSource
MapOperator
FilterOperator
JoinOperator
DataSink
Operator
50
51. Example: Joins in Flink
DataSet<Order> large = ...
DataSet<Lineitem> medium = ...
DataSet<Customer> small = ...
⋈
⋈
DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1)
.with(new JoinFunction() { ... });
DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2)
.with(new JoinFunction() { ... });
DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);
small
51
Built-in strategies include partitioned join and replicated join with
local sort-merge or hybrid-hash algorithms.
γ
large medium
52. Optimizer Example
DataSet<Tuple...> large = env.readCsv(...);
DataSet<Tuple...> medium = env.readCsv(...);
DataSet<Tuple...> small = env.readCsv(...);
DataSet<Tuple...> joined1 = large.join(medium).where(3).equals(1)
.with(new JoinFunction() { ... });
DataSet<Tuple...> joined2 = small.join(joined1).where(0).equals(2)
.with(new JoinFunction() { ... });
DataSet<Tuple...> result = joined2.groupBy(3).aggregate(MAX, 2);
52
1) Partitioned hash-join
2) Broadcast hash-join
3) Grouping /Aggregation reuses the partitioning
from step (1) No shuffle!!!
Partitioned ≈ Reduce-side
Broadcast ≈ Map-side
53. Operating on serialized data
53
serializes data every time
Highly robust, never gives up on you
works on objects, RDDs may be stored serialized
Serialization considered slow, only when needed
makes serialization really cheap:
partial deserialization, operates on serialized form
Efficient and robust!