Timo Walther
Apache Flink PMC
Flink Forward @ San Francisco - April 11th, 2017
Table & SQL API
unified APIs for batch and stream processing
DataStream API is great…
 Very expressive stream processing
• Transform data, update state, define windows, aggregate, etc.
 Highly customizable windowing logic
• Assigners, Triggers, Evictors, Lateness
 Asynchronous I/O
• Improve communication to external systems
 Low-level Operations
• ProcessFunction gives access to timestamps and timers
… but it is not for Everyone!
 Writing DataStream programs is not always easy
• Stream processing technology spreads rapidly
• New streaming concepts (time, state, windows, ...)
 Requires knowledge & skill
• Continous applications have special requirements
• Programming experience (Java / Scala)
 Users want to focus on their business logic

Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

stream processingbig dataapache flink
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuVirtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.

alibabaapache flinkstream processing
Why not a Relational API?
 Relational API is declarative
• User says what is needed, system decides how to compute it
 Queries can be effectively optimized
• Less black-boxes, well-researched field
 Queries are efficiently executed
• Let Flink handle state, time, and common mistakes
 ”Everybody” knows and uses SQL!
 Easy, declarative, and concise relational API
 Tool for a wide range of use cases
 Relational API as a unifying layer
• Queries on batch tables terminate and produce a finite result
• Queries on streaming tables run continuously and produce result
 Same syntax & semantics for both queries
Table API & SQL
Table API & SQL
 Flink features two relational APIs
• Table API: LINQ-style API for Java & Scala (since Flink 0.9.0)
• SQL: Standard SQL (since Flink 1.1.0)
DataSet API DataStream API
Table API
Flink Dataflow Runtime

Table API & SQL Example
val tEnv = TableEnvironment.getTableEnvironment(env)
// configure your data source
val customerSource = CsvTableSource.builder()
.field("name", Types.STRING).field("prefs", Types.STRING)
// register as a table
tEnv.registerTableSource(”cust", customerSource)
// define your table program
val table = tEnv.scan("cust").select('name.lowerCase(), myParser('prefs))
val table = tEnv.sql("SELECT LOWER(name), myParser(prefs) FROM cust")
// convert
val ds: DataStream[Customer] = table.toDataStream[Customer]
Windowing in Table API
val sensorData: DataStream[(String, Long, Double)] = ???
// convert DataStream into Table
val sensorTable: Table = sensorData
.toTable(tableEnv, 'location, 'rowtime, 'tempF)
// define query on Table
val avgTempCTable: Table = sensorTable
.window(Tumble over on 'rowtime as 'w)
.groupBy('location, ’w)
.select('w.start as 'day,
(('tempF.avg - 32) * 0.556) as 'avgTempC)
.where('location like "room%")
Windowing in SQL
val sensorData: DataStream[(String, Long, Double)] = ???
// register DataStream
"sensorData", sensorData, 'location, 'rowtime, 'tempF)
// query registered Table
val avgTempCTable: Table = tableEnv.sql("""
AVG((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%’
GROUP BY location, TUMBLE(time, INTERVAL '1' DAY)
2 APIs [SQL, Table API]
2 backends [DataStream, DataSet]
4 different translation paths?

DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Parser & Validator
Table API Validator
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Parser & Validator
Table API Validator
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Parser & Validator
Table API Validator
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Parser & Validator
Table API Validator

Translation to Logical Plan
.window(Tumble over on 'rowtime as 'w)
.groupBy('location, ’w)
'w.start as 'day,
(('tempF.avg - 32) *
0.556) as 'avgTempC)
.where('location like "room%")
Catalog Node
Window Aggregate
Logical Table Scan
Logical Window
Logical Project
Logical Filter
Table Nodes Calcite Logical Plan
Table API Validation
Translation to DataStream Plan
Logical Table Scan
Logical Window
Logical Project
Logical Filter
Calcite Logical Plan
Logical Table Scan
Logical Window
Logical Calc
Optimized Plan
DataStream Scan
DataStream Calc
DataStream Plan
Translation to Flink Program
DataStream Scan
DataStream Calc
DataStream Plan
FlatMap Function
Aggregate & Window
DataStream Program
Translate &
Current State (in master)
 Batch support
• Selection, Projection, Sort, Inner & Outer Joins, Set operations
• Group-Windows for Slide, Tumble, Session
 Streaming support
• Selection, Projection, Union
• Group-Windows for Slide, Tumble, Session
• Different SQL OVER-Windows (RANGE/ROWS)
 UDFs, UDTFs, custom rules

Use Cases for Streaming SQL
 Continuous ETL & Data Import
 Live Dashboards & Reports
Outlook: Dynamic Tables
Dynamic Tables Model
 Dynamic tables change over time
 Dynamic tables are treated like static batch tables
• Dynamic tables are queried with standard SQL / Table API
• Every query returns another Dynamic Table
 “Stream / Table Duality”
• Stream ←→ Dynamic Table
conversions without information loss
Stream to Dynamic Table
 Append Mode:
 Update Mode:

Querying Dynamic Tables
 Dynamic tables change over time
• A[t]: Table A at specific point in time t
 Dynamic tables are queried with relational semantics
• Result of a query changes as input table changes
• q(A[t]): Evaluate query q on table A at time t
 Query result is continuously updated as t progresses
• Similar to maintaining a materialized view
• t is current event time
Querying a Dynamic Table
Querying a Dynamic Table
Querying a Dynamic Table
 Can we run any query on Dynamic Tables? No!
 State may not grow infinitely as more data arrives
• Set clean-up timeout or key constraints.
 Input may only trigger partial re-computation
 Queries with possibly unbounded state or computation
are rejected

Dynamic Table to Stream
 Convert Dynamic Table modifications into stream
 Similar to database logging techniques
• Undo: previous value of a modified element
• Redo: new value of a modified element
• Undo+Redo: old and the new value of a changed element
 For Dynamic Tables: Redo or Undo+Redo
Dynamic Table to Stream
 Undo+Redo Stream (because A is in Append Mode):
Dynamic Table to Stream
 Redo Stream (because A is in Update Mode):
Result computation & refinement
First result
(end – x)
Last result
(end + x)
State is purged.
Late updates
(on new data)
Update rate
(every x)
(end + x)
Complete result can be computed

Contributions welcome!
 Huge interest and many contributors
• Adding more window operators
• Introducing dynamic tables
 And there is a lot more to do
• New operators and features for streaming and batch
• Performance improvements
• Tooling and integration
 Try it out, give feedback, and start contributing!
Thank you!

Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for batch and stream processing

  • 1. 1 Timo Walther Apache Flink PMC @twalthr Flink Forward @ San Francisco - April 11th, 2017 Table & SQL API unified APIs for batch and stream processing
  • 3. DataStream API is great… 3  Very expressive stream processing • Transform data, update state, define windows, aggregate, etc.  Highly customizable windowing logic • Assigners, Triggers, Evictors, Lateness  Asynchronous I/O • Improve communication to external systems  Low-level Operations • ProcessFunction gives access to timestamps and timers
  • 4. … but it is not for Everyone! 4  Writing DataStream programs is not always easy • Stream processing technology spreads rapidly • New streaming concepts (time, state, windows, ...)  Requires knowledge & skill • Continous applications have special requirements • Programming experience (Java / Scala)  Users want to focus on their business logic
  • 5. Why not a Relational API? 5  Relational API is declarative • User says what is needed, system decides how to compute it  Queries can be effectively optimized • Less black-boxes, well-researched field  Queries are efficiently executed • Let Flink handle state, time, and common mistakes  ”Everybody” knows and uses SQL!
  • 6. Goals  Easy, declarative, and concise relational API  Tool for a wide range of use cases  Relational API as a unifying layer • Queries on batch tables terminate and produce a finite result • Queries on streaming tables run continuously and produce result stream  Same syntax & semantics for both queries 6
  • 7. Table API & SQL 7
  • 8. Table API & SQL  Flink features two relational APIs • Table API: LINQ-style API for Java & Scala (since Flink 0.9.0) • SQL: Standard SQL (since Flink 1.1.0) 8 DataSet API DataStream API Table API SQL Flink Dataflow Runtime
  • 9. Table API & SQL Example 9 val tEnv = TableEnvironment.getTableEnvironment(env) // configure your data source val customerSource = CsvTableSource.builder() .path("/path/to/customer_data.csv") .field("name", Types.STRING).field("prefs", Types.STRING) .build() // register as a table tEnv.registerTableSource(”cust", customerSource) // define your table program val table = tEnv.scan("cust").select('name.lowerCase(), myParser('prefs)) val table = tEnv.sql("SELECT LOWER(name), myParser(prefs) FROM cust") // convert val ds: DataStream[Customer] = table.toDataStream[Customer]
  • 10. Windowing in Table API 10 val sensorData: DataStream[(String, Long, Double)] = ??? // convert DataStream into Table val sensorTable: Table = sensorData .toTable(tableEnv, 'location, 'rowtime, 'tempF) // define query on Table val avgTempCTable: Table = sensorTable .window(Tumble over on 'rowtime as 'w) .groupBy('location, ’w) .select('w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%")
  • 11. Windowing in SQL 11 val sensorData: DataStream[(String, Long, Double)] = ??? // register DataStream tableEnv.registerDataStream( "sensorData", sensorData, 'location, 'rowtime, 'tempF) // query registered Table val avgTempCTable: Table = tableEnv.sql(""" SELECT TUMBLE_START(TUMBLE(time, INTERVAL '1' DAY) AS day, location, AVG((tempF - 32) * 0.556) AS avgTempC FROM sensorData WHERE location LIKE 'room%’ GROUP BY location, TUMBLE(time, INTERVAL '1' DAY) """)
  • 12. Architecture 2 APIs [SQL, Table API] * 2 backends [DataStream, DataSet] = 4 different translation paths? 12
  • 13. Architecture 13 DataSet Rules DataSet PlanDataSet DataStreamDataStream Plan DataStream Rules Calcite Catalog Calcite Logical Plan Calcite Optimizer Calcite Parser & Validator Table API SQL API DataSet Table Sources DataStream Table API Validator
  • 14. Architecture 14 DataSet Rules DataSet PlanDataSet DataStreamDataStream Plan DataStream Rules Calcite Catalog Calcite Logical Plan Calcite Optimizer Calcite Parser & Validator Table API SQL API DataSet Table Sources DataStream Table API Validator
  • 15. Architecture 15 DataSet Rules DataSet PlanDataSet DataStreamDataStream Plan DataStream Rules Calcite Catalog Calcite Logical Plan Calcite Optimizer Calcite Parser & Validator Table API SQL API DataSet Table Sources DataStream Table API Validator
  • 16. Architecture 16 DataSet Rules DataSet PlanDataSet DataStreamDataStream Plan DataStream Rules Calcite Catalog Calcite Logical Plan Calcite Optimizer Calcite Parser & Validator Table API SQL API DataSet Table Sources DataStream Table API Validator
  • 17. Translation to Logical Plan 17 sensorTable .window(Tumble over on 'rowtime as 'w) .groupBy('location, ’w) .select( 'w.start as 'day, 'location, (('tempF.avg - 32) * 0.556) as 'avgTempC) .where('location like "room%") Catalog Node Window Aggregate Project Filter Logical Table Scan Logical Window Aggregate Logical Project Logical Filter Table Nodes Calcite Logical Plan Table API Validation Translation
  • 18. Translation to DataStream Plan 18 Logical Table Scan Logical Window Aggregate Logical Project Logical Filter Calcite Logical Plan Logical Table Scan Logical Window Aggregate Logical Calc Optimized Plan DataStream Scan DataStream Calc DataStream Aggregate DataStream Plan Optimize Transform
  • 19. Translation to Flink Program 19 DataStream Scan DataStream Calc DataStream Aggregate DataStream Plan (Forwarding) FlatMap Function Aggregate & Window Function DataStream Program Translate & Code-generate
  • 20. Current State (in master)  Batch support • Selection, Projection, Sort, Inner & Outer Joins, Set operations • Group-Windows for Slide, Tumble, Session  Streaming support • Selection, Projection, Union • Group-Windows for Slide, Tumble, Session • Different SQL OVER-Windows (RANGE/ROWS)  UDFs, UDTFs, custom rules 20
  • 21. Use Cases for Streaming SQL  Continuous ETL & Data Import  Live Dashboards & Reports 21
  • 23. Dynamic Tables Model  Dynamic tables change over time  Dynamic tables are treated like static batch tables • Dynamic tables are queried with standard SQL / Table API • Every query returns another Dynamic Table  “Stream / Table Duality” • Stream ←→ Dynamic Table conversions without information loss 23
  • 24. Stream to Dynamic Table  Append Mode:  Update Mode: 24
  • 25. Querying Dynamic Tables  Dynamic tables change over time • A[t]: Table A at specific point in time t  Dynamic tables are queried with relational semantics • Result of a query changes as input table changes • q(A[t]): Evaluate query q on table A at time t  Query result is continuously updated as t progresses • Similar to maintaining a materialized view • t is current event time 25
  • 26. Querying a Dynamic Table 26
  • 27. Querying a Dynamic Table 27
  • 28. Querying a Dynamic Table  Can we run any query on Dynamic Tables? No!  State may not grow infinitely as more data arrives • Set clean-up timeout or key constraints.  Input may only trigger partial re-computation  Queries with possibly unbounded state or computation are rejected 28
  • 29. Dynamic Table to Stream  Convert Dynamic Table modifications into stream messages  Similar to database logging techniques • Undo: previous value of a modified element • Redo: new value of a modified element • Undo+Redo: old and the new value of a changed element  For Dynamic Tables: Redo or Undo+Redo 29
  • 30. Dynamic Table to Stream  Undo+Redo Stream (because A is in Append Mode): 30
  • 31. Dynamic Table to Stream  Redo Stream (because A is in Update Mode): 31
  • 32. Result computation & refinement 32 First result (end – x) Last result (end + x) State is purged. Late updates (on new data) Update rate (every x) Complete result (end + x) Complete result can be computed (end)
  • 33. Contributions welcome!  Huge interest and many contributors • Adding more window operators • Introducing dynamic tables  And there is a lot more to do • New operators and features for streaming and batch • Performance improvements • Tooling and integration  Try it out, give feedback, and start contributing! 33

