This document discusses Flink's Table and SQL APIs, which provide a unified way to write batch and streaming queries. It motivates the need for a relational API by explaining that while Flink's DataStream API is powerful, it requires more technical skills. The Table and SQL APIs allow users to focus on business logic by writing declarative queries. It describes how the APIs work, including translating queries to logical and execution plans and supporting batch, streaming and windowed queries. Finally, it outlines the current capabilities and opportunities for contributors to help expand Flink's relational features.
Rethinking State Management in Cloud-Native Streaming Systems
Current 2022 talk.
Speaker: Yingjun Wu
Title: Rethinking State Management in Cloud-Native Streaming Systems.
Abstract:
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Spring Batch is a framework for writing batch jobs that can run as scheduled or on-demand processes without user interaction. It provides reusable components for connecting to databases or other systems, processing and transforming data in chunks, and writing output. The basic architecture includes a job launcher, job made of steps, and components for reading input, processing it, and writing output in chunks. Spring Batch Admin provides a web-based interface for monitoring and managing batch jobs.
Arrow Flight is a proposed RPC layer for Apache Arrow that allows for efficient transfer of Arrow record batches between systems. It uses GRPC as the foundation to define streams of Arrow data that can be consumed in parallel across locations. Arrow Flight supports custom actions that can be used to build services on top of the generic API. By extending GRPC, Arrow Flight aims to simplify the creation of data applications while enabling high performance data transfer and locality awareness.
This document discusses Aspect Oriented Programming (AOP) using the Spring Framework. It defines AOP as a programming paradigm that extends OOP by enabling modularization of crosscutting concerns. It then discusses how AOP addresses common crosscutting concerns like logging, validation, caching, and transactions through aspects, pointcuts, and advice. It also compares Spring AOP and AspectJ, and shows how to implement AOP in Spring using annotations or XML.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
How to Extend Apache Spark with Customized Optimizations
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
The document discusses Struts framework and internationalization (I18N) applications. Some key points:
1. Struts is a MVC framework that simplifies development of web applications. It provides components like ActionServlet and tag libraries.
2. I18N applications display output based on the user's locale/language. This is achieved using properties files with language-specific translations and the ResourceBundle class.
3. In Struts, properties files are configured in struts-config.xml and accessed in JSPs using the <message> tag. Keys not found will result in errors unless null=false is specified.
This document provides an overview and comparison of the algorithms, phases, and commonalities of modern concurrent garbage collectors in HotSpot, including G1, Shenandoah, and Z GC. It begins with laying the groundwork on stop-the-world vs concurrent collection and heap layout. It then introduces the key differences between the three collectors in their marking, barrier, and compaction approaches. The goal of the document is to provide a technical introduction and high-level differences between these concurrent garbage collectors.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Tutorial Videos: https://www.youtube.com/playlist?list=PLD8nQCAhR3tQ7KXnvIk_v_SLK-Fb2y_k_
Day 1 : Introduction to React, Babel and Webpack
Prerequisites of starting the workshop ( Basic understanding of Node & Express )
What is Virtual DOM?
What is React and why should we use it?
Install and set up React:
a-Using create-react-app
b-From scratch using Babel and Webpack. We will use Webpack Dev Server.
Day 2 : React Basic Concepts
Types of Components: Class-based and Functional based Components
Use of JSX
Parent, Child, and Nested Components
Difference between State and Props
Create and Handle Routes
Component Lifecycle Methods
Create a form and handling form inputs
Use of arrow functions and Spread Operator
Day 3: Advanced Concepts in React
Use of Refs
What are Higher Order Components( HOC )?
How to use HOC
Understanding Context in React
Flink powered stream processing platform at Pinterest
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
How We Optimize Spark SQL Jobs With parallel and sync IO
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API.
Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer.
In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API.
Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer.
In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
SQL can be used to query both streaming and batch data. Apache Flink and Apache Calcite enable SQL queries on streaming data. Flink uses its Table API and integrates with Calcite to translate SQL queries into dataflow programs. This allows standard SQL to be used for both traditional batch analytics on finite datasets and stream analytics producing continuous results from infinite data streams. Queries are executed continuously by applying operators within windows to subsets of streaming data.
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Fabian Hueske presented on stream analytics using SQL on Apache Flink. Flink provides a scalable platform for stream processing that is fast, accurate, and reliable. Its relational APIs allow querying both batch and streaming data using standard SQL or a LINQ-style Table API. Queries on streaming data produce continuously updating results. Windows can be used to compute aggregates over tumbling time intervals. The dynamic tables representing streaming data can be converted to output streams encoding updates as insertions and deletions. While not all queries can be supported, techniques like limiting state size allow bounding computational resources. Use cases like continuous ETL, dashboards, and event-driven architectures were discussed.
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API.
SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined.
In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Building a REST Service in minutes with Spring BootOmri Spector
A walk through building a micro service using Spring Boot.
Deck presented at Java 2016
Source accompanying presentation can be found at https://github.com/ospector/sbdemo
Bucketing 2.0: Improve Spark SQL Performance by Removing ShuffleDatabricks
Bucketing is commonly used in Hive and Spark SQL to improve performance by eliminating Shuffle in Join or group-by-aggregate scenario. This is ideal for a variety of write-once and read-many datasets at Bytedance.
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in FlinkFlink Forward
This document discusses an on-demand backfilling solution for Flink pipelines that process incentive programs. It describes a pipeline that evaluates driver incentives based on their status changes. Some incentives are defined retroactively, so an on-demand source was created that can continuously backfill old data. This source emits state changes paired with the applicable incentives from the beginning of each period. It allows a single pipeline to process both new and retroactive incentives without needing separate pipelines. The document also suggests this approach could be generalized into a pipeline template.
Rethinking State Management in Cloud-Native Streaming SystemsYingjun Wu
Current 2022 talk.
Speaker: Yingjun Wu
Title: Rethinking State Management in Cloud-Native Streaming Systems.
Abstract:
Stream processing is becoming increasingly essential for extracting business value from data in real-time. To achieve strict user-defined SLAs under constantly changing workloads, modern streaming systems have started taking advantage of the cloud for scalable and resilient resources. New demand opens new opportunities and challenges for state management, which is at the core of streaming systems. Existing approaches typically use embedded key-value storage so that each worker can access it locally to achieve high performance. However, it requires an external durable file system for checkpointing, is complicated and time-consuming to redistribute state during scaling and migration, and is prone to performance throttling. Therefore, we propose shared storage based on LSM-tree. State gets stored at cloud object storage and seamlessly makes itself durable, and the high bandwidth of cloud storage enables fast recovery. The location of a partition of the state decouples with compute nodes thus making scaling straightforward and more efficient. Compaction in this shared LSM-tree is now globally coordinated with opportunistic serverless boosting instead of relying on individual compute nodes. We design a streaming-aware compaction and caching strategy to achieve smoother and better end-to-end performance.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Spring Batch is a framework for writing batch jobs that can run as scheduled or on-demand processes without user interaction. It provides reusable components for connecting to databases or other systems, processing and transforming data in chunks, and writing output. The basic architecture includes a job launcher, job made of steps, and components for reading input, processing it, and writing output in chunks. Spring Batch Admin provides a web-based interface for monitoring and managing batch jobs.
Arrow Flight is a proposed RPC layer for Apache Arrow that allows for efficient transfer of Arrow record batches between systems. It uses GRPC as the foundation to define streams of Arrow data that can be consumed in parallel across locations. Arrow Flight supports custom actions that can be used to build services on top of the generic API. By extending GRPC, Arrow Flight aims to simplify the creation of data applications while enabling high performance data transfer and locality awareness.
This document discusses Aspect Oriented Programming (AOP) using the Spring Framework. It defines AOP as a programming paradigm that extends OOP by enabling modularization of crosscutting concerns. It then discusses how AOP addresses common crosscutting concerns like logging, validation, caching, and transactions through aspects, pointcuts, and advice. It also compares Spring AOP and AspectJ, and shows how to implement AOP in Spring using annotations or XML.
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...Databricks
The document discusses code optimization techniques in Spark SQL's Catalyst optimizer. It describes how function outlining can improve performance of generated Java code by splitting large routines into smaller ones. The document outlines a Spark SQL query optimization case study where outlining a 300+ line routine from Catalyst code generation improved query performance by up to 19% on a Power8 cluster. Overall, the document examines how function outlining and other code generation optimizations in Catalyst can help the Java JIT compiler better optimize Spark SQL queries.
How to Extend Apache Spark with Customized OptimizationsDatabricks
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
The document discusses Struts framework and internationalization (I18N) applications. Some key points:
1. Struts is a MVC framework that simplifies development of web applications. It provides components like ActionServlet and tag libraries.
2. I18N applications display output based on the user's locale/language. This is achieved using properties files with language-specific translations and the ResourceBundle class.
3. In Struts, properties files are configured in struts-config.xml and accessed in JSPs using the <message> tag. Keys not found will result in errors unless null=false is specified.
This document provides an overview and comparison of the algorithms, phases, and commonalities of modern concurrent garbage collectors in HotSpot, including G1, Shenandoah, and Z GC. It begins with laying the groundwork on stop-the-world vs concurrent collection and heap layout. It then introduces the key differences between the three collectors in their marking, barrier, and compaction approaches. The goal of the document is to provide a technical introduction and high-level differences between these concurrent garbage collectors.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Tutorial Videos: https://www.youtube.com/playlist?list=PLD8nQCAhR3tQ7KXnvIk_v_SLK-Fb2y_k_
Day 1 : Introduction to React, Babel and Webpack
Prerequisites of starting the workshop ( Basic understanding of Node & Express )
What is Virtual DOM?
What is React and why should we use it?
Install and set up React:
a-Using create-react-app
b-From scratch using Babel and Webpack. We will use Webpack Dev Server.
Day 2 : React Basic Concepts
Types of Components: Class-based and Functional based Components
Use of JSX
Parent, Child, and Nested Components
Difference between State and Props
Create and Handle Routes
Component Lifecycle Methods
Create a form and handling form inputs
Use of arrow functions and Spread Operator
Day 3: Advanced Concepts in React
Use of Refs
What are Higher Order Components( HOC )?
How to use HOC
Understanding Context in React
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
How We Optimize Spark SQL Jobs With parallel and sync IODatabricks
Although NVMe has been more and more popular these years, a large amount of HDD are still widely used in super-large scale big data clusters. In a EB-level data platform, IO(including decompression and decode) cost contributes a large proportion of Spark jobs’ cost. In another word, IO operation is worth optimizing.
In ByteDancen, we do a series of IO optimization to improve performance, including parallel read and asynchronized shuffle. Firstly we implement file level parallel read to improve performance when there are a lot of small files. Secondly, we design row group level parallel read to accelerate queries for big-file scenario. Thirdly, implement asynchronized spill to improve job peformance. Besides, we design parquet column family, which will split a table into a few column families and different column family will be in different Parquets files. Different column family can be read in parallel, so the read performance is much higher than the existing approach. In our practice, the end to end performance is improved by 5% to 30%
In this talk, I will illustrate how we implement these features and how they accelerate Apache Spark jobs.
Timo Walther - Table & SQL API - unified APIs for batch and stream processingVerverica
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API.
Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer.
In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
Apache Flink's Table & SQL API - unified APIs for batch and stream processingTimo Walther
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API.
Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer.
In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
SQL can be used to query both streaming and batch data. Apache Flink and Apache Calcite enable SQL queries on streaming data. Flink uses its Table API and integrates with Calcite to translate SQL queries into dataflow programs. This allows standard SQL to be used for both traditional batch analytics on finite datasets and stream analytics producing continuous results from infinite data streams. Queries are executed continuously by applying operators within windows to subsets of streaming data.
Fabian Hueske - Stream Analytics with SQL on Apache FlinkVerverica
Fabian Hueske presented on stream analytics using SQL on Apache Flink. Flink provides a scalable platform for stream processing that is fast, accurate, and reliable. Its relational APIs allow querying both batch and streaming data using standard SQL or a LINQ-style Table API. Queries on streaming data produce continuously updating results. Windows can be used to compute aggregates over tumbling time intervals. The dynamic tables representing streaming data can be converted to output streams encoding updates as insertions and deletions. While not all queries can be supported, techniques like limiting state size allow bounding computational resources. Use cases like continuous ETL, dashboards, and event-driven architectures were discussed.
Stream Analytics with SQL on Apache FlinkFabian Hueske
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API.
SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined.
In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...Flink Forward
Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.
Apache Flink Overview at SF Spark and FriendsStephan Ewen
Introductory presentation for Apache Flink, with bias towards streaming data analysis features in Flink. Shown at the San Francisco Spark and Friends Meetup
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Julian Hyde
The revolution has happened. We are living the age of the deconstructed database. The modern enterprises are powered by data, and that data lives in many formats and locations, in-flight and at rest, but somewhat surprisingly, the lingua franca for remains SQL.
In this talk, Julian describes Apache Calcite, a toolkit for relational algebra that powers many systems including Apache Beam, Flink and Hive. He discusses some areas of development in Calcite: streaming SQL, materialized views, enabling spatial query on vanilla databases, and what a mash-up of all three might look like.
He also describes how SQL is being extended to handle streaming, and the challenges that will need to be solved if it is to become standard.
A talk given by Julian Hyde at Lyft, San Francisco, on 2018/06/27.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Data Stream Analytics - Why they are importantParis Carbone
Streaming is cool and it can help us do quick analytics and make profit but what about tsunamis? This is a motivation talk presented at the SeRC Big Data Workshop in Sweden during spring 2016. It motivates the streaming paradigm and provides examples on Apache Flink.
Streaming SQL Foundations: Why I ❤ Streams+TablesC4Media
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2rtxaMm.
Tyler Akidau explores the relationship between the Beam Model and stream & table theory. He explains what is required to provide robust stream processing support in SQL and discusses concrete efforts that have been made in this area by the Apache Beam, Calcite, and Flink communities, compare to other offerings such as Apache Kafka’s KSQL and Apache Spark’s Structured streaming. Filmed at qconlondon.com.
Tyler Akidau is a senior staff software engineer at Google, where he is the technical lead for the Data Processing Languages & Systems group, responsible for Google's Apache Beam efforts, Google Cloud Dataflow, and internal data processing tools like Google Flume, MapReduce, and MillWheel. His also a founding member of the Apache Beam PMC.
"Apache Flink aims to make stream processing easy and accessible for everyone. It should come as no surprise that a high level of abstraction puts more load on the core. Flink's SQL engine is the workhorse behind many on-prem and managed SQL platforms. Yet very few users know what is really going on under the hood when submitting a SQL query.
In this talk, we take a deep look into the internals of Flink SQL. Let's take the stack apart! We start with some SQL text and go all the way down to Flink's streaming primitives. I will go through the individual optimizer phases. You will learn how event-time operations are tracked when declaring a watermark, how state is managed when using different kinds of joins, and how changelog modes and upsert keys travel through topology when reading from a Change Data Capture connector.
After this talk, you may not be able to write an optimizer rule, but you should, at least, get a feeling for the power of a simple streaming SQL query."
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIsFlink Forward
http://flink-forward.org/kb_sessions/taking-a-look-under-the-hood-of-apache-flinks-relational-apis/
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk will take a look under the hood of Flink’s relational APIs. We will show the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, we will discuss potential improvements and give an outlook for future extensions and features.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.
Fast federated SQL with Apache CalciteChris Baynes
This document discusses Apache Calcite, an open source framework for federated SQL queries. It provides an introduction to Calcite and its components. It then evaluates Calcite's performance on single data sources through benchmarks. Lastly, it proposes a hybrid approach to enable efficient federated queries using Calcite and Spark.
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The document discusses distributed stream processing frameworks. It provides an overview of frameworks like Storm, Spark Streaming, Samza, Flink, and Kafka Streams. It compares aspects of different frameworks like programming models, delivery guarantees, fault tolerance, and state management. General guidelines are given for choosing a framework based on needs like latency requirements and state needs. Storm and Trident are recommended for low latency tasks while Spark Streaming and Flink are more full-featured but have higher latency. The document provides code examples for word count in different frameworks.
What's new with Apache Spark's Structured Streaming?Miklos Christine
Structured Streaming in Apache Spark allows users to write streaming applications as batch-style queries on static or streaming data sources. It treats streams as continuous unbounded tables and allows batch queries written on DataFrames/Datasets to be automatically converted into incremental execution plans to process streaming data in micro-batches. This provides a simple yet powerful API for building robust stream processing applications with end-to-end fault tolerance guarantees and integration with various data sources and sinks.
Similar to Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for batch and stream processing (20)
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
1. For streaming data, Flink SQL uses STREAMs for append-only queries and CHANGELOGs for upsert queries instead of tables.
2. Stateless queries on streaming data, such as projections and filters, result in new STREAMs or CHANGELOGs.
3. Stateful queries, such as aggregations, produce STREAMs or CHANGELOGs depending on whether they are windowed or not. Join queries between streaming sources also result in STREAM outputs.
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Large Scale Real Time Fraudulent Web Behavior DetectionFlink Forward
Flink Forward San Francisco 2022.
Neuro-ID analyzes web behavior at a large scale to determine visitors' intent on web pages, specifically in the online lending industry. When users interact with an online loan application, our software analyzes their behavior to determine if the applicant may be potentially fraudulent. Lenders can then request various scores describing the applicant's intentions in real-time to use to make decisions during the application flow. Flink gives our product the ability to observe behavior in a stateful manner. As an applicant interacts with an online loan application, a Flink application is used to compare earlier actions to later actions. This processing in Flink can determine the applicant's intent throughout the process of the application.
by
Jeff Niemann & Randy Hanak
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
Airline Satisfaction Project using Azure
This presentation is created as a foundation of understanding and comparing data science/machine learning solutions made in Python notebooks locally and on Azure cloud, as a part of Course DP-100 - Designing and Implementing a Data Science Solution on Azure.
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...javier ramirez
Los sistemas distribuidos son difíciles. Los sistemas distribuidos de alto rendimiento, más. Latencias de red, mensajes sin confirmación de recibo, reinicios de servidores, fallos de hardware, bugs en el software, releases problemáticas, timeouts... hay un montón de motivos por los que es muy difícil saber si un mensaje que has enviado se ha recibido y procesado correctamente en destino. Así que para asegurar mandas el mensaje otra vez.. y otra... y cruzas los dedos para que el sistema del otro lado tenga tolerancia a los duplicados.
QuestDB es una base de datos open source diseñada para alto rendimiento. Nos queríamos asegurar de poder ofrecer garantías de "exactly once", deduplicando mensajes en tiempo de ingestión. En esta charla, te cuento cómo diseñamos e implementamos la palabra clave DEDUP en QuestDB, permitiendo deduplicar y además permitiendo Upserts en datos en tiempo real, añadiendo solo un 8% de tiempo de proceso, incluso en flujos con millones de inserciones por segundo.
Además, explicaré nuestra arquitectura de log de escrituras (WAL) paralelo y multithread. Por supuesto, todo esto te lo cuento con demos, para que veas cómo funciona en la práctica.
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for batch and stream processing
1. 1
Timo Walther
Apache Flink PMC
@twalthr
Flink Forward @ San Francisco - April 11th, 2017
Table & SQL API
unified APIs for batch and stream processing
3. DataStream API is great…
3
Very expressive stream processing
• Transform data, update state, define windows, aggregate, etc.
Highly customizable windowing logic
• Assigners, Triggers, Evictors, Lateness
Asynchronous I/O
• Improve communication to external systems
Low-level Operations
• ProcessFunction gives access to timestamps and timers
4. … but it is not for Everyone!
4
Writing DataStream programs is not always easy
• Stream processing technology spreads rapidly
• New streaming concepts (time, state, windows, ...)
Requires knowledge & skill
• Continous applications have special requirements
• Programming experience (Java / Scala)
Users want to focus on their business logic
5. Why not a Relational API?
5
Relational API is declarative
• User says what is needed, system decides how to compute it
Queries can be effectively optimized
• Less black-boxes, well-researched field
Queries are efficiently executed
• Let Flink handle state, time, and common mistakes
”Everybody” knows and uses SQL!
6. Goals
Easy, declarative, and concise relational API
Tool for a wide range of use cases
Relational API as a unifying layer
• Queries on batch tables terminate and produce a finite result
• Queries on streaming tables run continuously and produce result
stream
Same syntax & semantics for both queries
6
8. Table API & SQL
Flink features two relational APIs
• Table API: LINQ-style API for Java & Scala (since Flink 0.9.0)
• SQL: Standard SQL (since Flink 1.1.0)
8
DataSet API DataStream API
Table API
SQL
Flink Dataflow Runtime
9. Table API & SQL Example
9
val tEnv = TableEnvironment.getTableEnvironment(env)
// configure your data source
val customerSource = CsvTableSource.builder()
.path("/path/to/customer_data.csv")
.field("name", Types.STRING).field("prefs", Types.STRING)
.build()
// register as a table
tEnv.registerTableSource(”cust", customerSource)
// define your table program
val table = tEnv.scan("cust").select('name.lowerCase(), myParser('prefs))
val table = tEnv.sql("SELECT LOWER(name), myParser(prefs) FROM cust")
// convert
val ds: DataStream[Customer] = table.toDataStream[Customer]
10. Windowing in Table API
10
val sensorData: DataStream[(String, Long, Double)] = ???
// convert DataStream into Table
val sensorTable: Table = sensorData
.toTable(tableEnv, 'location, 'rowtime, 'tempF)
// define query on Table
val avgTempCTable: Table = sensorTable
.window(Tumble over 1.day on 'rowtime as 'w)
.groupBy('location, ’w)
.select('w.start as 'day,
'location,
(('tempF.avg - 32) * 0.556) as 'avgTempC)
.where('location like "room%")
11. Windowing in SQL
11
val sensorData: DataStream[(String, Long, Double)] = ???
// register DataStream
tableEnv.registerDataStream(
"sensorData", sensorData, 'location, 'rowtime, 'tempF)
// query registered Table
val avgTempCTable: Table = tableEnv.sql("""
SELECT TUMBLE_START(TUMBLE(time, INTERVAL '1' DAY) AS day,
location,
AVG((tempF - 32) * 0.556) AS avgTempC
FROM sensorData
WHERE location LIKE 'room%’
GROUP BY location, TUMBLE(time, INTERVAL '1' DAY)
""")
13. Architecture
13
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Calcite
Parser & Validator
Table API SQL API
DataSet
Table
Sources
DataStream
Table API Validator
14. Architecture
14
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Calcite
Parser & Validator
Table API SQL API
DataSet
Table
Sources
DataStream
Table API Validator
15. Architecture
15
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Calcite
Parser & Validator
Table API SQL API
DataSet
Table
Sources
DataStream
Table API Validator
16. Architecture
16
DataSet Rules
DataSet PlanDataSet DataStreamDataStream Plan
DataStream Rules
Calcite Catalog
Calcite Logical Plan
Calcite Optimizer
Calcite
Parser & Validator
Table API SQL API
DataSet
Table
Sources
DataStream
Table API Validator
17. Translation to Logical Plan
17
sensorTable
.window(Tumble over 1.day on 'rowtime as 'w)
.groupBy('location, ’w)
.select(
'w.start as 'day,
'location,
(('tempF.avg - 32) *
0.556) as 'avgTempC)
.where('location like "room%")
Catalog Node
Window Aggregate
Project
Filter
Logical Table Scan
Logical Window
Aggregate
Logical Project
Logical Filter
Table Nodes Calcite Logical Plan
Table API Validation
Translation
18. Translation to DataStream Plan
18
Logical Table Scan
Logical Window
Aggregate
Logical Project
Logical Filter
Calcite Logical Plan
Logical Table Scan
Logical Window
Aggregate
Logical Calc
Optimized Plan
DataStream Scan
DataStream Calc
DataStream
Aggregate
DataStream Plan
Optimize
Transform
19. Translation to Flink Program
19
DataStream Scan
DataStream Calc
DataStream
Aggregate
DataStream Plan
(Forwarding)
FlatMap Function
Aggregate & Window
Function
DataStream Program
Translate &
Code-generate
20. Current State (in master)
Batch support
• Selection, Projection, Sort, Inner & Outer Joins, Set operations
• Group-Windows for Slide, Tumble, Session
Streaming support
• Selection, Projection, Union
• Group-Windows for Slide, Tumble, Session
• Different SQL OVER-Windows (RANGE/ROWS)
UDFs, UDTFs, custom rules
20
21. Use Cases for Streaming SQL
Continuous ETL & Data Import
Live Dashboards & Reports
21
23. Dynamic Tables Model
Dynamic tables change over time
Dynamic tables are treated like static batch tables
• Dynamic tables are queried with standard SQL / Table API
• Every query returns another Dynamic Table
“Stream / Table Duality”
• Stream ←→ Dynamic Table
conversions without information loss
23
25. Querying Dynamic Tables
Dynamic tables change over time
• A[t]: Table A at specific point in time t
Dynamic tables are queried with relational semantics
• Result of a query changes as input table changes
• q(A[t]): Evaluate query q on table A at time t
Query result is continuously updated as t progresses
• Similar to maintaining a materialized view
• t is current event time
25
28. Querying a Dynamic Table
Can we run any query on Dynamic Tables? No!
State may not grow infinitely as more data arrives
• Set clean-up timeout or key constraints.
Input may only trigger partial re-computation
Queries with possibly unbounded state or computation
are rejected
28
29. Dynamic Table to Stream
Convert Dynamic Table modifications into stream
messages
Similar to database logging techniques
• Undo: previous value of a modified element
• Redo: new value of a modified element
• Undo+Redo: old and the new value of a changed element
For Dynamic Tables: Redo or Undo+Redo
29
30. Dynamic Table to Stream
Undo+Redo Stream (because A is in Append Mode):
30
31. Dynamic Table to Stream
Redo Stream (because A is in Update Mode):
31
32. Result computation & refinement
32
First result
(end – x)
Last result
(end + x)
State is purged.
Late updates
(on new data)
Update rate
(every x)
Complete
result
(end + x)
Complete result can be computed
(end)
33. Contributions welcome!
Huge interest and many contributors
• Adding more window operators
• Introducing dynamic tables
And there is a lot more to do
• New operators and features for streaming and batch
• Performance improvements
• Tooling and integration
Try it out, give feedback, and start contributing!
33
DATASTREAM:event-time semantics, stateful exactly-once processing, high throughput & low latency at the same time
compute exact and deterministic results in real-time
ASYNC: enrich stream events with data stored in a database, communication delay with external system does not dominate the streaming application’s total work
There is a talent gap.
SKILL:Memory-bound
Handling of timestamps and watermarks in ProcessFunctions
API which quickly solves 80% of their use cases where simple tasks can be defined using little code.
BUSINESS:
No null support, no timestamps, no common tools for string normalization etc.
Users do not specify implementation.
UDF: great for expressiveness, bad for optimization - need for manual tuning
ProcessFunctions implemented in Flink handle state and time.
SQL is the most widely used language for data analytics
SQL would make stream processing much more accessible
Flink is a platform for distributed stream and batch data processing
users only need to learn a single API
a query produces exactly the same result regardless whether its input is static batch data or streaming data
TABLE API:
- Language INtegrated Query (LINQ) API- Queries are not embedded as String- Centered around Table objects- Operations are applied on Tables and return a Table
SQL:- Standard SQL- Queries are embedded as Strings into programs- Referenced tables must be registered- Queries return a Table object- Integration with Table API
Equivalent feature set (at the moment)
Table API and SQL can be mixed
Both are tightly integrated with Flink’s core API
often referred to as the Table API
Works for both batch and stream.
Works for both batch and stream.
Maintain standard SQL
Compliant
Apache Calcite is a SQL parsing and query optimizer framework
Used by many other projects to parse and optimize SQL queriesApache Drill, Apache Hive, Apache Kylin, Cascading, …
Tables, columns, and data types are stored in a central catalog
Tables can be created from
DataSets
DataStreams
TableSources (without going through DataSet/DataStream API)
Table API and SQL queries are translated into common logical plan representation.
Logical plans are translated and optimized depending on execution backend.
Plans are transformed into DataSet or DataStream programs.
API calls are translated into logical operators and immediately validated
API operators compose a treeBefore optimization, the API operator tree is translated into a logical Calcite plan
Calcite provides many optimization rules
Custom rules to transform logical nodes into Flink nodes
DataSet rules to translate batch queries
DataStream rules to translate streaming queries
Constant expression reduction
into DataStream or DataSet operators
With continous queries in mind!
Janino Compiler Framework
Arriving data is incrementally aggregated using the given aggregate function. This means that the window function typically has only a single value to process when called.
RANGE UNBOUNDED preceding
ROWS BETWEEN 5 preceding AND CURRENT ROW
RANGE BETWEEN INTERVAL '1' SECOND preceding AND CURRENT ROW
BUT:
relational processing of batch is well defined and understood
not all relational operators can be naively applied on streams
no widely accepted definition of the semantics for relational processing of streaming data
all supported operators have in common that they never update result records which have been emitted
not an issue for record-at-a-time operators such as projection and filter
affects operators that collect and process multiple records: emitted results cannot be updated, late event are discarded
emit data to log-style system where emitted data cannot be updated (results cannot be refined)
only append operations, no updates or delete
many streaming analytics use cases that need to update results
no discarding possible
early results neededcan be updated and refinedanalyze and explore streaming data in a real-time fashion
vastly increase the scope of the APIs and the range of supported use cases
can be challenging to implement using the DataStream API
It is important to note that this is only the logical model and does not imply how the query is actually executed
RETURNS:
runs continuously and produces a table that is continuously updated -> Dynamic Table
result-updating queries
But: we must of course preserve the unified semantics for stream and batch inputs
we have to specify how the records of a stream modify the dynamic table
APPEND:each stream record is an insert modification
conceptually the dynamic table is ever-growing and infinite in size
UPDATE/REPLACE:
specify a unique key attribute
stream record can represent an insert, update, or delete modification
append mode is in fact a special case of update mode
Let’s imagine we take a snapshot of a dynamic table at a specific point in time. Snapshot is like a regular static batch table.
PROGRESS:
If we repeatedly compute the result of a query on snapshots for progressing points in time, we obtain many static result tables. They are changing over time and effectively constitute a dynamic table.
At each point in time t, the result table is equivalent to a batch query on the dynamic table A at time t.
APPENDMODE:
query continuously updates result rows that it had previously emitted instead of only adding new rows
size of the result table depends on the number of distinct grouping keys of the input table
NOTE:
As long as it is not emitted these changes are completely internal and not visible to a user.changes materialize when a dynamic table is emitted
In contrast to the result of the first example, the resulting table grows relative to the time.
While the non-windowed query (mostly) updates rows of the result table, the windowed aggregation query only appends new rows to the result table.
only those that can be continuously, incrementally, and efficiently computed
Traditional database systems use logs to rebuild tables in case of failures and for replication.
UNDO logs contain the previous value of a modified element to revert incomplete transactions
REDO logs contain the new value of a modified element to redo lost changes of completed transactions
UNDO/REDO logs contain the old and the new value of a changed element
An insert modification is emitted as an insert message with the new row,
a delete modification is emitted as a delete message with the old row,
and an update modification is emitted as a delete message with the old row and an insert message with the new row
Current processing model is a subset of the dynamic table model.
downstream operators or data sinks need to be able to correctly handle both types of messages
only tables with unique keys can have update and delete modifications.
the downstream operators need to be able to access previous values by key.
val outputOpts = OutputOptions()
.firstResult(-15.minutes) // first result 15 mins early
.completeResult(+5.minutes) // complete result 5 mins late
.updateRate(3.minutes) // result is updated every 3 mins
.lateUpdates(true) // late result updates enabled
.lastResult(+15.minutes) // last result 15 mins late -> until state is purged
More TableSource and SinksGenerated intermediate data types, serializers, comparatorsStandalone SQL client
Code generated aggregate functions