This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
Flink provides unified batch and stream processing. It natively supports streaming dataflows, long batch pipelines, machine learning algorithms, and graph analysis through its layered architecture and treatment of all computations as data streams. Flink's optimizer selects efficient execution plans such as shipping strategies and join algorithms. It also caches loop-invariant data to speed up iterative algorithms and graph processing.
CDC Stream Processing with Apache FlinkTimo Walther
An instant world requires instant decisions at scale. This includes the ability to digest and react to changes in real-time. Thus, event logs such as Apache Kafka can be found in almost every architecture, while databases and similar systems still provide the foundation. Change Data Capture (CDC) has become popular for propagating changes. Nevertheless, integrating all these systems, which often have slightly different semantics, can be a challenge.
In this talk, we highlight what it means for Apache Flink to be a general data processor that acts as a data integration hub. Looking under the hood, we demonstrate Flink's SQL engine as a changelog processor that ships with an ecosystem tailored to processing CDC data and maintaining materialized views. We will discuss the semantics of different data sources and how to perform joins or stream enrichment between them. This talk illustrates how Flink can be used with systems such as Kafka (for upsert logging), Debezium, JDBC, and others.
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
This document provides an overview of Apache Flink and streaming analytics. It discusses key concepts in streaming such as event time vs processing time, watermarks, windows, and fault tolerance using checkpoints and savepoints. It provides examples of time-windowed and session-windowed aggregations as well as pattern detection using state. The document also covers mixing event time and processing time, window triggers, and reprocessing data from savepoints in streaming jobs.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Building large scale transactional data lake using apache hudiBill Liu
Data is a critical infrastructure for building machine learning systems. From ensuring accurate ETAs to predicting optimal traffic routes, providing safe, seamless transportation and delivery experiences on the Uber platform requires reliable, performant large-scale data storage and analysis. In 2016, Uber developed Apache Hudi, an incremental processing framework, to power business critical data pipelines at low latency and high efficiency, and helps distributed organizations build and manage petabyte-scale data lakes.
In this talk, I will describe what is APache Hudi and its architectural design, and then deep dive to improving data operations by providing features such as data versioning, time travel.
We will also go over how Hudi brings kappa architecture to big data systems and enables efficient incremental processing for near real time use cases.
Speaker: Satish Kotha (Uber)
Apache Hudi committer and Engineer at Uber. Previously, he worked on building real time distributed storage systems like Twitter MetricsDB and BlobStore.
website: https://www.aicamp.ai/event/eventdetails/W2021043010
Building Reliable Data Lakes at Scale with Delta LakeDatabricks
Most data practitioners grapple with data reliability issues—it’s the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Understand the key data reliability challenges
How Delta Lake brings reliability to data lakes at scale
Understand how Delta Lake fits within an Apache Spark™ environment
How to use Delta Lake to realize data reliability improvements
Prerequisites
A fully-charged laptop (8-16GB memory) with Chrome or Firefox
Pre-register for Databricks Community Edition
Extending Flink SQL for stream processing use casesFlink Forward
1. For streaming data, Flink SQL uses STREAMs for append-only queries and CHANGELOGs for upsert queries instead of tables.
2. Stateless queries on streaming data, such as projections and filters, result in new STREAMs or CHANGELOGs.
3. Stateful queries, such as aggregations, produce STREAMs or CHANGELOGs depending on whether they are windowed or not. Join queries between streaming sources also result in STREAM outputs.
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
Matthew Powers gave a talk on optimizing data lakes for Apache Spark. He discussed community goals like standardizing method signatures. He advocated for using Spark helper libraries like spark-daria and spark-fast-tests. Powers explained how to build better data lakes using techniques like partitioning data on relevant fields to skip data and speed up queries significantly. He also covered modern Scala libraries, incremental updates, compacting small files, and using Delta Lakes to more easily update partitioned data lakes over time.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Christian Kreuzfeld – Static vs Dynamic Stream ProcessingFlink Forward
The document discusses static vs dynamic stream processing. It covers using stream processing for the first time, increasing use cases, implementation issues, and requirements for stream processing frameworks. It then summarizes the SPQR and Apache Flink frameworks, highlighting how SPQR allows no-code topology definition while Flink provides many extension points. Finally, it discusses future directions, including using Apache Zeppelin for its support of dynamic queries on streaming data.
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
Flink provides a convenient abstraction layer for YARN that simplifies distributing computational tasks across a cluster. It allows writing custom input formats and operators more easily than traditional approaches like MapReduce. This document discusses two examples - a MongoDB to Avro data conversion pipeline and a file copying job - that were simplified and made more efficient by implementing them in Flink rather than traditional MapReduce or custom YARN applications. Flink handles task parallelization and orchestration automatically.
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
1. Google Cloud Dataflow is a fully managed service that allows users to define data processing pipelines that can run batch or streaming computations.
2. The Dataflow programming model defines pipelines as directed graphs of transformations on collections of data elements. This provides flexibility in how computations are defined across batch and streaming workloads.
3. The Dataflow service handles graph optimization, scaling of workers, and monitoring of jobs to efficiently execute user-defined pipelines on Google Cloud Platform.
Vasia Kalavri – Training: Gelly School Flink Forward
- Gelly is a graph processing library built on Apache Flink that provides APIs for Java and Scala to work with graphs and perform graph algorithms
- It allows seamless integration of graph-based and record-based analysis by mixing the Gelly and Flink DataSet APIs
- Common graph algorithms like connected components, PageRank, and similarity recommendations are included in the library
Chris Hillman – Beyond Mapreduce Scientific Data Processing in Real-timeFlink Forward
This document discusses processing scientific mass spectrometry data in real-time using parallel and distributed computing techniques. It describes how a mass spectrometry experiment produces terabytes of data that currently takes over 24 hours to fully process. The document proposes using MapReduce and Apache Flink to parallelize the data processing across clusters to help speed it up towards real-time analysis. Initial tests show Flink can process the data 2-3 times faster than traditional Hadoop MapReduce. Finally, it discusses simulating real-time streaming of the data using Kafka and Flink Streaming to enable processing results within 10 seconds of the experiment completing.
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
This document discusses using Apache Flink for personalization analytics with MongoDB data. It describes the personalization process, evolving user profiles over time, and benefits of separating data into services. Flink allows iterative clustering algorithms like K-means to run efficiently on streaming data. The document recommends starting small, focusing on a proof of concept, and exploring Flink's capabilities for aggregation, connectors, and extending functionality for new use cases.
Mikio Braun – Data flow vs. procedural programming Flink Forward
The document discusses the differences between procedural and data flow programming styles as used in Flink. Procedural programming uses variables, loops, and functions to operate on ordered data structures. Data flow programming treats data as unordered sets and uses parallel set transformations like maps, filters, and reductions. It cannot nest operations and uses broadcast variables to combine intermediate results. The document provides examples translating algorithms like centering, sums, and linear regression from procedural to data flow styles in Flink.
Moon soo Lee – Data Science Lifecycle with Apache Flink and Apache ZeppelinFlink Forward
This document discusses Apache Zeppelin and Apache Flink integration. It describes how the Flink interpreter allows users to run Flink jobs within Zeppelin notebooks, accessing features like dynamic forms, angular displays, and progress monitoring. The roadmap includes improving multi-tenancy with authentication and containers, and developing Helium as a platform for packaging and distributing analytics applications on Zeppelin.
Bouygues Telecom is a large French telecommunications company with over 14 million customers. They developed a system called LUX to analyze massive logs from network equipment to produce real-time mobile quality of experience indicators. LUX ingests 4 billion events per day from equipment logs using Apache Kafka for real-time streaming, then uses Apache Flink to calculate key performance indicators and quality of experience metrics with a latency of less than 60 minutes to enable real-time diagnostics and business intelligence.
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
This document discusses Google Cloud Dataflow and how it can be executed using Apache Flink. It provides an overview of Dataflow and its API, which is similar to batch and streaming concepts in Flink. It then describes how a Dataflow program is translated to an Abstract Syntax Tree (AST) and how the AST is converted to a Flink execution graph by implementing translators for specific Dataflow transforms like ParDo and Combine. Finally, it mentions the FlinkPipelineRunner that is available on GitHub to execute Dataflow pipelines using Flink.
This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.
This document discusses stateful stream processing. It provides examples of stateful streaming applications and describes several open source stream processors, including their programming models and approaches to fault tolerance. It also examines how different systems handle state in streaming programs and discusses the tradeoffs of various approaches.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
The document provides an overview of Apache Flink's DataStream API for stream processing. It discusses key concepts like stream execution environments, data types (including tuples), transformations (such as map, filter, grouping), data sources (files, sockets, collections), sinks, and fault tolerance through checkpointing. The document also contains examples of a WordCount application using the DataStream API in Java.
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkFlink Forward
Suneel Marthi gave a talk about BigPetStore, a blueprint for Apache Flink applications that uses synthetic data generators. BigPetStore includes data generators, examples using tools like MapReduce, Spark and Flink to process the generated data, and tests for integration. It is used for templates, education, testing, demos and benchmarking. The talk outlined the history and components of BigPetStore and described upcoming work to expand it for Flink, including batch and table API examples and machine learning algorithms.
These are the slides that supported the presentation on Apache Flink at the ApacheCon Budapest.
Apache Flink is a platform for efficient, distributed, general-purpose data processing.
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
Apache Flink is an open source platform for distributed stream and batch data processing. It provides APIs called DataStream for unbounded streaming data and DataSet for bounded batch data. Flink runs streaming topologies that allow for windowing, aggregation and other stream processing functions. It supports exactly-once processing semantics through distributed snapshots and checkpoints. The system is optimized for low latency and high throughput streaming applications.
Apache Flink Training: DataSet API BasicsFlink Forward
This document provides an overview of the Apache Flink DataSet API. It introduces key concepts such as batch processing, data types including tuples, transformations like map, filter, group, and reduce, joining datasets, data sources and sinks, and an example word count program in Java. The word count example demonstrates reading text data, tokenizing strings, grouping and counting words, and writing the results. The document contains slides with code snippets and explanations of Flink's DataSet API concepts and features.
This document discusses how Apache Flink handles time and windows in streaming data. It explains that streaming data never stops arriving, so windows are used to bucket incoming elements. Windows can be defined based on event time (the timestamp of when events occurred) or processing time (when the system processed the events). Event time is more accurate but processing time is easier to implement. Flink allows for windows based on event time by using watermarks to track the progress of event times and ensure windows have all elements. The document provides an example of how to define event time and processing time windows using the Flink API.
Tran Nam-Luc – Stale Synchronous Parallel Iterations on FlinkFlink Forward
This document discusses two topics: 1) Stale Synchronous Parallel (SSP) iterations on Apache Flink to address stragglers, and 2) a distributed Frank-Wolfe algorithm using SSP and a parameter server. For SSP on Flink, it describes integrating an iteration control model and API to allow iterations when worker data is within a staleness threshold. For the distributed Frank-Wolfe algorithm, it applies SSP to coordinate local atom selection and global coefficient updates via a parameter server in solving LASSO regression problems.
This document provides an overview of the internals of Apache Flink. It discusses how Flink programs are compiled into execution plans by the Flink optimizer and executed in a pipelined fashion by the Flink runtime. The runtime uses optimized implementations of sorting and hashing to represent data internally as serialized bytes, avoiding object overhead. It also describes how Flink handles iterative programs and memory management. Overall, it explains how Flink hides complexity from users while providing high performance distributed processing.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
This document provides an overview of Google's Bigtable distributed storage system. It describes Bigtable's data model as a sparse, multidimensional sorted map indexed by row, column, and timestamp. Bigtable stores data across many tablet servers, with a single master server coordinating metadata operations like tablet assignment and load balancing. The master uses Chubby, a distributed lock service, to track which tablet servers are available and reassign tablets if servers become unreachable.
Operating and Supporting Delta Lake in ProductionDatabricks
The document discusses strategies for optimizing and managing metadata in Delta Lake. It provides an overview of optimize, auto-optimize, and optimize write strategies and how to choose the appropriate strategy based on factors like workload, data size, and cluster resources. It also discusses Delta Lake transaction logs, configurations like log retention duration, and tips for working with Delta Lake metadata.
Streaming machine learning is being integrated in Spark 2.1+, but you don’t need to wait. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Spark’s new Structured Streaming and walk you through creating your own streaming model. By the end of this session, you’ll have a better understanding of Spark’s Structured Streaming API as well as how machine learning works in Spark.
Video: https://www.youtube.com/watch?v=kkOG_aJ9KjQ
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Created at the University of Berkeley in California, Apache Spark combines a distributed computing system through computer clusters with a simple and elegant way of writing programs. Spark is considered the first open source software that makes distribution programming really accessible to data scientists. Here you can find an introduction and basic concepts.
This document summarizes an Apache Spark workshop that took place in September 2017 in Stockholm. It introduces the speaker's background and experience with Spark. It then provides an overview of the Spark ecosystem and core concepts like RDDs, DataFrames, and Spark Streaming. Finally, it discusses important Spark concepts like caching, checkpointing, broadcasting, and resilience.
Hadoop and HBase experiences in perf log projectMao Geng
This document discusses experiences using Hadoop and HBase in the Perf-Log project. It provides an overview of the Perf-Log data format and architecture, describes how Hadoop and HBase were configured, and gives examples of using MapReduce jobs and HBase APIs like Put and Scan to analyze log data. Key aspects covered include matching Hadoop and HBase versions, running MapReduce jobs, using column families in HBase, and filtering Scan results.
This talk gives details about Spark internals and an explanation of the runtime behavior of a Spark application. It explains how high level user programs are compiled into physical execution plans in Spark. It then reviews common performance bottlenecks encountered by Spark users, along with tips for diagnosing performance problems in a production application.
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Ufuc Celebi – Stream & Batch Processing in one SystemFlink Forward
The document describes the architecture and execution model of Apache Flink. Flink uses a distributed dataflow model where a job is represented as a directed acyclic graph of operators. The client submits this graph to the JobManager, which schedules tasks across TaskManagers. Tasks communicate asynchronously through data channels to process bounded and unbounded data in a pipelined fashion.
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. While recomputing the entire RDD makes sense for dealing with machine failure, if your nodes are more being removed frequently, you can end up in a seemingly loop-like scenario, where you scale down and need to recompute the expensive part of your computation, scale back up, and then need to scale back down again.
Even if you aren’t in a serverless-like environment, preemptable or spot instances can encounter similar issues with large decreases in workers, potentially triggering large recomputes. In this talk, we explore approaches for improving the scale-down experience on open source cluster managers, such as Yarn and Kubernetes-everything from how to schedule jobs to location of blocks and their impact (shuffle and otherwise).
Strata Singapore: GearpumpReal time DAG-Processing with Akka at ScaleSean Zhong
Gearpump is a Akka based realtime streaming engine, it use Actor to model everything. It has super performance and flexibility. It has performance of 18000000 messages/second and latency of 8ms on a cluster of 4 machines.
H2O Design and Infrastructure with Matt DowleSri Ambati
This document provides an overview of H2O, an open source machine learning platform that allows for distributed, in-memory analytics of large datasets. It discusses how H2O works, including how it uses a map-reduce style to parallelize machine learning algorithms across multiple nodes. The document demonstrates starting an 8-node H2O cluster on Amazon EC2 and importing a 23GB dataset in under a minute, significantly faster than with other tools. It also summarizes how H2O's distributed fork-join framework executes tasks across nodes and shares data through its distributed data structures.
At LinkedIn we run lots of Java services on Linux boxes. Java and Linux are a perfect pair. Except when they're not; then there's fireworks. This talk describes 5 situations we encountered where Java interacted with normal Linux behavior to create stunningly sub-optimal application behavior like minutes-long GC pauses. We'll deep dive to show What Java Got Wrong, why Linux behaves the way it does, and how the two can conspire to ruin your day. Finally we'll examine actual code samples showing how we fixed or hid the problems.
Remember the last time you tried to write a MapReduce job (obviously something non trivial than a word count)? It sure did the work, but has lot of pain points from getting an idea to implement it in terms of map reduce. Did you wonder how life will be much simple if you had to code like doing collection operations and hence being transparent* to its distributed nature? Did you want/hope for more performant/low latency jobs? Well, seems like you are in luck.
In this talk, we will be covering a different way to do MapReduce kind of operations without being just limited to map and reduce, yes, we will be talking about Apache Spark. We will compare and contrast Spark programming model with Map Reduce. We will see where it shines, and why to use it, how to use it. We’ll be covering aspects like testability, maintainability, conciseness of the code, and some features like iterative processing, optional in-memory caching and others. We will see how Spark, being just a cluster computing engine, abstracts the underlying distributed storage, and cluster management aspects, giving us a uniform interface to consume/process/query the data. We will explore the basic abstraction of RDD which gives us so many awesome features making Apache Spark a very good choice for your big data applications. We will see this through some non trivial code examples.
Session at the IndicThreads.com Confence held in Pune, India on 27-28 Feb 2015
http://www.indicthreads.com
http://pune15.indicthreads.com
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
The document provides an overview of Spark and its machine learning library MLlib. It discusses how Spark uses resilient distributed datasets (RDDs) to perform distributed computing tasks across clusters in a fault-tolerant manner. It summarizes the key capabilities of MLlib, including its support for common machine learning algorithms and how MLlib can be used together with other Spark components like Spark Streaming, GraphX, and SQL. The document also briefly discusses future directions for MLlib, such as tighter integration with DataFrames and new optimization methods.
Debunking Common Myths in Stream ProcessingKostas Tzoumas
This document discusses stream processing with Apache Flink. It begins by defining streaming as the continuous processing of never-ending data streams. It then debunks four common myths about stream processing: 1) that there is always a throughput/latency tradeoff, showing that Flink can achieve high throughput and low latency; 2) that exactly-once processing is not possible, but Flink provides exactly-once state guarantees with checkpoints; 3) that streaming is only for real-time applications, whereas it can also be used for historical data; and 4) that streaming is too hard, whereas most data problems are actually streaming problems. The document concludes by discussing Flink's community and examples of companies using Flink in production.
Debunking Six Common Myths in Stream ProcessingKostas Tzoumas
This document discusses common myths about stream processing and Apache Flink. It debunks the myths that streaming requires batch processing (Lambda architecture), that there is always a throughput/latency tradeoff, that exactly-once processing is not possible, that streaming is only for real-time applications, that batching is different than buffering, and that streaming is inherently difficult. It also provides an overview of Apache Flink's features and the state of its development.
This talk is an application-driven walkthrough to modern stream processing, exemplified by Apache Flink, and how this enables new applications and makes old applications easier and more efficient. In this talk, we will walk through several real-world stream processing application scenarios of Apache Flink, highlighting unique features in Flink that make these applications possible. In particular, we will see (1) how support for handling out of order streams enables real-time monitoring of cloud infrastructure, (2) how the ability handle high-volume data streams with low latency SLAs enables real-time alerts in network equipment, (3) how the combination of high throughput and the ability to handle batch as a special case of streaming enables an architecture where the same exact program is used for real-time and historical data processing, and (4) how stateful stream processing can enable an architecture that eliminates the need for an external database store, leading to more than 100x performance speedup, among many other benefits.
This document discusses continuous counting on data streams using Apache Flink. It begins by introducing streaming data and how counting is an important but challenging problem. It then discusses issues with batch-oriented and lambda architectures for counting. The document presents Flink's streaming architecture and DataStream API as solutions. It discusses requirements for low-latency, high-efficiency counting on streams, as well as fault tolerance, accuracy, and queryability. Benchmark results show Flink achieving sub-second latencies and high throughput. The document closes by overviewing upcoming features in Flink like SQL and dynamic scaling.
This document introduces Apache Flink, an open-source stream processing framework. It discusses how Flink can be used for both streaming and batch data processing using common APIs. It also summarizes Flink's features like exactly-once stream processing, iterative algorithms, and libraries for machine learning, graphs, and SQL-like queries. The document promotes Flink as a high-performance stream processor that is easy to use and integrates streaming and batch workflows.
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses the rise of stream processing and how Flink enables low-latency applications through features like pipelining, operator state, fault tolerance using distributed snapshots, and integration with batch processing. The document also outlines Flink's roadmap, which includes graduating its DataStream API, fully managing windowing and state, and unifying batch and stream processing.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
It's your unstructured data: How to get your GenAI app to production (and spe...Zilliz
So you've successfully built a GenAI app POC for your company -- now comes the hard part: bringing it to production. Aparavi addresses the challenges of AI projects while addressing data privacy and PII. Our Service for RAG helps AI developers and data scientists to scale their app to 1000s to millions of users using corporate unstructured data. Aparavi’s AI Data Loader cleans, prepares and then loads only the relevant unstructured data for each AI project/app, enabling you to operationalize the creation of GenAI apps easily and accurately while giving you the time to focus on what you really want to do - building a great AI application with useful and relevant context. All within your environment and never having to share private corporate data with anyone - not even Aparavi.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptxFwdays
I will share my personal experience of full-time development on wasm Blazor
What difficulties our team faced: life hacks with Blazor app routing, whether it is necessary to write JavaScript, which technology stack and architectural patterns we chose
What conclusions we made and what mistakes we committed
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
2. Welcome
§ Last talk: how to program PageRank in Flink,
and Flink programming model
§ This talk: how Flink works internally
§ Again, a big bravo to the Flink community
2
4. DataSet and transformations
Input X First Y Second
Operator X Operator Y
ExecutionEnvironment
env
=
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String>
input
=
env.readTextFile(input);
DataSet<String>
first
=
input
.filter
(str
-‐>
str.contains(“Apache
Flink“));
DataSet<String>
second
=
first
.filter
(str
-‐>
str.length()
>
40);
second.print()
env.execute();
4
6. Other API elements & tools
§ Accumulators and counters
• Int, Long, Double counters
• Histogram accumulator
• Define your own
§ Broadcast variables
§ Plan visualization
§ Local debugging/testing mode
6
7. Data types and grouping
public
static
class
Access
{
public
int
userId;
public
String
url;
...
}
public
static
class
User
{
public
int
userId;
public
int
region;
public
Date
customerSince;
...
}
DataSet<Tuple2<Access,User>>
campaign
=
access.join(users)
.where(“userId“).equalTo(“userId“)
DataSet<Tuple3<Integer,String,String>
someLog;
someLog.groupBy(0,1).reduceGroup(...);
§ Bean-style Java classes & field names
§ Tuples and position addressing
§ Any data type with key selector function
7
8. Other API elements
§ Hadoop compatibility
• Supports all Hadoop data types, input/output
formats, Hadoop mappers and reducers
§ Data streaming API
• DataStream instead of DataSet
• Similar set of operators
• Currently in alpha but moving very fast
§ Scala and Java APIs (mirrored)
§ Graph API (Spargel)
8
10. for
(String
token
:
value.split("W"))
{
out.collect(new
Tuple2<>(token,
1));
Task
Manager
DataSet<String>
text
=
env.readTextFile(input);
DataSet<Tuple2<String,
Integer>>
result
=
text
Job
Manager
Task
Manager
.flatMap((str,
out)
-‐>
{
})
.groupBy(0)
.aggregate(SUM,
1);
Flink Client &
Optimizer
O
Romeo,
Romeo,
wherefore
art
thou
Romeo?
O,
1
Romeo,
3
wherefore,
1
art,
1
thou,
1
Apache Flink
10
Nor
arm,
nor
face,
nor
any
other
part
nor,
3
arm,
1
face,
1,
any,
1,
other,
1
part,
1
11. If you want to know one
thing about Flink is that
you don’t need to know
the internals of Flink.
11
12. Philosophy
§ Flink “hides” its internal workings from the
user
§ This is good
• User does not worry about how jobs are
executed
• Internals can be changed without breaking
changes
§ … and bad
• Execution model more complicated to explain
compared to MapReduce or Spark RDD
12
13. Recap: DataSet
Input X First Y Second
Operator X Operator Y
13
ExecutionEnvironment
env
=
ExecutionEnvironment.getExecutionEnvironment();
DataSet<String>
input
=
env.readTextFile(input);
DataSet<String>
first
=
input
.filter
(str
-‐>
str.contains(“Apache
Flink“));
DataSet<String>
second
=
first
.filter
(str
-‐>
str.length()
>
40);
second.print()
env.execute();
14. Common misconception
Input X First Y Second
§ Programs are not executed eagerly
§ Instead, system compiles program to an
execution plan and executes that plan
14
15. DataSet<String>
§ Think of it as a PCollection<String>, or a
Spark RDD[String]
§ With a major difference: it can be produced/
recovered in several ways
• … like a Java collection
• … like an RDD
• … perhaps it is never fully materialized (because
the program does not need it to)
• … implicitly updated in an iteration
§ And this is transparent to the user
15
16. Example: grep
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
16
17. Staged (batch) execution
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subseqent stages:
Grep log for matches
Caching in-memory
and disk if needed
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
17
18. Load Log
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
Pipelined execution
Romeo,
Romeo,
where
art
thou
Romeo?
Load Log
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
000000111111000000111111
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if
needed
Search
for str1
Search
for str2
Search
for str2
Grep 1
Grep 2
Grep 2
18
Note: Log
DataSet is
never
“created”!
19. Benefits of pipelining
§ 25 node cluster
§ Grep log for 3
terms
§ Scale data size
from 100GB to
1TB
2500
Time to complete grep (sec) Data size (GB)
2250
2000
1750
1500
1250
1000
750
500
250
0
Pipelined with Flink
0 100 200 300 400 500 600 700 800 900 1000
Cluster memory
exceeded 19
21. Drawbacks of pipelining
§ Long pipelines may be active at the same time leading
to memory fragmentation
• FLINK-1101: Changes memory allocation from static to
adaptive
§ Fault-tolerance harder to get right
• FLINK-986: Adds intermediate data sets (similar to RDDS) as
first-class citizen to Flink Runtime. Will lead to fine-grained
fault-tolerance among other features.
21
23. Iterate by unrolling
Client
Step Step Step Step Step
§ for/while loop in client submits one job per
iteration step
§ Data reuse by caching in memory and/or disk
23
24. Iterate natively
Y initial
solution
DataSet<Page>
pages
=
...
DataSet<Neighborhood>
edges
=
...
IterativeDataSet<Page>
pagesIter
=
pages.iterate(maxIterations);
DataSet<Page>
newRanks
=
update
(pagesIter,
edges);
DataSet<Page>
result
=
pagesIter.closeWith(newRanks)
24
partial
solution
partial
X solution
other
datasets
iteration
result
Replace
Step function
25. Iterate natively with deltas
Replace
workset A B workset
initial
workset
initial
partial
solution
solution
Y delta
X set
other
datasets
Merge deltas
DeltaIteration<...>
pagesIter
=
pages.iterateDelta(initialDeltas,
iteration
result
maxIterations,
0);
DataSet<...>
newRanks
=
update
(pagesIter,
edges);
DataSet<...>
newRanks
=
...
DataSet<...>
result
=
pagesIter.closeWith(newRanks,
deltas)
See http://data-artisans.com/data-analysis-with-flink.html 25
29. The growing Flink stack
29
Python API
(upcoming) Graph API Apache
Common API
Flink Optimizer Flink Stream Builder
Scala API
(batch)
Java API
(streaming)
Java API
(batch)
MRQL
Flink Local Runtime
Embedded
environment
(Java collections)
Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Single node execution Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Redis Rabbit
Kafka MQ Azure
tables …
30. Stack without Flink Streaming
30
30
Python API
(upcoming) Graph API Apache
Focus on regular (batch)
processing…
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution
31. Program lifecycle
30
30
Python API
(upcoming) Graph API Apache
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution
31
val
source1
=
…
val
source2
=
…
maxed
=
source1
.map(v
=>
(v._1,v._2,
val
math.max(v._1,v._2))
val
filtered
=
source2
.filter(v
=>
(v._1
>
4))
val
result
=
maxed
.join(filtered).where(0).equalTo(0)
.filter(_1
>
3)
.groupBy(0)
.reduceGroup
{……}
1
3
4
5
2
32. 30
30
Python API
(upcoming) Graph API Apache
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution
§ The optimizer is the
component that selects
an execution plan for a
Common API program
§ Think of an AI system
manipulating your
program for you J
§ But don’t be scared – it
works
• Relational databases have
been doing this for
decades – Flink ports the
technology to API-based
systems
Flink Optimizer
32
34. Two execution plans
34
GroupRed
sort
Combine
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
broadcast forward
Map DataSource
Filter
DataSource
orders.tbl
lineitem.tbl
Join
Hybrid Hash
buildHT probe
hash-part [0] hash-part [0]
hash-part [0,1]
GroupRed
sort
Best plan forward
depends on
relative sizes
of input files
35. Flink Local Runtime
30
30
Python API
(upcoming) Graph API Apache
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution
§ Local runtime, not
the distributed
execution engine
§ Aka: what happens
inside every
parallel task
35
36. Flink runtime operators
§ Sorting and hashing data
• Necessary for grouping, aggregation,
reduce, join, cogroup, delta iterations
§ Flink contains tailored implementations
of hybrid hashing and external sorting in
Java
• Scale well with both abundant and restricted
memory sizes
36
37. Internal data representation
37
JVM Heap
map
JVM Heap
reduce
O
Romeo,
Romeo,
wherefore
art
thou
Romeo?
00110011
art,
1
O,
1
Romeo,
1
Romeo,
1
00110011
00010111
01110001
01111010
00010111
00110011
Network transfer
Local sort
How is intermediate data internally represented?
38. Internal data representation
§ Two options: Java objects or raw bytes
§ Java objects
• Easier to program
• Can suffer from GC overhead
• Hard to de-stage data to disk, may suffer from “out
of memory exceptions”
§ Raw bytes
• Harder to program (customer serialization stack,
more involved runtime operators)
• Solves most of memory and GC problems
• Overhead from object (de)serialization
§ Flink follows the raw byte approach
38
39. Memory in Flink
public
class
WC
{
public
String
word;
public
int
count;
}
empty
page
Pool of Memory Pages
JVM Heap
User code
objects
Sorting,
hashing,
caching
Shuffling,
broadcasts
Unmanaged
heap
Managed
heap
Network
buffers
39
40. Memory in Flink (2)
§ Internal memory management
• Flink initially allocates 70% of the free heap as byte[]
segments
• Internal operators allocate() and release() these
segments
§ Flink has its own serialization stack
• All accepted data types serialized to data segments
§ Easy to reason about memory, (almost) no
OutOfMemory errors, reduces the pressure to
the GC (smooth performance)
40
41. Operating on serialized data
Microbenchmark
§ Sorting 1GB worth of (long, double) tuples
§ 67,108,864 elements
§ Simple quicksort
41
42. Flink distributed execution
30
30
Python API
(upcoming) Graph API Apache
Scala API Java API
Common API
Flink Optimizer
MRQL
Embedded Flink Local Runtime
environment
(Java collections) Local
Environment
(for debugging)
Remote environment
(Regular cluster execution) Apache Tez
Standalone or YARN cluster
Data
storage Files HDFS S3 JDBC Azure
tables …
Single node execution
42
§ Pipelined
• Same engine for
Flink and Flink
streaming
§ Pluggable
• Local runtime can be
executed on other
engines
• E.g., Java collections
and Apache Tez
44. Summary
§ Flink decouples API from execution
• Same program can be executed in many different
ways
• Hopefully users do not need to care about this and
still get very good performance
§ Unique Flink internal features
• Pipelined execution, native iterations, optimizer,
serialized data manipulation, good disk destaging
§ Very good performance
• Known issues currently worked on actively
44