This document provides instructions for deploying an Apache Flink cluster on Docker and Docker Compose. It describes setting up the necessary tools like VirtualBox and Ubuntu, installing Docker and Flink, building Docker images from the Flink source code, and running Flink containers locally. It then explains how to push the images to IBM Bluemix and run the Flink cluster within Bluemix containers, including creating the JobManager and TaskManager containers through the Bluemix CLI.
Flink SQL: The Challenges to Build a Streaming SQL EngineHostedbyConfluent
Flink SQL is Apache Flink's streaming SQL engine that supports data movement, data warehousing, and event-driven scenarios. There are four main challenges in building a streaming SQL engine: late data with unbounded operators, retractions amplification in complex query graphs, maintaining event ordering across distributed systems, and dealing with nondeterminism from functions like random and timestamps. The document discusses how Flink SQL addresses these challenges and the state and storage solutions in Flink, including using local state, disaggregated state in external storage, and the Apache Paimon lake storage format which can improve performance by 10x.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
This presentation:
* covers basics of caching and popular cache types
* explains evolution from simple cache to distributed, and from distributed to IMDG
* not describes usage of NoSQL solutions for caching
* is not intended for products comparison or for promotion of Hazelcast as the best solution
Kafka is an open source messaging system that can handle massive streams of data in real-time. It is fast, scalable, durable, and fault-tolerant. Kafka is commonly used for stream processing, website activity tracking, metrics collection, and log aggregation. It supports high throughput, reliable delivery, and horizontal scalability. Some examples of real-time use cases for Kafka include website monitoring, network monitoring, fraud detection, and IoT applications.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a distributed publish-subscribe messaging system that allows both publishing and subscribing to streams of records. It uses a distributed commit log that provides low latency and high throughput for handling real-time data feeds. Key features include persistence, replication, partitioning, and clustering.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Redis is an in-memory key-value store that is often used as a database, cache, and message broker. It supports various data structures like strings, hashes, lists, sets, and sorted sets. While data is stored in memory for fast access, Redis can also persist data to disk. It is widely used by companies like GitHub, Craigslist, and Engine Yard to power applications with high performance needs.
Apache Iceberg: An Architectural Look Under the CoversScyllaDB
Data Lakes have been built with a desire to democratize data - to allow more and more people, tools, and applications to make use of data. A key capability needed to achieve it is hiding the complexity of underlying data structures and physical data storage from users. The de-facto standard has been the Hive table format addresses some of these problems but falls short at data, user, and application scale. So what is the answer? Apache Iceberg.
Apache Iceberg table format is now in use and contributed to by many leading tech companies like Netflix, Apple, Airbnb, LinkedIn, Dremio, Expedia, and AWS.
Watch Alex Merced, Developer Advocate at Dremio, as he describes the open architecture and performance-oriented capabilities of Apache Iceberg.
You will learn:
• The issues that arise when using the Hive table format at scale, and why we need a new table format
• How a straightforward, elegant change in table format structure has enormous positive effects
• The underlying architecture of an Apache Iceberg table, how a query against an Iceberg table works, and how the table’s underlying structure changes as CRUD operations are done on it
• The resulting benefits of this architectural design
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersCloudera, Inc.
Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
This document summarizes a presentation on optimizing Zabbix performance through tuning. It discusses identifying and fixing common problems like default templates and database settings. Next, it covers tuning Zabbix configuration by adjusting the number of server processes and monitoring internal stats. Additional optimizations include using proxies to distribute load, partitioning historical tables, and running Zabbix components on separate hardware. The summary emphasizes monitoring internal stats, tuning configurations and databases, disabling housekeeping, and reviewing additional reading on tuning MySQL, PostgreSQL and Zabbix internals.
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.
Flink Forward San Francisco 2019: How to Join Two Data Streams? - Piotr NowojskiFlink Forward
Joins are one of the most common operations in SQL. However it is far from trivial how to express and execute them in Streaming environment with continuously running queries.During this talk we will first look into why Join operations are more difficult on infinite data streams. Next we will check couple of different approaches to tackle this problem like Time Windowed Joins or the recent addition to Flink SQL: Temporal Joins. Temporal Tables and Temporal Joins are new concepts that provide an efficient solution to a common problem of for example data enrichment. Before Flink 1.7 data enrichment in SQL was often impossible to express using Windowed Joins or very inefficient when using Regular Joins. With Temporal Joins Flink provide an interesting and ANSI SQL complaint alternative way how to join two data streams.
Common issues with Apache Kafka® Producerconfluent
Badai Aqrandista, Confluent, Senior Technical Support Engineer
This session will be about a common issue in the Kafka Producer: producer batch expiry. We will be discussing the Kafka Producer internals, its common causes, such as a slow network or small batching, and how to overcome them. We will also be sharing some examples along the way!
https://www.meetup.com/apache-kafka-sydney/events/279651982/
Apache Flink: API, runtime, and project roadmapKostas Tzoumas
The document provides an overview of Apache Flink, an open source stream processing framework. It discusses Flink's programming model using DataSets and transformations, real-time stream processing capabilities, windowing functions, iterative processing, and visualization tools. It also provides details on Flink's runtime architecture, including its use of pipelined and staged execution, optimizations for iterative algorithms, and how the Flink optimizer selects execution plans.
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
This document summarizes a presentation about Bouygues Telecom's use of Apache Flink for real-time data integration and processing of mobile network event logs. Bouygues Telecom processes over 4 billion logs per day from their network equipment to calculate mobile quality of experience (QoE) indicators within 60 seconds for business intelligence, diagnostics and alerting. They were previously using Hadoop for batch processing but needed a real-time solution. After evaluating Apache Spark and Flink, they chose Flink for its true streaming capabilities, backpressure handling, and high performance on limited resources. Flink helped them process a day's worth of logs in under an hour from 10 Kafka partitions across 10 TaskManagers, each with only
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
The document provides an overview of Apache Flink's DataStream API for stream processing. It discusses key concepts like stream execution environments, data types (including tuples), transformations (such as map, filter, grouping), data sources (files, sockets, collections), sinks, and fault tolerance through checkpointing. The document also contains examples of a WordCount application using the DataStream API in Java.
This document provides an overview of a presentation comparing Apache Flink and Apache Spark. The presentation aims to address marketing claims, confusing statements, and outdated information regarding Flink vs Spark. It outlines key criteria to evaluate the two platforms, such as streaming capabilities, state management, and scalability. The document then directly compares some criteria, such as their support for iterative processing and streaming engines. The presenter hopes this evaluation framework will help others assess Flink and Spark for stream processing use cases.
This document discusses stateful stream processing. It provides examples of stateful streaming applications and describes several open source stream processors, including their programming models and approaches to fault tolerance. It also examines how different systems handle state in streaming programs and discusses the tradeoffs of various approaches.
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
This document discusses Google Cloud Dataflow and how it can be executed using Apache Flink. It provides an overview of Dataflow and its API, which is similar to batch and streaming concepts in Flink. It then describes how a Dataflow program is translated to an Abstract Syntax Tree (AST) and how the AST is converted to a Flink execution graph by implementing translators for specific Dataflow transforms like ParDo and Combine. Finally, it mentions the FlinkPipelineRunner that is available on GitHub to execute Dataflow pipelines using Flink.
Matthias J. Sax – A Tale of Squirrels and StormsFlink Forward
The document discusses similarities and differences between Apache Flink and Apache Storm, two stream processing frameworks. It describes how Flink and Storm have similar capabilities as true stream processing engines with low latency. However, it notes that Flink has advantages like richer APIs, exactly-once processing, and higher throughput. The document also provides details on the system architectures, topology deployment strategies, and Storm compatibility features of Flink.
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
This document discusses interactive analytics with HopsWorks and Zeppelin. It summarizes HopsWorks, a frontend for Hops that supports true multi-tenancy, free-text search across metadata, and interactive analytics with Flink and Zeppelin. It also discusses how HopsFS and HopsYARN improve on HDFS and YARN architectures with metadata stored in a distributed database for consistency and global search.
Fabian Hueske – Juggling with Bits and BytesFlink Forward
This document discusses how Apache Flink operates on binary data. Flink adopts a database management system approach by serializing data objects into fixed memory segments for efficient in-memory and out-of-memory processing. This approach improves memory safety, reduces garbage collection overhead, and allows for efficient algorithms to operate directly on the binary data representations. It requires significant implementation effort compared to using generic Java collections, but provides benefits like predictable performance and resource usage.
Flink 0.10 @ Bay Area Meetup (October 2015)Stephan Ewen
Flink 0.10 focuses on operational readiness with improvements to high availability, monitoring, and integration with other systems. It provides first-class support for event time processing and refines the DataStream API to be both easy to use and powerful for stream processing tasks.
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLFlink Forward
The document discusses Samsara, a domain specific language for distributed machine learning. It provides an algebraic expression language for linear algebra operations and optimizes distributed computations. An example of linear regression on a cereals dataset is presented to demonstrate how Samsara can be used to estimate regression coefficients in a distributed fashion. Key steps include loading data as a distributed row matrix, extracting feature and target matrices, computing the normal equations, and solving the linear system to estimate coefficients.
Cascading is a Java API for building batch data applications on Hadoop. Originally developed for Hadoop MapReduce, Cascading programs can now run on Apache Flink. When translated to Flink, a Cascading program runs as a single Flink job rather than multiple MapReduce jobs. This improves performance by allowing pipelined data exchange and avoiding writing intermediate results to HDFS. Benchmark results show the TF-IDF example program runs much faster on Flink than when translated to MapReduce, due to Flink's efficient in-memory operators and pipelined execution.
Flink allows users to run Hadoop MapReduce jobs without changing any code by wrapping Hadoop's APIs. It supports Hadoop data types, file systems, and functions like mappers and reducers. Specifically, Flink can run a WordCount example written using Hadoop APIs without modifications by utilizing Hadoop input/output formats and mapper/reducer functions. Going forward, Flink aims to allow injecting entire MapReduce jobs as a unit into a Flink program while supporting custom Hadoop partitioners and sorters.
Assaf Araki – Real Time Analytics at ScaleFlink Forward
1) The document discusses real-time analytics at scale for internet of things data using smart data pipes.
2) It describes Intel's big data analytics team and their goals of helping Intel gain a competitive advantage through operational excellence and helping win in the area of intelligent machines.
3) As an example, it outlines a use case for Parkinson's disease research that collects objective measures from patients to generate insights using big data analytics from clinical trials and population studies.
This document provides an overview of Apache Flink, an open-source stream processing framework. It discusses Flink's capabilities in supporting streaming, batch, and iterative processing natively through a streaming dataflow model. It also describes Flink's architecture including the client, job manager, task managers, and various execution setups like local, remote, YARN, and embedded. Finally, it compares Flink to other stream and batch processing systems in terms of their APIs, fault tolerance guarantees, and strengths.
Anwar Rizal – Streaming & Parallel Decision Tree in FlinkFlink Forward
This document outlines a streaming decision tree classifier for classifying data streams using Apache Flink. It discusses the need for a classifier that can learn from streaming data. The architecture uses Kafka streams to ingest a stream of labeled data points and broadcast the evolving decision tree model. The algorithm builds approximate histograms over data features to determine split points for the decision tree in a streaming fashion without needing to store all data. This allows the classifier to continuously learn and make predictions on streaming data.
Vyacheslav Zholudev – Flink, a Convenient Abstraction Layer for Yarn?Flink Forward
Flink provides a convenient abstraction layer for YARN that simplifies distributing computational tasks across a cluster. It allows writing custom input formats and operators more easily than traditional approaches like MapReduce. This document discusses two examples - a MongoDB to Avro data conversion pipeline and a file copying job - that were simplified and made more efficient by implementing them in Flink rather than traditional MapReduce or custom YARN applications. Flink handles task parallelization and orchestration automatically.
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
Flink can handle many data types and provides a type system to identify types for serialization and comparisons. Composite types like Tuples and POJOs can be used and fields within them can define keys. Windows provide a way to perform aggregations over finite slices of infinite streams. Connected streams allow correlating and joining multiple streams. Stateful functions have access to local and partitioned state for stateful stream processing. Kafka integration allows consuming from and producing to Kafka topics.
S. Bartoli & F. Pompermaier – A Semantic Big Data CompanionFlink Forward
This document introduces Okkam, an Italian company that uses Apache Flink for large-scale data integration and semantic technologies. It discusses Okkam's use of Flink for domain reasoning, RDF data processing, duplicate detection, entity linkage, and telemetry analysis. The document also provides lessons learned from Okkam's Flink experiences and suggestions for improving Flink.
Kamal Hakimzadeh – Reproducible Distributed ExperimentsFlink Forward
This document discusses reproducible distributed experiments. It motivates reproducibility in data science due to analytical vs empirical proofs and complex scheduling and fault tolerance. It defines reproducibility as infrastructure, software, experiments and data. It demos a word count experiment on Karamel, a framework for reproducibility across bare metal, VMs, and software defined in Chef Github. Karamel Engine uses a DSL service and cloud clients to orchestrate physical mapping. Orchestration follows a queuing model. Challenges include scalability, fault recovery, elasticity, instrumentation, and language support.
WebSphere Application Server Liberty Profile and DockerDavid Currie
Docker is a tool that allows applications to be run in isolated containers. The document discusses Docker and its popularity, benefits including consistency and speed. It provides an overview of Docker concepts like images, containers and registries. It then discusses IBM's involvement with Docker including contributions to projects and products that support Docker. Finally, it covers using the WebSphere Application Server Liberty Profile with Docker, including building and running Docker images for Liberty.
Connect to blumix docker container with puttyJoseph Chang
Should I choose VM or Container? Sometime it makes not different to you.
If you are new to Docker. You can start from the way you familar with to access Docker.
This document summarizes Docker concepts and provides steps for a local Docker development setup. It introduces Docker images, containers, and registries. It then outlines requirements for development and production configurations and provides examples of setting up a Node.js/Angular frontend and Django backend using Docker images. The document concludes with notes on continuous integration and architecture options.
Bring Continuous Integration to Your Laptop With the Drone CI Docker Extensio...jemije2490
1. The Drone CI Docker extension allows developers to run continuous integration (CI) pipelines locally on their laptop using Docker Desktop.
2. With the extension, developers can import Drone CI pipelines into Docker Desktop and run specific steps of a pipeline to test and debug them.
3. The extension integrates Drone CI directly into Docker Desktop for streamlined management of CI/CD workflows during development.
Docker and containers - Presentation Slides by Priyadarshini AnandPRIYADARSHINI ANAND
The document provides an overview of Docker containers and how to get started with Docker. It discusses what containers are, how Docker works, the differences between containers and VMs, and how to use basic Docker commands. It also covers creating Docker images using Dockerfiles and provides examples of common Dockerfile commands.
The document discusses Docker and container orchestration tools. It begins with an agenda on multi-machine Docker swarms and alternatives like Kubernetes and Mesos. It then covers setting up a multi-node Docker swarm across two virtual machines, deploying an application to the swarm, and accessing the clustered application. Moby Project is introduced as the new name for Docker's open source components to distinguish them from commercial Docker products. Tools like Kitematic, Docker's Universal Control Plane, and Panamax are also briefly mentioned.
The document discusses Docker containers and images. It explains that Docker containers allow applications to be packaged and run in isolation. Images contain the build files and metadata for containers. The document provides examples of creating, running, stopping, restarting, and removing Docker containers based on images. It also discusses viewing container logs and committing changed containers back to new images.
Currently Bluemix VM dosen't provide the Windows VM Image. The slide show you how to get windows VM image from Cloudbase and how to upload it to bluemix VM.
Docker allows applications to be packaged into standardized units called containers that can run on any infrastructure. IBM Bluemix supports Docker containers and provides services for building, managing, and hosting containerized applications in a hybrid cloud environment. Key benefits of Docker containers include increased portability and efficiency in development and deployment across physical and cloud infrastructure.
This document provides an overview of the architecture of the Big Data Europe (BDE) Integrator platform. It discusses the goals of being open source, simple to use for big data, supporting various use cases, and integrating custom components. It describes the different user categories and the component lifecycle. It also provides information on developing Spark applications with Docker and BDE, the UI integrator application, using a reverse proxy, and links to code demos.
This document discusses Docker containers on Windows. It begins by explaining the difference between virtual machines and containers, and the options for container runtimes on Windows like Nano Server and Windows Server Core. It then provides an example of a simple Dockerfile and discusses strategies for reducing image sizes like using a multi-stage Dockerfile. The document also covers using Docker with Visual Studio 2017 and SQL Server, and concludes with contact information for the author.
Faster and Easier Software Development using Docker Platformmsyukor
Faster and Easier Software Development using Docker Platform presentation for Workshop with Open Source Community 1/2019 organized by MAMPU Malaysia under project Open Source Development and Capabilities Program (OSDeC) for Public Sector in Malaysia on January 29, 2019 at Port Dickson, Negeri Sembilan, Malaysia.
This presentation describes how to use Podman to replace Docker in the Alfresco 7.4.0 development process.
Alfresco platform is built using containerization technology. Alfresco can utilize containerization platforms like Podman, which provide the necessary tools and infrastructure to create, manage, and run containers.
Podman is presented as an alternative to Docker. Both Docker and Podman can be used effectively for Alfresco development. So consider your familiarity with the tools, preferred workflow, ecosystem support, security requirements, and any specific performance considerations to make the best choice for your Alfresco development needs.
Il s’agit dans un premier temps de présenter Docker, ses cas d’usage et quelques bonnes pratiques d’utilisation.
Le but est de présenter Docker, son mode de fonctionnement et son écosystème.
Ce qu’il peut apporter et les pièges à éviter
https://github.com/kanedafromparis/prez-fabric8-dmp
Environment isolation with Docker (Alex Medvedev, Alpari)Symfoniacs
Docker can isolate application environments in software containers that are like virtual machines but more lightweight and faster. A Dockerfile defines the steps to build a container image. For example, a Dockerfile can create a container image for a Symfony PHP application that contains PHP-FPM and dependencies. The application code can be mounted into the container from the host machine. Nginx on the host can then serve the application using the container's PHP-FPM.
The document discusses Docker and container orchestration tools. It begins with an agenda on multi-machine Docker swarms and alternatives like Kubernetes and Mesos. It then provides step-by-step instructions for setting up a multi-node Docker swarm cluster on VirtualBox machines and deploying an application. The document also discusses the Moby Project for separating Docker's open source and commercial components, as well as other Docker tools for developers.
Container runtime and tooling has matured since Docker brought it to the mainstream a decade ago. There are multiple options for building and running containers available to the developers and system administrators. Oleg Chunikhin, CTO at Kublr, will provide a review and analysis of the popular options.
Similar to Simon Laws – Apache Flink Cluster Deployment on Docker and Docker-Compose (20)
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Flink Forward San Francisco 2022.
At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community!
by
Caito Scherr
Practical learnings from running thousands of Flink jobsFlink Forward
Flink Forward San Francisco 2022.
Task Managers constantly running out of memory? Flink job keeps restarting from cryptic Akka exceptions? Flink job running but doesn’t seem to be processing any records? We share practical learnings from running thousands of Flink Jobs for different use-cases and take a look at common challenges they have experienced such as out-of-memory errors, timeouts and job stability. We will cover memory tuning, S3 and Akka configurations to address common pitfalls and the approaches that we take on automating health monitoring and management of Flink jobs at scale.
by
Hong Teoh & Usamah Jassat
Extending Flink SQL for stream processing use casesFlink Forward
1. For streaming data, Flink SQL uses STREAMs for append-only queries and CHANGELOGs for upsert queries instead of tables.
2. Stateless queries on streaming data, such as projections and filters, result in new STREAMs or CHANGELOGs.
3. Stateful queries, such as aggregations, produce STREAMs or CHANGELOGs depending on whether they are windowed or not. Join queries between streaming sources also result in STREAM outputs.
The top 3 challenges running multi-tenant Flink at scaleFlink Forward
Apache Flink is the foundation for Decodable's real-time SaaS data platform. Flink runs critical data processing jobs with strong security requirements. In addition, Decodable has to scale to thousands of tenants, power various use cases, provide an intuitive user experience and maintain cost-efficiency. We've learned a lot of lessons while building and maintaining the platform. In this talk, I'll share the top 3 toughest challenges building and operating this platform with Flink, and how we solved them.
Using Queryable State for Fun and ProfitFlink Forward
Flink Forward San Francisco 2022.
A particular feature in our system relies on a streaming 90-minute trailing window of 1-minute samples - implemented as a lookaside cache - to speed up a particular query, allowing our customers to rapidly see an overview of their estate. Across our entire customer base, there is a substantial amount of data flowing into this cache - ~1,000,000 entries/second, with the entire cache requiring ~600GB of RAM. The current implementation is simplistic but expensive. In this talk I describe a replacement implementation as a stateful streaming Flink application leveraging Queryable State. This Flink application reduces the net cost by ~90%. In this session, the implementation is described in detail, including windowing considerations, a sliding-window state buffer that avoids the sliding window replication penalty, and a comparison of queryable state and Redis queries. The talk concludes with a frank discussion of when this distinctive approach is, and is not, appropriate.
by
Ron Crocker
Changelog Stream Processing with Apache FlinkFlink Forward
Flink Forward San Francisco 2022.
The world is constantly changing. Data is continuously produced and thus should be consumed in a similar fashion by enterprise systems. Only this enables real-time decisions at scale. Message logs such as Apache Kafka can be found in almost every architecture, while databases and other batch systems still provide the foundation. Change Data Capture (CDC) propagates changes downstream. In this talk, we will highlight what it means to be a general data processor and how Flink can act as an integration hub. We present the current state of Flink and how it can power various use cases on both finite and infinite streams. We demonstrate Flink's SQL engine as a changelog processor that is shipped with an ecosystem tailored to process CDC data and maintain materialized views. We will use Kafka as an upsert log, Debezium for connecting to databases, and enrich streams of various sources. Finally, we will combine Flink's Table API with DataStream API for event-driven applications beyond SQL.
by
Timo Walther
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Choosing the Best Outlook OST to PST Converter: Key Features and Considerationswebbyacad software
When looking for a good software utility to convert Outlook OST files to PST format, it is important to find one that is easy to use and has useful features. WebbyAcad OST to PST Converter Tool is a great choice because it is simple to use for anyone, whether you are tech-savvy or not. It can smoothly change your files to PST while keeping all your data safe and secure. Plus, it can handle large amounts of data and convert multiple files at once, which can save you a lot of time. It even comes with 24*7 technical support assistance and a free trial, so you can try it out before making a decision. Whether you need to recover, move, or back up your data, Webbyacad OST to PST Converter is a reliable option that gives you all the support you need to manage your Outlook data effectively.
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Snarky Security
How wonderful it is that in our modern age, every bit of our biological data can be digitized, stored, and potentially pilfered by cyber thieves! Isn't it just splendid to think that while scientists are busy pushing the boundaries of biotechnology, hackers could be plotting the next big bio-data heist? This delightful scenario is brought to you by the ever-expanding digital landscape of biology and biotechnology, where the integration of computer science, engineering, and data science transforms our understanding and manipulation of biological systems.
While the fusion of technology and biology offers immense benefits, it also necessitates a careful consideration of the ethical, security, and associated social implications. But let's be honest, in the grand scheme of things, what's a little risk compared to potential scientific achievements? After all, progress in biotechnology waits for no one, and we're just along for the ride in this thrilling, slightly terrifying, adventure.
So, as we continue to navigate this complex landscape, let's not forget the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. After all, what could possibly go wrong?
-------------------------
This document provides a comprehensive analysis of the security implications biological data use. The analysis explores various aspects of biological data security, including the vulnerabilities associated with data access, the potential for misuse by state and non-state actors, and the implications for national and transnational security. Key aspects considered include the impact of technological advancements on data security, the role of international policies in data governance, and the strategies for mitigating risks associated with unauthorized data access.
This view offers valuable insights for security professionals, policymakers, and industry leaders across various sectors, highlighting the importance of robust data protection measures and collaborative international efforts to safeguard sensitive biological information. The analysis serves as a crucial resource for understanding the complex dynamics at the intersection of biotechnology and security, providing actionable recommendations to enhance biosecurity in an digital and interconnected world.
The evolving landscape of biology and biotechnology, significantly influenced by advancements in computer science, engineering, and data science, is reshaping our understanding and manipulation of biological systems. The integration of these disciplines has led to the development of fields such as computational biology and synthetic biology, which utilize computational power and engineering principles to solve complex biological problems and innovate new biotechnological applications. This interdisciplinary approach has not only accelerated research and development but also introduced new capabilities such as gene editing and biomanufact
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
How UiPath Discovery Suite supports identification of Agentic Process Automat...DianaGray10
📚 Understand the basics of the newly persona-based LLM-powered Agentic Process Automation and discover how existing UiPath Discovery Suite products like Communication Mining, Process Mining, and Task Mining can be leveraged to identify APA candidates.
Topics Covered:
💡 Idea Behind APA: Explore the innovative concept of Agentic Process Automation and its significance in modern workflows.
🔄 How APA is Different from RPA: Learn the key differences between Agentic Process Automation and Robotic Process Automation.
🚀 Discover the Advantages of APA: Uncover the unique benefits of implementing APA in your organization.
🔍 Identifying APA Candidates with UiPath Discovery Products: See how UiPath's Communication Mining, Process Mining, and Task Mining tools can help pinpoint potential APA candidates.
🔮 Discussion on Expected Future Impacts: Engage in a discussion on the potential future impacts of APA on various industries and business processes.
Enhance your knowledge on the forefront of automation technology and stay ahead with Agentic Process Automation. 🧠💼✨
Speakers:
Arun Kumar Asokan, Delivery Director (US) @ qBotica and UiPath MVP
Naveen Chatlapalli, Solution Architect @ Ashling Partners and UiPath MVP
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.