How Kafka and MemSQL Became the Dynamic Duo (Sarung Tripathi, MemSQL) Kafka S...HostedbyConfluent
Kafka and MemSQL are the perfect combination of speed, scale, and power to take on the world’s most complex operational analytics challenges. In this session, you will learn how Kafka and MemSQL have become the dynamic duo, and how you can use them together to achieve ingest of tens of millions of records per second and enable highly concurrent, real-time analytics. In the last few months, Kafka and MemSQL have been hard at work, devising a plan to take on the world’s next set of streaming data challenges. So stay tuned: there may just be an announcement!
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Kafka error handling patterns and best practices | Hemant Desale and Aruna Ka...HostedbyConfluent
Transaction Banking from Goldman Sachs is a high volume, latency sensitive digital banking platform offering. We have chosen an event driven architecture to build highly decoupled and independent microservices in a cloud native manner and are designed to meet the objectives of Security, Availability Latency and Scalability. Kafka was a natural choice – to decouple producers and consumers and to scale easily for high volume processing. However, there are certain aspects that require careful consideration – handling errors and partial failures, managing downtime of consumers, secure communication between brokers and producers / consumers. In this session, we will present the patterns and best practices that helped us build robust event driven applications. We will also present our solution approach that has been reused across multiple application domains. We hope that by sharing our experience, we can establish a reference implementation that application developers can benefit from.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Kafka at the core of an AIOps pipeline | Sunanda Kommula, Selector.ai and Ala...HostedbyConfluent
Large networks consist of a diverse range of equipment, across private, public, hybrid clouds and partner networks. A hierarchical network has layers of infrastructure, catering to access, core, or distribution roles, managed by different organizations specialized to architect the right network hardware, software, and features for that network layer. The nature of data generated by each component can vary in type and form, including logs, events, metrics, or alarms.
The diversity of data generated by a large network is beyond human scale. Apache Kafka® is a critical hub in large networks, empowering AIOps to enhance decision making, improve analysis and insights by contextualizing large volumes of operational data. Kafka solved the big problem of collecting, processing, storing and normalizing data at scale, allowing us to focus on building the AIOps pipeline.
Our platform connects the dots across relevant operations data and provides operations teams with simple and powerful access to insights, from within increasingly popular collaboration environments like Slack and Microsoft teams. The pipeline must also integrate with automation solutions.
This session will cover how large volumes of streaming messages can be received by parallel Kafka consumers, and turned into action by network operations teams, dramatically reducing downtime and improving performance.
Developing a custom Kafka connector? Make it shine! | Igor Buzatović, Porsche...HostedbyConfluent
Some people see their cars just as a means to get them from point A to point B without breaking down halfway, but most of us want it also to be comfortable, performant, easy to drive, and of course - to look good.
We can think of Kafka Connect connectors in a similar way. While the main focus is on getting data from or writing data to the external target system, it’s also relevant how easy it is to configure, does it scale well, does it provide the best possible data consistency, is it resilient to both the external system and Kafka cluster failures, and so on. This talk focuses on aspects of connector plugin development important for achieving these goals. More specifically - we‘ll cover configuration definition and validation, external source partitions and offsets handling, achieving desired delivery semantics, and more."
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Real-time Data Streaming from Oracle to Apache Kafka confluent
Dbvisit is a New Zealand-based company with offices worldwide that provides software to replicate data from Oracle databases in real-time to Apache Kafka. Their Dbvisit Replicate Connector is a plugin for Kafka Connect that allows minimal impact replication of database table changes to Kafka topics. The connector also generates metadata topics. Dbvisit focuses only on Oracle databases and replication, has proprietary log mining technology, and supports Oracle back to version 9.2. They have over 1,300 customers globally and offer perpetual or term licensing models for their replication software along with support plans. Dbvisit is a good fit for organizations using Oracle that want to offload reporting, enable real-time analytics, and integrate data into Kafka in a cost-effective manner
Azure Cosmos DB Kafka Connectors | Abinav Rameesh, MicrosoftHostedbyConfluent
The document discusses Kafka connectors for Cosmos DB that allow for seamless integration between the two services without requiring complex application code. It provides an overview of Kafka Connect and connectors, use cases for integrating Cosmos DB and Kafka, and the architecture of source and sink connectors that can read from and write to Cosmos DB and Kafka. It also previews a demo of the connectors and suggests ways to take integration further.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
This document summarizes Activision Data's transition from a batch data pipeline to a real-time streaming data pipeline using Apache Kafka and Kafka Streams. Some key points:
- The new pipeline ingests, processes, and stores game telemetry data from over 200k messages per second and over 5PB of data across 9 years of games.
- Kafka Streams is used to transform the raw streaming data through multiple microservices with low 10-second end-to-end latency, compared to 6-24 hours previously.
- Kafka Connect integrates the streaming data with data stores like AWS S3, Cassandra, and Elasticsearch.
- The new pipeline provides real-time and historical access to structured
Change data capture with MongoDB and Kafka.Dan Harvey
In any modern web platform you end up with a need to store different views of your data in many different datastores. I will cover how we have coped with doing this in a reliable way at State.com across a range of different languages, tools and datastores.
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
Keeping Analytics Data Fresh in a Streaming Architecture | John Neal, QlikHostedbyConfluent
Qlik is an industry leader across its solution stack, both on the Data Integration side of things with Qlik Replicate (real-time CDC) and Qlik Compose (data warehouse and data lake automation), and on the Analytics side with Qlik Sense. These two “sides” of Qlik are coming together more frequently these days as the need for “always fresh” data increases across organizations.
When real-time streaming applications are the topic du jour, those companies are looking to Apache Kafka to provide the architectural backbone those applications require. Those same companies turn to Qlik Replicate to put the data from their enterprise database systems into motion at scale, whether that data resides in “legacy” mainframe databases; traditional relational databases such as Oracle, MySQL, or SQL Server; or applications such as SAP and SalesForce.
In this session we will look in depth at how Qlik Replicate can be used to continuously stream changes from a source database into Apache Kafka. From there, we will explore how a purpose-built consumer can be used to provide the bridge between Apache Kafka and an analytics application such as Qlik Sense.
Migrating from One Cloud Provider to Another (Without Losing Your Data or You...HostedbyConfluent
If you’re considering -- or planning -- a cloud migration, you may be concerned about risks to your data and your mental health. Migrations at scale are fraught with risk. You absolutely can’t lose data, compromise its integrity, or suffer downtime, so you want to be slow and careful. On the other hand, you’re paying two providers for every day the migration goes on, so you need to move as fast as possible.
Unity Technologies accumulates lots of data. We recently moved our data infrastructure as part of a major cloud migration from Amazon Web Services (AWS) to Google Cloud Platform (GCP).
To minimize risk and costs our team used Apache Kafka and Confluent Platform, while engaging Confluent Platform Professional Services to help ensure a speedy and seamless migration. Kafka was already serving as the backbone to our data infrastructure, which handles over half a million events per second, and during the migration it also served as the bridge between AWS and GCP.
Join us at this session to learn about the processes and tools used, the challenges faced, and the lessons learned as we moved our operations and petabytes of data from AWS to GCP with zero downtime.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
Understanding Kafka Produce and Fetch api calls for high throughtput applicat...HostedbyConfluent
The data team at Cloudflare uses Kafka to process tens of petabytes a day. All this data is moved using the 2 foundational Kafka api calls: Produce (api key 0) and Fetch (api key 1). Understanding the structure of these calls (and of the underlying RecordSet structure) is key to building high throughput clients.
The talk describes the basics of the Kafka wire protocol (api keys, correlation id), and the structure of the Produce and Fetch calls. It shows how the asynchronous nature of the wire protocol can combine with the structure of the Produce and Fetch calls to increase latency and reduce client throughput; a solution is offered through use of synchronous single-partition calls.
The RecordSet structure, which is used to encode and store sets (batches) of records is described, and its implications on Fetch requests are discussed. The relationship between Fetch api calls and ""consume"" operations is discussed, as is the impact of offset alignment to RecordSet boundaries.
PCAP Graphs for Cybersecurity and System TuningDr. Mirko Kämpf
This document discusses analyzing network traffic patterns in Hadoop clusters. Packet capture data was collected from example Hadoop workloads and analyzed using Gephi. Initial results show the network structure and communication between nodes for batch processing (TeraSort) and real-time streaming (Twitter collection). Further analysis aims to classify components, understand dependencies, and identify anomalies over time to better understand typical and atypical workload behavior.
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...confluent
Interactive Queries in Apache Kafka's Streams API allows users to query the local state of a Kafka Streams application without accessing an external database. It treats the Kafka Streams application as an embedded, lightweight database. The local state is fault-tolerant and can be sharded across tasks to scale horizontally. Users can discover other application instances and their state to perform queries on remote state if needed. Interactive Queries simplifies stateful stream processing by reducing moving parts compared to using an external database.
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames in Python. For DataFrames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics.
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?
In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.
We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
The document provides an overview of Apache Cassandra, an open-source distributed database management system. It discusses Cassandra's peer-to-peer architecture that allows for scalability and availability. The key concepts covered include Cassandra's data model using columns, rows, column families and its distribution across nodes using consistent hashing of row keys. The document also briefly outlines Cassandra's basic read and write operations and how it handles replication and failure recovery.
Silicon Valley Code Camp: 2011 Introduction to MongoDBManish Pandit
This document provides an introduction and overview of MongoDB, a document-oriented NoSQL database. It discusses how MongoDB differs from relational databases, its support for schemaless documents and easy querying. Key concepts covered include collections, documents, inserting and querying data, and replication and scaling architectures like master-slave and replica sets. The document also touches on accessing MongoDB programmatically, object-document mappers, internal architecture details, administration, and comparisons to other NoSQL solutions.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data.
In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
The document discusses Project Voldemort, a distributed key-value storage system developed at LinkedIn. It provides an overview of Voldemort's motivation and features, including high availability, horizontal scalability, and consistency guarantees. It also describes LinkedIn's use of Voldemort and Hadoop for applications like event logging, online lookups, and batch processing of large datasets.
The Adventure: BlackRay as a Storage Enginefschupp
- BlackRay is an in-memory relational database and search engine that supports SQL and fulltext search. It was originally developed in 2005 for a phone directory with over 80 million records.
- The presentation discusses BlackRay's architecture, data loading process, indexing performance, and support for transactions, clustering, and APIs. It also describes efforts to implement BlackRay as a MySQL storage engine.
- Going forward, the team aims to improve SQL support, add security features, and explore using BlackRay as a backend for other applications like LDAP directories.
MongoDB is a document-oriented, schema-free, scalable, high-performance, open-source database that bridges the gap between key-value stores and traditional relational databases. MongoDB uses a document-oriented data model where data is stored in documents that map to programming language data types, which reduces the need for joins. It provides high performance through an absence of joins and support for indexing of embedded documents and arrays.
The document discusses data partitioning and distribution across multiple machines in a cluster. It explains that data replication does not scale well, but data partitioning, where each record exists on only one machine, allows write latency to scale with the number of machines in the cluster. Coherence provides a distributed cache that partitions data and offers functions for server-side processing near the data through tools like entry processors.
1. The document discusses various technologies for building big data architectures, including NoSQL databases, distributed file systems, and data partitioning techniques.
2. Key-value stores, document databases, and graph databases are introduced as alternatives to relational databases for large, unstructured data.
3. The document also covers approaches for scaling databases horizontally, such as sharding, replication, and partitioning data across multiple servers.
Ado.net & data persistence frameworksLuis Goldster
The document discusses serialization, ADO.NET, data tier approaches, and persistence frameworks. Serialization allows persisting an object's state to storage and recreating it later. ADO.NET provides classes for connecting to and interacting with databases. Common data tier approaches include presenting data directly to the presentation layer, adding a business logic layer, or adding a service layer between business logic and data access. Persistence frameworks aim to simplify data access by encapsulating object persistence behaviors like reading, writing, and deleting objects from storage.
LucidDB is an open source column-oriented database management system that was developed by LucidEra as part of its business intelligence stack. It is faster than row-oriented databases for analytic queries on large datasets due to its use of columnar storage and compression techniques. Data can be extracted from transactional systems into LucidDB using SQL, and ETL tools like Pentaho Data Integration can be used to transform and load the data. Pre-aggregating data into dimensional aggregate tables in LucidDB further improves query performance. Future work will focus on incremental view maintenance and parallelism.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data.
In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
This document provides definitions and brief explanations of key terms related to Kognitio's analytical platform and database technologies. Some of the key terms defined include 10GbE networking, ACID compliance, Amazon Web Services, analytical platforms, analytical workloads, blade servers, cores, CPUs, data warehouses, databases appliances, dimensions, disk storage, elastic block store, ETL processes, external scripting, external tables, in-memory databases, JDBC, latency, linear scalability, massively parallel processing, MDX, measures, memory, nodes, NoSQL databases, ODBC, OLAP, OLTP, parallel processing, persistence layers, private clouds, public clouds, and R language.
The document discusses stack-based buffer overflows. It explains that stack-based overflows can overwrite local variables, function pointers, or return addresses on the stack. Overwriting a return address allows an attacker to change the flow of execution to code of their choice, usually a shellcode-containing buffer. The document provides an example of a stack-based overflow overwriting a return address, which would cause execution to jump to the attacker's buffer after the function returns.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
MapReduce is a programming model used for processing and generating large data sets in a parallel, distributed manner. It involves three main steps: Map, Shuffle, and Reduce. In the Map step, data is processed by individual nodes. In the Shuffle step, data is redistributed based on keys. In the Reduce step, processed data with the same key is grouped and aggregated. Serialization is the process of converting data into a byte stream for storage or transmission. It allows data to be transferred between systems and formats like JSON, XML, and binary formats are commonly used. Schema control is important for big data serialization to validate data structure.
The document discusses erasure coding as an alternative to replication in distributed storage systems like HDFS. It notes that while replication provides high durability, it has high storage overhead, and erasure coding can provide similar durability with half the storage overhead but slower recovery. The document outlines how major companies like Facebook, Windows Azure Storage, and Google use erasure coding. It then provides details on HDFS-EC, including its architecture, use of hardware acceleration, and performance evaluation showing its benefits over replication.
2. Agenda •
1. Column Advantage
2. Storage and Process
3. Hadoop Related
3. History
2001 PAX
Mike Stonebraker, Daniel Abadi, Adam Batkin, Xuedong Chen, Mitch
Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, …
C-Store: A Column Oriented DBMS
D. J. Abadi, etc: Integrating Compression and Execution in Column-O
riented Database Systems. In SIGMOD, pages 671–682, 2006.
D. J. Abadi, etc: Materialization Strategies in a Column-Oriented DB
MS. In ICDE, pages 466–475, 2007.
6. Columnar Store vs Row Store
● IO-1 (basic column store): Every storage block contain
s data from only ONE column.
● IO-2: Aggressive compression.
● IO-3: No record-ids.
● CPU-4: A column executor
● CPU-5: Executor runs on compressed data.
● CPU-6: Executor can process columns that are key se
quence or entry sequence.
7. Columnar Store advantage
●
Compression
RLE, Bitmap ..
●
Ppd
reduce IO
●
Late Materialization
less memeory and CPU overhead
●
Block Iteration (Vectorization)
less CPU overhead
●
Invisible Join
– block as join key
8. Compression
● Run-length Encoding ● High Selectivity :
● ENCODING DELTAVAL Gender ,age
● Bit Vector Encoding ● Mid Selectivity :
● BLOCK_DICT City , Category
data skew ● Low Selectivity :
compound item_id , user_id
Price,quantity,
comment
12. late materialization
Construct Row
Apply Filter + Projection
Projections column only needed(also ppd)
Decoding Column First
Wait util process
Different Compression have difference behavior
15. Common Confusion IO
Choose more column ,more close to row store
IO <5%
record-ID
Row store free space at block tail
variable length field
IO Access Pattern means scalability
Hardware Trend
Compression rate
16. Common Confusion SerDe
Row or PAX SerDe
cpu cache miss
no columnar compression
Block Iteration (construct tuple or row)
Java vs C/C++
C/c++ direct memory mapping
Java Fastutil
17. Index and MV
Reduce IO Scalability
Avoid Sort Storange cost
Index join Complex desige
Lookup Hard maintain
Pre-computation : High latency
Join Slow down loading
Group by Lost Details
Query Rewrite
19. Hadoop Related
File Format
Trenvi vs IBM CIF
Schema Evolution
Portable File Format
Bigger Block Size
IO Pattern
SerDe network influence
20. Hadoop Related
Storage Cost
NameNode
Less block
Bigger block size
Cold data even bigger
No Intermediate Level
JobTracker
Each Job have Less Map and reduce number
DataNode
21. Hadoop Related
Real Data ingestion
Hbase + Flume
Balanced Data
Write avro file format first, then sort merge
SerDe memory reduce
Tuple Structure not row
Batch Update+Delete+Insert
22. Hadoop Related
MR Performance Boost
Block Shuffle (3 times faster)
Skew data have less overhead
Less map number and bigger spill
Reduce side combine
Light Compression Codec(snappy not LZO)
Combiner or in-memroy combiner deprecated
23. Hadoop Related
Easier Performance Tuning
mapred.min.split.size(deprecated)
mapred.child.java.opts
mapred.compress.map.output(deprecated)
io.sort.mb
io.sort.spill.percent(deprecated)
Io.sort.factor
mapred.reduce.parallel.copies(deprecated)
Map and reduce number easier estimate
Reduce algorithm will change
24. Hadoop Related
Easy Management
Less Partition or Dynamic Partition
Integrity constraints and Referential integrity
Statistic make simple query engine
Cold Data automatic merge
Trojan Layout vs Columnar Projections
Less Design complexity
Map join vs Fat Table
Group by + Index