Flurry Analytic Backend - Processing Terabytes of Data in Real-time

•

9 likes•5,159 views

Flurry processes terabytes of mobile data in real-time by switching from a MapReduce framework to a pipeline using Kafka. Kafka allows for continuous, asynchronous processing without job startup times. Flurry sets up Kafka clusters with topics that data log consumers read from in parallel to process streaming data and compute metrics in real-time for analytics. Flurry monitors Kafka and consumers for failures and errors to ensure reliable processing.

What's hot

Spark Summit EU talk by Ruben Pulido Behar Veliqi

Spark Summit

The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.

Data Pipeline with Kafka

Peerapat Asoktummarungsri

Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...

DataWorks Summit

More than 300,000 globally connected Scania vehicles, trucks and buses, are continuously submitting their GPS positions. In this presentation we will show how Scania analyses these positions to obtain valuable information about the operation of the vehicles. The algorithms have been developed in a research project, FUMA, that is run as a joint venture between Fraunhofer Chalmers Centre and Scania. In the project we build a continuous delivery pipeline that enables us to iteratively improve our code, our algorithms and the data deliverables. The pipeline runs on a Hortonworks platform using Apache Spark Streaming. In the build pipeline we use Jenkins, Nexus and Ansible to test, deploy and run Apache Spark Streaming jobs and the results are pushed to Apache Kafka. We will highlight and present some of the steps we have taken in order to put a streaming big data application in production at a manufacturing company. We think that people with a general awareness of the challenges with big data, the possibilities of the streaming paradigm and the need for continuous delivery will find this talk very intriguing. In this presentation you will learn how we develop and run the code and how we ensure that we are creating value for Scania.

Unified Batch & Stream Processing with Apache Samza

DataWorks Summit

The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources. Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines. Speaker Navina Ramesh, Sr. Software Engineer, LinkedIn

Espresso Database Replication with Kafka, Tom Quiggle

confluent

This document discusses using Apache Kafka for database replication in LinkedIn's ESPRESSO database system. It provides an overview of ESPRESSO's architecture and transition from per-instance to per-partition replication using Kafka. Key aspects covered include Kafka configuration, the message protocol for ensuring in-order delivery, and checkpointing by the Kafka producer to allow resuming replication from the last committed transaction after failures.

January 2011 HUG: Kafka Presentation

Yahoo Developer Network

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

DataWorks Summit/Hadoop Summit

This document summarizes the work done by Yahoo engineers to optimize performance of queries on a mobile analytics data mart hosted on Apache Hive. They implemented several techniques like using Tez, vectorized query execution, map-side aggregations, and ORC file format, which provided significant performance boosts. For high cardinality partitioned tables, they leveraged sketching which reduced query times from over 100 seconds to under 25 seconds. They also implemented a data mart in a box solution for easier setup of custom data marts and funnels analysis using UDFs. Human: Thank you for the summary. Summarize the following document in 2 sentences or less: [DOCUMENT]: Lorem ipsum dolor

Spark Summit EU talk by Kaarthik Sivashanmugam

Spark Summit

This document discusses Spark Streaming techniques used at Bing scale. It addresses challenges like processing billions of events per hour from multiple data centers in near real-time while handling issues like out of order events, delays, and state management. Techniques used include dynamically repartitioning Kafka partitions, running Kafka fetch jobs on time in separate threads to avoid delays, caching Kafka RDDs in parallel threads for querying, and using UpdateStateByKey to join streams while enforcing application time windows.

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark

Michael Stack

This document discusses using Phoenix and Spark with ApsaraDB HBase. It covers the architecture of Phoenix as a service over HBase, use cases like log and internet company scenarios, best practices for table properties and queries, challenges around availability and stability, and improvements being made. It also discusses how Spark can be used for analysis, bulk loading, real-time ETL, and to provide elastic compute resources. Example architectures show Spark SQL analyzing HBase and structured streaming incrementally loading data. Scenarios discussed include online reporting, complex analysis, log indexing and querying, and time series monitoring.

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

DataWorks Summit/Hadoop Summit

The document discusses enhancements made to Sqoop to improve importing data from relational databases to Hive. Key enhancements include a new Hive Merge tool for synchronizing incremental data updates, support for dynamic partitioning and external tables in Hive, and encrypting passwords in the Sqoop metastore. The presentation includes demos and discusses Apache Jiras where Expedia contributed patches related to these Sqoop enhancements.

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Spark Summit

This document describes a system called DeviceAnalyzer that builds predictive models in near-real time using Apache Spark and Apache Lucene. It discusses: 1) Integrating Spark and Lucene to enable column search capabilities in Spark and add Spark operations to Lucene. 2) Representing Spark DataFrames as Lucene documents to build a distributed Lucene index from DataFrames. 3) Using the index for tasks like searching devices matching a query, generating statistical and predictive models on retrieved devices, and finding dimensions correlated with selected devices. 4) Architectural components like Trapezium for batch, streaming, and API services and a LuceneDAO for indexing DataFrames and querying the index.

Real time data viz with Spark Streaming, Kafka and D3.js

Ben Laird

This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.

Data ingestion

nitheeshe2

This document discusses data ingestion into Hadoop. It describes how data can be ingested in real-time or in batches. Common tools for ingesting data into Hadoop include Apache Flume, Apache NiFi, and Apache Sqoop. Flume is designed for streaming data ingestion and uses a source-channel-sink architecture to reliably move data into Hadoop. NiFi focuses on real-time data collection and processing capabilities. Sqoop can import and export structured data between Hadoop and relational databases.

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Databricks

At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure. GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

DataWorks Summit

At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner. In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth. We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs. Topics include : * Kafka and Spark Streaming for stateless and stateful use-cases * Spark Structured Streaming as a possible alternative * Combining Spark Streaming with batch ETLs * "Streaming" over Data Lake using Kafka

Kafka connect-london-meetup-2016

Gwen (Chen) Shapira

This document discusses Apache Kafka and Confluent's Kafka Connect tool for large-scale streaming data integration. Kafka Connect allows importing and exporting data from Kafka to other systems like HDFS, databases, search indexes, and more using reusable connectors. Connectors use converters to handle serialization between data formats. The document outlines some existing connectors and upcoming improvements to Kafka Connect.

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Amy W. Tang

Event Detection Pipelines with Apache Kafka

DataWorks Summit

The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Michael Stack

This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.

Migrating pipelines into Docker

DataWorks Summit/Hadoop Summit

This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.

What's hot (20)

Spark Summit EU talk by Ruben Pulido Behar Veliqi

Data Pipeline with Kafka

Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...

Unified Batch & Stream Processing with Apache Samza

Espresso Database Replication with Kafka, Tom Quiggle

January 2011 HUG: Kafka Presentation

Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive

Spark Summit EU talk by Kaarthik Sivashanmugam

HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

Spark Summit EU talk by Debasish Das and Pramod Narasimha

Real time data viz with Spark Streaming, Kafka and D3.js

Data ingestion

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...

Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka

Kafka connect-london-meetup-2016

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Event Detection Pipelines with Apache Kafka

HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase

Migrating pipelines into Docker

Viewers also liked

Hadoop 2 - Going beyond MapReduce

Uwe Printz

Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.

storm at twitter

Krishna Gade

Something about Kafka - Why Kafka is so fast

ViSenze - Artificial Intelligence for the Visual Web

How To Analyze Geolocation Data with Hive and Hadoop

Hortonworks

Phoenix - A High Performance Open Source SQL Layer over HBase

Salesforce Developers

Big data landscape version 2.0

Matt Turck

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Apache Kafka 0.8 basic training (120 slides) covering: 1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka 2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers 3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning 4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps 5. Playing with Kafka using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

Best Strategy for Developing App Architecture and High Quality App

Flurry, Inc.

Deep-Dive: Building Native iOS and Android Application with the AWS Mobile SDK

Amazon Web Services

This document provides an overview of building native mobile applications with AWS services using the AWS Mobile SDK. It discusses the benefits of native apps over web apps, and how to integrate the AWS Mobile SDK into iOS and Android applications. It also describes several AWS services that are commonly used for mobile backends, such as Cognito, S3, DynamoDB, API Gateway, Lambda, and Mobile Analytics. Finally, it discusses options for building hybrid mobile apps with Cordova and React Native that can leverage AWS services.

Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data

Hortonworks

Setup 3 Node Kafka Cluster on AWS - Hands On

hkbhadraa

Hortonworks Data In Motion Series Part 4

Hortonworks

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Hortonworks

How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services? Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business. Join Hortonworks and Informatica as we discuss: - What is a data lake? - The modern data architecture for a data lake - How Hadoop fits into the modern data architecture - Innovative use-cases for a data lake

Viewers also liked (13)

Hadoop 2 - Going beyond MapReduce

storm at twitter

Something about Kafka - Why Kafka is so fast

How To Analyze Geolocation Data with Hive and Hadoop

Phoenix - A High Performance Open Source SQL Layer over HBase

Big data landscape version 2.0

Apache Kafka 0.8 basic training - Verisign

Best Strategy for Developing App Architecture and High Quality App

Deep-Dive: Building Native iOS and Android Application with the AWS Mobile SDK

Microsoft and Hortonworks Delivers the Modern Data Architecture for Big Data

Setup 3 Node Kafka Cluster on AWS - Hands On

Hortonworks Data In Motion Series Part 4

Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...

Similar to Flurry Analytic Backend - Processing Terabytes of Data in Real-time

Confluent kafka meetupseattle jan2017

Nitin Kumar

This document provides an overview of the Confluent streaming platform and Apache Kafka. It discusses how streaming platforms can be used to publish, subscribe and process streams of data in real-time. It also highlights challenges with traditional architectures and how the Confluent platform addresses them by allowing data to be ingested from many sources and processed using stream processing APIs. The document also summarizes key components of the Confluent platform like Kafka Connect for streaming data between systems, the Schema Registry for ensuring compatibility, and Control Center for monitoring the platform.

Santander Stream Processing with Apache Flink

confluent

Unconference Round Table Notes

Timothy Spann

Unconference Round Table Notes The future of real-time stream processing WASM (Web Assembly) Petabyte, 5000 Node Clusters, Smart Hyper Scaling Multi-language support (Python, Rust, Kotlin, Golang, Carbon, JVM) Machine Learning, Deep Learning, AI and Advanced Math Low Code Development like Apache NiFi, DataFlow Designer, SQL Dynamic Hybrid Deployment Citizen Stream Engineer IoT, Edge Streaming and Hybrid Edge Streaming Java 20, 21; Java Loom Virtual Threading Ultra low latency, trillions of events per second, massive RAM/network Current challenges of real-time stream processing and proposed solutions Deployment, Automation and Scaling Choosing right project/sizing for use case Simple Event Processing vs Complex Event Processing Leveraging existing applications Developer Skills Self management and monitoring Cost issues -> autoscaling, optimizing, performance, hybrid deployment Performance / Benchmarking real-time stream processing Kafka/Pulsar: https://openmessaging.cloud/docs/benchmarks/ NiFi: https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/ Flink: https://github.com/ververica/flink-sql-benchmark Hazelcast: https://hazelcast.com/press-release/hazelcast-demonstrates-cloud-efficiency-real-time-stream-processing-of-one-billion-events-per-second/ Current trends of real-time stream processing in 2023 Current challenges of real-time stream processing and proposed solutions Performance / Benchmarking real-time stream processing The future of real-time stream processing Current trends of real-time stream processing in 2023 Lightweight serverless Hazelcast SQL Flink Kafka or Pulsar as Messaging Hub Java 17+ Managed Clusters, Containers and Environments Real-Time Analytics Fast Storage Options

Streaming Data Ingest and Processing with Apache Kafka

Attunity

Apache™ Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing. Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to: -Realize the value of streaming data ingest with Kafka -Turn databases into live feeds for streaming ingest and processing -Accelerate data delivery to enable real-time analytics -Reduce skill and training requirements for data ingest The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts. For more information, please go to www.attunity.com/kafka.

Get Started Building YARN Applications

Hortonworks

Kafka Practices @ Uber - Seattle Apache Kafka meetup

Mingmin Chen

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks

DataWorks Summit/Hadoop Summit

This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Slim Baltagi

Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Slim Baltagi

Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière

confluent

Cloud lunch and learn real-time streaming in azure

Timothy Spann

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...

Timothy Spann

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka Apache NiFi, Apache Flink, Apache Kafka Timothy Spann Principal Developer Advocate Cloudera Data in Motion https://budapestdata.hu/2023/en/speakers/timothy-spann/ Timothy Spann Principal Developer Advocate Cloudera (US) LinkedIn · GitHub · datainmotion.dev June 8 · Online · English talk Building Modern Data Streaming Apps with NiFi, Flink and Kafka In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more. In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg. We use the best streaming tools for the current applications with FLaNK. flankstack.dev BIO Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming. Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.

The Never Landing Stream with HTAP and Streaming

Timothy Spann

Streaming Data and Stream Processing with Apache Kafka

confluent

Apache Kafka is an open-source streaming platform that can be used to build real-time data pipelines and streaming applications. It addresses challenges with diverse data sets arriving at increasing rates. The document discusses how Apache Kafka can help with challenges around data integration, stream processing, and managing streaming platforms at scale. It also outlines key features of Apache Kafka like the Kafka Connect API for data integration, the Kafka Streams API for stream processing, and Confluent Control Center for monitoring and management.

OOP 2014

Emil Andreas Siemes

Hortonworks provides an open source Apache Hadoop distribution called Hortonworks Data Platform (HDP). Their mission is to enable modern data architectures through delivering enterprise Apache Hadoop. They have over 300 employees and are headquartered in Palo Alto, CA. Hortonworks focuses on driving innovation through the open source Apache community process, integrating Hadoop with existing technologies, and engineering Hadoop for enterprise reliability and support.

ApacheCon-Flume-Kafka-2016

Jayesh Thakrar

This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Kai Wähner

Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.

Big Data Analytics Platforms by KTH and RISE SICS

Big Data Value Association

This document describes Hopsworks, an end-to-end data platform for analytics and machine learning built by KTH and RISE SICS. It provides data ingestion, preparation, experimentation, model training, and deployment capabilities. The platform is built on Apache technologies like Apache Beam, Spark, Flink, Kafka, and uses Kubernetes for orchestration. It also includes a feature store for ML features. The document then discusses Apache Flink and its use for stream processing applications. It provides examples of using Flink's APIs like SQL, CEP, and machine learning. Finally, it introduces the concept of continuous deep analytics and the Arcon framework for unified analytics across streams, tensors, graphs and more through an intermediate

High Availability by Design

David Prinzing

Things fail. It’s a fact of life. But that doesn’t mean that your applications and services need to fail. In this talk, David Prinzing described a solution architecture that has been proven to deliver amazing performance at scale with continuous availability on Amazon Web Services. You can’t just move your application to the cloud and expect this – you need to design for it. Technology selections include Amazon Web Services, Ubuntu Linux, Apache Cassandra for the database, Dropwizard for providing RESTful web services, and AngularJS as the foundation for an HTML5 web application. Event: http://www.meetup.com/AWS-EASTBAY/events/225570266

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

Timothy Spann

This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.

Similar to Flurry Analytic Backend - Processing Terabytes of Data in Real-time (20)

Confluent kafka meetupseattle jan2017

Santander Stream Processing with Apache Flink

Unconference Round Table Notes

Streaming Data Ingest and Processing with Apache Kafka

Get Started Building YARN Applications

Kafka Practices @ Uber - Seattle Apache Kafka meetup

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière

Cloud lunch and learn real-time streaming in azure

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...

The Never Landing Stream with HTAP and Streaming

Streaming Data and Stream Processing with Apache Kafka

OOP 2014

ApacheCon-Flume-Kafka-2016

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Big Data Analytics Platforms by KTH and RISE SICS

High Availability by Design

PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends

More from Trieu Nguyen

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf

Flurry Analytic Backend - Processing Terabytes of Data in Real-time

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Flurry Analytic Backend - Processing Terabytes of Data in Real-time

Similar to Flurry Analytic Backend - Processing Terabytes of Data in Real-time (20)

More from Trieu Nguyen

More from Trieu Nguyen (20)

Recently uploaded

Recently uploaded (20)

Flurry Analytic Backend - Processing Terabytes of Data in Real-time