Flurry processes terabytes of mobile data in real-time by switching from a MapReduce framework to a pipeline using Kafka. Kafka allows for continuous, asynchronous processing without job startup times. Flurry sets up Kafka clusters with topics that data log consumers read from in parallel to process streaming data and compute metrics in real-time for analytics. Flurry monitors Kafka and consumers for failures and errors to ensure reliable processing.
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
The document discusses IBM's transition from a single-tenant Hadoop architecture to a multi-tenant Apache Spark architecture for their Watson Analytics for Social Media product. The new architecture aggregates social media data from thousands of tenants into a single stream and uses Spark, Kafka and Zookeeper to provide robust real-time analytics with low latency switching between tenants. Key aspects of the new architecture include separating analytics into tenant-specific and language-specific components, and removing state from processing components.
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Building Value Within the Heavy Vehicle Industry Using Big Data and Streaming...DataWorks Summit
More than 300,000 globally connected Scania vehicles, trucks and buses, are continuously submitting their GPS positions. In this presentation we will show how Scania analyses these positions to obtain valuable information about the operation of the vehicles. The algorithms have been developed in a research project, FUMA, that is run as a joint venture between Fraunhofer Chalmers Centre and Scania.
In the project we build a continuous delivery pipeline that enables us to iteratively improve our code, our algorithms and the data deliverables. The pipeline runs on a Hortonworks platform using Apache Spark Streaming. In the build pipeline we use Jenkins, Nexus and Ansible to test, deploy and run Apache Spark Streaming jobs and the results are pushed to Apache Kafka. We will highlight and present some of the steps we have taken in order to put a streaming big data application in production at a manufacturing company. We think that people with a general awareness of the challenges with big data, the possibilities of the streaming paradigm and the need for continuous delivery will find this talk very intriguing. In this presentation you will learn how we develop and run the code and how we ensure that we are creating value for Scania.
Unified Batch & Stream Processing with Apache SamzaDataWorks Summit
The traditional lambda architecture has been a popular solution for joining offline batch operations with real time operations. This setup incurs a lot of developer and operational overhead since it involves maintaining code that produces the same result in two, potentially different distributed systems. In order to alleviate these problems, we need a unified framework for processing and building data pipelines across batch and stream data sources.
Based on our experiences running and developing Apache Samza at LinkedIn, we have enhanced the framework to support: a) Pluggable data sources and sinks; b) A deployment model supporting different execution environments such as Yarn or VMs; c) A unified processing API for developers to work seamlessly with batch and stream data. In this talk, we will cover how these design choices in Apache Samza help tackle the overhead of lambda architecture. We will use some real production use-cases to elaborate how LinkedIn leverages Apache Samza to build unified data processing pipelines.
Speaker
Navina Ramesh, Sr. Software Engineer, LinkedIn
Espresso Database Replication with Kafka, Tom Quiggleconfluent
This document discusses using Apache Kafka for database replication in LinkedIn's ESPRESSO database system. It provides an overview of ESPRESSO's architecture and transition from per-instance to per-partition replication using Kafka. Key aspects covered include Kafka configuration, the message protocol for ensuring in-order delivery, and checkpointing by the Kafka producer to allow resuming replication from the last committed transaction after failures.
Kafka is a high-throughput distributed messaging system with publish and subscribe capabilities. It provides persistence with replication to disk for fault tolerance. Kafka is simple to implement and runs efficiently on large clusters with low latency and high throughput. It was created at LinkedIn to process streaming data from the LinkedIn website and has since been open sourced.
This document summarizes the work done by Yahoo engineers to optimize performance of queries on a mobile analytics data mart hosted on Apache Hive. They implemented several techniques like using Tez, vectorized query execution, map-side aggregations, and ORC file format, which provided significant performance boosts. For high cardinality partitioned tables, they leveraged sketching which reduced query times from over 100 seconds to under 25 seconds. They also implemented a data mart in a box solution for easier setup of custom data marts and funnels analysis using UDFs.
Human: Thank you for the summary. Summarize the following document in 2 sentences or less:
[DOCUMENT]:
Lorem ipsum dolor
Spark Summit EU talk by Kaarthik SivashanmugamSpark Summit
This document discusses Spark Streaming techniques used at Bing scale. It addresses challenges like processing billions of events per hour from multiple data centers in near real-time while handling issues like out of order events, delays, and state management. Techniques used include dynamically repartitioning Kafka partitions, running Kafka fetch jobs on time in separate threads to avoid delays, caching Kafka RDDs in parallel threads for querying, and using UpdateStateByKey to join streams while enforcing application time windows.
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and SparkMichael Stack
This document discusses using Phoenix and Spark with ApsaraDB HBase. It covers the architecture of Phoenix as a service over HBase, use cases like log and internet company scenarios, best practices for table properties and queries, challenges around availability and stability, and improvements being made. It also discusses how Spark can be used for analysis, bulk loading, real-time ETL, and to provide elastic compute resources. Example architectures show Spark SQL analyzing HBase and structured streaming incrementally loading data. Scenarios discussed include online reporting, complex analysis, log indexing and querying, and time series monitoring.
The document discusses enhancements made to Sqoop to improve importing data from relational databases to Hive. Key enhancements include a new Hive Merge tool for synchronizing incremental data updates, support for dynamic partitioning and external tables in Hive, and encrypting passwords in the Sqoop metastore. The presentation includes demos and discusses Apache Jiras where Expedia contributed patches related to these Sqoop enhancements.
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
This document describes a system called DeviceAnalyzer that builds predictive models in near-real time using Apache Spark and Apache Lucene. It discusses:
1) Integrating Spark and Lucene to enable column search capabilities in Spark and add Spark operations to Lucene.
2) Representing Spark DataFrames as Lucene documents to build a distributed Lucene index from DataFrames.
3) Using the index for tasks like searching devices matching a query, generating statistical and predictive models on retrieved devices, and finding dimensions correlated with selected devices.
4) Architectural components like Trapezium for batch, streaming, and API services and a LuceneDAO for indexing DataFrames and querying the index.
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.
This document discusses data ingestion into Hadoop. It describes how data can be ingested in real-time or in batches. Common tools for ingesting data into Hadoop include Apache Flume, Apache NiFi, and Apache Sqoop. Flume is designed for streaming data ingestion and uses a source-channel-sink architecture to reliably move data into Hadoop. NiFi focuses on real-time data collection and processing capabilities. Sqoop can import and export structured data between Hadoop and relational databases.
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Databricks
At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaDataWorks Summit
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences.
To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals.
Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth.
We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty...).
We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs.
Topics include :
* Kafka and Spark Streaming for stateless and stateful use-cases
* Spark Structured Streaming as a possible alternative
* Combining Spark Streaming with batch ETLs
* "Streaming" over Data Lake using Kafka
This document discusses Apache Kafka and Confluent's Kafka Connect tool for large-scale streaming data integration. Kafka Connect allows importing and exporting data from Kafka to other systems like HDFS, databases, search indexes, and more using reusable connectors. Connectors use converters to handle serialization between data formats. The document outlines some existing connectors and upcoming improvements to Kafka Connect.
The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
This document provides an introduction to JanusGraph, an open source distributed graph database that can be used with Apache HBase for storage. It begins with background on graph databases and their structures, such as vertices, edges, properties, and different storage models. It then discusses JanusGraph's architecture, support for the TinkerPop graph computing framework, and schema and data modeling capabilities. Details are given on partitioning graphs across servers and using different indexing approaches. The document concludes by explaining why HBase is a good storage backend for JanusGraph and providing examples of how the data model would be structured within HBase.
This document discusses Spotify's migration of data pipelines to Docker. It provides background on Spotify growing from 50 to 1000 engineers and the challenges of scaling their big data infrastructure. Spotify adopted Docker to help solve packaging and dependency issues, moving pipelines from cron jobs to a REST API and Docker images. Docker is allowing Spotify to transparently migrate their on-premise Hadoop cluster to Google Cloud, handling over 100 petabytes of data and growing.
Hadoop 2 introduces the YARN framework to provide a common platform for multiple data processing paradigms beyond just MapReduce. YARN splits cluster resource management from application execution, allowing different applications like MapReduce, Spark, Storm and others to run on the same Hadoop cluster. HDFS 2 improves HDFS with features like high availability, federation and snapshots. Apache Tez provides a new data processing engine that enables pipelining of jobs to improve performance over traditional MapReduce.
Have a lot of data? Using or considering using Apache HBase (part of the Hadoop family) to store your data? Want to have your cake and eat it too? Phoenix is an open source project put out by Salesforce. Join us to learn how you can continue to use SQL, but get the raw speed of native HBase usage through Phoenix.
Here's the second version of our big data landscape. Thoughts, questions, comments? We'd love to hear your feedback in the comments section here: http://wp.me/p2dLS7-6A
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
Best Strategy for Developing App Architecture and High Quality AppFlurry, Inc.
Yahoo has been developing several success mobile apps in Taiwan. We’re going to share our best strategy for developing mobile apps. Learning how to use YDevelopKit to save your development resource and using DevOps to retain high quality result simultaneously.
Deep-Dive: Building Native iOS and Android Application with the AWS Mobile SDKAmazon Web Services
This document provides an overview of building native mobile applications with AWS services using the AWS Mobile SDK. It discusses the benefits of native apps over web apps, and how to integrate the AWS Mobile SDK into iOS and Android applications. It also describes several AWS services that are commonly used for mobile backends, such as Cognito, S3, DynamoDB, API Gateway, Lambda, and Mobile Analytics. Finally, it discusses options for building hybrid mobile apps with Cordova and React Native that can leverage AWS services.
Microsoft and Hortonworks Delivers the Modern Data Architecture for Big DataHortonworks
Joint webinar with Microsoft and Hortonworns on the power of combining the Hortonworks Data Platform with Microsoft’s ubiquitous Windows, Office, SQL Server, Parallel Data Warehouse, and Azure platform to build the Modern Data Architecture for Big Data.
Hortonworks Data In Motion Series Part 4Hortonworks
How real-world enterprises leverage Hortonworks DataFlow/Apache NiFi to to create real-time data flows in record time to enable new business opportunities, improve customer retention, accelerate big data projects from months to minutes through increased efficiency and reduced costs.
On-Demand webinar: http://hortonworks.com/webinar/paradigm-shift-business-usual-real-time-dataflows-record-time/
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...Hortonworks
How do you turn data from many different sources into actionable insights and manufacture those insights into innovative information-based products and services?
Industry leaders are accomplishing this by adding Hadoop as a critical component in their modern data architecture to build a data lake. A data lake collects and stores data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field. A data lake cost-effectively scales to collect and retain massive amounts of data over time, and convert all this data into actionable information that can transform your business.
Join Hortonworks and Informatica as we discuss:
- What is a data lake?
- The modern data architecture for a data lake
- How Hadoop fits into the modern data architecture
- Innovative use-cases for a data lake
This document provides an overview of the Confluent streaming platform and Apache Kafka. It discusses how streaming platforms can be used to publish, subscribe and process streams of data in real-time. It also highlights challenges with traditional architectures and how the Confluent platform addresses them by allowing data to be ingested from many sources and processed using stream processing APIs. The document also summarizes key components of the Confluent platform like Kafka Connect for streaming data between systems, the Schema Registry for ensuring compatibility, and Control Center for monitoring the platform.
Santander Stream Processing with Apache Flinkconfluent
Flink is becoming the de facto standard for stream processing due to its scalability, performance, fault tolerance, and language flexibility. It supports stream processing, batch processing, and analytics through one unified system. Developers choose Flink for its robust feature set and ability to handle stream processing workloads at large scales efficiently.
Streaming Data Ingest and Processing with Apache KafkaAttunity
Apache™ Kafka is a fast, scalable, durable, and fault-tolerant
publish-subscribe messaging system. It offers higher throughput, reliability and replication. To manage growing data volumes, many companies are leveraging Kafka for streaming data ingest and processing.
Join experts from Confluent, the creators of Apache™ Kafka, and the experts at Attunity, a leader in data integration software, for a live webinar where you will learn how to:
-Realize the value of streaming data ingest with Kafka
-Turn databases into live feeds for streaming ingest and processing
-Accelerate data delivery to enable real-time analytics
-Reduce skill and training requirements for data ingest
The recorded webinar on slide 32 includes a demo using automation software (Attunity Replicate) to stream live changes from a database into Kafka and also includes a Q&A with our experts.
For more information, please go to www.attunity.com/kafka.
Hortonworks Get Started Building YARN Applications Dec. 2013. We cover YARN basics, benefits, getting started and roadmap. Actian shares their experience and recommendations on building their real-world YARN application.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramièreconfluent
During the Confluent Streaming event in Paris, Florent Ramière, Technical Account Manager at Confluent, goes beyond brokers, introducing a whole new ecosystem with Kafka Streams, KSQL, Kafka Connect, Rest proxy, Schema Registry, MirrorMaker, etc.
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and Kafka
Apache NiFi, Apache Flink, Apache Kafka
Timothy Spann
Principal Developer Advocate
Cloudera
Data in Motion
https://budapestdata.hu/2023/en/speakers/timothy-spann/
Timothy Spann
Principal Developer Advocate
Cloudera (US)
LinkedIn · GitHub · datainmotion.dev
June 8 · Online · English talk
Building Modern Data Streaming Apps with NiFi, Flink and Kafka
In my session, I will show you some best practices I have discovered over the last 7 years in building data streaming applications including IoT, CDC, Logs, and more.
In my modern approach, we utilize several open-source frameworks to maximize the best features of all. We often start with Apache NiFi as the orchestrator of streams flowing into Apache Kafka. From there we build streaming ETL with Apache Flink SQL. We will stream data into Apache Iceberg.
We use the best streaming tools for the current applications with FLaNK. flankstack.dev
BIO
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark. Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
Streaming Data and Stream Processing with Apache Kafkaconfluent
Apache Kafka is an open-source streaming platform that can be used to build real-time data pipelines and streaming applications. It addresses challenges with diverse data sets arriving at increasing rates. The document discusses how Apache Kafka can help with challenges around data integration, stream processing, and managing streaming platforms at scale. It also outlines key features of Apache Kafka like the Kafka Connect API for data integration, the Kafka Streams API for stream processing, and Confluent Control Center for monitoring and management.
Hortonworks provides an open source Apache Hadoop distribution called Hortonworks Data Platform (HDP). Their mission is to enable modern data architectures through delivering enterprise Apache Hadoop. They have over 300 employees and are headquartered in Palo Alto, CA. Hortonworks focuses on driving innovation through the open source Apache community process, integrating Hadoop with existing technologies, and engineering Hadoop for enterprise reliability and support.
This document compares Apache Flume and Apache Kafka for use in data pipelines. It describes Conversant's evolution from a homegrown log collection system to using Flume and then integrating Kafka. Key points covered include how Flume and Kafka work, their capabilities for reliability, scalability, and ecosystems. The document also discusses customizing Flume for Conversant's needs, and how Conversant monitors and collects metrics from Flume and Kafka using tools like JMX, Grafana dashboards, and OpenTSDB.
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration.
This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.
This document describes Hopsworks, an end-to-end data platform for analytics and machine learning built by KTH and RISE SICS. It provides data ingestion, preparation, experimentation, model training, and deployment capabilities. The platform is built on Apache technologies like Apache Beam, Spark, Flink, Kafka, and uses Kubernetes for orchestration. It also includes a feature store for ML features. The document then discusses Apache Flink and its use for stream processing applications. It provides examples of using Flink's APIs like SQL, CEP, and machine learning. Finally, it introduces the concept of continuous deep analytics and the Arcon framework for unified analytics across streams, tensors, graphs and more through an intermediate
Things fail. It’s a fact of life. But that doesn’t mean that your applications and services need to fail. In this talk, David Prinzing described a solution architecture that has been proven to deliver amazing performance at scale with continuous availability on Amazon Web Services. You can’t just move your application to the cloud and expect this – you need to design for it. Technology selections include Amazon Web Services, Ubuntu Linux, Apache Cassandra for the database, Dropwizard for providing RESTful web services, and AngularJS as the foundation for an HTML5 web application. Event: http://www.meetup.com/AWS-EASTBAY/events/225570266
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
This document provides an overview and summary of Apache Pulsar, a distributed streaming and messaging platform. It discusses Pulsar's benefits like data durability, scalability, geo-replication and multi-tenancy. It outlines key use cases like message queuing and data streaming. The document also summarizes Pulsar's architecture, subscriptions modes, connectors, and integration with other technologies like Apache Flink, Apache NiFi and MQTT. It highlights real-world customer implementations and provides demos of ingesting IoT data via Pulsar.
Similar to Flurry Analytic Backend - Processing Terabytes of Data in Real-time (20)
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfTrieu Nguyen
1. The document outlines the Chief Platform Engineer's background and introduces LEO CDP, a customer data platform for the travel industry.
2. It discusses 5 challenges companies face related to customer growth, journeys, data platforms, communication and understanding customers with big data.
3. A case study shows how LEO CDP can be used to create a customer journey map for a travel agency, including personalized promotions and offers sent via email.
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
This document discusses how to track and improve customer experience using LEO CDP. It begins by explaining why measuring customer experience is important, then introduces four key metrics: Customer Feedback Score, Customer Effort Score, Customer Satisfaction Score, and Net Promoter Score. It describes using journey maps to manage customer experience data and visualize the customer journey. Finally, it presents LEO CDP as a software solution for collecting customer experience data, building surveys, and generating reports to gain insights to improve products, services, and the overall customer experience.
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
Part 1: Why should every business need to deploy a CDP ?
1. Big data is the reality of business today
2. What are technologies to manage customer data ?
3. The rise of first-party data and new technologies for Digital Marketing
4. How to apply USPA mindset to build your CDP for data-driven business
Part 2: How to use LEO CDP for your business
1. Core functions of LEO CDP for marketers and IT managers
2. Data Unification for Customer 360 Analytics
3. Data Segmentation
4. Customer Personalization
5. Customer Data Activation
Part 3: Case study in O2O Retail and Ecommerce
1. How to build customer journey map for ecommerce and retail
2. How to do customer analytics to find ideal customer profiles
The ideal customer profile in a B2B context
The ideal customer profile in a B2C context
3. Manage product catalog for customer personalization
4. Monitoring Data of Customer Experience (CX Analytics)
CX Data Flow
CX Rating plugin is embedded in the website, to collect feedback data
An overview of CX Report
A CX Report in a customer profile
5. Monitoring data with real-time event tracking reports
Event Data Flow
Summary Event Data Report
Event Data Report in a Customer Profile
Part 4: How to setup an instance of LEO CDP for free
1. Technical architecture
2. Server infrastructure
3. Setup middlewares: Nginx, ArangoDB, Redis, Java and Python
Network requirements
Software requirements for new server
ArangoDB
Nginx Proxy
SSL for Nginx Server
Java 8 JVM
Redis
Install Notes for Linux Server
Clone binary code for new server
Set DNS hosts for LEO CDP workers
4. Setup data for testing and system verification
Part 5: Summary all key ideas
Why should you invest in LEO CDP ?
Purpose: Big data and AI democracy for SMEs companies
Problem: Customer Analytics and Customer Personalization
Solutions: CDP + CX + Personalization Engine
Product demo: LEO CDP for Ecommerce and Fintech
Business model: Freemium → Ecosystem → Subscription
Market size: 20 billion USD in 2026 and CAGR 34.6%
Differentiation: cloud-native software
Go-to-market approach: Community → Free → Paid
Team: 1 full-stack dev, 1 data scientist and 12,000 fans of BigDataVietnam.org Community
Need 150,000 USD for scaling business (you get 20% share)
The document outlines new features and updates for 2022 from USPA Technology Company, including a new dedicated dashboard for CMOs, updated UI for Customer 360 Insights, and a focus on data-driven business processes and digital marketing in B2B through standardizing data-driven processes and focusing on customer insights.
Lộ trình triển khai LEO CDP cho ngành bất động sảnTrieu Nguyen
1) Hiểu bài toán số hoá trải nghiệm khách hàng
2) Nghiên cứu giải pháp LEO CDP
3) Lộ trình triển khai
Phát triển / số hoá điểm chạm khách hàng
Xây dựng bản đồ hành trình khách hàng
Định nghĩa các metrics và KPI quan trọng
Xây dựng web portal và mobile data hub
Xây dựng kế hoạch Digital Marketing
Triển khai CDP và Marketing Automation
Xây dựng đội Analytics để phân tích dữ liệu
From Dataism to Customer Data PlatformTrieu Nguyen
1) How to think in the age of Dataism with LEO CDP ?
2) Why is Dataism for human, business and society ?
3) How should LEO Customer Data Platform (LEO CDP) work ?
4) How to use LEO CDP for your business ?
Data collection, processing & organization with USPA frameworkTrieu Nguyen
1) How to think in the age of Dataism with USPA framework ?
2) How to collect customer data
3) Data Segmentation Processing for flexibility and scalability
4) Data Organization for personalization and business activation
Part 1: Introduction to digital marketing technologyTrieu Nguyen
This document provides an overview of a mini-course on data-driven marketing using the USPA framework presented by Trieu Nguyen. It includes biographical information about Trieu Nguyen's background and experience in big data projects, machine learning, and digital marketing roles. The document also outlines the topics that will be covered in the mini-course, including digital media models, search engine marketing, social media marketing, advertising technology, customer data platforms, and case studies. Key terms like omnichannel strategy, customer experience strategy, and artificial intelligence strategies for marketing are also defined.
Transform your marketing and sales capabilities with Big Data and A.I
1) Why is Customer Data Platform (CDP) ?
Case study: Enhancing the revenue of your restaurant with CDP and mobile app marketing
Question: Why can CDP disrupt business model for restaurant industry (B2C) ?
2) How would CDP work in practice ?
Introducing USPA.tech as logical framework for implementing CDP in practice
How Can a Customer Data Platform Enhance Your Account-Based Marketing Strategy (B2B) ?
3) How can we implement CDP for business?
Introducing the CDP as customer-first marketing platform for all industries (my key idea in this slide)
How to build a Personalized News Recommendation PlatformTrieu Nguyen
This document discusses how to build a personalized news recommendation platform. It explains that recommendation systems are needed to retain users, increase traffic, and improve the content experience. It describes popular techniques like collaborative filtering, content-based filtering, and hybrid systems. Specifically, it outlines a case study using a USPA framework with real social news data. Key factors for a news recommendation system are discussed like novelty, user history, and location. The document also provides a simple example of building a recommendation engine with Apache Spark.
How to grow your business in the age of digital marketing 4.0Trieu Nguyen
1. The document discusses how businesses can grow in the digital marketing age using technologies like cloud services, big data, AI, and headless CMS platforms.
2. It introduces LeoCloudCMS as a headless API CMS that is built for digital marketing 4.0 and can run scalably on cloud computing.
3. The key idea is to think of your entire business as a "box" and use LeoCloudCMS to attract internet users into the box and offer valuable services.
Video Ecosystem and some ideas about video big dataTrieu Nguyen
Introduction to Video Ecosystem Mind Map
Video Streaming Platform
Video Ad Tech Platform
Video Player Platform
Video Content Distribution Platform
Video Analytics Platform
Summary of key ideas
Q & A
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
1) Introduction to the key Big Data concepts
1.1 The Origins of Big Data
1.2 What is Big Data ?
1.3 Why is Big Data So Important ?
1.4 How Is Big Data Used In Practice ?
2) Introduction to the key principles of Big Data Systems
2.1 How to design Data Pipeline in 6 steps
2.2 Using Lambda Architecture for big data processing
3) Practical case study : Chat bot with Video Recommendation Engine
4) FAQ for student
This document discusses open over-the-top (OTT) video content platforms. It defines OTT as streaming media distributed directly over the internet bypassing traditional distribution methods. The document then covers OTT market drivers and business models. It examines the most popular OTT platform in Vietnam and challenges for successful OTT platforms including scalability, content acquisition and management, audience engagement, and business models. Finally, it proposes a modular technical architecture for an open OTT video platform using open source technologies.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.
Introduction to Recommendation Systems (Vietnam Web Submit)Trieu Nguyen
1) Why do we need recommendation systems ?
2) How can we think with recommendation systems ?
3) How can we implement a recommendation system with open source technologies ?
RFX framework https://github.com/rfxlab
Apache Kafka: https://kafka.apache.org
Apache Spark: https://spark.apache.org
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
3. Flurry is a leading mobile advertising and analytics provider
Publisher
Advertiser
Audience
AppCircle
Applications: 10,000+
Devices/month: 300M
Conversions/month: 120M
AppSpot
Applications: 2,500+
Devices/month: 250M
Impressions/month: 7.5B
Analytics Applications: 400,000
Devices/month: 1.2B
Data points/month: 1.9T
4. • Why Flurry Switched from a MapReduce Framework to
pipeline processing
• How Flurry uses Kafka in data processing
• Tuning of Kafka to work in Flurry’s environment
• Flurry Monitoring and error handling of streams
Topics
The Path to Real-Time Processing
www.flurry.com 4
15. Why Kafka for Flurry
www.flurry.com 15
Device Reports
MapReduce
(jobs)
Kafka
Startup
Time
16. Introducing the Data Log Consumer (DLC)
www.flurry.com 16
Agent Portal
Data Log Consumer
Developer
Portal
Metrics Computer
HDFS
HBase
HBase
Hadoop/Hbase
Jetty
Jetty
HTTP
Binary Encoded
Data
Metrics Table
(Cube)
Normalized
Data Storage
User Profile
Data
MySQL
Kafka
Hadoop Map/Reduce
Web Layer Metrics Processing
17. • Zookeeper timeouts
• Completely async service
• Default fsync interval
• Commit threshold from local environments
Tuning Kafka for Flurry
Challenges
www.flurry.com 17
18. How Flurry Uses Kafka
Infrastructure and Setup
www.flurry.com 18
Consumer Group
C1 C2 C… C325
Kafka Cluster
B1 B2 B3
Broker
P1 P2 P… P400
Topic
20. Next Steps: 0.8
www.flurry.com 20
Data Log Consumer
HDFS
Kafka
Data Log Consumer
Kafka
Kafka Cluster
Broker 1
P0 P2
Broker 2
P1 P3
P1’ P3’ P0’ P2’
21. Next Steps: Extended Pipeline
www.flurry.com 21
Input Data
NoSQL DataStore
Real-Time Batch
Collectors
Consumer/
Producer
Systems
MapReduce
(jobs)
External
Action
External
Action
22. Next Steps: Topics and Consumer Groups
Infrastructure and Setup
www.flurry.com 22
Consumer Group 2
C1’ C2’ C… CN’
Topic 1
Consumer Group 1
C1 C2 C… CN
Consumer Group N
C1’’ C2’’ C… CN’’
Topic 2