Lambda Architecture 2.0 for Reactive AB Testing

•

13 likes•2,783 views

1) What is data-driven business? 2) What and why is Lambda Architecture 2.0? 3) What problems did it solve for us? 4) Workshop with case study: Building A/B testing tool for digital marketing with Lambda Architecture 2.0

What's hot

Rakuten - Recommendation Platform

Karthik Murugesan

This document discusses recommendations and personalization at Rakuten. It notes that Rakuten has over 100 million users and handles over 40 million item views per day. Recommendation challenges include dealing with different languages, user behaviors, business areas, and aggregating data across services. Rakuten uses a member-based business model that connects its various services through a common Rakuten ID. The document outlines Rakuten's business-to-business-to-consumer model and how recommendations must handle many shops, item references, and a global catalog. It also provides an overview of Rakuten's recommendation system and some of the challenges in generating and ranking recommendation candidates.

The More the Merrier: Scaling Model Building Infrastructure at Zendesk

Databricks

Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback Speaker: Wai Chee Yau

Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...

Altan Khendup

The document discusses the Lambda architecture, which provides a common pattern for integrating real-time and batch processing systems. It describes the key components of Lambda - the batch layer, speed layer, and serving layer. The challenges of implementing Lambda are that it requires multiple systems and technologies to be coordinated. Real-world examples are needed to help practical application. The document also provides examples of medical and customer analytics use cases that could benefit from a Lambda approach.

Lambda architecture for real time big data

Trieu Nguyen

- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation. - The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data. - Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.

How to design and implement a data ops architecture with sdc and gcp

Joseph Arriola

"Lessons learned using Apache Spark for self-service data prep in SaaS world"

Pavel Hardak

This presentation discusses Workday's use of Apache Spark for self-service data preparation and analytics within its SaaS platform. It covers Workday's unified analytics platform powered by Spark, how Prism uses Spark for interactive data prep and publishing, and lessons learned in areas like nested SQL optimization, plan deduplication, broadcast join tuning, and case-insensitive string grouping. The presentation aims to share Workday's production experiences leveraging Spark for analytics in a multi-tenant SaaS environment.

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...

Big Data Spain

Big data real time architectures

Daniel Marcous

RFX - Full-Stack Technology for Real-time Big Data

Trieu Nguyen

RFX is a full-stack technology framework for real-time big data processing that was created in 2013 and is used by FPT for analytics tasks on websites like Vnexpress.net and eclick.vn. It is built from open source projects like Akka, Netty, Kafka, Spark, Redis and uses a reactive programming approach to optimize user experience through real-time data processing and business logic. RFX aims to provide a fast data intelligence platform for solving problems like analytics, user segmentation, and automatic optimization of user experiences.

The Lyft data platform: Now and in the future

markgrover

- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America. - Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments. - Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks. - Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.

Enterprise Metadata Integration, Cloudera

Neo4j

This document discusses an approach to enterprise metadata integration using a multilayer metadata model. Key points include: - Status dashboards provide facts from technical, operational, application, and quality metadata layers - A graph database allows for context exploration across the entire cluster - The integration of metadata from multiple sources provides a more holistic view of business knowledge

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Albert Wong

Building a data platform doesn’t have to be like entering a portal to Stranger Things. Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale. Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.

Better Together: How Graph database enables easy data integration with Spark ...

TigerGraph

Ai platform at scale

Henry Saputra

The document discusses designing scalable platforms for artificial intelligence (AI) and machine learning (ML). It outlines several challenges in developing AI applications, including technical debts, unpredictability, different data and compute needs compared to traditional software. It then reviews existing commercial AI platforms and common components of AI platforms, including data access, ML workflows, computing infrastructure, model management, and APIs. The rest of the document focuses on eBay's Krylov project as an example AI platform, outlining its architecture, challenges of deploying platforms at scale, and needed skill sets on the platform team.

Rapid Data Analytics @ Netflix

Data Con LA

At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility". How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company? How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos? We'll talk about how Netflix equips its business intelligence and data engineers with: the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house the freedom to create and drop new tables in production without approval the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool the freedom to retire analytics and data processes whose value doesn't justify their support costs Speaker Bios Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace. Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco. Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.

Zero Downtime App Deployment using Hadoop

DataWorks Summit/Hadoop Summit

Hadoop can enable zero downtime app deployments by using microservices, continuous delivery, and real-time analytics. The presenters describe how Expedia saves $5M annually through zero downtime deployments. Their architecture uses microservices, continuous integration, deployment monitoring with Storm/Kafka/HDFS, and analytics in Solr/Hive to enable canary testing, fast feedback, and automated problem resolution. A live demo shows log processing, analytics, and using results to ensure smooth, high-quality deployments.

Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...

Sabri Skhiri

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Databricks

Bighead is Airbnb's machine learning infrastructure that was created to: - Standardize and simplify the ML development workflow; - Reduce the time and effort to build ML models from weeks/months to days/weeks; and - Enable more teams at Airbnb to utilize ML. It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Big Data Spain

Applied Machine Learning for Ranking Products in an Ecommerce Setting

Databricks

As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed. In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.

What's hot (20)

Rakuten - Recommendation Platform

The More the Merrier: Scaling Model Building Infrastructure at Zendesk

Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...

Lambda architecture for real time big data

How to design and implement a data ops architecture with sdc and gcp

"Lessons learned using Apache Spark for self-service data prep in SaaS world"

Shortening the Feedback Loop: How Spotify’s Big Data Ecosystem has evolved to...

Big data real time architectures

RFX - Full-Stack Technology for Real-time Big Data

The Lyft data platform: Now and in the future

Enterprise Metadata Integration, Cloudera

2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)

Better Together: How Graph database enables easy data integration with Spark ...

Ai platform at scale

Rapid Data Analytics @ Netflix

Zero Downtime App Deployment using Hadoop

Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...

Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...

Are we reaching a Data Science Singularity? How Cognitive Computing is emergi...

Applied Machine Learning for Ranking Products in an Ecommerce Setting

Viewers also liked

Big Data and Fast Data - Lambda Architecture in Action

Guido Schmutz

Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data? This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view. This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing as well as some selected products from the Open Source Software community. While this session mostly focuses on the software architecture of BigData and FastData systems, some lessons learned in the implementation of such a system are presented as well.

Mais enfin, pourquoi faire un “corpus de référence” en 2012?

Lou Burnard

Experimentation Platform at Netflix

Steve Urban

Lambda Architecture with Spark

Knoldus Inc.

A real time architecture using Hadoop and Storm @ FOSDEM 2013

Nathan Bijnens

The document discusses a real-time architecture using Hadoop and Storm. It describes a layered architecture with a batch layer using Hadoop to store all data, a speed layer using Storm for stream processing of recent data, and a serving layer that merges views from the batch and speed layers. The batch layer generates immutable views from raw data, while the speed layer maintains incremental real-time views over a limited window. This architecture allows queries to be served with an eventual consistency guarantee.

Apache Kafka

Joe Stein

Apache Kafka is a distributed publish-subscribe messaging system that was originally created by LinkedIn and contributed to the Apache Software Foundation. It is written in Scala and provides a multi-language API to publish and consume streams of records. Kafka is useful for both log aggregation and real-time messaging due to its high performance, scalability, and ability to serve as both a distributed messaging system and log storage system with a single unified architecture. To use Kafka, one runs Zookeeper for coordination, Kafka brokers to form a cluster, and then publishes and consumes messages with a producer API and consumer API.

Implementing the Lambda Architecture efficiently with Apache Spark

DataWorks Summit

This document discusses implementing the Lambda Architecture efficiently using Apache Spark. It provides an overview of the Lambda Architecture concept, which aims to provide low latency querying while supporting batch updates. The Lambda Architecture separates processing into batch and speed layers, with a serving layer that merges the results. Apache Spark is presented as an efficient way to implement the Lambda Architecture due to its unified processing engine, support for streaming and batch data, and ability to easily scale out. The document recommends resources for learning more about Spark and the Lambda Architecture.

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

mumrah

Introduction to Apache Kafka

Jeff Holoman

The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.

Talks@Coursera - A/B Testing @ Internet Scale

courseratalks

This document discusses A/B testing at large internet companies. It describes how companies like Amazon, Microsoft, Google, and LinkedIn use A/B testing to evaluate new ideas, measure their impact, and gain customer feedback. It outlines best practices for A/B testing, such as running one experiment at a time, choosing appropriate metrics and statistical significance, properly powering experiments, and addressing issues like multiple testing. The document also describes the key components of a scalable A/B testing system, including experiment management, online infrastructure for traffic routing and data logging, and automated offline analysis.

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Helena Edelson

Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...

Introduction to Apache Spark

Rahul Jain

Apache Kafka 0.8 basic training - Verisign

Michael Noll

Apache Kafka 0.8 basic training (120 slides) covering: 1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka 2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers 3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning 4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps 5. Playing with Kafka using Wirbelsturm Audience: developers, operations, architects Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/ Verisign is a global leader in domain names and internet security. Tools mentioned: - Wirbelsturm (https://github.com/miguno/wirbelsturm) - kafka-storm-starter (https://github.com/miguno/kafka-storm-starter) Blog post at: http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/ Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!

A/B Testing Framework Design

Patrick McKenzie

Viewers also liked (14)

Big Data and Fast Data - Lambda Architecture in Action

Mais enfin, pourquoi faire un “corpus de référence” en 2012?

Experimentation Platform at Netflix

Lambda Architecture with Spark

A real time architecture using Hadoop and Storm @ FOSDEM 2013

Apache Kafka

Implementing the Lambda Architecture efficiently with Apache Spark

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

Introduction to Apache Kafka

Talks@Coursera - A/B Testing @ Internet Scale

Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...

Introduction to Apache Spark

Apache Kafka 0.8 basic training - Verisign

A/B Testing Framework Design

Similar to Lambda Architecture 2.0 for Reactive AB Testing

Serverless Go at BuzzBird

Vladislav Supalov

Data Engineer's Lunch 90: Migrating SQL Data with Arcion

Anant Corporation

Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona

Dobo Radichkov

Big Data in the Cloud - Montreal April 2015

Cindy Gross

RedisGraph A Low Latency Graph DB: Pieter Cailliau

Redis Labs

This document summarizes a presentation about RedisGraph, a graph database that runs on Redis. The presentation discusses RedisGraph's capabilities, use cases where graph databases are useful, and what new features are upcoming for RedisGraph. Specific points mentioned include RedisGraph's support for the Cypher query language, improvements in performance and functionality since its general availability, and how the graph database can power features for IBM's Multicloud Manager product.

Machine learning model to production

Georg Heiler

This document discusses moving machine learning models from prototype to production. It outlines some common problems with the current workflow where moving to production often requires redevelopment from scratch. Some proposed solutions include using notebooks as APIs and developing analytics that are accessed via an API. It also discusses different data science platforms and architectures for building end-to-end machine learning systems, focusing on flexibility, security, testing and scalability for production environments. The document recommends a custom backend integrated with Spark via APIs as the best approach for the current project.

Uotm workshop

Ravi Patel

This document provides an overview of Big Data and Hadoop concepts, architectures, and hands-on demonstrations using Microsoft Azure HDInsight. It begins with definitions of Big Data and Hadoop, then demonstrates sample end-to-end architectures using Azure services. Hands-on labs explore creating storage, streaming jobs, and querying data using HDInsight. The document emphasizes that Hadoop is well-suited for large-scale data exploration and analytics on unknown datasets. It shows how running Hadoop on Azure provides elasticity, low costs, and easier management compared to on-premises Hadoop deployments.

Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)

Maarten Balliauw

The document discusses Azure Cache and Redis. It provides an overview of Redis, including its data types, transactions, pub/sub capabilities, scripting, and sharding/partitioning. It then discusses common patterns for using Redis, such as caching, counting likes on Facebook, getting the latest reviews, rate limiting, and autocompletion. The document emphasizes that Redis is very flexible and can be used for more than just caching, acting as a general datastore. It concludes by recommending a Redis reference book for further learning.

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Databricks

Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations. We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.

Building NLP applications with Transformers

Julien SIMON

The document discusses how transformer models and transfer learning (Deep Learning 2.0) have improved natural language processing by allowing researchers to easily apply pre-trained models to new tasks with limited data. It presents examples of how HuggingFace has used transformer models for tasks like translation and part-of-speech tagging. The document also discusses tools from HuggingFace that make it easier to train models on hardware accelerators and deploy them to production.

Serverless projects at Myplanet

Daniel Zivkovic

1) Learn about Myplanet's Headless CMS solution using Gatsby Preview and Contentful’s UI Extensions (https://www.contentful.com/resources/serverless/) 2) their Serverless project with IBM - using Apache OpenWhisk (https://www.ibm.com/cloud/functions) 3) how Myplanet got involved with AWS DeepRacer - a fun way to get started with Reinforcement Learning (RL), and their racing experience at re:Invent DeepRacer League (https://reinvent.awsevents.com/learn/deepracer/) 4) their Machine Learning (ML) research related to finding DeepRacer’s ideal line (https://medium.com/myplanet-musings/the-best-path-a-deepracer-can-learn-2a468a3f6d64). BONUS: Two TED Talks referenced in the intro 5) When ideas have sex | Matt Ridley | Jul 14, 2010 https://www.ted.com/talks/matt_ridley_when_ideas_have_sex 6) Why The Best Leaders Make Love The Top Priority | Matt Tenney | Dec 5, 2019 https://www.youtube.com/watch?v=qCVoohdyI6I VIDEO: https://youtu.be/ZH1xxmBNx5k

Building your data driven business with Reactive Marketing Technology

Trieu Nguyen

The document discusses data-driven business and reactive marketing technology. It begins with key questions about data-driven business, the benefits of analytics, and introduces the "9D" model for big data business. Tools for building reactive marketing technology are presented, including Apache Storm, Apache Kafka, Apache Spark, and the Hadoop ecosystem. A case study demonstrates how to build a digital marketing software using open source big data tools. The philosophy and a lightweight lambda architecture for building a reactive system is described.

«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...

it-people

The document discusses what serverless computing is and how it can be used for building applications. Serverless applications rely on third party services to manage server infrastructure and are event-triggered. Popular serverless frameworks like AWS Lambda, Google Cloud Functions, Microsoft Azure Functions, and Zappa allow developers to write code that runs in a serverless environment and handle events and triggers without having to manage servers.

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...

Daniel Zivkovic

Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW

Has serverless adoption hit a roadblock?

Veselin Pizurica

The large O’Reilly survey on serverless adoption indicated that the majority of enterprises have not yet adopted serverless. They have cited the following concerns as main factors: security, the steep learning curve, vendor lock-in, integration/debugging and observability of serverless applications. In this talk, I will share my views on these concerns and present how Waylay IO has addressed these challenges. Waylay IO’s mission is to finally unlock all promised benefits of serverless computation, with an intuitive and developer-friendly low-code platform.

DataTalks #4: Необходимый минимум инструментов для построения своей системы р...

WG_ Events

С каждым днем вопрос о персонализации поведения системы для пользователя становится все острее. В докладе Алексей рассмотрит набор инструментов, с помощью которых можно построить свой сервис по рекомендациям с минимальными временными затратами. Алексей расскажет не только о теории, но и приведет практические советы, как построить прототип, не отвлекая разработчиков на построение цельного pipeline, имея на руках одного-двух Data Scintist, которые могут сформулировать идею модели, и одного java/scala-разработчика, способного перевести модель в код. Доклад будет полезен техническим специалистам, отвечающим на запросы вроде: «мы хотим начать хоть с чего-то, но не знаем с какой стороны подойти». Интересуетесь анализом данных? Присоединяйтесь к нашей группе на Facebook: https://www.facebook.com/groups/DataTalks/

Real time data viz with Spark Streaming, Kafka and D3.js

Ben Laird

This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.

BigData Meets the Federal Data Center

Abe Usher

Maintainable Machine Learning Products

Andrew Musselman

Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

Codemotion

Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.

Similar to Lambda Architecture 2.0 for Reactive AB Testing (20)

Serverless Go at BuzzBird

Data Engineer's Lunch 90: Migrating SQL Data with Arcion

Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, Barcelona

Big Data in the Cloud - Montreal April 2015

RedisGraph A Low Latency Graph DB: Pieter Cailliau

Machine learning model to production

Uotm workshop

Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)

Best Practices for Building and Deploying Data Pipelines in Apache Spark

Building NLP applications with Transformers

Serverless projects at Myplanet

Building your data driven business with Reactive Marketing Technology

«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...

Has serverless adoption hit a roadblock?

DataTalks #4: Необходимый минимум инструментов для построения своей системы р...

Real time data viz with Spark Streaming, Kafka and D3.js

BigData Meets the Federal Data Center

Maintainable Machine Learning Products

Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...

More from Trieu Nguyen

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf

Trieu Nguyen

1. The document outlines the Chief Platform Engineer's background and introduces LEO CDP, a customer data platform for the travel industry. 2. It discusses 5 challenges companies face related to customer growth, journeys, data platforms, communication and understanding customers with big data. 3. A case study shows how LEO CDP can be used to create a customer journey map for a travel agency, including personalized promotions and offers sent via email.

Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business

Trieu Nguyen

Building Your Customer Data Platform with LEO CDP

Trieu Nguyen

How to track and improve Customer Experience with LEO CDP

Trieu Nguyen

This document discusses how to track and improve customer experience using LEO CDP. It begins by explaining why measuring customer experience is important, then introduces four key metrics: Customer Feedback Score, Customer Effort Score, Customer Satisfaction Score, and Net Promoter Score. It describes using journey maps to manage customer experience data and visualize the customer journey. Finally, it presents LEO CDP as a software solution for collecting customer experience data, building surveys, and generating reports to gain insights to improve products, services, and the overall customer experience.

[Notes] Customer 360 Analytics with LEO CDP

Trieu Nguyen

Part 1: Why should every business need to deploy a CDP ? 1. Big data is the reality of business today 2. What are technologies to manage customer data ? 3. The rise of first-party data and new technologies for Digital Marketing 4. How to apply USPA mindset to build your CDP for data-driven business Part 2: How to use LEO CDP for your business 1. Core functions of LEO CDP for marketers and IT managers 2. Data Unification for Customer 360 Analytics 3. Data Segmentation 4. Customer Personalization 5. Customer Data Activation Part 3: Case study in O2O Retail and Ecommerce 1. How to build customer journey map for ecommerce and retail 2. How to do customer analytics to find ideal customer profiles The ideal customer profile in a B2B context The ideal customer profile in a B2C context 3. Manage product catalog for customer personalization 4. Monitoring Data of Customer Experience (CX Analytics) CX Data Flow CX Rating plugin is embedded in the website, to collect feedback data An overview of CX Report A CX Report in a customer profile 5. Monitoring data with real-time event tracking reports Event Data Flow Summary Event Data Report Event Data Report in a Customer Profile Part 4: How to setup an instance of LEO CDP for free 1. Technical architecture 2. Server infrastructure 3. Setup middlewares: Nginx, ArangoDB, Redis, Java and Python Network requirements Software requirements for new server ArangoDB Nginx Proxy SSL for Nginx Server Java 8 JVM Redis Install Notes for Linux Server Clone binary code for new server Set DNS hosts for LEO CDP workers 4. Setup data for testing and system verification Part 5: Summary all key ideas

Leo CDP - Pitch Deck

Trieu Nguyen

Why should you invest in LEO CDP ? Purpose: Big data and AI democracy for SMEs companies Problem: Customer Analytics and Customer Personalization Solutions: CDP + CX + Personalization Engine Product demo: LEO CDP for Ecommerce and Fintech Business model: Freemium → Ecosystem → Subscription Market size: 20 billion USD in 2026 and CAGR 34.6% Differentiation: cloud-native software Go-to-market approach: Community → Free → Paid Team: 1 full-stack dev, 1 data scientist and 12,000 fans of BigDataVietnam.org Community Need 150,000 USD for scaling business (you get 20% share)

LEO CDP - What's new in 2022

Trieu Nguyen

Lộ trình triển khai LEO CDP cho ngành bất động sản

Trieu Nguyen

1) Hiểu bài toán số hoá trải nghiệm khách hàng 2) Nghiên cứu giải pháp LEO CDP 3) Lộ trình triển khai Phát triển / số hoá điểm chạm khách hàng Xây dựng bản đồ hành trình khách hàng Định nghĩa các metrics và KPI quan trọng Xây dựng web portal và mobile data hub Xây dựng kế hoạch Digital Marketing Triển khai CDP và Marketing Automation Xây dựng đội Analytics để phân tích dữ liệu

Why is LEO CDP important for digital business ?

Trieu Nguyen

From Dataism to Customer Data Platform

Trieu Nguyen

Data collection, processing & organization with USPA framework

Trieu Nguyen

Part 1: Introduction to digital marketing technology

Trieu Nguyen

This document provides an overview of a mini-course on data-driven marketing using the USPA framework presented by Trieu Nguyen. It includes biographical information about Trieu Nguyen's background and experience in big data projects, machine learning, and digital marketing roles. The document also outlines the topics that will be covered in the mini-course, including digital media models, search engine marketing, social media marketing, advertising technology, customer data platforms, and case studies. Key terms like omnichannel strategy, customer experience strategy, and artificial intelligence strategies for marketing are also defined.

Why is Customer Data Platform (CDP) ?

Trieu Nguyen

Transform your marketing and sales capabilities with Big Data and A.I 1) Why is Customer Data Platform (CDP) ? Case study: Enhancing the revenue of your restaurant with CDP and mobile app marketing Question: Why can CDP disrupt business model for restaurant industry (B2C) ? 2) How would CDP work in practice ? Introducing USPA.tech as logical framework for implementing CDP in practice How Can a Customer Data Platform Enhance Your Account-Based Marketing Strategy (B2B) ? 3) How can we implement CDP for business? Introducing the CDP as customer-first marketing platform for all industries (my key idea in this slide)

How to build a Personalized News Recommendation Platform

Trieu Nguyen

This document discusses how to build a personalized news recommendation platform. It explains that recommendation systems are needed to retain users, increase traffic, and improve the content experience. It describes popular techniques like collaborative filtering, content-based filtering, and hybrid systems. Specifically, it outlines a case study using a USPA framework with real social news data. Key factors for a news recommendation system are discussed like novelty, user history, and location. The document also provides a simple example of building a recommendation engine with Apache Spark.

How to grow your business in the age of digital marketing 4.0

Trieu Nguyen

1. The document discusses how businesses can grow in the digital marketing age using technologies like cloud services, big data, AI, and headless CMS platforms. 2. It introduces LeoCloudCMS as a headless API CMS that is built for digital marketing 4.0 and can run scalably on cloud computing. 3. The key idea is to think of your entire business as a "box" and use LeoCloudCMS to attract internet users into the box and offer valuable services.

Video Ecosystem and some ideas about video big data

Trieu Nguyen

Concepts, use cases and principles to build big data systems (1)

Trieu Nguyen

1) Introduction to the key Big Data concepts 1.1 The Origins of Big Data 1.2 What is Big Data ? 1.3 Why is Big Data So Important ? 1.4 How Is Big Data Used In Practice ? 2) Introduction to the key principles of Big Data Systems 2.1 How to design Data Pipeline in 6 steps 2.2 Using Lambda Architecture for big data processing 3) Practical case study : Chat bot with Video Recommendation Engine 4) FAQ for student

Open OTT - Video Content Platform

Trieu Nguyen

This document discusses open over-the-top (OTT) video content platforms. It defines OTT as streaming media distributed directly over the internet bypassing traditional distribution methods. The document then covers OTT market drivers and business models. It examines the most popular OTT platform in Vietnam and challenges for successful OTT platforms including scalability, content acquisition and management, audience engagement, and business models. Finally, it proposes a modular technical architecture for an open OTT video platform using open source technologies.

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis

Trieu Nguyen

This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.

Introduction to Recommendation Systems (Vietnam Web Submit)

Trieu Nguyen

More from Trieu Nguyen (20)

Building Your Customer Data Platform with LEO CDP in Travel Industry.pdf

Building Your Customer Data Platform with LEO CDP - Spa and Hotel Business

Building Your Customer Data Platform with LEO CDP

How to track and improve Customer Experience with LEO CDP

[Notes] Customer 360 Analytics with LEO CDP

Leo CDP - Pitch Deck

LEO CDP - What's new in 2022

Lộ trình triển khai LEO CDP cho ngành bất động sản

Why is LEO CDP important for digital business ?

From Dataism to Customer Data Platform

Data collection, processing & organization with USPA framework

Part 1: Introduction to digital marketing technology

Why is Customer Data Platform (CDP) ?

How to build a Personalized News Recommendation Platform

How to grow your business in the age of digital marketing 4.0

Video Ecosystem and some ideas about video big data

Concepts, use cases and principles to build big data systems (1)

Open OTT - Video Content Platform

Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis

Introduction to Recommendation Systems (Vietnam Web Submit)

Recently uploaded

Acid Base Practice Test 4- KEY.pdfkkjkjk

talha2khan2k

Aws MLOps Interview Questions with answers

Sathiakumar Chandr

Training on CSPro and step by steps.pptx

lenjisoHussein

Unit 1 Introduction to DATA SCIENCE .pptx

Priyanka Jadhav

DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...

JeevanKp7

Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients. The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease. A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.

Systane Global education training centre

AkhinaRomdoni

Data Analytics for Decision Making By District 11 Solutions

District 11 Solutions

Where to order Frederick Community College diploma?

SomalyEng

Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia

aznidajailani

Annex K RBF's The World Game pdf document

Steven McGee

Audits Of Complaints Against the PPD Report_2022.pdf

evwcarr

Combined supervised and unsupervised neural networks for pulse shape discrimi...

Samuel Jackson

Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem. Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.

Technology used in Ott data analysis project

49AkshitYadav

SFBA Splunk Usergroup meeting July 17, 2024

Becky Burwell

PRODUCT | RESEARCH-PRESENTATION-1.1.pptx

amazenolmedojeruel

Getting Started with Interactive Brokers API and Python.pdf

Riya Sen

In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.

SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING

PrabhuB33

SAMPLE PRODUCT RESEARCH PR - strikingly.pptx

wojakmodern

Histology of Muscle types histology o.ppt

SamanArshad11

The Rise of Python in Finance,Automating Trading Strategies: _.pdf

Riya Sen

In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.

Recently uploaded (20)

Acid Base Practice Test 4- KEY.pdfkkjkjk

Aws MLOps Interview Questions with answers

Training on CSPro and step by steps.pptx

Unit 1 Introduction to DATA SCIENCE .pptx

DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...

Systane Global education training centre

Data Analytics for Decision Making By District 11 Solutions

Where to order Frederick Community College diploma?

Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia

Annex K RBF's The World Game pdf document

Audits Of Complaints Against the PPD Report_2022.pdf

Combined supervised and unsupervised neural networks for pulse shape discrimi...

Technology used in Ott data analysis project

SFBA Splunk Usergroup meeting July 17, 2024

PRODUCT | RESEARCH-PRESENTATION-1.1.pptx

Getting Started with Interactive Brokers API and Python.pdf

SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING

SAMPLE PRODUCT RESEARCH PR - strikingly.pptx

Histology of Muscle types histology o.ppt

The Rise of Python in Finance,Automating Trading Strategies: _.pdf

Lambda Architecture 2.0 for Reactive AB Testing

1. Lambda Architecture 2.0 for Data-Driven Business Team Trieu Nguyen - http://nguyentantrieu.info Truc Le - https://www.linkedin.com/pub/le-kien-truc/31/379/938 Data-driven + Lambda Architecture = growing business2 mc2ads.com - Fast Data Labs

2. Key questions for us today ? 1. What if the business is not driven by data? 2. What and why is Lambda Architecture? 3. What problems did it solve for us? Workshop with case study: Improving “Flappy bird” with A/B Testing Tool and Lambda Architecture 2.0

3. Red bird Blue bird Which bird could let you down soon ? OK, let’s play the Game ! Design it better with data VS

4. Data-driven game design for Flappy Bird !

5. What if the business is not driven by data? Refer: http://www.nytimes.com/2011/04/24/business/24unboxed.html

6. What if the business is not driven by data?

7. How ? Data-Driven Business

8. 1970s 1990s 2000s 2010s Data Management Technology and Trends ● Netty.io ● Apache Storm ● Apache Kafka ● Apache Spark ● RFX ● ... ● Hadoop Ecosystem ● NoSQL Ecosystem ● ... ● Oracle ● MySQL ● PostgreSQL ● ...

9. Why is Lambda Architecture 2.0 ? It helps to organize your data infrastructure into understandable structure and react quickly to context changes

10. “Vision Without Execution Is Just Hallucination” Ok, cool ideas, but how we build it ? Our Our We are here

11. Our goals 1. Understand the big picture 2. See the reality 3. Do actions to make it happen Ok! Let’s make “Flappy bird” into “Happy bird” !

12. What is Lambda Architecture 2.0 ? It’s just the architecture for data-driven business for reacting to fast data for data mining and machine learning on Big Data for observable data for SQL querying (SQL is true lambda language !?)

13. Case study: Improving “Flappy bird” with A/B Testing Tool and Lambda Architecture 2.0 ● Short introduction about A/B testing ● Setup full open source technology stack ● Run example code with Java and Python

14. Our goal: Applying data-driven in Game Design

15. forecasting Make correct decision

16. How? One of basic principle is “Test our theory” From observable solutions, test them all to find the best one ! More at http://en.wikipedia.org/wiki/A/B_testing

17. 1. Working with A/B testing tool (using Abba framework) 2. Let’s play Flappy Bird 2.0 ! 3. Collecting data → store data as stream (Kafka) 4. Stream processing → real-time view processing (RFX) 5. Batch processing → sampling AB Test (Spark) 6. Query processing → finding facts from experiment (SQL over Phoenix / HBase) 7. Collecting feedback data → Game Design Report Steps

18. For simple demo, we use Abba, a simple A/B testing self-hosted framework

19. Why is reactive view in Lambda Architecture 2.0 ? UX is the key for successful product development, so we must react to bad UX quickly (with data)

20. Technology stack ( 5D model ) 1) Data collector (I/O networking) ● Netty for event log collector and HTTP server (lambda2) 2) Data persistence (aka: data storage) ● Kafka for distributed message storage (Apache Kafka) ● HBase for scalable big table 3) Data processing ● RFX with fast data processing (RFX framework) ● Python for data sampling in A/B test experiments ● Rx(Java/JS) for reacting to data experiment (reactivex) 4) Data analysis ● measures of uncertainty(Python Dempster-Shafer theory) 5) Data ad-hoc reporting ● SQL over Phoenix / HBase ( http://phoenix.apache.org )

21. My email: tantrieuf31@gmail.com tweet me @tantrieuf31

Lambda Architecture 2.0 for Reactive AB Testing

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Lambda Architecture 2.0 for Reactive AB Testing

Similar to Lambda Architecture 2.0 for Reactive AB Testing (20)

More from Trieu Nguyen

More from Trieu Nguyen (20)

Recently uploaded

Recently uploaded (20)

Lambda Architecture 2.0 for Reactive AB Testing