1) What is data-driven business?
2) What and why is Lambda Architecture 2.0?
3) What problems did it solve for us?
4) Workshop with case study:
Building A/B testing tool for digital marketing with Lambda Architecture 2.0
This document discusses recommendations and personalization at Rakuten. It notes that Rakuten has over 100 million users and handles over 40 million item views per day. Recommendation challenges include dealing with different languages, user behaviors, business areas, and aggregating data across services. Rakuten uses a member-based business model that connects its various services through a common Rakuten ID. The document outlines Rakuten's business-to-business-to-consumer model and how recommendations must handle many shops, item references, and a global catalog. It also provides an overview of Rakuten's recommendation system and some of the challenges in generating and ranking recommendation candidates.
The More the Merrier: Scaling Model Building Infrastructure at ZendeskDatabricks
Significant amount of effort is required to transform a machine learning (ML) model into a useful machine learning product. The incorporation of ML into real world applications almost feels like "1% algorithm and 99% perspiration". I will share with you my team experience in building 3 ML products at Zendesk. I will also discuss some real-world problems and scaling complexities you may encounter when building these products at web scale. Close collaboration with different groups including product, engineering and data science is imperative to strike the balance between model performance, scalability and computational efficiency. The talk mainly focuses on scaling our model building infrastructure with an aim to build at least 50,000 models a day. This is achieved as part of our efforts to deliver a ML product called Content Cues. In a nutshell, Content Cues summarizes text from customers support tickets to form insightful topics. It combines multiple ML algorithms including deep learning, clustering and other natural language processing approaches. These ML algorithms are then run through tens of thousands of eligible Zendesk customer data every day. My talk will cover the following topics: How we implement a horizontally scalable model building and model serving pipeline by combining AWS EMR, AWS Batch and Kubernetes How we tune the model building pipeline to optimize cost and efficiency without compromising resiliency Challenges in model monitoring, model versioning evolution and capturing of user feedback
Speaker: Wai Chee Yau
Data Apps with the Lambda Architecture - with Real Work Examples on Merging B...Altan Khendup
The document discusses the Lambda architecture, which provides a common pattern for integrating real-time and batch processing systems. It describes the key components of Lambda - the batch layer, speed layer, and serving layer. The challenges of implementing Lambda are that it requires multiple systems and technologies to be coordinated. Real-world examples are needed to help practical application. The document also provides examples of medical and customer analytics use cases that could benefit from a Lambda approach.
Lambda architecture for real time big dataTrieu Nguyen
- The document discusses the Lambda Architecture, a system designed by Nathan Marz for building real-time big data applications. It is based on three principles: human fault-tolerance, data immutability, and recomputation.
- The document provides two case studies of applying Lambda Architecture - at Greengar Studios for API monitoring and statistics, and at eClick for real-time data analytics on streaming user event data.
- Key lessons discussed are keeping solutions simple, asking the right questions to enable deep analytics and profit, using reactive and functional approaches, and turning data into useful insights.
How to design and implement a data ops architecture with sdc and gcpJoseph Arriola
Do you know how to use StreamSets Data Collector with Google Cloud Platform (GCP)? In this session we'll explain how YaloChat designed and implemented a streaming architecture that is sustainable, operable and scalable. Discover how we deployed Data Collector to integrate GCP components such as Pub / Sub and BigQuery to achieve DataOps in the cloud
"Lessons learned using Apache Spark for self-service data prep in SaaS world"Pavel Hardak
This presentation discusses Workday's use of Apache Spark for self-service data preparation and analytics within its SaaS platform. It covers Workday's unified analytics platform powered by Spark, how Prism uses Spark for interactive data prep and publishing, and lessons learned in areas like nested SQL optimization, plan deduplication, broadcast join tuning, and case-insensitive string grouping. The presentation aims to share Workday's production experiences leveraging Spark for analytics in a multi-tenant SaaS environment.
Big data real time architectures -
How do to big data processing in real time?
What architectures are out there to support this paradigm?
Which one should we choose?
What Advantages / Pitfalls they contain.
RFX - Full-Stack Technology for Real-time Big DataTrieu Nguyen
RFX is a full-stack technology framework for real-time big data processing that was created in 2013 and is used by FPT for analytics tasks on websites like Vnexpress.net and eclick.vn. It is built from open source projects like Akka, Netty, Kafka, Spark, Redis and uses a reactive programming approach to optimize user experience through real-time data processing and business logic. RFX aims to provide a fast data intelligence platform for solving problems like analytics, user segmentation, and automatic optimization of user experiences.
The Lyft data platform: Now and in the futuremarkgrover
- Lyft has grown significantly in recent years, providing over 1 billion rides to 30.7 million riders through 1.9 million drivers in 2018 across North America.
- Data is core to Lyft's business decisions, from pricing and driver matching to analyzing performance and informing investments.
- Lyft's data platform supports data scientists, analysts, engineers and others through tools like Apache Superset, change data capture from operational stores, and streaming frameworks.
- Key focuses for the platform include business metric observability, streaming applications, and machine learning while addressing challenges of reliability, integration and scale.
This document discusses an approach to enterprise metadata integration using a multilayer metadata model. Key points include:
- Status dashboards provide facts from technical, operational, application, and quality metadata layers
- A graph database allows for context exploration across the entire cluster
- The integration of metadata from multiple sources provides a more holistic view of business knowledge
2016 Tableau in the Cloud - A Netflix Original (AWS Re:invent)Albert Wong
Building a data platform doesn’t have to be like entering a portal to Stranger Things.
Join us in one hour for Tableau in the Cloud: A Netflix Original where Albert Wong, Netflix’s analytics expert, will show you how to simplify your data stack to deliver self-service analytics at scale.
Albert will discuss the details of connecting to big data, finding datasets, and discovering critical insights from visualizations. He will also share how Netflix is developing and growing their analytics ecosystem with Tableau, and how they prioritize sustaining their data culture of freedom and responsibility.
The document discusses designing scalable platforms for artificial intelligence (AI) and machine learning (ML). It outlines several challenges in developing AI applications, including technical debts, unpredictability, different data and compute needs compared to traditional software. It then reviews existing commercial AI platforms and common components of AI platforms, including data access, ML workflows, computing infrastructure, model management, and APIs. The rest of the document focuses on eBay's Krylov project as an example AI platform, outlining its architecture, challenges of deploying platforms at scale, and needed skill sets on the platform team.
At Netflix, we've spent a lot of time thinking about how we can make our analytics group move quickly. Netflix's Data Engineering & Analytics organization embraces the company's culture of "Freedom & Responsibility".
How does a company with a $40 billion market cap and $6 billion in annual revenue keep their data teams moving with the agility of a tiny company?
How do hundreds of data engineers and scientists make the best decisions for their projects independently, without the analytics environment devolving into chaos?
We'll talk about how Netflix equips its business intelligence and data engineers with:
the freedom to leverage cloud-based data tools - Spark, Presto, Redshift, Tableau and others - in ways that solve our most difficult data problems
the freedom to find and introduce right software for the job - even if it isn't used anywhere else in-house
the freedom to create and drop new tables in production without approval
the freedom to choose when a question is a one-off, and when a question is asked often enough to require a self-service tool
the freedom to retire analytics and data processes whose value doesn't justify their support costs
Speaker Bios
Monisha Kanoth is a Senior Data Architect at Netflix, and was one of the founding members of the current streaming Content Analytics team. She previously worked as a big data lead at Convertro (acquired by AOL) and as a data warehouse lead at MySpace.
Jason Flittner is a Senior Business Intelligence Engineer at Netflix, focusing on data transformation, analysis, and visualization as part of the Content Data Engineering & Analytics team. He previously led the EC2 Business Intelligence team at Amazon Web Services and was a business intelligence engineer with Cisco.
Chris Stephens is a Senior Data Engineer at Netflix. He previously served as the CTO at Deep 6 Analytics, a machine learning & content analytics company in Los Angeles, and on the data warehouse teams at the FOX Audience Network and Anheuser-Busch.
Hadoop can enable zero downtime app deployments by using microservices, continuous delivery, and real-time analytics. The presenters describe how Expedia saves $5M annually through zero downtime deployments. Their architecture uses microservices, continuous integration, deployment monitoring with Storm/Kafka/HDFS, and analytics in Solr/Hive to enable canary testing, fast feedback, and automated problem resolution. A live demo shows log processing, analytics, and using results to ensure smooth, high-quality deployments.
Lambda Architecture 2.0 Convergence between Real-Time Analytics, Context-awar...Sabri Skhiri
At Huawei, we have developed a scalable Complex Event Processing with a significant improvement of the expressiveness. In the scope of the "context-aware" distributed systems, we need to define new architecture patterns. In this way we open new doors to new features and capabilities.
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...Databricks
Bighead is Airbnb's machine learning infrastructure that was created to:
- Standardize and simplify the ML development workflow;
- Reduce the time and effort to build ML models from weeks/months to days/weeks; and
- Enable more teams at Airbnb to utilize ML.
It provides shared services and tools for data management, model training/inference, and model management to make the ML process more efficient and production-ready. This includes services like Zipline for feature storage, Redspot for notebook environments, Deep Thought for online inference, and the Bighead UI for model monitoring.
Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks
As a leading e-commerce company in fashion in the Netherlands, Wehkamp dedicates itself to provide a better shopping experience for the customers. Using Spark, the data science team is able to develop various machine-learning projects for this purpose based on the large scale data of products and customers. A major topic for the data science team is ranking products. If a visitor enters a search phrase, what are the best products that fit the search phrase and in what order should the products been shown? Ranking products is also important if a visitor enters a product overview page, where hundreds or even thousands of products of a certain article type are displayed.
In this project, Spark is used in the whole pipeline: retrieving and processing the search phrases and their results, making click models, creating feature sets, training and evaluating ranking models, pushing the models to production using ElasticSearch and creating Tableau dashboarding. In this talk, we are going to demonstrate how we use Spark to build up the whole pipeline of ranking products and the challenges we faced along the way.
Big Data and Fast Data - Lambda Architecture in ActionGuido Schmutz
Big Data (volume) and real-time information processing (velocity) are two important aspects of Big Data systems. At first sight, these two aspects seem to be incompatible. Are traditional software architectures still the right choice? Do we need new, revolutionary architectures to tackle the requirements of Big Data?
This presentation discusses the idea of the so-called lambda architecture for Big Data, which acts on the assumption of a bisection of the data-processing: in a batch-phase a temporally bounded, large dataset is processed either through traditional ETL or MapReduce. In parallel, a real-time, online processing is constantly calculating the values of the new data coming in during the batch phase. The combination of the two results, batch and online processing is giving the constantly up-to-date view.
This talk presents how such an architecture can be implemented using Oracle products such as Oracle NoSQL, Hadoop and Oracle Event Processing as well as some selected products from the Open Source Software community. While this session mostly focuses on the software architecture of BigData and FastData systems, some lessons learned in the implementation of such a system are presented as well.
The presentation covers lambda architecture and implementation with spark. In the presentation we will discuss about components of lambda architecture like batch layer, speed layer and serving layer. We will also discuss its advantages and benefits with spark.
A real time architecture using Hadoop and Storm @ FOSDEM 2013Nathan Bijnens
The document discusses a real-time architecture using Hadoop and Storm. It describes a layered architecture with a batch layer using Hadoop to store all data, a speed layer using Storm for stream processing of recent data, and a serving layer that merges views from the batch and speed layers. The batch layer generates immutable views from raw data, while the speed layer maintains incremental real-time views over a limited window. This architecture allows queries to be served with an eventual consistency guarantee.
Apache Kafka is a distributed publish-subscribe messaging system that was originally created by LinkedIn and contributed to the Apache Software Foundation. It is written in Scala and provides a multi-language API to publish and consume streams of records. Kafka is useful for both log aggregation and real-time messaging due to its high performance, scalability, and ability to serve as both a distributed messaging system and log storage system with a single unified architecture. To use Kafka, one runs Zookeeper for coordination, Kafka brokers to form a cluster, and then publishes and consumes messages with a producer API and consumer API.
Implementing the Lambda Architecture efficiently with Apache SparkDataWorks Summit
This document discusses implementing the Lambda Architecture efficiently using Apache Spark. It provides an overview of the Lambda Architecture concept, which aims to provide low latency querying while supporting batch updates. The Lambda Architecture separates processing into batch and speed layers, with a serving layer that merges the results. Apache Spark is presented as an efficient way to implement the Lambda Architecture due to its unified processing engine, support for streaming and batch data, and ability to easily scale out. The document recommends resources for learning more about Spark and the Lambda Architecture.
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
Apache Kafka is a distributed publish-subscribe messaging system that allows both publishing and subscribing to streams of records. It uses a distributed commit log that provides low latency and high throughput for handling real-time data feeds. Key features include persistence, replication, partitioning, and clustering.
The document provides an introduction and overview of Apache Kafka presented by Jeff Holoman. It begins with an agenda and background on the presenter. It then covers basic Kafka concepts like topics, partitions, producers, consumers and consumer groups. It discusses efficiency and delivery guarantees. Finally, it presents some use cases for Kafka and positioning around when it may or may not be a good fit compared to other technologies.
Talks@Coursera - A/B Testing @ Internet Scalecourseratalks
This document discusses A/B testing at large internet companies. It describes how companies like Amazon, Microsoft, Google, and LinkedIn use A/B testing to evaluate new ideas, measure their impact, and gain customer feedback. It outlines best practices for A/B testing, such as running one experiment at a time, choosing appropriate metrics and statistical significance, properly powering experiments, and addressing issues like multiple testing. The document also describes the key components of a scalable A/B testing system, including experiment management, online infrastructure for traffic routing and data logging, and automated offline analysis.
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
Regardless of the meaning we are searching for over our vast amounts of data, whether we are in science, finance, technology, energy, health care…, we all share the same problems that must be solved: How do we achieve that? What technologies best support the requirements? This talk is about how to leverage fast access to historical data with real time streaming data for predictive modeling for lambda architecture with Spark Streaming, Kafka, Cassandra, Akka and Scala. Efficient Stream Computation, Composable Data Pipelines, Data Locality, Cassandra data model and low latency, Kafka producers and HTTP endpoints as akka actors...
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Apache Kafka 0.8 basic training - VerisignMichael Noll
Apache Kafka 0.8 basic training (120 slides) covering:
1. Introducing Kafka: history, Kafka at LinkedIn, Kafka adoption in the industry, why Kafka
2. Kafka core concepts: topics, partitions, replicas, producers, consumers, brokers
3. Operating Kafka: architecture, hardware specs, deploying, monitoring, P&S tuning
4. Developing Kafka apps: writing to Kafka, reading from Kafka, testing, serialization, compression, example apps
5. Playing with Kafka using Wirbelsturm
Audience: developers, operations, architects
Created by Michael G. Noll, Data Architect, Verisign, https://www.verisigninc.com/
Verisign is a global leader in domain names and internet security.
Tools mentioned:
- Wirbelsturm (https://github.com/miguno/wirbelsturm)
- kafka-storm-starter (https://github.com/miguno/kafka-storm-starter)
Blog post at:
http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and-tutorial/
Many thanks to the LinkedIn Engineering team (the creators of Kafka) and the Apache Kafka open source community!
What happens when you start transitioning from a monolithic PHP app to Go services running on AWS Lambda? Good things! I'd like to share the problems encountered, decisions made and lessons learned along the way.
Data Engineer's Lunch 90: Migrating SQL Data with ArcionAnant Corporation
In Data Engineer's Lunch 90, Eric Ramseur teaches our audience how to use Arcion.
From best practices to real-world examples, this talk will provide you with the knowledge and insights you need to ensure a successful migration of your SQL data. So whether you're new to data migration or looking to improve your existing process, join us and discover how Arcion can help you achieve your goals.
Real-time serverless analytics at Shedd – OLX data summit, Mar 2018, BarcelonaDobo Radichkov
OLX Group presentation on real-time serverless analytics at the 2018 OLX internal data summit in Barcelona.
The presentation focuses on best practices in real-time data applications, including AWS technologies such as Kinesis, Lambda (with serverless framework) and ElastiCache.
Presentation examines case study of real-time product recommendations built on top of serverless architecture.
Big Data in the Cloud - Montreal April 2015Cindy Gross
slides:
Basic Big Data and Hadoop terminology
What projects fit well with Hadoop
Why Hadoop in the cloud is so Powerful
Sample end-to-end architecture
See: Data, Hadoop, Hive, Analytics, BI
Do: Data, Hadoop, Hive, Analytics, BI
How this tech solves your business problems
RedisGraph A Low Latency Graph DB: Pieter CailliauRedis Labs
This document summarizes a presentation about RedisGraph, a graph database that runs on Redis. The presentation discusses RedisGraph's capabilities, use cases where graph databases are useful, and what new features are upcoming for RedisGraph. Specific points mentioned include RedisGraph's support for the Cypher query language, improvements in performance and functionality since its general availability, and how the graph database can power features for IBM's Multicloud Manager product.
This document discusses moving machine learning models from prototype to production. It outlines some common problems with the current workflow where moving to production often requires redevelopment from scratch. Some proposed solutions include using notebooks as APIs and developing analytics that are accessed via an API. It also discusses different data science platforms and architectures for building end-to-end machine learning systems, focusing on flexibility, security, testing and scalability for production environments. The document recommends a custom backend integrated with Spark via APIs as the best approach for the current project.
This document provides an overview of Big Data and Hadoop concepts, architectures, and hands-on demonstrations using Microsoft Azure HDInsight. It begins with definitions of Big Data and Hadoop, then demonstrates sample end-to-end architectures using Azure services. Hands-on labs explore creating storage, streaming jobs, and querying data using HDInsight. The document emphasizes that Hadoop is well-suited for large-scale data exploration and analytics on unknown datasets. It shows how running Hadoop on Azure provides elasticity, low costs, and easier management compared to on-premises Hadoop deployments.
Get more than a cache back! The Microsoft Azure Redis Cache (NDC Oslo)Maarten Balliauw
The document discusses Azure Cache and Redis. It provides an overview of Redis, including its data types, transactions, pub/sub capabilities, scripting, and sharding/partitioning. It then discusses common patterns for using Redis, such as caching, counting likes on Facebook, getting the latest reviews, rate limiting, and autocompletion. The document emphasizes that Redis is very flexible and can be used for more than just caching, acting as a general datastore. It concludes by recommending a Redis reference book for further learning.
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Many data pipelines share common characteristics and are often built in similar but bespoke ways, even within a single organisation. In this talk, we will outline the key considerations which need to be applied when building data pipelines, such as performance, idempotency, reproducibility, and tackling the small file problem. We’ll work towards describing a common Data Engineering toolkit which separates these concerns from business logic code, allowing non-Data-Engineers (e.g. Business Analysts and Data Scientists) to define data pipelines without worrying about the nitty-gritty production considerations.
We’ll then introduce an implementation of such a toolkit in the form of Waimak, our open-source library for Apache Spark (https://github.com/CoxAutomotiveDataSolutions/waimak), which has massively shortened our route from prototype to production. Finally, we’ll define new approaches and best practices about what we believe is the most overlooked aspect of Data Engineering: deploying data pipelines.
Building NLP applications with TransformersJulien SIMON
The document discusses how transformer models and transfer learning (Deep Learning 2.0) have improved natural language processing by allowing researchers to easily apply pre-trained models to new tasks with limited data. It presents examples of how HuggingFace has used transformer models for tasks like translation and part-of-speech tagging. The document also discusses tools from HuggingFace that make it easier to train models on hardware accelerators and deploy them to production.
1) Learn about Myplanet's Headless CMS solution using Gatsby Preview and Contentful’s UI Extensions (https://www.contentful.com/resources/serverless/)
2) their Serverless project with IBM - using Apache OpenWhisk (https://www.ibm.com/cloud/functions)
3) how Myplanet got involved with AWS DeepRacer - a fun way to get started with Reinforcement Learning (RL), and their racing experience at re:Invent DeepRacer League (https://reinvent.awsevents.com/learn/deepracer/)
4) their Machine Learning (ML) research related to finding DeepRacer’s ideal line (https://medium.com/myplanet-musings/the-best-path-a-deepracer-can-learn-2a468a3f6d64).
BONUS: Two TED Talks referenced in the intro
5) When ideas have sex | Matt Ridley | Jul 14, 2010 https://www.ted.com/talks/matt_ridley_when_ideas_have_sex
6) Why The Best Leaders Make Love The Top Priority | Matt Tenney | Dec 5, 2019 https://www.youtube.com/watch?v=qCVoohdyI6I
VIDEO: https://youtu.be/ZH1xxmBNx5k
Building your data driven business with Reactive Marketing TechnologyTrieu Nguyen
The document discusses data-driven business and reactive marketing technology. It begins with key questions about data-driven business, the benefits of analytics, and introduces the "9D" model for big data business. Tools for building reactive marketing technology are presented, including Apache Storm, Apache Kafka, Apache Spark, and the Hadoop ecosystem. A case study demonstrates how to build a digital marketing software using open source big data tools. The philosophy and a lightweight lambda architecture for building a reactive system is described.
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...it-people
The document discusses what serverless computing is and how it can be used for building applications. Serverless applications rely on third party services to manage server infrastructure and are event-triggered. Popular serverless frameworks like AWS Lambda, Google Cloud Functions, Microsoft Azure Functions, and Zappa allow developers to write code that runs in a serverless environment and handle events and triggers without having to manage servers.
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW
The large O’Reilly survey on serverless adoption indicated that the majority of enterprises have not yet adopted serverless. They have cited the following concerns as main factors: security, the steep learning curve, vendor lock-in, integration/debugging and observability of serverless applications.
In this talk, I will share my views on these concerns and present how Waylay IO has addressed these challenges. Waylay IO’s mission is to finally unlock all promised benefits of serverless computation, with an intuitive and developer-friendly low-code platform.
DataTalks #4: Необходимый минимум инструментов для построения своей системы р...WG_ Events
С каждым днем вопрос о персонализации поведения системы для пользователя становится все острее. В докладе Алексей рассмотрит набор инструментов, с помощью которых можно построить свой сервис по рекомендациям с минимальными временными затратами.
Алексей расскажет не только о теории, но и приведет практические советы, как построить прототип, не отвлекая разработчиков на построение цельного pipeline, имея на руках одного-двух Data Scintist, которые могут сформулировать идею модели, и одного java/scala-разработчика, способного перевести модель в код.
Доклад будет полезен техническим специалистам, отвечающим на запросы вроде: «мы хотим начать хоть с чего-то, но не знаем с какой стороны подойти».
Интересуетесь анализом данных? Присоединяйтесь к нашей группе на Facebook: https://www.facebook.com/groups/DataTalks/
Real time data viz with Spark Streaming, Kafka and D3.jsBen Laird
This document discusses building a dynamic visualization of large streaming transaction data. It proposes using Apache Kafka to handle the transaction stream, Apache Spark Streaming to process and aggregate the data, MongoDB for intermediate storage, a Node.js server, and Socket.io for real-time updates. Visualization would use Crossfilter, DC.js and D3.js to enable interactive exploration of billions of records in the browser.
BigData Meets the Federal Data Center - an overview of nosql solutions to data challenges (e.g. Hadoop, Hbase, Mongodb, cassandra, redis etc). Also includes a vignette on Google Prediction API.
Machine learning applications are typically stitched together from hopes and dreams, shell scripts, cron jobs, home-grown schedulers, snippets of configuration clipped from multiple blog posts, thousands of hard-coded business rules, a.k.a. "our SQL corpus," and a few lines of training and testing code. Organizing all the moving parts into something maintainable and supportive of ongoing development is a challenge most teams have on their TODO list, roadmap, or tech debt pile. Getting ahead of the day-to-day demands and settling into a sane architecture often seems like an unattainable goal. The past several years have seen an explosion of tool-building in the data engineering and analytics area, including in Apache projects spanning the areas of search and information retrieval, job orchestration, file and stream formats, and machine learning libraries. In this talk we will cover our product and development teams' choices of architecture and tools, from data ingestion and storage, through transformations and processing, to presentation of results and publishing to web services, reports, and applications.
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Codemotion
Once you start working with Big Data systems, you discover a whole bunch of problems you won’t find in monolithic systems. Monitoring all of the components becomes a big data problem itself. In the talk, we’ll mention all of the aspects that you should take into consideration when monitoring a distributed system using tools like Web Services, Spark, Cassandra, MongoDB, AWS. Not only the tools, what should you monitor about the actual data that flows in the system? We’ll cover the simplest solution with your day to day open source tools, the surprising thing, that it comes not from an Ops Guy.
Similar to Lambda Architecture 2.0 for Reactive AB Testing (20)
Building Your Customer Data Platform with LEO CDP in Travel Industry.pdfTrieu Nguyen
1. The document outlines the Chief Platform Engineer's background and introduces LEO CDP, a customer data platform for the travel industry.
2. It discusses 5 challenges companies face related to customer growth, journeys, data platforms, communication and understanding customers with big data.
3. A case study shows how LEO CDP can be used to create a customer journey map for a travel agency, including personalized promotions and offers sent via email.
How to track and improve Customer Experience with LEO CDPTrieu Nguyen
This document discusses how to track and improve customer experience using LEO CDP. It begins by explaining why measuring customer experience is important, then introduces four key metrics: Customer Feedback Score, Customer Effort Score, Customer Satisfaction Score, and Net Promoter Score. It describes using journey maps to manage customer experience data and visualize the customer journey. Finally, it presents LEO CDP as a software solution for collecting customer experience data, building surveys, and generating reports to gain insights to improve products, services, and the overall customer experience.
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
Part 1: Why should every business need to deploy a CDP ?
1. Big data is the reality of business today
2. What are technologies to manage customer data ?
3. The rise of first-party data and new technologies for Digital Marketing
4. How to apply USPA mindset to build your CDP for data-driven business
Part 2: How to use LEO CDP for your business
1. Core functions of LEO CDP for marketers and IT managers
2. Data Unification for Customer 360 Analytics
3. Data Segmentation
4. Customer Personalization
5. Customer Data Activation
Part 3: Case study in O2O Retail and Ecommerce
1. How to build customer journey map for ecommerce and retail
2. How to do customer analytics to find ideal customer profiles
The ideal customer profile in a B2B context
The ideal customer profile in a B2C context
3. Manage product catalog for customer personalization
4. Monitoring Data of Customer Experience (CX Analytics)
CX Data Flow
CX Rating plugin is embedded in the website, to collect feedback data
An overview of CX Report
A CX Report in a customer profile
5. Monitoring data with real-time event tracking reports
Event Data Flow
Summary Event Data Report
Event Data Report in a Customer Profile
Part 4: How to setup an instance of LEO CDP for free
1. Technical architecture
2. Server infrastructure
3. Setup middlewares: Nginx, ArangoDB, Redis, Java and Python
Network requirements
Software requirements for new server
ArangoDB
Nginx Proxy
SSL for Nginx Server
Java 8 JVM
Redis
Install Notes for Linux Server
Clone binary code for new server
Set DNS hosts for LEO CDP workers
4. Setup data for testing and system verification
Part 5: Summary all key ideas
Why should you invest in LEO CDP ?
Purpose: Big data and AI democracy for SMEs companies
Problem: Customer Analytics and Customer Personalization
Solutions: CDP + CX + Personalization Engine
Product demo: LEO CDP for Ecommerce and Fintech
Business model: Freemium → Ecosystem → Subscription
Market size: 20 billion USD in 2026 and CAGR 34.6%
Differentiation: cloud-native software
Go-to-market approach: Community → Free → Paid
Team: 1 full-stack dev, 1 data scientist and 12,000 fans of BigDataVietnam.org Community
Need 150,000 USD for scaling business (you get 20% share)
The document outlines new features and updates for 2022 from USPA Technology Company, including a new dedicated dashboard for CMOs, updated UI for Customer 360 Insights, and a focus on data-driven business processes and digital marketing in B2B through standardizing data-driven processes and focusing on customer insights.
Lộ trình triển khai LEO CDP cho ngành bất động sảnTrieu Nguyen
1) Hiểu bài toán số hoá trải nghiệm khách hàng
2) Nghiên cứu giải pháp LEO CDP
3) Lộ trình triển khai
Phát triển / số hoá điểm chạm khách hàng
Xây dựng bản đồ hành trình khách hàng
Định nghĩa các metrics và KPI quan trọng
Xây dựng web portal và mobile data hub
Xây dựng kế hoạch Digital Marketing
Triển khai CDP và Marketing Automation
Xây dựng đội Analytics để phân tích dữ liệu
From Dataism to Customer Data PlatformTrieu Nguyen
1) How to think in the age of Dataism with LEO CDP ?
2) Why is Dataism for human, business and society ?
3) How should LEO Customer Data Platform (LEO CDP) work ?
4) How to use LEO CDP for your business ?
Data collection, processing & organization with USPA frameworkTrieu Nguyen
1) How to think in the age of Dataism with USPA framework ?
2) How to collect customer data
3) Data Segmentation Processing for flexibility and scalability
4) Data Organization for personalization and business activation
Part 1: Introduction to digital marketing technologyTrieu Nguyen
This document provides an overview of a mini-course on data-driven marketing using the USPA framework presented by Trieu Nguyen. It includes biographical information about Trieu Nguyen's background and experience in big data projects, machine learning, and digital marketing roles. The document also outlines the topics that will be covered in the mini-course, including digital media models, search engine marketing, social media marketing, advertising technology, customer data platforms, and case studies. Key terms like omnichannel strategy, customer experience strategy, and artificial intelligence strategies for marketing are also defined.
Transform your marketing and sales capabilities with Big Data and A.I
1) Why is Customer Data Platform (CDP) ?
Case study: Enhancing the revenue of your restaurant with CDP and mobile app marketing
Question: Why can CDP disrupt business model for restaurant industry (B2C) ?
2) How would CDP work in practice ?
Introducing USPA.tech as logical framework for implementing CDP in practice
How Can a Customer Data Platform Enhance Your Account-Based Marketing Strategy (B2B) ?
3) How can we implement CDP for business?
Introducing the CDP as customer-first marketing platform for all industries (my key idea in this slide)
How to build a Personalized News Recommendation PlatformTrieu Nguyen
This document discusses how to build a personalized news recommendation platform. It explains that recommendation systems are needed to retain users, increase traffic, and improve the content experience. It describes popular techniques like collaborative filtering, content-based filtering, and hybrid systems. Specifically, it outlines a case study using a USPA framework with real social news data. Key factors for a news recommendation system are discussed like novelty, user history, and location. The document also provides a simple example of building a recommendation engine with Apache Spark.
How to grow your business in the age of digital marketing 4.0Trieu Nguyen
1. The document discusses how businesses can grow in the digital marketing age using technologies like cloud services, big data, AI, and headless CMS platforms.
2. It introduces LeoCloudCMS as a headless API CMS that is built for digital marketing 4.0 and can run scalably on cloud computing.
3. The key idea is to think of your entire business as a "box" and use LeoCloudCMS to attract internet users into the box and offer valuable services.
Video Ecosystem and some ideas about video big dataTrieu Nguyen
Introduction to Video Ecosystem Mind Map
Video Streaming Platform
Video Ad Tech Platform
Video Player Platform
Video Content Distribution Platform
Video Analytics Platform
Summary of key ideas
Q & A
Concepts, use cases and principles to build big data systems (1)Trieu Nguyen
1) Introduction to the key Big Data concepts
1.1 The Origins of Big Data
1.2 What is Big Data ?
1.3 Why is Big Data So Important ?
1.4 How Is Big Data Used In Practice ?
2) Introduction to the key principles of Big Data Systems
2.1 How to design Data Pipeline in 6 steps
2.2 Using Lambda Architecture for big data processing
3) Practical case study : Chat bot with Video Recommendation Engine
4) FAQ for student
This document discusses open over-the-top (OTT) video content platforms. It defines OTT as streaming media distributed directly over the internet bypassing traditional distribution methods. The document then covers OTT market drivers and business models. It examines the most popular OTT platform in Vietnam and challenges for successful OTT platforms including scalability, content acquisition and management, audience engagement, and business models. Finally, it proposes a modular technical architecture for an open OTT video platform using open source technologies.
Apache Hadoop and Spark: Introduction and Use Cases for Data AnalysisTrieu Nguyen
This document provides an introduction to Apache Hadoop and Spark for data analysis. It discusses the growth of big data from sources like the internet, science, and IoT. Hadoop is introduced as providing scalability on commodity hardware to handle large, diverse data types with fault tolerance. Key Hadoop components are HDFS for storage, MapReduce for processing, and HBase for non-relational databases. Spark is presented as improving on MapReduce by using in-memory computing for iterative jobs like machine learning. Real-world use cases of Spark at companies like Uber, Pinterest, and Netflix are briefly described.
Introduction to Recommendation Systems (Vietnam Web Submit)Trieu Nguyen
1) Why do we need recommendation systems ?
2) How can we think with recommendation systems ?
3) How can we implement a recommendation system with open source technologies ?
RFX framework https://github.com/rfxlab
Apache Kafka: https://kafka.apache.org
Apache Spark: https://spark.apache.org
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
The Rise of Python in Finance,Automating Trading Strategies: _.pdf
Lambda Architecture 2.0 for Reactive AB Testing
1. Lambda
Architecture 2.0
for Data-Driven
Business
Team
Trieu Nguyen - http://nguyentantrieu.info
Truc Le - https://www.linkedin.com/pub/le-kien-truc/31/379/938
Data-driven + Lambda Architecture = growing business2
mc2ads.com - Fast Data Labs
2. Key questions for us today ?
1. What if the business is not driven by data?
2. What and why is Lambda Architecture?
3. What problems did it solve for us?
Workshop with case study:
Improving “Flappy bird” with
A/B Testing Tool and
Lambda Architecture 2.0
3. Red bird Blue bird
Which bird could let you down soon ?
OK, let’s play the Game ! Design it better with data
VS
9. Why is Lambda Architecture 2.0 ?
It helps to organize your data infrastructure into
understandable structure and react quickly to
context changes
10. “Vision Without Execution Is Just Hallucination”
Ok, cool ideas,
but how we build it ? Our
Our
We are here
11. Our goals
1. Understand the big picture
2. See the reality
3. Do actions to make it happen
Ok! Let’s make “Flappy bird” into “Happy bird” !
12. What is Lambda Architecture 2.0 ?
It’s just the architecture for data-driven business
for reacting to
fast data
for data mining
and machine
learning on Big
Data
for observable
data
for SQL querying
(SQL is true lambda
language !?)
13. Case study:
Improving “Flappy bird” with
A/B Testing Tool and
Lambda Architecture 2.0
● Short introduction about A/B testing
● Setup full open source technology stack
● Run example code with Java and Python
16. How? One of basic principle is “Test our theory”
From observable solutions, test them all to find the best
one ! More at http://en.wikipedia.org/wiki/A/B_testing
17. 1. Working with A/B testing tool (using Abba framework)
2. Let’s play Flappy Bird 2.0 !
3. Collecting data → store data as stream (Kafka)
4. Stream processing → real-time view processing (RFX)
5. Batch processing → sampling AB Test (Spark)
6. Query processing → finding facts from experiment
(SQL over Phoenix / HBase)
7. Collecting feedback data → Game Design Report
Steps
18. For simple demo, we use Abba,
a simple A/B testing self-hosted
framework
19. Why is reactive view in Lambda Architecture 2.0 ?
UX is the key for successful product development, so
we must react to bad UX quickly (with data)
20. Technology stack ( 5D model )
1) Data collector (I/O networking)
● Netty for event log collector and HTTP server (lambda2)
2) Data persistence (aka: data storage)
● Kafka for distributed message storage (Apache Kafka)
● HBase for scalable big table
3) Data processing
● RFX with fast data processing (RFX framework)
● Python for data sampling in A/B test experiments
● Rx(Java/JS) for reacting to data experiment (reactivex)
4) Data analysis
● measures of uncertainty(Python Dempster-Shafer theory)
5) Data ad-hoc reporting
● SQL over Phoenix / HBase ( http://phoenix.apache.org )