Tobias (Toby) Mao’s Post

Co-Founder and CTO @ Tobiko Data

Working with video streaming data at Netflix was incredibly challenging. There were two major technical factors that made it difficult to use. 1. Massive scale - Netflix stores a bunch of data when you watch something. The data is unimaginably big, meaning jobs are expensive, take a long time, and may even crash without prior planning. 2. Late arriving data - Netflix allows you to watch shows when you're offline, meaning data is not logged when it happens. They only get sent to the server once a device goes back online which can be delayed by a couple of weeks. Developing and working on these datasets was a huge pain. Trying to test changes involved a bunch of manual work. Some tricky things involved creating dev environments. Only selecting a single partition of the data so that you can iterate relatively quickly. Manually writing and running alter table statements since these tables are too big to refresh. When it comes time to deploy these changes to production, you've got to hold your breath because mistakes are really tough to recover from. Additionally, you spend most of your time tracking down stream consumers of the table to make sure you're not breaking anything or coordinate changes. Late arriving data is especially tricky because you need to avoid data leakage and you select the right amount of late data. Partition based systems like Spark leverage "insert overwrite" to atomically replace data. But that means if you try to insert one row of data that was a year old, you could accidentally wipe out the existing dataset. That's why we designed SQLMesh the way we did. So that all data engineers have the ability to instantly create dev environments without duplicating data and wasting precious time and money, and why we knew lookback/backfill/restatements had to be a first class experience. Even if you don't have Netflix scale data, you're likely facing similar issues that I did. Excited to see what you all do with SQLMesh. #bigdata #dataengineering

6 Comments

Tobias (Toby) Mao

Co-Founder and CTO @ Tobiko Data

A blog post I wrote last year about incremental data: https://tobikodata.com/correctly-loading-incremental-data-at-scale.html Virtual data environments: https://tobikodata.com/virtual-data-environments.html

6 Reactions

Pablo Vergara Riquelme

Data Scientist | ex-Uber

I love this clear example of sqlmesh that outperforms clearly other alternatives. I did once overwrote a table by mistake (I enabled our legacy deprecated dbt), but I reverted back this issue quite easily using the underlying sqlmesh tables,

1 Reaction

Scott Robertson

It sounds like an exiting environment to work in...also slightly terrifying. The fact you only had to worry about movie data, no one's life was on the line, makes it more exiting, and less terrifying... Kudos for taking lessons learn and converting them for the community.

1 Reaction

Brian Greene

Platform Engineering for Data with NeuronSphere.io

When dealing with surgical robotics data we had similar issues, except our clients produced video as well as massive tabular and streaming sets that required processing late-arriving data. Performing a highly selective backfill “re-run this subset of transformations for data that looks like this and happened on weekdays”… and some of the transformations are containers that process video…

2 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Anishek Kamal

Data & AI Architect at Microsoft | Startup Advisor | Skills and Career Mentor for Data and AI Professionals
5mo Edited
Report this post
𝗘𝘃𝗲𝗿 𝗰𝘂𝗿𝗶𝗼𝘂𝘀 𝗮𝗯𝗼𝘂𝘁 𝗵𝗼𝘄 𝗡𝗲𝘁𝗳𝗹𝗶𝘅 𝘄𝗼𝗿𝗸𝘀 𝘄𝗶𝘁𝗵 𝗱𝗮𝘁𝗮? Netflix just had its first big event about data engineering, sharing lots of cool info for tech lovers and data fans. It was more than just a get-together, it was a chance to really get into how Netflix is creative with data. Here's what they talked about: 1. Breaking down Netflix's data engineering tools 2. How they handle data processing 3. Using streaming SQL on Data Mesh with Apache Flink 4. Making reliable data pipelines 5. Managing knowledge – making the most of Netflix's data 6. Psyberg: A new way to update data with Iceberg 7. Tips for managing complex data jobs 8. Using media data to help Netflix's creative team Sounds interesting? Check out the link to know more - https://lnkd.in/geH8MKEn
Like Comment
To view or add a comment, sign in
Raj singh

Python HackerRank 4 ⭐| Data Alchemist | Crafting Insights | SQL, Power BI, Pandas, Matplotlib 🚀| TCET'26
6mo Edited
Report this post
🎬 Exploring the World of Netflix: A Data Visualization Journey 🌐 Hey LinkedIn community! 👋🏽 I hope this post finds you well. Today, I'm thrilled to share an exciting project - delving deep into Netflix data and visualizing intriguing insights. 📊 Country-wise Visualization: Ever wondered how Netflix content varies across different countries? My Streamlit-powered application lets you select a country and explore its unique Netflix library. From popular shows to top actors, it's a cinematic journey around the globe. 🍿 Netflix Rating Data Visualization: Get ready for a visual feast! Discover the distribution of movies and shows, explore the rating landscape, and see how ratings vary across genres and directors. Uncover the top 10 directors shaping the Netflix experience. 🌍 Country-wise Movie & Show Ratio: Curious about the content mix in your favorite country? Just type in the country name, and the app will show you the proportion of movies to TV shows. It's a great way to understand the streaming preferences worldwide. 📺 Netflix Shows Analysis: Delve into the specifics of Netflix shows - from the total number of shows to their release year distribution, genre breakdown, and even the top directors and cast members. It's a comprehensive analysis for the true Netflix aficionado. 🚀 How to Explore? I've made it easy for you! Just click on the link to my website or access the code on GitHub. Start your Netflix data exploration - the app is user-friendly, and you might discover something new about your favorite streaming platform. 🔗 Link to the Code - GitHub - https://lnkd.in/djBMA9RC 🔗 Website Link- https://lnkd.in/d-c6xPMu 🙌🏽 Acknowledgment: A big shoutout to the Streamlit library, Plotly, and pandas for making data visualization a breeze. Feel free to share your thoughts and discoveries! Happy streaming! 🍿🌟 #Netflix #DataVisualization #Streamlit #DataAnalysis #NetflixInsights #PythonProgramming #TechCommunity
Like Comment
To view or add a comment, sign in
Siddharth Dange

Data Scientist @ADP India | Developing NextGen Payroll Engine | Member Technical | Ai ML | SQL | Certified AWS Cloud Practitioner | GenAi
1mo Edited
Report this post
🍿 From Binge-Watching to Data Crunching: My Netflix Revenue Analysis Journey 🍿 One evening, while watching my favorite series on Netflix, a thought struck me: "How exactly does Netflix generate its revenue, and how has it evolved over the years?" This curiosity led me down a fascinating rabbit hole of research and data analysis. 🌐📊 I stumbled upon a rich dataset on Kaggle that detailed Netflix subscribers and revenue by country. Inspired by the potential insights, I embarked on an exploratory data analysis (EDA) project to uncover the financial landscape of this streaming giant. 📊 **Project Highlights:** In this project, I explored a dataset from Kaggle that provided insights into Netflix's global subscriber base and revenue distribution. Here are some key observations: 1. **Subscriber Growth:** 🌍 Netflix has seen tremendous growth worldwide, with significant subscribers increasing from regions like North America, Europe, and Asia. 2. **Revenue Distribution:** 💰 The majority of Netflix's revenue comes from the US, but other regions are catching up fast, showcasing Netflix's global expansion strategy. 3. **Market Penetration:** 📈 The data highlights Netflix's market penetration in various countries, with emerging markets showing substantial growth potential. 4. **ARPU Analysis:** 💹 The Average Revenue Per User (ARPU) varies significantly across different regions, reflecting diverse pricing strategies and market conditions. ✨ **Key Insights:** - **North America** remains a stronghold for Netflix, but **Europe and Asia** are growing rapidly. - **Emerging markets** like India and Brazil show high potential for future growth. . 🛠️ **Tools and Technologies Used:** - Python 🐍 - Pandas & NumPy - Matplotlib for visualizations 📊 📈 The analysis sheds light on how Netflix is navigating its growth across different regions and adapting its strategies to maximize subscriber growth and revenue. 👉 **Check out the full analysis on : [ https://lnkd.in/gXwfyDeB ] 👈 #DataScience #Netflix #EDA #DataAnalytics #Python #BigData #MachineLearning #Visualization #Kaggle #RevenueAnalysis #GlobalMarket #SubscriberGrowth #TechInsights 🔗 **Let's connect and discuss more about data science projects and insights!**
Like Comment
To view or add a comment, sign in
Sahil Sachdeva

Ex-Data Science Intern @ CodSoft | Data Science, Analytics, Visualization
8mo
Report this post
"Exciting News! 🚀📊 Just created a remarkable set of Amazon Prime Dashboards on Tableau, bringing the world of streaming to life in a whole new way. 🎥🍿 These meticulously crafted visualizations capture the essence of what's on Amazon Prime, from top trending shows to viewer preferences, providing a comprehensive overview of the streaming landscape. 📈📌 Through the power of data analytics and the versatility of Tableau, we've unraveled intriguing insights that empower content creators, marketers, and streaming enthusiasts to make data-driven decisions. 📡💡 This project showcases the synergy of technology and entertainment, as we delve into viewers' choices, analyze engagement trends, and chart the future of streaming. 🌟 Dataset - Kaggle Join me in this exciting journey where data transforms into art, and let's explore the endless possibilities that data visualization offers to the world of entertainment. 🎨📡 #Tableau #DataVisualization #AmazonPrime #StreamingAnalytics #EntertainmentInsights"
Like Comment
To view or add a comment, sign in
Mohd Nadeem

AWS Devops Engineer || LINUX📝 || GIT || CI/CD 🔗|| AWS 💭 || JENKINS 🚀 || DOCKER 🛳️ || KUBERNETES ☸️ || TERRAFORM🌐 || ANSIBLE⚡ Looking for opportunities to work in Industry
11mo Edited
Report this post
This dataset contains Netflix Subscription and revenue growth, The purpose of this analysis is to showcase my proficiency in Data visualization and exploratory Data Analysis. The analysis will encompass various stages including data preprocessing, exploratory data analysis (EDA), Features Engineering, Feature Encoding, and Features Transformation. providing recommendations based on the results. #dataanalysis #dataanalytics #datascience #datascientist #datajobs #datacleaning #pythonprogramming #pandas #numpy
Like Comment
To view or add a comment, sign in
Elint Tech

513 followers
5mo
Report this post
The Netflix data engineering stack 🚀 Have you ever wondered about the inner workings of a streaming powerhouse's data infrastructure? In a video presentation, Chris Stephens, a Data Engineer, and Pedro Duarte, a Software Engineer from Netflix, guide new engineers through the intricacies of Netflix's data engineering stack. They delve into how batch and streaming data pipelines are constructed, shedding light on the foundational components necessary to manage and process vast amounts of data for content delivery, recommendation systems, user analytics, and more. This resource is invaluable for gaining insights into the workings of data engineering within large-scale enterprises like Netflix. 🎥 Watch the video here: https://lnkd.in/dfruST9J #dataengineering #netflix #techinnovation #streaming #datainfrastructure #bigdata #dataanalytics #softwareengineering #datamanagement #datascience #engineeringinsights
Like Comment
To view or add a comment, sign in
Ishtdeep Singh

Business Analyst | PowerBi | Tableau | Management Consultant
6mo
Report this post
🚀 Day 2: Netflix Extravaganza! Exploring Global Content, Text Matching, Network Analysis, and More! 🍿📊 Today's dive into the fascinating world of Netflix data promises an exhilarating exploration of content diversity, textual similarities, and the evolving landscape of TV shows and movies. 🎬🌎 🌐 Global Content Overview: Country-wise Availability: Mapping Netflix's content reach across different countries. Popular Genres: Identifying the most prevalent genres in each region. 📜 Text-Based Similarities: Content Matching: Utilizing text-based features to identify and group similar movies and TV shows. Recommendation Insights: Enhancing the user experience by suggesting related content. 🕵️ Network Analysis of Actors/Directors: Collaboration Heatmap: Unveiling the most frequent collaborations between actors and directors. Influential Figures: Identifying key contributors who shape Netflix's content landscape. 📈 TV Shows vs. Movies Focus: Content Production Trends: Analyzing Netflix's focus on TV shows and movies over recent years. Release Frequency: Comparing the number of new TV shows and movies added annually. 🎯 Objectives for Day 2: 1. Understanding what content is available in different countries 2. Identifying similar content by matching text-based features 3. Network analysis of Actors and Directors and find interesting insights 4. Does Netflix have more focus on TV shows than movies in recent years? 🙌 Feedback and Collaboration: I invite the Netflix community and my network to share thoughts, ideas, and experiences related to streaming analytics. Let's make this exploration a collaborative journey full of insights and discoveries! 💬🤝 eagerly awaiting your feedback and gearing up for the next 98 days of data-driven discoveries! 🚀🍿 #BIChallenge #PowerBI #NetflixAnalytics #Netflix #tableau #DataVisualization #ContentDiscovery #100Days100Dashboards #StreamingInsights #DataAnalytics #DataScience #BigData #SQL, #Python #MachineLearning #DataVisualization #Analytics.🌐✨
3 Comments
Like Comment
To view or add a comment, sign in
Richard W.

2 x Top Technical Writer @medium | sde @tcs
9mo Edited
Report this post
Netflix Subscription Forecasting: #DataScience Case Study - I took a break from the usual and dived into a fun data science project! Forecasted Netflix's subscriber count for upcoming quarters using the SARIMA model. 📊 Key Insights: 1. By the end of Q1 2025, Netflix is projected to hit approx. 2.56 x 10⁸ subscribers. 2. Recommendations include scaling resources and exploring new marketing strategies. Dive into the full project and explore the data magic! 👩💻👨💻 🔗 Read more about the Full Project: https://lnkd.in/d_438FQv

Case Study #3: Netflix Subscription Forecasting — Resume Project

warepam.medium.com
Like Comment
To view or add a comment, sign in
N Manikanta G

Actively Seeking for new role | Data Engineer| BI Developer| BI engineer | Big Data | Python | SQL | AWS | Azure | Hadoop | Kafka | ETL | Tableau| PySpark |Scala| NoSql | Apache airflow
8mo
Report this post
🚀 Just came across an article that I believe is a must-read for all aspiring engineers and those who value a community . The article "Streaming SQL in Data Mesh" dives deep into the intricacies of data streaming and the Data Mesh concept. 📊 🔗 Read the full article here: https://lnkd.in/gUEfr7EY 🌐 It's incredible to see how technology is transforming the way we handle data, and this piece sheds light on innovative approaches that can shape the future of data engineering. 🤝 Let's discuss and learn together! I'd love to hear your thoughts on this article and how it relates to your experiences in the tech world. Feel free to comment 🌟 #DataEngineering #TechInnovation #DataMesh #Community #EngineeringInsights #ContinuousLearning

Streaming SQL in Data Mesh

netflixtechblog.com
Like Comment
To view or add a comment, sign in
Data Engineering

22,044 followers
6mo
Report this post
𝗢𝘂𝗿 𝗙𝗶𝗿𝘀𝘁 𝗡𝗲𝘁𝗳𝗹𝗶𝘅 𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗦𝘂𝗺𝗺𝗶𝘁 You can find each of the talks in the link below with a short description of each, or you can go straight to the playlist on YouTube here: https://lnkd.in/g3Dvk-SN Source: Netflix Engineering #DataEngineering

Our First Netflix Data Engineering Summit

netflixtechblog.com
Like Comment
To view or add a comment, sign in

8,702 followers

View Profile Follow

Tobias (Toby) Mao’s Post

More from this author

SQLMesh: The future of DataOps

Explore topics