Tobias (Toby) Mao’s Post

View profile for Tobias (Toby) Mao, graphic

Co-Founder and CTO @ Tobiko Data

Working with video streaming data at Netflix was incredibly challenging. There were two major technical factors that made it difficult to use. 1. Massive scale - Netflix stores a bunch of data when you watch something. The data is unimaginably big, meaning jobs are expensive, take a long time, and may even crash without prior planning. 2. Late arriving data - Netflix allows you to watch shows when you're offline, meaning data is not logged when it happens. They only get sent to the server once a device goes back online which can be delayed by a couple of weeks. Developing and working on these datasets was a huge pain. Trying to test changes involved a bunch of manual work. Some tricky things involved creating dev environments. Only selecting a single partition of the data so that you can iterate relatively quickly. Manually writing and running alter table statements since these tables are too big to refresh. When it comes time to deploy these changes to production, you've got to hold your breath because mistakes are really tough to recover from. Additionally, you spend most of your time tracking down stream consumers of the table to make sure you're not breaking anything or coordinate changes. Late arriving data is especially tricky because you need to avoid data leakage and you select the right amount of late data. Partition based systems like Spark leverage "insert overwrite" to atomically replace data. But that means if you try to insert one row of data that was a year old, you could accidentally wipe out the existing dataset. That's why we designed SQLMesh the way we did. So that all data engineers have the ability to instantly create dev environments without duplicating data and wasting precious time and money, and why we knew lookback/backfill/restatements had to be a first class experience. Even if you don't have Netflix scale data, you're likely facing similar issues that I did. Excited to see what you all do with SQLMesh. #bigdata #dataengineering

Tobias (Toby) Mao

Co-Founder and CTO @ Tobiko Data

3w

A blog post I wrote last year about incremental data: https://tobikodata.com/correctly-loading-incremental-data-at-scale.html Virtual data environments: https://tobikodata.com/virtual-data-environments.html

I love this clear example of sqlmesh that outperforms clearly other alternatives. I did once overwrote a table by mistake (I enabled our legacy deprecated dbt), but I reverted back this issue quite easily using the underlying sqlmesh tables,

Scott Robertson

Full Stack Cognitive Science Enthusiast | Senior Software Engineer at ATA, LLC | Servant Leader | Husband | Father | Brain Geek

3w

It sounds like an exiting environment to work in...also slightly terrifying. The fact you only had to worry about movie data, no one's life was on the line, makes it more exiting, and less terrifying... Kudos for taking lessons learn and converting them for the community.

Brian Greene

Platform Engineering for Data with NeuronSphere.io

3w

When dealing with surgical robotics data we had similar issues, except our clients produced video as well as massive tabular and streaming sets that required processing late-arriving data. Performing a highly selective backfill “re-run this subset of transformations for data that looks like this and happened on weekdays”… and some of the transformations are containers that process video…

See more comments

To view or add a comment, sign in

Explore topics