SlideShare a Scribd company logo
Dynamic DDL
Adding Structure to Streaming Data on the
Hao Zou
Software Engineer
Data Science & Engineering
David Winters
Big Data Architect
Data Science & Engineering
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDL Architecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions
Background and
Consumer Devices GoPro Apps &
Product Insight
CRM/Marketing &
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers, Accessories,
• External - CRM, ERP, OTT, E-Commerce, Web, Social, etc.
• Variety of data ingestion mechanisms - Lambda Architecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed
• Seamless Data Aggregations
• Blend data between different sources, hardware, and
Data Platform
ETL Cluster
• Aggregations and
• Hive and Spark jobs
• Map/Reduce
• Airflow
Secure Data Mart
• End User Query
• Impala / Sentry
• Parquet
• Kerberos & LDAP
Analytics Apps
Real Time
•Log file streaming
•RESTful service
•Spark Streaming
Batch Induction
• Batch files
• Scheduled downloads
• Pre-processing
• Java App
• Airflow
• Rest API
• FTP downloads
• S3 sync
Pipeline for processing of streaming logs
To ETL Cluster
Hive Metastore
To SDM Cluster
From Realtime Cluster
Hive Metastore
Studio - Staging
SDM Cluster
From ETL Cluster
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters
Amazon S3
Real Time
State Ephemeral
Data Mart
Cluster #1
Data Mart
Cluster #2
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Dynamic DDL Deep Dive
Streaming Cluster
Pipeline for processing of streaming logs
Centralized Hive
For each topic, dynamically add the table
structure and create the table or insert
data into the table if already exists
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing
their structure.
• Why is Dynamic DDL needed?
• Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded
and has to be manually updated based on the changes of the incoming data.
• All of the aggregation SQL would have to be manually updated due to the schema change.
• Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes
• How we did this?
• Using Spark SQL/Dataframe
• See example
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}}
Fixed schema
Dynamically generated
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten the data first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in
Add the new columns that exist in the incoming
data frame but do not exist yet in the destination
This syntax is not working anymore after upgrading to spark 2.x
Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the
• Update spark source code to support Alter table syntax and
repackage it
Project all columns from the table
Append the data into the destination table
Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition key wisely
Use coalesce if there too many partitions
Use Coalesce to control the job tasksUse filter if Data still too large
Using Cloud-based
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are
used for sharding.
• S3 does not have strong transactional semantics but instead has eventual
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but
more throughput through parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors
to optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or
off-heap) and incremental parallel uploads (S3A Fast Upload).
• More here:
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics
when streaming files.
• For S3, moves/renames are copy and delete operations which can be
very slow especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.
Q & A

More Related Content

What's hot

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
DataWorks Summit
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
DataWorks Summit
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
DataWorks Summit
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
DataWorks Summit/Hadoop Summit
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
DataWorks Summit/Hadoop Summit
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
DataWorks Summit/Hadoop Summit
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
DataWorks Summit
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
DataWorks Summit
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
DataWorks Summit
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
DataWorks Summit
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
DataWorks Summit/Hadoop Summit
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
DataWorks Summit
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
DataWorks Summit
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
DataWorks Summit
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari
DataWorks Summit

What's hot (20)

Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bridle your Flying Islands and Castles in the Sky: Built-in Governance and Se...
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the CloudBring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Bring Your SAP and Enterprise Data to Hadoop, Kafka, and the Cloud
Cloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerationsCloudy with a chance of Hadoop - real world considerations
Cloudy with a chance of Hadoop - real world considerations
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC IsilonImproving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Improving Hadoop Resiliency and Operational Efficiency with EMC Isilon
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Modernizing Business Processes with Big Data: Real-World Use Cases for Produc...
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open sourceBig SQL: Powerful SQL Optimization - Re-Imagined for open source
Big SQL: Powerful SQL Optimization - Re-Imagined for open source
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
Protecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against DisastersProtecting your Critical Hadoop Clusters Against Disasters
Protecting your Critical Hadoop Clusters Against Disasters
HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016HPE Keynote Hadoop Summit San Jose 2016
HPE Keynote Hadoop Summit San Jose 2016
Format Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and ParquetFormat Wars: from VHS and Beta to Avro and Parquet
Format Wars: from VHS and Beta to Avro and Parquet
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Big Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeNBig Data Simplified - Is all about Ab'strakSHeN
Big Data Simplified - Is all about Ab'strakSHeN
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
What's new in Ambari
What's new in AmbariWhat's new in Ambari
What's new in Ambari

Similar to Dynamic DDL: Adding structure to streaming IoT data on the fly

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
DataWorks Summit
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
Amazon Web Services
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Luan Moreno Medeiros Maciel
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Sam Palani
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark Summit
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
Amazon Web Services
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Torsten Steinbach
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

Similar to Dynamic DDL: Adding structure to streaming IoT data on the fly (20)

Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Dynamic DDL: Adding Structure to Streaming Data on the Fly with David Winters...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...Adding structure to your streaming pipelines: moving from Spark streaming to ...
Adding structure to your streaming pipelines: moving from Spark streaming to ...
Building Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon RedshiftBuilding Your Data Warehouse with Amazon Redshift
Building Your Data Warehouse with Amazon Redshift
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & TableauBig Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Big data applications
Big data applicationsBig data applications
Big data applications
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Real-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQLReal-Time Analytics with Spark and MemSQL
Real-Time Analytics with Spark and MemSQL
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Recently uploaded

Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Marrie Morris
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
Yury Chemerkin
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx

Recently uploaded (20)

Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...It's your unstructured data: How to get your GenAI app to production (and spe...
It's your unstructured data: How to get your GenAI app to production (and spe...
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
Top 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdfTop 12 AI Technology Trends For 2024.pdf
Top 12 AI Technology Trends For 2024.pdf
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...How UiPath Discovery Suite supports identification of Agentic Process Automat...
How UiPath Discovery Suite supports identification of Agentic Process Automat...
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx

Dynamic DDL: Adding structure to streaming IoT data on the fly

  • 1. Dynamic DDL Adding Structure to Streaming Data on the Fly
  • 2. OUR SPEAKERS Hao Zou Software Engineer Data Science & Engineering GoPro David Winters Big Data Architect Data Science & Engineering GoPro
  • 3. TOPICS TO COVER • Background and Business • GoPro Data Platform Architecture • Old File-based Pipeline Architecture • New Dynamic DDL Architecture • Dynamic DDL Deep Dive • Using Cloud-Based Services (Optional) • Questions
  • 7. GoPro Data Analytics Platform Consumer Devices GoPro Apps & Cloud E-Commerce Social Media & OTT CRM Product Insight User CRM/Marketing & Personalization ERP Web Mobile
  • 8. DATA CHALLENGES AT GOPRO • Variety of data - Hardware and Software products • Software - Mobile and Desktop Apps • Hardware - Cameras, Drones, Controllers, Accessories, etc. • External - CRM, ERP, OTT, E-Commerce, Web, Social, etc. • Variety of data ingestion mechanisms - Lambda Architecture • Real-time streaming pipeline - GoPro products • Batch pipeline - External 3rd party systems • Complex Transformations • Data often stored in binary to conserve space in cameras • Heterogeneous data formats (JSON, XML, and packed binary) • Seamless Data Aggregations • Blend data between different sources, hardware, and
  • 10. OLD FILE-BASED PIPELINE ARCHITECTURE ETL Cluster • Aggregations and Joins • Hive and Spark jobs • Map/Reduce • Airflow Secure Data Mart Cluster • End User Query • Impala / Sentry • Parquet • Kerberos & LDAP Analytics Apps •Hue •Tableau •Plotly •Python •R Real Time Cluster •Log file streaming •RESTful service •Kafka •Spark Streaming •HBase Batch Induction Framework • Batch files • Scheduled downloads • Pre-processing • Java App • Airflow JSON JSON Parquet DDL • Rest API • FTP downloads • S3 sync Streaming Batch Download
  • 11. STREAMING ENDPOINT ELBHTTP Pipeline for processing of streaming logs To ETL Cluster events events state
  • 13. ETL PIPELINE HDFS Hive Metastore To SDM Cluster From Realtime Cluster Batch Induction Framework state snapshot
  • 15. PROS AND CONS OF OLD SYSTEM • Isolation of workloads • Fast ingest • Secure • Fast delivery/queries • Loosely coupled clusters • Multiple copies of data • Tightly coupled storage and compute • Lack of elasticity • Operational overhead of multiple clusters
  • 16. NEW DYNAMIC DDL ARCHITECTURE Amazon S3 Bucket Real Time Cluster Batch Induction Framework Hive Metastore Ephemeral ETL Cluster Parquet + DDL Aggregates Events + State Ephemeral Data Mart Cluster #1 Ephemeral Data Mart Cluster #2 Ephemeral Data Mart Cluster #N • Rest API • FTP downloads • S3 sync Streaming Batch Download •Notebook s •Tableau •Plotly •Python •R Improvements Single copy of data Separate storage from compute Elastic clusters Single long running cluster to maintain Parquet + DDL Dynamic DDL!
  • 18. NEW DYNAMIC DDL ARCHITECTURE Streaming Cluster ELBHTTP Pipeline for processing of streaming logs S3 HIVE METASTORE transition Centralized Hive MetaStore For each topic, dynamically add the table structure and create the table or insert data into the table if already exists
  • 19. DYNAMIC DDL • What is Dynamic DDL? • Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing their structure. • Why is Dynamic DDL needed? • Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded and has to be manually updated based on the changes of the incoming data. • All of the aggregation SQL would have to be manually updated due to the schema change. • Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes seconds). • How we did this? • Using Spark SQL/Dataframe • See example
  • 20. DYNAMIC DDL • Example: {"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}} Fixed schema Dynamically generated schema {"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"} {"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"} Flatten the data first SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state, MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name, MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name, MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city, id as data_record_id, log_ts as data_log_ts from test group by id, log_ts
  • 21. DYNAMIC DDL USING SPARK SQL/DATAFRAME • Code snippet of Dynamic DDL transforming new JSON attributes into relational columns Add the partition columns Manually create the table due to a bug in spark
  • 22. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new columns that exist in the incoming data frame but do not exist yet in the destination table This syntax is not working anymore after upgrading to spark 2.x
  • 23. DYNAMIC DDL USING SPARK SQL/DATAFRAME Three temporary way to solve the problem in spark 2.x: • Launch a hiveserver2 service, then use jdbc call hive to alter the table • Use spark to directly connect to hivemetastore, then update the metadata • Update spark source code to support Alter table syntax and repackage it
  • 24. DYNAMIC DDL USING SPARK SQL/DATAFRAME Project all columns from the table Append the data into the destination table
  • 25. DYNAMIC DDL USING SPARK SQL/DATAFRAME Add the new partition key • Reprocessing the DDL Table with new partition Key (Tuning tips) Choose the partition key wisely Use coalesce if there too many partitions Use Coalesce to control the job tasksUse filter if Data still too large
  • 27. USING S3: WHAT IS S3? • S3 is not a file system. • S3 is an object store. Similar to a key-value store. • S3 objects are presented in a hierarchical view but are not stored in that manner. • S3 objects are stored with a key derived from a “path”. • The key is used to fan out the objects across shards. • The path is for display purposes only. Only the first 3 to 4 characters are used for sharding. • S3 does not have strong transactional semantics but instead has eventual consistency. • S3 is not appropriate for realtime updates. • S3 is suited for longer term storage.
  • 28. USING S3: BEHAVIORS • S3 has similar behaviors to HDFS but even more extreme. • Larger latencies • Larger files/writes – Think GBs • Write and read latencies are larger but the bandwidth is much larger with S3. • Thus throughput can be increased with parallel writers (same latency but more throughput through parallel operations) • Partition your RDDs/DataFrames and increase your workers/executors to optimize the parallelism. • Each write/read has more overhead due to the web service calls. • So use larger buffers. • Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS. • Collect data for longer durations before writing large buffers in parallel to S3. • Retry logic – Writes to S3 can and will fail. • Cannot stream to S3 – Complete files must be uploaded. • Technically, you can simulate streaming with multipart upload.
  • 29. USING S3: TIPS • Tips for using S3 with HDFS • Use the s3a scheme. • Many optimizations including buffering options (disk-based, on-heap, or off-heap) and incremental parallel uploads (S3A Fast Upload). • More here: aws/tools/hadoop-aws/index.html#S3A • Don’t use rename/move. • Moves are great for HDFS to support better transactional semantics when streaming files. • For S3, moves/renames are copy and delete operations which can be very slow especially due to the eventual consistency. • Other advanced S3 techniques: • Hash object names to better shard the objects in a bucket. • Use multiple buckets to increase bandwidth.

Editor's Notes

  1. GoPro is the ultimate accessory for active people with smartphones. Products and services include: Core products -- Cameras Advanced Solutions – Karma, stabilization and VR Accessories and Mounts Software suite that ties it all together  GoPro has become the ultimate, end-to-end storytelling solution.
  2. This is what we wanted to build.
  3. These were the challenges to solve.
  4. High Level Architecture of Data Platform Isolation of workloads  3 clusters (ingest, ETL, delivery) Lamdba architecture Input and output data formats Cadence of clusters A word about Data Sources: IoT data Logs from devices, applications (desktop and mobile), external systems and services, ERP, web/email marketing, etc. Some Raw and Gzip, Some Binary and JSON Some streaming and some batch Batch data Web marketing, campaigns Social media ERP CRM Lambda architecture Both batch and stream processing Basic needs/workloads in a Data Platform High throughput ingestion Transformations: joins, aggregations, etc. Fast queries Today, we have 3 clusters to isolate these workloads We started with one cluster, ETL Everything ran there Ingest (Flume) Batch (Framework) ETL (Hive) Analytical (Impala) Lots of resource contention (I/O, memory, cores) To alleviate the resource contention, we opted for 3 clusters to isolate the workloads. Ingest cluster for near real-time streaming Kafka, Spark Streaming (Cloudera Parcels) Input: Logs, Output: JSON Minutes cadence Moving towards more real-time in seconds Induction framework for scheduled batch ingestion ETL cluster for heavy duty aggregation Input: JSON flat files, Output: Aggregated Parquet files Hive (Map/Reduce) Hourly cadence Secure Data Mart Kerberos, LDAP, Active Directory, Apache Sentry (Cloudera committers) Input: Compressed Parquet files Analytical SQL engine for Tableau, ad-hoc queries (Hue), data wrangling (Trifacta), and data science (Jupyter Notebooks and RStudio) With all that said, we will examine the newer technologies that will enable us to simplify our architecture and merge clusters in the future. Kudu is one possible new technology that could help us to consolidate some of the clusters.
  5. Let’s take a deeper dive into our streaming ingestion… Logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Endpoint is an elastic pool of Tomcat servers sitting behind ELB in AWS Custom servlet pushes logs into Kafka topics by environment A series of Spark streaming jobs process the logs from Kafka Landing place in ingestion cluster is HDFS with JSON flat files Rationalization of tech stacks… Why Kafka? Unrivaled write throughput for a queue Traditional queue throughput: 100K writes/sec on the biggest box you can buy Kafka throughput: 1M writes/sec on 3-4 commodity servers Strong ordering policy of messages Distributed Fault-tolerant through replication Support synchronous and asynchronous writes Pairs nicely with Spark Streaming for simpler scaling out (Kafka topic partitions map directly to Spark RDD partitions/tasks) Why Spark Streaming? Strong transactional semantics - "exactly once" processing Leverage Spark technology for both data ingest and analytics Horizontally scalable - High throughput for micro-batching Large open source community
  6. Keyword: Impedance mismatch As previously stated, logs are streamed from devices and software applications (desktop and mobile) to web service endpoint Logs are diverse: gzipped, raw, binary, JSON, batched events, streamed single events Vary significantly in size from < 1 KB to > 1 MB Logs are redirected based on data category and routed to appropriate Kafka topic and respective Spark Streaming job Logs move from Kafka topic to Kafka topic with each Kafka topic having a Spark Streaming job that consumes the log, processes the log, and writes the log to another topic Tree like structure of jobs with more generic logic towards the root of the tree and more specialized logic moving towards the leaf nodes There are generic jobs/services and specialized jobs/services Generic services include PII removal and hashing, IP to Geo lookups, and batched writing to HDFS We perform batched HDFS writing since Kafka likes small messages (1 KB ideal) and HDFS likes large files (100+ MB) Specialized services contain business logic Finally, the logs are written into HDFS as JSON flat files (which are sometimes compressed depending on the type of data) Scheduled ETL jobs perform a distributed copy (distcp) to move the data to the ETL cluster for further heavier aggregations
  7. A few things… Two flows of data: streaming and batch Join data sources Aggregate data sources Convert to compressed columnar format (gzipped Parquet fies) On the ETL cluster… Here’s where we do our heavy lifting. Almost entirely all Hive Map Reduce jobs Some Impala to make the really big narly aggregations more performant Previously, had a custom Java Map Reduce job for sessionization of events This has been replaced with a Spark Streaming job on the ingestion cluster In the future, want to push as much of the ETL processing back into the ingestion cluster for more real-time processing We also have a custom Java Induction framework which ingests data from external services that only make data available on slower schedules (daily, twice daily, etc.) The output from the ETL cluster is Parquet files that are added to partitioned managed tables in the Hive metastore. The Parquet files are then copied via distcp to the Secure Data Mart.
  8. Parquet files are copied from the ETL cluster and added to partitioned managed tables in the Hive Metastore of the Secure Data Mart. The Secure Data Mart is protected with Apache Sentry. Kerberos is used for authentication.  Corporate Standard Active Directory stores the groups.  Corporate Standard Access control is role based and the roles are assigned with Sentry. Hue has a Sentry UI app to manage authorization.
  9. Store data in one place  Data (S3) + Structure (Hive Metastore) Separate compute nodes from storage nodes Elasticity  size of clusters and number of clusters Lower operational overhead of maintaining HDFS storage nodes Redirect batch ingest into stream ingest (Pump batch data into Kafka)  RESULT: One codebase for both stream and batch ingestion
  10. Even thought the fixed schema resolves the problem of data provider changing the data structure frequently. It becomes really difficult for the analysts and data scientists to analyze the data. would they know these two rows are coming from one event? 2.has to use id to associate  these two rows
  11. /* * The table does not exist, so create it.  We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table .  So we create the table first and  later alter * the table and add columns to it.  See below for more details: * * */
  12. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * */
  13. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * */
  14. /* * The table does not exist, so create it. We create it manually due to a bug in Spark where its metadata * gets out of synch with Hive's metadata if we let Spark automatically create the table and we later alter * the table and add columns to it. See below for more details: * * */