Dynamic DDL: Adding structure to streaming IoT data on the fly

Dynamic DDL
Adding Structure to Streaming Data on the
Fly

OUR SPEAKERS
Hao Zou
Software Engineer
Data Science & Engineering
GoPro
David Winters
Big Data Architect
Data Science & Engineering
GoPro

TOPICS TO COVER
• Background and Business
• GoPro Data Platform Architecture
• Old File-based Pipeline Architecture
• New Dynamic DDL Architecture
• Dynamic DDL Deep Dive
• Using Cloud-Based Services (Optional)
• Questions

WHEN WE GOT HERE…
DATA ANALYTICS WAS BASED ON WORD OF MOUTH (& THIS GUY)

GoPro
Data
Analytics
Platform
Consumer Devices GoPro Apps &
Cloud
E-Commerce
Social
Media
& OTT
CRM
Product Insight
User
CRM/Marketing &
Personalization
ERP
Web
Mobile

DATA CHALLENGES AT GOPRO
• Variety of data - Hardware and Software products
• Software - Mobile and Desktop Apps
• Hardware - Cameras, Drones, Controllers, Accessories,
etc.
• External - CRM, ERP, OTT, E-Commerce, Web, Social, etc.
• Variety of data ingestion mechanisms - Lambda Architecture
• Real-time streaming pipeline - GoPro products
• Batch pipeline - External 3rd party systems
• Complex Transformations
• Data often stored in binary to conserve space in cameras
• Heterogeneous data formats (JSON, XML, and packed
binary)
• Seamless Data Aggregations
• Blend data between different sources, hardware, and

OLD FILE-BASED PIPELINE ARCHITECTURE
ETL Cluster
• Aggregations and
Joins
• Hive and Spark jobs
• Map/Reduce
• Airflow
Secure Data Mart
Cluster
• End User Query
• Impala / Sentry
• Parquet
• Kerberos & LDAP
Analytics Apps
•Hue
•Tableau
•Plotly
•Python
•R
Real Time
Cluster
•Log file streaming
•RESTful service
•Kafka
•Spark Streaming
•HBase
Batch Induction
Framework
• Batch files
• Scheduled downloads
• Pre-processing
• Java App
• Airflow
JSON
JSON
Parquet
DDL
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download

STREAMING ENDPOINT
ELBHTTP
Pipeline for processing of streaming logs
To ETL Cluster
events
events
state

SPARK STREAMING PIPELINE
/path1/…
/path2/…
/path3/…
ToETL
Cluster
/path4/…
events
state
events
events
events
state
state
state

ETL PIPELINE
HDFS
Hive Metastore
To SDM Cluster
From Realtime Cluster
Batch
Induction
Framework
state
snapshot

DATA DELIVERY!
HDFS
Hive Metastore
Applications
Thrift
ODBC
Server
User
Studio
Studio - Staging
GDA
Report
SDM Cluster
From ETL Cluster

PROS AND CONS OF OLD SYSTEM
• Isolation of workloads
• Fast ingest
• Secure
• Fast delivery/queries
• Loosely coupled clusters
• Multiple copies of data
• Tightly coupled storage and compute
• Lack of elasticity
• Operational overhead of multiple clusters

NEW DYNAMIC DDL ARCHITECTURE
Amazon S3
Bucket
Real Time
Cluster
Batch
Induction
Framework
Hive
Metastore
Ephemeral
ETL
Cluster
Parquet
+
DDL
Aggregates
Events
+
State Ephemeral
Data Mart
Cluster #1
Ephemeral
Data Mart
Cluster #2
Ephemeral
Data Mart
Cluster #N
• Rest API
• FTP downloads
• S3 sync
Streaming
Batch
Download
•Notebook
s
•Tableau
•Plotly
•Python
•R
Improvements
Single copy of data
Separate storage from compute
Elastic clusters
Single long running cluster to maintain
Parquet
+
DDL
Dynamic
DDL!

NEW DYNAMIC DDL ARCHITECTURE
Streaming Cluster
ELBHTTP
Pipeline for processing of streaming logs
S3
HIVE
METASTORE
transition
Centralized Hive
MetaStore
For each topic, dynamically add the table
structure and create the table or insert
data into the table if already exists

DYNAMIC DDL
• What is Dynamic DDL?
• Dynamic DDL is adding structure (schema) to the data on the fly whenever the providers of the data are changing
their structure.
• Why is Dynamic DDL needed?
• Providers of data are changing their structure constantly. Without Dynamic DDL, the table schema is hard coded
and has to be manually updated based on the changes of the incoming data.
• All of the aggregation SQL would have to be manually updated due to the schema change.
• Faster turnaround for the data ingestion. Data can be ingested and made available within minutes (sometimes
seconds).
• How we did this?
• Using Spark SQL/Dataframe
• See example

DYNAMIC DDL
• Example:
{"_data":{"record":{"id": "1", "first_name": "John", "last_name": "Fork", "state": "California", "city": "san Mateo"}, "log_ts":"2016-07-20T00:06:01Z"}}
Fixed schema
Dynamically generated
schema
{"record_key":"state","record_value":"California","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"last_name","record_value":"Fork","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"city","record_value":"san Mateo","id":"1","log_ts":"2016-07-20T00:06:01Z"}
{"record_key":"first_name","record_value":"John","id":"1","log_ts":"2016-07-20T00:06:01Z"}
Flatten the data first
SELECT MAX(CASE WHEN record_key = 'state' THEN record_value ELSE null END) AS data_record_state,
MAX(CASE WHEN record_key = 'last_name' THEN record_value ELSE null END) As data_record_last_name,
MAX(CASE WHEN record_key = 'first_name' THEN record_value ELSE null END) As data_record_first_name,
MAX(CASE WHEN record_key = 'city' THEN record_value ELSE null END) AS data_record_city,
id as data_record_id, log_ts as data_log_ts from test group by id, log_ts

DYNAMIC DDL USING SPARK SQL/DATAFRAME
• Code snippet of Dynamic DDL transforming new JSON attributes into relational columns
Add the partition columns
Manually create the table due to a bug in
spark

Add the new columns that exist in the incoming
data frame but do not exist yet in the destination
table
This syntax is not working anymore after upgrading to spark 2.x

Three temporary way to solve the problem in spark 2.x:
• Launch a hiveserver2 service, then use jdbc call hive to alter the table
• Use spark to directly connect to hivemetastore, then update the
metadata
• Update spark source code to support Alter table syntax and
repackage it

Project all columns from the table
Append the data into the destination table

Add the new partition key
• Reprocessing the DDL Table with new partition Key (Tuning tips)
Choose the partition key wisely
Use coalesce if there too many partitions
Use Coalesce to control the job tasksUse filter if Data still too large

USING S3: WHAT IS S3?
• S3 is not a file system.
• S3 is an object store. Similar to a key-value store.
• S3 objects are presented in a hierarchical view but are not stored in that manner.
• S3 objects are stored with a key derived from a “path”.
• The key is used to fan out the objects across shards.
• The path is for display purposes only. Only the first 3 to 4 characters are
used for sharding.
• S3 does not have strong transactional semantics but instead has eventual
consistency.
• S3 is not appropriate for realtime updates.
• S3 is suited for longer term storage.

USING S3: BEHAVIORS
• S3 has similar behaviors to HDFS but even more extreme.
• Larger latencies
• Larger files/writes – Think GBs
• Write and read latencies are larger but the bandwidth is much larger with S3.
• Thus throughput can be increased with parallel writers (same latency but
more throughput through parallel operations)
• Partition your RDDs/DataFrames and increase your workers/executors
to optimize the parallelism.
• Each write/read has more overhead due to the web service calls.
• So use larger buffers.
• Match the size of your HDFS buffers/blocks if reading/writing from/to HDFS.
• Collect data for longer durations before writing large buffers in parallel to S3.
• Retry logic – Writes to S3 can and will fail.
• Cannot stream to S3 – Complete files must be uploaded.
• Technically, you can simulate streaming with multipart upload.

USING S3: TIPS
• Tips for using S3 with HDFS
• Use the s3a scheme.
• Many optimizations including buffering options (disk-based, on-heap, or
off-heap) and incremental parallel uploads (S3A Fast Upload).
• More here: http://hadoop.apache.org/docs/current/hadoop-
aws/tools/hadoop-aws/index.html#S3A
• Don’t use rename/move.
• Moves are great for HDFS to support better transactional semantics
when streaming files.
• For S3, moves/renames are copy and delete operations which can be
very slow especially due to the eventual consistency.
• Other advanced S3 techniques:
• Hash object names to better shard the objects in a bucket.
• Use multiple buckets to increase bandwidth.

Dynamic DDL: Adding structure to streaming IoT data on the fly

Related slideshows

More Related Content

What's hot

What's hot (20)

Similar to Dynamic DDL: Adding structure to streaming IoT data on the fly

Similar to Dynamic DDL: Adding structure to streaming IoT data on the fly (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Dynamic DDL: Adding structure to streaming IoT data on the fly

Editor's Notes