SlideShare a Scribd company logo
From Raw Data to 
Analytics with No 
ETL 
Marcel Kornacker // Cloudera, Inc.! 
Lenni Kuff // Cloudera, Inc.
‹‹#›› 
Outline 
• Evolution of ETL in the context of analytics! 
• traditional systems! 
• Hadoop today! 
• Cloudera’s vision for ETL: no ETL! 
• with qualifications
‹‹#›› 
Traditional ETL 
• Extract: physical extraction from source data store! 
• could be an RDBMS acting as an operational data store! 
• or log data materialized as json! 
• Transform:! 
• data cleansing and standardization! 
• conversion of naturally complex/nested data into a flat relational 
schema! 
• Load: the targeted analytic DBMS converts the transformed data into 
its binary format (typically columnar)
‹‹#›› 
Traditional ETL 
• Three aspects to the traditional ETL process:! 
1. semantic transformation such as data standardization/cleansing 
-> makes data more queryable, adds value! 
2. representational transformation: from source to target schema 
(from complex/nested to flat relational) 
-> “lateral” transformation that doesn’t change semantics, 
adds operational overhead! 
3. data movement: from source to staging area to target system 
-> adds yet more operational overhead
‹‹#›› 
Traditional ETL 
• The goals of “analytics with no ETL”:! 
• simplify aspect 1! 
• eliminate aspects 2 and 3
‹‹#›› 
ETL with Hadoop Today 
• A typical ETL workflow with Hadoop looks like this:! 
• raw source data initially lands in HDFS (examples: text/xml/json log files)! 
• that data is mapped into a table to make it queryable: 
CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS 
LOCATION ‘/raw-log-data/‘;! 
• the target table is mapped to a different location: 
CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; 
• the raw source data is converted to the target format: 
INSERT INTO LogData SELECT * FROM RawLogData; 
• the data is then available for batch reporting/analytics (via Impala, Hive, Pig, 
Spark) or interactive analytics (via Impala, Search)
‹‹#›› 
ETL with Hadoop Today 
• Compared to traditional ETL, this has several advantages:! 
• Hadoop acts as a centralized location for all data: raw source data 
lives side by side with the transformed data! 
• data does not need to be moved between multiple platforms/clusters! 
• data in the raw source format is queryable as soon as it lands, 
although at reduced performance, compared to an optimized 
columnar data format! 
• all data transformations are expressed through the same platform 
and can reference any of the Hadoop-resident data sources (and 
more)
‹‹#›› 
ETL with Hadoop Today 
• However, even this still has drawbacks:! 
• new data needs to be loaded periodically into the target table, and 
doing that reliably and within SLAs can be a challenge! 
• you now have two tables: 
one with current but slow data 
another with lagging but fast data
‹‹#›› 
A Vision for Analytics with No ETL 
• Goals:! 
• no explicit loading/conversion step to move raw data into a target 
table! 
• a single view of the data that is! 
• up-to-date! 
• (mostly) in an efficient columnar format
‹‹#›› 
A Vision for Analytics with No ETL 
• Elements of an ETL-light analytic stack:! 
• support for complex/nested schemas 
-> avoid remapping of raw data into a flat relational schema! 
• background and incremental data conversion 
-> retain in-place single view of entire data set, with most data 
being in an efficient format! 
• bonus: schema inference and schema evolution 
-> start analyzing data as soon as it arrives, regardless of its 
complexity
‹‹#›› 
Support for Complex Schemas in Impala 
• Standard relational: all columns have scalar values: 
CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.! 
• Complex types: structs, arrays, maps 
in essence, a nested relational schema! 
• Supported file formats: 
Parquet, json, XML, Avro! 
• Design principle for SQL extensions: maintain SQL’s way of dealing 
with multi-valued data
‹‹#›› 
Support for Complex Schemas in Impala 
• Example: 
CREATE TABLE Customers ( 
cid BIGINT, 
address STRUCT { 
street STRING, 
zip INT 
}, 
orders ARRAY<STRUCT { 
oid BIGINT, 
total DECIMAL(9, 2), 
items ARRAY< STRUCT { 
iid BIGINT, qty INT, price DECIMAL(9, 2) }> 
}> 
)
‹‹#›› 
Support for Complex Schemas in Impala 
• Total revenue with items that cost more than $10: 
SELECT SUM(i.price * i.qty) 
FROM Customers.orders.items i 
WHERE i.price > 10! 
• Customers and order totals in zip 94611: 
SELECT c.cid, o.total 
FROM Customers c, c.orders o 
WHERE c.address.zip = 94611
‹‹#›› 
Support for Complex Schemas in Impala 
• Customers that have placed more than 10 orders: 
SELECT c.cid 
FROM Customers c 
WHERE COUNT(c.orders) > 10 
(shorthand for: 
WHERE (SELECT COUNT(*) FROM c.orders) > 10)! 
• Number of orders and average item price per customer: 
SELECT c.cid, COUNT(c.orders), 
AVG(c.orders.items.price) 
FROM Customers c
‹‹#›› 
Background Format Conversion 
• Sample workflow:! 
• create table for data: 
CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;! 
• load data into table: 
LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData 
SOURCE FORMAT SEQUENCEFILE; 
• Pre-requisite for incremental conversion: 
multi-format tables and partitions! 
• currently: each table partition has a single file format! 
• instead: allow a mix of file formats (separated into format-specific 
subdirectories)
‹‹#›› 
Background Format Conversion 
• Conversion process! 
• atomic: the switch from the source to the target data files is atomic 
from the perspective of a running query (but any running query 
sees the full data set) ! 
• redundant: with option to retain original data! 
• incremental: Impala’s catalog service detects new data files that are 
not in the target format automatically
‹‹#›› 
Schema Inference and Schema Evolution 
• Schema inference from data files is useful to reduce the barrier to 
analyzing complex source data! 
• as an example, log data often has hundreds of fields! 
• the time required to create the DDL manually is substantial! 
• Example: schema inference from structured data files! 
• available today: 
CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’! 
• future formats: XML, json, Avro
‹‹#›› 
Schema Inference and Schema Evolution 
• Schema evolution:! 
• a necessary follow-on to schema inference: every schema evolves over time; 
explicit maintenance is as time-consuming as the initial creation! 
• algorithmic schema evolution requires sticking to generally safe schema 
modifications: adding new fields! 
• adding new top-level columns! 
• adding fields within structs! 
• Example workflow: 
LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH 
SCHEMA EXPANSION;! 
• scans data to determine new columns/fields to add! 
• synchronous: if there is an error, the ‘load’ is aborted and the user notified
‹‹#›› 
Conclusion 
• Hadoop offers a number of advantages over traditional multi-platform ETL 
solutions:! 
• availability of all data sets on a single platform! 
• data becomes accessible through SQL as soon as it lands! 
• However, this can be improved further:! 
• a richer analytic SQL that is extended to handle nested data! 
• an automated background conversion process that preserves an up-to-date 
view of all data while providing BI-typical performance! 
• simple automation of initial schema creation and subsequent maintenance 
that makes dealing with large, complex schemas less labor-intensive
Thank you
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution
‹‹#›› 
This an example 
segue slide on a blue 
background. This could 
also be a quote slide. 
This is an optional subtitle or space for attribution

More Related Content

What's hot

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
DataWorks Summit
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
Tamir Dresher
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
Jonathan Seidman
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
nvvrajesh
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
James Serra
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
Anant Kumar
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
Richard Vermillion
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
Craig Jordan
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Adaryl "Bob" Wakefield, MBA
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
David P. Moore
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
DataWorks Summit/Hadoop Summit
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
Sailesh Krishnamurthy
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
Cloudera, Inc.
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
Edgar Alejandro Villegas
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
IBM Analytics
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
Rob Winters
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
Rob Winters
 

What's hot (19)

Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Hadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop SummitHadoop crash course workshop at Hadoop Summit
Hadoop crash course workshop at Hadoop Summit
 
Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher   Anatomy of a data driven architecture - Tamir Dresher
Anatomy of a data driven architecture - Tamir Dresher
 
Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013Integrating hadoop - Big Data TechCon 2013
Integrating hadoop - Big Data TechCon 2013
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Research on vector spatial data storage scheme based
Research on vector spatial data storage scheme basedResearch on vector spatial data storage scheme based
Research on vector spatial data storage scheme based
 
The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.The Warranty Data Lake – After, Inc.
The Warranty Data Lake – After, Inc.
 
Tableau and hadoop
Tableau and hadoopTableau and hadoop
Tableau and hadoop
 
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Filling the Data Lake
Filling the Data LakeFilling the Data Lake
Filling the Data Lake
 
Introducing Amazon Aurora
Introducing Amazon AuroraIntroducing Amazon Aurora
Introducing Amazon Aurora
 
Impala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on HadoopImpala Unlocks Interactive BI on Hadoop
Impala Unlocks Interactive BI on Hadoop
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the HerdHadoop-DS: Which SQL-on-Hadoop Rules the Herd
Hadoop-DS: Which SQL-on-Hadoop Rules the Herd
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 
Data Vault Automation at the Bijenkorf
Data Vault Automation at the BijenkorfData Vault Automation at the Bijenkorf
Data Vault Automation at the Bijenkorf
 

Viewers also liked

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
mattlieber
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
Kai Wähner
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
Donald Miner
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Amazon Web Services
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 

Viewers also liked (6)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Parquet and impala overview external
Parquet and impala overview externalParquet and impala overview external
Parquet and impala overview external
 
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about..."Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
"Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about...
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
Easy Analytics on AWS with Amazon Redshift, Amazon QuickSight, and Amazon Mac...
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 

Similar to From Raw Data to Analytics with No ETL

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Cloudera, Inc.
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
Chester Chen
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
ssusere05ec21
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
Ike Ellis
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Michael Rys
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
John Sichi
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Fwdays
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
StreamHorizon
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Michael Rys
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
gluent.
 
ETL
ETL ETL
ETL
butest
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
Remy Rosenbaum
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Amazon Web Services
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
hadooparchbook
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
James Chen
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
Amazon Web Services
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
Matthias Wahl
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
Adam Doyle
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
Timothy Spann
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
Getting value from IoT, Integration and Data Analytics
 

Similar to From Raw Data to Analytics with No ETL (20)

Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
Friction-free ETL: Automating data transformation with Impala | Strata + Hado...
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfimpalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf
 
Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
Hive Evolution: ApacheCon NA 2010
Hive Evolution:  ApacheCon NA 2010Hive Evolution:  ApacheCon NA 2010
Hive Evolution: ApacheCon NA 2010
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
Best Practices and Performance Tuning of U-SQL in Azure Data Lake (SQL Konfer...
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
ETL
ETL ETL
ETL
 
Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)Jethro for tableau webinar (11 15)
Jethro for tableau webinar (11 15)
 
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon RedshiftBest Practices for Migrating Your Data Warehouse to Amazon Redshift
Best Practices for Migrating Your Data Warehouse to Amazon Redshift
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Introduction to Amazon Athena
Introduction to Amazon AthenaIntroduction to Amazon Athena
Introduction to Amazon Athena
 
Turning a Search Engine into a Relational Database
Turning a Search Engine into a Relational DatabaseTurning a Search Engine into a Relational Database
Turning a Search Engine into a Relational Database
 
Data Ingestion Engine
Data Ingestion EngineData Ingestion Engine
Data Ingestion Engine
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
Yury Chemerkin
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
Stephanie Beckett
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
OnBoard
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPathCommunity
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
Fwdays
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Badri_Bady
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
 

Recently uploaded (20)

FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
 
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
 
What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024What's New in Teams Calling, Meetings, Devices June 2024
What's New in Teams Calling, Meetings, Devices June 2024
 
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
 
Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx"Making .NET Application Even Faster", Sergey Teplyakov.pptx
"Making .NET Application Even Faster", Sergey Teplyakov.pptx
 
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
 
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
 

From Raw Data to Analytics with No ETL

  • 1. From Raw Data to Analytics with No ETL Marcel Kornacker // Cloudera, Inc.! Lenni Kuff // Cloudera, Inc.
  • 2. ‹‹#›› Outline • Evolution of ETL in the context of analytics! • traditional systems! • Hadoop today! • Cloudera’s vision for ETL: no ETL! • with qualifications
  • 3. ‹‹#›› Traditional ETL • Extract: physical extraction from source data store! • could be an RDBMS acting as an operational data store! • or log data materialized as json! • Transform:! • data cleansing and standardization! • conversion of naturally complex/nested data into a flat relational schema! • Load: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar)
  • 4. ‹‹#›› Traditional ETL • Three aspects to the traditional ETL process:! 1. semantic transformation such as data standardization/cleansing -> makes data more queryable, adds value! 2. representational transformation: from source to target schema (from complex/nested to flat relational) -> “lateral” transformation that doesn’t change semantics, adds operational overhead! 3. data movement: from source to staging area to target system -> adds yet more operational overhead
  • 5. ‹‹#›› Traditional ETL • The goals of “analytics with no ETL”:! • simplify aspect 1! • eliminate aspects 2 and 3
  • 6. ‹‹#›› ETL with Hadoop Today • A typical ETL workflow with Hadoop looks like this:! • raw source data initially lands in HDFS (examples: text/xml/json log files)! • that data is mapped into a table to make it queryable: CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘;! • the target table is mapped to a different location: CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; • the raw source data is converted to the target format: INSERT INTO LogData SELECT * FROM RawLogData; • the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search)
  • 7. ‹‹#›› ETL with Hadoop Today • Compared to traditional ETL, this has several advantages:! • Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data! • data does not need to be moved between multiple platforms/clusters! • data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data format! • all data transformations are expressed through the same platform and can reference any of the Hadoop-resident data sources (and more)
  • 8. ‹‹#›› ETL with Hadoop Today • However, even this still has drawbacks:! • new data needs to be loaded periodically into the target table, and doing that reliably and within SLAs can be a challenge! • you now have two tables: one with current but slow data another with lagging but fast data
  • 9. ‹‹#›› A Vision for Analytics with No ETL • Goals:! • no explicit loading/conversion step to move raw data into a target table! • a single view of the data that is! • up-to-date! • (mostly) in an efficient columnar format
  • 10. ‹‹#›› A Vision for Analytics with No ETL • Elements of an ETL-light analytic stack:! • support for complex/nested schemas -> avoid remapping of raw data into a flat relational schema! • background and incremental data conversion -> retain in-place single view of entire data set, with most data being in an efficient format! • bonus: schema inference and schema evolution -> start analyzing data as soon as it arrives, regardless of its complexity
  • 11. ‹‹#›› Support for Complex Schemas in Impala • Standard relational: all columns have scalar values: CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc.! • Complex types: structs, arrays, maps in essence, a nested relational schema! • Supported file formats: Parquet, json, XML, Avro! • Design principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data
  • 12. ‹‹#›› Support for Complex Schemas in Impala • Example: CREATE TABLE Customers ( cid BIGINT, address STRUCT { street STRING, zip INT }, orders ARRAY<STRUCT { oid BIGINT, total DECIMAL(9, 2), items ARRAY< STRUCT { iid BIGINT, qty INT, price DECIMAL(9, 2) }> }> )
  • 13. ‹‹#›› Support for Complex Schemas in Impala • Total revenue with items that cost more than $10: SELECT SUM(i.price * i.qty) FROM Customers.orders.items i WHERE i.price > 10! • Customers and order totals in zip 94611: SELECT c.cid, o.total FROM Customers c, c.orders o WHERE c.address.zip = 94611
  • 14. ‹‹#›› Support for Complex Schemas in Impala • Customers that have placed more than 10 orders: SELECT c.cid FROM Customers c WHERE COUNT(c.orders) > 10 (shorthand for: WHERE (SELECT COUNT(*) FROM c.orders) > 10)! • Number of orders and average item price per customer: SELECT c.cid, COUNT(c.orders), AVG(c.orders.items.price) FROM Customers c
  • 15. ‹‹#›› Background Format Conversion • Sample workflow:! • create table for data: CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET;! • load data into table: LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT SEQUENCEFILE; • Pre-requisite for incremental conversion: multi-format tables and partitions! • currently: each table partition has a single file format! • instead: allow a mix of file formats (separated into format-specific subdirectories)
  • 16. ‹‹#›› Background Format Conversion • Conversion process! • atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set) ! • redundant: with option to retain original data! • incremental: Impala’s catalog service detects new data files that are not in the target format automatically
  • 17. ‹‹#›› Schema Inference and Schema Evolution • Schema inference from data files is useful to reduce the barrier to analyzing complex source data! • as an example, log data often has hundreds of fields! • the time required to create the DDL manually is substantial! • Example: schema inference from structured data files! • available today: CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’! • future formats: XML, json, Avro
  • 18. ‹‹#›› Schema Inference and Schema Evolution • Schema evolution:! • a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation! • algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields! • adding new top-level columns! • adding fields within structs! • Example workflow: LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION;! • scans data to determine new columns/fields to add! • synchronous: if there is an error, the ‘load’ is aborted and the user notified
  • 19. ‹‹#›› Conclusion • Hadoop offers a number of advantages over traditional multi-platform ETL solutions:! • availability of all data sets on a single platform! • data becomes accessible through SQL as soon as it lands! • However, this can be improved further:! • a richer analytic SQL that is extended to handle nested data! • an automated background conversion process that preserves an up-to-date view of all data while providing BI-typical performance! • simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive
  • 21. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 22. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution
  • 23. ‹‹#›› This an example segue slide on a blue background. This could also be a quote slide. This is an optional subtitle or space for attribution