Bi on Big Data - Strata 2016 in London

BI on Big Data
Tomer Shiran - @tshiran
Co-Founder & CEO, Dremio
Strata + Hadoop World London, June 3, 2016
What are your options?

2 BI on Hadoop: What are your Options
Dremio Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Founder of Apache Arrow & Drill
• Previously Quigo (AOL); Offermatica
(ADBE); aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• Previously MapR (VP Product &
employee #5), MapR; Microsoft;
IBM Research
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Founder of Apache Parquet
• Apache Pig PMC Member
• Previously Twitter (Lead, Analytics
Data Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source including
the creators of Apache Arrow & Apache Parquet

Recent changes to the BI landscape
Good ol’ Days
• Only a few databases (e.g.
Oracle, Teradata, SQL
Server)
• A few BI tools (MicroStrategy,
Cognos)
• Everything worked with
everything
• Things were easy!
Modern Reality
• Larger scale, less control and
less structure
• Lots of databases!
• Data Lake, not database
• HDFS: It’s a file system, folks!
• NoSQL: Let’s put the schema in
the application
• It can feel like the wild west!

Major Approaches to BI on Big Data?
ETL to RDBMS
o “Make the new world look like the old world!”
o Load a transformed set of data into relational database
Monolithic (all-in-one) solutions
o Use BI tools that connect directly to Big Data
SQL-on-Big-Data
o Connect BI tools to a query engine sitting on top of Big Data
o Three main sub-categories
• Native SQL
• Batch SQL
• OLAP Cubes

So how do we bring BI to Big Data?
Big Data
RDBMS
BI options
ETL tool
ETL to Data
Warehouse
Big Data
SQL
Engine
BI options
SQL-on-Big-DataBig Data
Monolithic tool with built-in BI
Monolithic All-in-one
Solutions

ETL to RDBMS: Introduction
• ETL (Extract, Transform, and Load) a subset of the data into a
relational database
o Oracle, PostgreSQL, Teradata, Redshift, Vertica, …
• Connect any desired BI tool to the RDBMS
o Tableau, Qlik, …
• Two options:
o Commercial tools (Informatica, Talend, Pentaho,…)
o Custom development, scripts, etc.
Big Data
RDBMS
BI options
ETL tool

ETL to RDBMS: Example
• Load web server logs from HDFS into RDBMS
• ETL software: Pentaho Data Integration (aka ‘Kettle’)
• RDBMS: MySQL
Connect ETL
to RDBMS
Add and Configure
Input/Output
Connect Input
and Output
Create and fill
RDBMS table
Connect BI tool
To RDBMS
0
50
100
150
200
250
April May June July
Source: http://wiki.pentaho.com/display/BAD/Extracting+Data+from+HDFS+to+Load+an+RDBMS

ETL to RDBMS: Pros and Cons
Pros
• Relational databases and their BI integrations are very mature
• Use your favorite tools
o Tableau, Excel, R, …
Cons
• Traditional ETL tools don’t work well with modern data
o Changing schemas, complex or semi-structured data, …
o Hand-coded scripts are a common substitute
• Data freshness
o How often do you replicate/synchronize?
• Data resolution
o Can’t store all the raw data in the RDBMS (due to scalability and/or cost)
o Need to sample, aggregate or time-constrain the data
…and really, who wants to ETL?

Monolithic (or All-in-One) Solutions: Introduction
• Single piece of software on top of Big Data
• Performs both data visualization (BI) and execution
• Utilize sampling or manual pre-aggregation to reduce
the data volume that the user is interacting with
• Examples:
o Datameer
o Platfora
o Zoomdata Big Data
Monolithic system
with built-in BI
Monolithic
Solutions

Platfora Architecture Overview
• Constructs aggregates that are
loaded into an external database
o Aggregates provide fast
visualizations
o Aggregations must be created
before consumption
MapReduce/Spark
HDFS
Hadoop Cluster
Hadoop
Proprietary DB
Aggregates
Platfora Cluster

Hadoop Cluster
Datameer Nodes
Datameer Architecture Overview
• Users interact with samples of the data
in an Excel-like interface
• Finished designs use the whole dataset
• Query router determines execution
engine based on data size
Single Node
Custom Execution
Tez MapReduce
Query Router
Sampling
Hadoop
HDFS

Zoomdata Architecture Overview
• Queries on historical (ie, non-streaming)
data are split into many sampling queries
• This sampling provides a view of the data
that converges toward an accurate picture
o But adds load on the data source…
• Can handle streaming data sources
Stream Processing
Engine
Spark-based
cache
HDFS / MongoDB
Zoomdata Server
Incremental
Sampling
Streaming
Data
Source
Multiple
Data Clusters

Monolithic Solutions: Pros and Cons
Pros
• Only one tool to learn and operate
• Easier than building and maintain ETL-to-RDBMS pipeline
• Integrated data preparation in some solutions
Cons
• Can’t analyze the raw data
o Rely on aggregation or sampling before primary analysis
• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)
• Can’t run arbitrary SQL queries

SQL-on-Big-Data: Introduction
• SQL queries against Big Data
o Hadoop
o NoSQL
• MongoDB, HBase, ...
o Cloud Storage
• S3, Azure Data Lake, GCS, …
• Use your existing BI tools
o Leverage standard ODBC/JDBC drivers
Tableau, Qlik, R, …
SQL Engine
Hadoop & NoSQL

SQL-on-Big-Data: Introduction
Three major design philosophies:
• Native SQL
• Batch & Data Science SQL
• OLAP Cubes on Hadoop

Native SQL
• Apache Drill
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
o Based on Apache Arrow
o Columnar in-memory execution
• Apache Impala (incubating)
o Utilizes the Hive metastore
o Focused on data in HDFS
• Presto
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)

Native SQL: Pros and Cons
Pros
• Highest performance for Big Data workloads
• Connect to Hadoop and also NoSQL systems
• Make Hadoop “look like a database”
Cons
• Queries may still be too slow for interactive analysis on many TB/PB
• Can’t defeat physics

Batch & Data Science SQL
• Hive
o Enables SQL queries to be translated to
MapReduce/Tez
o Most commonly used for batch processing and ETL
workloads
• Spark SQL
o Provides a way to deliver SQL queries in Spark
programs (Scala/Java/Python)
o Excellent interleaving with data science work

Batch & Data Science SQL: Pros and Cons
Pros
o Potentially simpler deployment (no daemons)
• New YARN job (MapReduce/Spark) for each query
o Check-pointing support enables very long-running queries
• Days to weeks (ETL work)
o Works well in tandem with machine learning (Spark)
Cons
o Latency prohibitive for for interactive analytics
• Tableau, Qlik Sense, …
o Slower than native SQL engines

OLAP Cubes on Hadoop
• Kylin
o Hadoop-only
o Stores OLAP cubes in HBase
o Queries fail if not satisfied by cubes
o Open source
• AtScale
o Hadoop-only
o Leverages external SQL engine
• Hive, Impala, SparkSQL
o Collaborative cube creation
o Closed source

OLAP Cubes on Hadoop: Pros and Cons
Pros
o Fast queries on pre-aggregated data
o Can use SQL and MDX tools
Cons
o Explicit cube definition/modeling phase
• Not “self-service”
• Frequent updates required due to dependency on business logic
o Aggregation create and maintenance can be long (and large)
o User connects to and interacts with the cube
• Can’t interact with the raw data

SQL-on-Big-Data: Solution Comparison
Native SQL Batch & DS SQL OLAP Cubes
Technologies Drill, Impala, Presto Hive, Spark SQL Kylin, AtScale
Connectivity SQL and NoSQL SQL and NoSQL Hadoop-only
Primary Use Case Interactive
ETL or data-science
focused
Constrained
Interactive
Query Capability Raw data Raw data Aggregated data
Deployment Model
New daemons
collocated with existing
services
New MapReduce and/or
Spark job for each
query
Varies

SQL-on-Big-Data: General Pros and Cons
Pros
• Continue using your favorite BI tools and SQL-based clients
o Tableau, Qlik, Power BI, Excel, R, SAS, …
• Technical analysts can write custom SQL queries
Cons
• Another layer in your data stack
• May need to pre-aggregate the data depending on your scale
• Need a separate data preparation tool (or custom scripts)

Deciding what is right for you?

ETL to
RDBMS
BI on Big data: Heuristic
Do you already have a favorite BI Tool
No
Is External Cluster Okay?
Does your schema change frequently?
No Yes
Yes
Platfora
Zoomdata
No
Yes
Do you want to be
able to write SQL
No
Datameer
No
Do you like Excel Metaphor?
Yes
Monolithic/All-in-one Solutions
No
Is your working data relatively small & static?
Yes
Yes
Yes
Do you have very predictable analysis needs?
OLAP Cubes
on Hadoop
No
Are you focused on interactive BI?
No
Do you need to query NoSQL?
No
Hive
Native SQL
Do you want to combine ML with SQL?
No Yes
SparkSQL

Q&A
Tomer Shiran
tshiran@dremio.com
@tshiran
Reach out to learn what we’re up to at Dremio
(or to join the private beta…)

Bi on Big Data - Strata 2016 in London

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Bi on Big Data - Strata 2016 in London

Similar to Bi on Big Data - Strata 2016 in London (20)

Recently uploaded

Recently uploaded (20)

Bi on Big Data - Strata 2016 in London