SlideShare a Scribd company logo
BI on Big Data
Tomer Shiran - @tshiran
Co-Founder & CEO, Dremio
Strata + Hadoop World London, June 3, 2016
What are your options?
2 BI on Hadoop: What are your Options
Dremio Company Background
Jacques Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Founder of Apache Arrow & Drill
• Previously Quigo (AOL); Offermatica
(ADBE); aQuantive (MSFT)
Tomer Shiran
Founder & CEO
• Previously MapR (VP Product &
employee #5), MapR; Microsoft;
IBM Research
• Carnegie Mellon, Technion
Julien Le Dem
• Founder of Apache Parquet
• Apache Pig PMC Member
• Previously Twitter (Lead, Analytics
Data Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Stealth data analytics startup
• Founded in 2015
• Led by experts in Big Data and open source including
the creators of Apache Arrow & Apache Parquet
3 BI on Hadoop: What are your Options
Recent changes to the BI landscape
Good ol’ Days
• Only a few databases (e.g.
Oracle, Teradata, SQL
• A few BI tools (MicroStrategy,
• Everything worked with
• Things were easy!
Modern Reality
• Larger scale, less control and
less structure
• Lots of databases!
• Data Lake, not database
• HDFS: It’s a file system, folks!
• NoSQL: Let’s put the schema in
the application
• It can feel like the wild west!
4 BI on Hadoop: What are your Options
Major Approaches to BI on Big Data?
o “Make the new world look like the old world!”
o Load a transformed set of data into relational database
Monolithic (all-in-one) solutions
o Use BI tools that connect directly to Big Data
o Connect BI tools to a query engine sitting on top of Big Data
o Three main sub-categories
• Native SQL
• Batch SQL
• OLAP Cubes
5 BI on Hadoop: What are your Options
So how do we bring BI to Big Data?
Big Data
BI options
ETL tool
ETL to Data
Big Data
BI options
SQL-on-Big-DataBig Data
Monolithic tool with built-in BI
Monolithic All-in-one
6 BI on Hadoop: What are your Options
ETL to RDBMS: Introduction
• ETL (Extract, Transform, and Load) a subset of the data into a
relational database
o Oracle, PostgreSQL, Teradata, Redshift, Vertica, …
• Connect any desired BI tool to the RDBMS
o Tableau, Qlik, …
• Two options:
o Commercial tools (Informatica, Talend, Pentaho,…)
o Custom development, scripts, etc.
Big Data
BI options
ETL tool
7 BI on Hadoop: What are your Options
ETL to RDBMS: Example
• Load web server logs from HDFS into RDBMS
• ETL software: Pentaho Data Integration (aka ‘Kettle’)
Connect ETL
Add and Configure
Connect Input
and Output
Create and fill
RDBMS table
Connect BI tool
April May June July
8 BI on Hadoop: What are your Options
ETL to RDBMS: Pros and Cons
• Relational databases and their BI integrations are very mature
• Use your favorite tools
o Tableau, Excel, R, …
• Traditional ETL tools don’t work well with modern data
o Changing schemas, complex or semi-structured data, …
o Hand-coded scripts are a common substitute
• Data freshness
o How often do you replicate/synchronize?
• Data resolution
o Can’t store all the raw data in the RDBMS (due to scalability and/or cost)
o Need to sample, aggregate or time-constrain the data
…and really, who wants to ETL?
9 BI on Hadoop: What are your Options
Monolithic (or All-in-One) Solutions: Introduction
• Single piece of software on top of Big Data
• Performs both data visualization (BI) and execution
• Utilize sampling or manual pre-aggregation to reduce
the data volume that the user is interacting with
• Examples:
o Datameer
o Platfora
o Zoomdata Big Data
Monolithic system
with built-in BI
10 BI on Hadoop: What are your Options
Platfora Architecture Overview
• Constructs aggregates that are
loaded into an external database
o Aggregates provide fast
o Aggregations must be created
before consumption
Hadoop Cluster
Proprietary DB
Platfora Cluster
11 BI on Hadoop: What are your Options
Hadoop Cluster
Datameer Nodes
Datameer Architecture Overview
• Users interact with samples of the data
in an Excel-like interface
• Finished designs use the whole dataset
• Query router determines execution
engine based on data size
Single Node
Custom Execution
Tez MapReduce
Query Router
12 BI on Hadoop: What are your Options
Zoomdata Architecture Overview
• Queries on historical (ie, non-streaming)
data are split into many sampling queries
• This sampling provides a view of the data
that converges toward an accurate picture
o But adds load on the data source…
• Can handle streaming data sources
Stream Processing
HDFS / MongoDB
Zoomdata Server
Data Clusters
13 BI on Hadoop: What are your Options
Monolithic Solutions: Pros and Cons
• Only one tool to learn and operate
• Easier than building and maintain ETL-to-RDBMS pipeline
• Integrated data preparation in some solutions
• Can’t analyze the raw data
o Rely on aggregation or sampling before primary analysis
• Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …)
• Can’t run arbitrary SQL queries
14 BI on Hadoop: What are your Options
SQL-on-Big-Data: Introduction
• SQL queries against Big Data
o Hadoop
• MongoDB, HBase, ...
o Cloud Storage
• S3, Azure Data Lake, GCS, …
• Use your existing BI tools
o Leverage standard ODBC/JDBC drivers
Tableau, Qlik, R, …
SQL Engine
Hadoop & NoSQL
15 BI on Hadoop: What are your Options
SQL-on-Big-Data: Introduction
Three major design philosophies:
• Native SQL
• Batch & Data Science SQL
• OLAP Cubes on Hadoop
16 BI on Hadoop: What are your Options
Native SQL
• Apache Drill
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
o Based on Apache Arrow
o Columnar in-memory execution
• Apache Impala (incubating)
o Utilizes the Hive metastore
o Focused on data in HDFS
• Presto
o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
17 BI on Hadoop: What are your Options
Native SQL: Pros and Cons
• Highest performance for Big Data workloads
• Connect to Hadoop and also NoSQL systems
• Make Hadoop “look like a database”
• Queries may still be too slow for interactive analysis on many TB/PB
• Can’t defeat physics
18 BI on Hadoop: What are your Options
Batch & Data Science SQL
• Hive
o Enables SQL queries to be translated to
o Most commonly used for batch processing and ETL
• Spark SQL
o Provides a way to deliver SQL queries in Spark
programs (Scala/Java/Python)
o Excellent interleaving with data science work
19 BI on Hadoop: What are your Options
Batch & Data Science SQL: Pros and Cons
o Potentially simpler deployment (no daemons)
• New YARN job (MapReduce/Spark) for each query
o Check-pointing support enables very long-running queries
• Days to weeks (ETL work)
o Works well in tandem with machine learning (Spark)
o Latency prohibitive for for interactive analytics
• Tableau, Qlik Sense, …
o Slower than native SQL engines
20 BI on Hadoop: What are your Options
OLAP Cubes on Hadoop
• Kylin
o Hadoop-only
o Stores OLAP cubes in HBase
o Queries fail if not satisfied by cubes
o Open source
• AtScale
o Hadoop-only
o Leverages external SQL engine
• Hive, Impala, SparkSQL
o Collaborative cube creation
o Closed source
21 BI on Hadoop: What are your Options
OLAP Cubes on Hadoop: Pros and Cons
o Fast queries on pre-aggregated data
o Can use SQL and MDX tools
o Explicit cube definition/modeling phase
• Not “self-service”
• Frequent updates required due to dependency on business logic
o Aggregation create and maintenance can be long (and large)
o User connects to and interacts with the cube
• Can’t interact with the raw data
22 BI on Hadoop: What are your Options
SQL-on-Big-Data: Solution Comparison
Native SQL Batch & DS SQL OLAP Cubes
Technologies Drill, Impala, Presto Hive, Spark SQL Kylin, AtScale
Connectivity SQL and NoSQL SQL and NoSQL Hadoop-only
Primary Use Case Interactive
ETL or data-science
Query Capability Raw data Raw data Aggregated data
Deployment Model
New daemons
collocated with existing
New MapReduce and/or
Spark job for each
23 BI on Hadoop: What are your Options
SQL-on-Big-Data: General Pros and Cons
• Continue using your favorite BI tools and SQL-based clients
o Tableau, Qlik, Power BI, Excel, R, SAS, …
• Technical analysts can write custom SQL queries
• Another layer in your data stack
• May need to pre-aggregate the data depending on your scale
• Need a separate data preparation tool (or custom scripts)
24 BI on Hadoop: What are your Options
Deciding what is right for you?
25 BI on Hadoop: What are your Options
ETL to
BI on Big data: Heuristic
Do you already have a favorite BI Tool
Is External Cluster Okay?
Does your schema change frequently?
No Yes
Do you want to be
able to write SQL
Do you like Excel Metaphor?
Monolithic/All-in-one Solutions
Is your working data relatively small & static?
Do you have very predictable analysis needs?
OLAP Cubes
on Hadoop
Are you focused on interactive BI?
Do you need to query NoSQL?
Native SQL
Do you want to combine ML with SQL?
No Yes
26 BI on Hadoop: What are your Options
Tomer Shiran
Reach out to learn what we’re up to at Dremio
(or to join the private beta…)

More Related Content

What's hot

Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
Aaron (Ari) Bornstein
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
David Giard
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Eli Singer
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Spark Summit
Big Data in Azure
Big Data in AzureBig Data in Azure
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
Idan Tohami
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Thang Bui (Bob)
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Tomasz Kopacz
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Data Con LA
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
Mark Kromer
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Durga Gadiraju
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Dipti Borkar
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Gary Stafford
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake
Laurent Leturgez

What's hot (20)

Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...Running cost effective big data workloads with Azure Synapse and Azure Data L...
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Big Data with Azure
Big Data with AzureBig Data with Azure
Big Data with Azure
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Big Data in Azure
Big Data in AzureBig Data in Azure
Big Data in Azure
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
Demystifying data engineering
Demystifying data engineeringDemystifying data engineering
Demystifying data engineering
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Big Data in the Real World
Big Data in the Real WorldBig Data in the Real World
Big Data in the Real World
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
Architecting a datalake
Architecting a datalakeArchitecting a datalake
Architecting a datalake

Viewers also liked

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
Dremio Corporation
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde

Viewers also liked (10)

Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Apache Arrow - An Overview
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview

Similar to Bi on Big Data - Strata 2016 in London

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
Joseph D'Antoni
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Jesus Rodriguez
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Rittman Analytics
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Uwe Printz
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
Andrew Brust
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
DataWorks Summit
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Adaryl "Bob" Wakefield, MBA
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Dataconomy Media

Similar to Bi on Big Data - Strata 2016 in London (20)

New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Introduction To Big Data & Hadoop
Introduction To Big Data & HadoopIntroduction To Big Data & Hadoop
Introduction To Big Data & Hadoop
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Data Integration and Data Warehousing for Cloud, Big Data and IoT: 
What’s Ne...
Hadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data ModelHadoop meets Agile! - An Agile Big Data Model
Hadoop meets Agile! - An Agile Big Data Model
Apache drill
Apache drillApache drill
Apache drill
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with HadoopBig Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Big Data 2.0: YARN Enablement for Distributed ETL & SQL with Hadoop
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
50 Shades of SQL
50 Shades of SQL50 Shades of SQL
50 Shades of SQL
Big Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R UsersBig Data Technologies and Why They Matter To R Users
Big Data Technologies and Why They Matter To R Users
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data PlatformModernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Apache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouseApache Tajo - An open source big data warehouse
Apache Tajo - An open source big data warehouse
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Recently uploaded

Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Snarky Security
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
Priyanka Aash
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
Yury Chemerkin
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Priyanka Aash
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Alliance

Recently uploaded (20)

Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
Camunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptxCamunda Chapter NY Meetup July 2024.pptx
Camunda Chapter NY Meetup July 2024.pptx
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI CertificationTrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
TrustArc Webinar - Innovating with TRUSTe Responsible AI Certification
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Welcome to Cyberbiosecurity. Because regular cybersecurity wasn't complicated...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Garbage In, Garbage Out: Why poor data curation is killing your AI models (an...
Demystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity ApplicationsDemystifying Neural Networks And Building Cybersecurity Applications
Demystifying Neural Networks And Building Cybersecurity Applications
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
"Hands-on development experience using wasm Blazor", Furdak Vladyslav.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Finetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and DefendingFinetuning GenAI For Hacking and Defending
Finetuning GenAI For Hacking and Defending
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partesExchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
Exchange, Entra ID, Conectores, RAML: Todo, a la vez, en todas partes
FIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptxFIDO Munich Seminar Introduction to FIDO.pptx
FIDO Munich Seminar Introduction to FIDO.pptx

Bi on Big Data - Strata 2016 in London

  • 1. BI on Big Data Tomer Shiran - @tshiran Co-Founder & CEO, Dremio Strata + Hadoop World London, June 3, 2016 What are your options?
  • 2. 2 BI on Hadoop: What are your Options Dremio Company Background Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Founder of Apache Arrow & Drill • Previously Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • Previously MapR (VP Product & employee #5), MapR; Microsoft; IBM Research • Carnegie Mellon, Technion Julien Le Dem Architect • Founder of Apache Parquet • Apache Pig PMC Member • Previously Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Stealth data analytics startup • Founded in 2015 • Led by experts in Big Data and open source including the creators of Apache Arrow & Apache Parquet
  • 3. 3 BI on Hadoop: What are your Options Recent changes to the BI landscape Good ol’ Days • Only a few databases (e.g. Oracle, Teradata, SQL Server) • A few BI tools (MicroStrategy, Cognos) • Everything worked with everything • Things were easy! Modern Reality • Larger scale, less control and less structure • Lots of databases! • Data Lake, not database • HDFS: It’s a file system, folks! • NoSQL: Let’s put the schema in the application • It can feel like the wild west!
  • 4. 4 BI on Hadoop: What are your Options Major Approaches to BI on Big Data? ETL to RDBMS o “Make the new world look like the old world!” o Load a transformed set of data into relational database Monolithic (all-in-one) solutions o Use BI tools that connect directly to Big Data SQL-on-Big-Data o Connect BI tools to a query engine sitting on top of Big Data o Three main sub-categories • Native SQL • Batch SQL • OLAP Cubes
  • 5. 5 BI on Hadoop: What are your Options So how do we bring BI to Big Data? Big Data RDBMS BI options ETL tool ETL to Data Warehouse Big Data SQL Engine BI options SQL-on-Big-DataBig Data Monolithic tool with built-in BI Monolithic All-in-one Solutions
  • 6. 6 BI on Hadoop: What are your Options ETL to RDBMS: Introduction • ETL (Extract, Transform, and Load) a subset of the data into a relational database o Oracle, PostgreSQL, Teradata, Redshift, Vertica, … • Connect any desired BI tool to the RDBMS o Tableau, Qlik, … • Two options: o Commercial tools (Informatica, Talend, Pentaho,…) o Custom development, scripts, etc. Big Data RDBMS BI options ETL tool
  • 7. 7 BI on Hadoop: What are your Options ETL to RDBMS: Example • Load web server logs from HDFS into RDBMS • ETL software: Pentaho Data Integration (aka ‘Kettle’) • RDBMS: MySQL Connect ETL to RDBMS Add and Configure Input/Output Connect Input and Output Create and fill RDBMS table Connect BI tool To RDBMS 0 50 100 150 200 250 April May June July Source:
  • 8. 8 BI on Hadoop: What are your Options ETL to RDBMS: Pros and Cons Pros • Relational databases and their BI integrations are very mature • Use your favorite tools o Tableau, Excel, R, … Cons • Traditional ETL tools don’t work well with modern data o Changing schemas, complex or semi-structured data, … o Hand-coded scripts are a common substitute • Data freshness o How often do you replicate/synchronize? • Data resolution o Can’t store all the raw data in the RDBMS (due to scalability and/or cost) o Need to sample, aggregate or time-constrain the data …and really, who wants to ETL?
  • 9. 9 BI on Hadoop: What are your Options Monolithic (or All-in-One) Solutions: Introduction • Single piece of software on top of Big Data • Performs both data visualization (BI) and execution • Utilize sampling or manual pre-aggregation to reduce the data volume that the user is interacting with • Examples: o Datameer o Platfora o Zoomdata Big Data Monolithic system with built-in BI Monolithic Solutions
  • 10. 10 BI on Hadoop: What are your Options Platfora Architecture Overview • Constructs aggregates that are loaded into an external database o Aggregates provide fast visualizations o Aggregations must be created before consumption MapReduce/Spark HDFS Hadoop Cluster Hadoop Proprietary DB Aggregates Platfora Cluster
  • 11. 11 BI on Hadoop: What are your Options Hadoop Cluster Datameer Nodes Datameer Architecture Overview • Users interact with samples of the data in an Excel-like interface • Finished designs use the whole dataset • Query router determines execution engine based on data size Single Node Custom Execution Tez MapReduce Query Router Sampling Hadoop HDFS
  • 12. 12 BI on Hadoop: What are your Options Zoomdata Architecture Overview • Queries on historical (ie, non-streaming) data are split into many sampling queries • This sampling provides a view of the data that converges toward an accurate picture o But adds load on the data source… • Can handle streaming data sources Stream Processing Engine Spark-based cache HDFS / MongoDB Zoomdata Server Incremental Sampling Streaming Data Source Multiple Data Clusters
  • 13. 13 BI on Hadoop: What are your Options Monolithic Solutions: Pros and Cons Pros • Only one tool to learn and operate • Easier than building and maintain ETL-to-RDBMS pipeline • Integrated data preparation in some solutions Cons • Can’t analyze the raw data o Rely on aggregation or sampling before primary analysis • Can’t use your existing BI or analytics tools (Tableau, Qlik, R, …) • Can’t run arbitrary SQL queries
  • 14. 14 BI on Hadoop: What are your Options SQL-on-Big-Data: Introduction • SQL queries against Big Data o Hadoop o NoSQL • MongoDB, HBase, ... o Cloud Storage • S3, Azure Data Lake, GCS, … • Use your existing BI tools o Leverage standard ODBC/JDBC drivers Tableau, Qlik, R, … SQL Engine Hadoop & NoSQL
  • 15. 15 BI on Hadoop: What are your Options SQL-on-Big-Data: Introduction Three major design philosophies: • Native SQL • Batch & Data Science SQL • OLAP Cubes on Hadoop
  • 16. 16 BI on Hadoop: What are your Options Native SQL • Apache Drill o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3) o Based on Apache Arrow o Columnar in-memory execution • Apache Impala (incubating) o Utilizes the Hive metastore o Focused on data in HDFS • Presto o Queries Hadoop, RDBMS, Files, NoSQL, Cloud (S3)
  • 17. 17 BI on Hadoop: What are your Options Native SQL: Pros and Cons Pros • Highest performance for Big Data workloads • Connect to Hadoop and also NoSQL systems • Make Hadoop “look like a database” Cons • Queries may still be too slow for interactive analysis on many TB/PB • Can’t defeat physics
  • 18. 18 BI on Hadoop: What are your Options Batch & Data Science SQL • Hive o Enables SQL queries to be translated to MapReduce/Tez o Most commonly used for batch processing and ETL workloads • Spark SQL o Provides a way to deliver SQL queries in Spark programs (Scala/Java/Python) o Excellent interleaving with data science work
  • 19. 19 BI on Hadoop: What are your Options Batch & Data Science SQL: Pros and Cons Pros o Potentially simpler deployment (no daemons) • New YARN job (MapReduce/Spark) for each query o Check-pointing support enables very long-running queries • Days to weeks (ETL work) o Works well in tandem with machine learning (Spark) Cons o Latency prohibitive for for interactive analytics • Tableau, Qlik Sense, … o Slower than native SQL engines
  • 20. 20 BI on Hadoop: What are your Options OLAP Cubes on Hadoop • Kylin o Hadoop-only o Stores OLAP cubes in HBase o Queries fail if not satisfied by cubes o Open source • AtScale o Hadoop-only o Leverages external SQL engine • Hive, Impala, SparkSQL o Collaborative cube creation o Closed source
  • 21. 21 BI on Hadoop: What are your Options OLAP Cubes on Hadoop: Pros and Cons Pros o Fast queries on pre-aggregated data o Can use SQL and MDX tools Cons o Explicit cube definition/modeling phase • Not “self-service” • Frequent updates required due to dependency on business logic o Aggregation create and maintenance can be long (and large) o User connects to and interacts with the cube • Can’t interact with the raw data
  • 22. 22 BI on Hadoop: What are your Options SQL-on-Big-Data: Solution Comparison Native SQL Batch & DS SQL OLAP Cubes Technologies Drill, Impala, Presto Hive, Spark SQL Kylin, AtScale Connectivity SQL and NoSQL SQL and NoSQL Hadoop-only Primary Use Case Interactive ETL or data-science focused Constrained Interactive Query Capability Raw data Raw data Aggregated data Deployment Model New daemons collocated with existing services New MapReduce and/or Spark job for each query Varies
  • 23. 23 BI on Hadoop: What are your Options SQL-on-Big-Data: General Pros and Cons Pros • Continue using your favorite BI tools and SQL-based clients o Tableau, Qlik, Power BI, Excel, R, SAS, … • Technical analysts can write custom SQL queries Cons • Another layer in your data stack • May need to pre-aggregate the data depending on your scale • Need a separate data preparation tool (or custom scripts)
  • 24. 24 BI on Hadoop: What are your Options Deciding what is right for you?
  • 25. 25 BI on Hadoop: What are your Options ETL to RDBMS BI on Big data: Heuristic Do you already have a favorite BI Tool No Is External Cluster Okay? Does your schema change frequently? No Yes Yes Platfora Zoomdata No Yes Do you want to be able to write SQL No Datameer No Do you like Excel Metaphor? Yes Monolithic/All-in-one Solutions No Is your working data relatively small & static? Yes Yes Yes Do you have very predictable analysis needs? OLAP Cubes on Hadoop No Are you focused on interactive BI? No Do you need to query NoSQL? No Hive Native SQL Do you want to combine ML with SQL? No Yes SparkSQL
  • 26. 26 BI on Hadoop: What are your Options Q&A Tomer Shiran @tshiran Reach out to learn what we’re up to at Dremio (or to join the private beta…)