SlideShare a Scribd company logo
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Apache Arrow
Columnar In-Memory Analytics
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Dremio [NOT TODAY’S TOPIC]
Jacques
Nadeau
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Tomer
Shiran
Founder & CEO
• VP Product, MapR; Microsoft; IBM
Research
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
Architect
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Founded in June 2015
• Led by experts in Big Data and open source
(Apache Parquet, Drill, Pig, Calcite and more)
• Currently in stealth
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Introducing Apache Arrow
• New open source project under the Apache Software Foundation
– Top-level project (directly!)
• Introduces new era of Columnar In-Memory Analytics
1. 10-100x speedup & concurrency for most workloads
2. Common data layer enables companies to choose best of breed
systems
3. Users can utilize any programming language
4. Works with relational and complex data as-is; no ETL required
• 13 major open source Big Data projects are already on board
– A significant % of the world’s data will be processed through Arrow!
UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Arrow Turbo-Charges Big Data Execution Engines
Apache Arrow Apache Arrow Apache Arrow Apache Arrow
Impala
Apache ArrowApache Arrow Apache Arrow Apache Arrow
…
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Performance Advantage of Columnar In-Memory
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Traditional
Memory Buffer
Arrow
Memory Buffer
• Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs
• Arrow optimizes CPU prefetching
and caching
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Evolution Towards Heterogeneous Data Infrastructure
RDBMS
Hadoop MapReduce
Databases
Cassandra
Elasticsearch
HBase
Kudu
MongoDB
Parquet
Phoenix
Execution Engines
Drill
Ibis
Impala
MapReduce
Pandas
Spark
Storm
Phase 1
Common Scheduler
YARN Mesos
Kubernetes
Phase 2
Common Data/Memory
Arrow
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Advantages of a Common Data Layer
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Who’s Behind Apache Arrow?
• The creators and lead developers of 13
major open source Big Data projects
– Employees of Cloudera, Databricks,
Datastax, Dremio, Hortonworks, MapR,
Salesforce, Twitter
• Jacques Nadeau is the PMC Chair (aka VP
Apache Arrow)
– Co-founder & CTO of Dremio
Calcite
Cassandra
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Current Status
• C, C++, Python and Java implementations
currently underway
• Will be adopted by Drill, Ibis, Impala, Kudu,
Parquet and Spark by EOY
• Additional languages (eg, R, JavaScript) and
projects also expected to adopt Arrow by EOY
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
Questions?
Jacques Nadeau
Dremio Founder & CTO
VP Apache Arrow
Julien Le Dem
Dremio Architect
VP Apache Parquet
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
APPENDIX
DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
PMC Members/Committers
Jacques Nadeau (PMC Chair)
Todd Lipcon
Ted Dunning
Michael Stack
P. Taylor Goetz
Reynold Xin
Julian Hyde
Julien Le Dem
James Taylor
Jake Luciani
Parth Chandra
Alex Levenson
Marcel Kornacker
Steven Phillips
Hanifi Gunes
Jason Altekruse
Abdel Hakim Deneche
Wes McKinney
Karthik Ramasamy
David Alves
Seshadri Mahalingam
Ippokratis Pandis

More Related Content

What's hot

Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Databricks
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Databricks
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Databricks
 

What's hot (20)

Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 

Viewers also liked

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde
 

Viewers also liked (8)

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
 
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 

Similar to Apache Arrow - An Overview

HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Puppet
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
 
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleUsing AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Data Works MD
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
Patrick McGarry
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
Jonas Brømsø
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Big Data Approaches to Cloud Security
Big Data Approaches to Cloud SecurityBig Data Approaches to Cloud Security
Big Data Approaches to Cloud Security
Paul Morse
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
Jonas Brømsø
 
New York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome SessionNew York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome Session
Aleksandr Yampolskiy
 

Similar to Apache Arrow - An Overview (20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
HUG_Ireland_Apache_Arrow_Tomer_Shiran
 
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
 
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
 
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
 
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
 
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
 
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
 
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleUsing AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
Stackato v5
Stackato v5Stackato v5
Stackato v5
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 
Big Data Approaches to Cloud Security
Big Data Approaches to Cloud SecurityBig Data Approaches to Cloud Security
Big Data Approaches to Cloud Security
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
New York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome SessionNew York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome Session
 

Recently uploaded

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
CS Kwak
 
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
dakyuhe
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
lead93317
 
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
Andre Hora
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
quanhoangd129
 
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
quanhoangd129
 
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
quanhoangd129
 
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
pyxgy
 
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
quanhoangd129
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
21h16charis
 
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
 
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
VishalKumarJha10
 
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
OnePlan Solutions
 
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
David D. Scott
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
NMahendiran
 
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Gene Gotimer
 
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Andre Hora
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Andre Hora
 
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner
 

Recently uploaded (20)

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
 
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
 
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
 
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
 
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
 
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
 
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
 
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
 
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
 
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
 
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
 
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
 
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
 
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to Monday.com using OnePlan - Webinar 18...
 
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
 
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
 
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
 
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
 
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
 
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
 

Apache Arrow - An Overview

  • 1. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Apache Arrow Columnar In-Memory Analytics UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 2. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Dremio [NOT TODAY’S TOPIC] Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • VP Product, MapR; Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Founded in June 2015 • Led by experts in Big Data and open source (Apache Parquet, Drill, Pig, Calcite and more) • Currently in stealth
  • 3. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Introducing Apache Arrow • New open source project under the Apache Software Foundation – Top-level project (directly!) • Introduces new era of Columnar In-Memory Analytics 1. 10-100x speedup & concurrency for most workloads 2. Common data layer enables companies to choose best of breed systems 3. Users can utilize any programming language 4. Works with relational and complex data as-is; no ETL required • 13 major open source Big Data projects are already on board – A significant % of the world’s data will be processed through Arrow! UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 4. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Arrow Turbo-Charges Big Data Execution Engines Apache Arrow Apache Arrow Apache Arrow Apache Arrow Impala Apache ArrowApache Arrow Apache Arrow Apache Arrow …
  • 5. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Performance Advantage of Columnar In-Memory Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer • Arrow leverages the data parallelism (SIMD) in modern Intel CPUs • Arrow optimizes CPU prefetching and caching
  • 6. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Evolution Towards Heterogeneous Data Infrastructure RDBMS Hadoop MapReduce Databases Cassandra Elasticsearch HBase Kudu MongoDB Parquet Phoenix Execution Engines Drill Ibis Impala MapReduce Pandas Spark Storm Phase 1 Common Scheduler YARN Mesos Kubernetes Phase 2 Common Data/Memory Arrow
  • 7. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Advantages of a Common Data Layer Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Who’s Behind Apache Arrow? • The creators and lead developers of 13 major open source Big Data projects – Employees of Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce, Twitter • Jacques Nadeau is the PMC Chair (aka VP Apache Arrow) – Co-founder & CTO of Dremio Calcite Cassandra Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm
  • 9. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Current Status • C, C++, Python and Java implementations currently underway • Will be adopted by Drill, Ibis, Impala, Kudu, Parquet and Spark by EOY • Additional languages (eg, R, JavaScript) and projects also expected to adopt Arrow by EOY
  • 10. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Questions? Jacques Nadeau Dremio Founder & CTO VP Apache Arrow Julien Le Dem Dremio Architect VP Apache Parquet
  • 11. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET APPENDIX
  • 12. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET PMC Members/Committers Jacques Nadeau (PMC Chair) Todd Lipcon Ted Dunning Michael Stack P. Taylor Goetz Reynold Xin Julian Hyde Julien Le Dem James Taylor Jake Luciani Parth Chandra Alex Levenson Marcel Kornacker Steven Phillips Hanifi Gunes Jason Altekruse Abdel Hakim Deneche Wes McKinney Karthik Ramasamy David Alves Seshadri Mahalingam Ippokratis Pandis

Editor's Notes

  1. This is changing the world! Emphasize that.
  2. Trying to turbo-charge all the major technologies that people use today.
  3. Explain that columnar on disk existed for several years, this is columnar in memory Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here. Very technical explanation – simplify it. One blue vs 4 blues
  4. Maybe improve the slide – from common scheduling to common data in memory
  5. Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today. We’re not offloading the work for them, they are going to do the work. Relationships – good point Call this a platform?