SlideShare a Scribd company logo
Apache Arrow
Columnar In-Memory Analytics
Founder & CTO
• Recognized SQL & NoSQL expert
• Apache Drill PMC Chair
• Quigo (AOL); Offermatica (ADBE);
aQuantive (MSFT)
Founder & CEO
• VP Product, MapR; Microsoft; IBM
• Apache Drill Founder
• Carnegie Mellon, Technion
Julien Le Dem
• Apache Parquet Founder
• Apache Pig PMC Member
• Twitter (Lead, Analytics Data
Pipeline); Yahoo! (Architect)
Top Silicon Valley VCs• Founded in June 2015
• Led by experts in Big Data and open source
(Apache Parquet, Drill, Pig, Calcite and more)
• Currently in stealth
Introducing Apache Arrow
• New open source project under the Apache Software Foundation
– Top-level project (directly!)
• Introduces new era of Columnar In-Memory Analytics
1. 10-100x speedup & concurrency for most workloads
2. Common data layer enables companies to choose best of breed
3. Users can utilize any programming language
4. Works with relational and complex data as-is; no ETL required
• 13 major open source Big Data projects are already on board
– A significant % of the world’s data will be processed through Arrow!
Arrow Turbo-Charges Big Data Execution Engines
Apache Arrow Apache Arrow Apache Arrow Apache Arrow
Apache ArrowApache Arrow Apache Arrow Apache Arrow
Performance Advantage of Columnar In-Memory
Intel CPU
SELECT * FROM clickstream WHERE
session_id = 1331246351
Memory Buffer
Memory Buffer
• Arrow leverages the data parallelism
(SIMD) in modern Intel CPUs
• Arrow optimizes CPU prefetching
and caching
Evolution Towards Heterogeneous Data Infrastructure
Hadoop MapReduce
Execution Engines
Phase 1
Common Scheduler
YARN Mesos
Phase 2
Common Data/Memory
Advantages of a Common Data Layer
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
• No overhead for cross-system
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
Who’s Behind Apache Arrow?
• The creators and lead developers of 13
major open source Big Data projects
– Employees of Cloudera, Databricks,
Datastax, Dremio, Hortonworks, MapR,
Salesforce, Twitter
• Jacques Nadeau is the PMC Chair (aka VP
Apache Arrow)
– Co-founder & CTO of Dremio
Current Status
• C, C++, Python and Java implementations
currently underway
• Will be adopted by Drill, Ibis, Impala, Kudu,
Parquet and Spark by EOY
• Additional languages (eg, R, JavaScript) and
projects also expected to adopt Arrow by EOY
Jacques Nadeau
Dremio Founder & CTO
VP Apache Arrow
Julien Le Dem
Dremio Architect
VP Apache Parquet
PMC Members/Committers
Jacques Nadeau (PMC Chair)
Todd Lipcon
Ted Dunning
Michael Stack
P. Taylor Goetz
Reynold Xin
Julian Hyde
Julien Le Dem
James Taylor
Jake Luciani
Parth Chandra
Alex Levenson
Marcel Kornacker
Steven Phillips
Hanifi Gunes
Jason Altekruse
Abdel Hakim Deneche
Wes McKinney
Karthik Ramasamy
David Alves
Seshadri Mahalingam
Ippokratis Pandis

More Related Content

What's hot

Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Parquet overview
Parquet overviewParquet overview
Parquet overview
Julien Le Dem
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Alluxio, Inc.
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Alkin Tezuysal
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Benjamin Leonhardi
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...

What's hot (20)

Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
Parquet overview
Parquet overviewParquet overview
Parquet overview
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
Parallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta LakeParallelization of Structured Streaming Jobs Using Delta Lake
Parallelization of Structured Streaming Jobs Using Delta Lake
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
My first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdfMy first 90 days with ClickHouse.pdf
My first 90 days with ClickHouse.pdf
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...

Viewers also liked

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Julian Hyde
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Dremio Corporation
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Dremio Corporation
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
Julian Hyde
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Julian Hyde
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
Julian Hyde

Viewers also liked (8)

The twins that everyone loved too much
The twins that everyone loved too muchThe twins that everyone loved too much
The twins that everyone loved too much
Options for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current MarketOptions for Data Prep - A Survey of the Current Market
Options for Data Prep - A Survey of the Current Market
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview

Similar to Apache Arrow - An Overview

HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
John Mulhall
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Julien Le Dem
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
DataWorks Summit/Hadoop Summit
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Julien Le Dem
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Ruslan Meshenberg
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Julien Le Dem
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Julien Le Dem
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleUsing AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Data Works MD
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
DataWorks Summit/Hadoop Summit
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
Patrick McGarry
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
Stackato v5
Stackato v5Stackato v5
Stackato v5
Jonas Brømsø
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
Big Data Approaches to Cloud Security
Big Data Approaches to Cloud SecurityBig Data Approaches to Cloud Security
Big Data Approaches to Cloud Security
Paul Morse
Stackato v2
Stackato v2Stackato v2
Stackato v2
Jonas Brømsø
New York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome SessionNew York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome Session
Aleksandr Yampolskiy

Similar to Apache Arrow - An Overview (20)

HUG_Ireland_Apache_Arrow_Tomer_Shiran HUG_Ireland_Apache_Arrow_Tomer_Shiran
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Strata NY 2016: The future of column-oriented data processing with Arrow and ...
Efficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and ArrowEfficient Data Formats for Analytics with Parquet and Arrow
Efficient Data Formats for Analytics with Parquet and Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet ArrowData Eng Conf NY Nov 2016 Parquet Arrow
Data Eng Conf NY Nov 2016 Parquet Arrow
Netflix oss season 2 episode 1 - meetup Lightning talks
Netflix oss   season 2 episode 1 - meetup Lightning talksNetflix oss   season 2 episode 1 - meetup Lightning talks
Netflix oss season 2 episode 1 - meetup Lightning talks
Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...Strata London 2016: The future of column oriented data processing with Arrow ...
Strata London 2016: The future of column oriented data processing with Arrow ...
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
Delegated Configuration with Multiple Hiera Databases - PuppetConf 2014
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Mule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet ArrowMule soft mar 2017 Parquet Arrow
Mule soft mar 2017 Parquet Arrow
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at ScaleUsing AWS, Terraform, and Ansible to Automate Splunk at Scale
Using AWS, Terraform, and Ansible to Automate Splunk at Scale
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamComputeIn-Ceph-tion: Deploying a Ceph cluster on DreamCompute
In-Ceph-tion: Deploying a Ceph cluster on DreamCompute
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Stackato v5
Stackato v5Stackato v5
Stackato v5
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Big Data Approaches to Cloud Security
Big Data Approaches to Cloud SecurityBig Data Approaches to Cloud Security
Big Data Approaches to Cloud Security
Stackato v2
Stackato v2Stackato v2
Stackato v2
New York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome SessionNew York REDIS Meetup Welcome Session
New York REDIS Meetup Welcome Session

Recently uploaded

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
CS Kwak
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Benjamin Bischoff
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
Andre Hora
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
Henry Schreiner
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
OnePlan Solutions
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
David D. Scott
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Gene Gotimer
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Andre Hora
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Andre Hora
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024
Henry Schreiner

Recently uploaded (20)

240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
240717 ProPILE - Probing Privacy Leakage in Large Language Models.pdf
UW Cert degree offer diploma
UW Cert degree offer diploma UW Cert degree offer diploma
UW Cert degree offer diploma
What is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdfWhat is Micro Frontends and Why Use it.pdf
What is Micro Frontends and Why Use it.pdf
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing ToolsOld Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
Old Tools, New Tricks: Unleashing the Power of Time-Tested Testing Tools
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
PathSpotter: Exploring Tested Paths to Discover Missing Tests (FSE 2024)
08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching08. Ruby Enumerable - Ruby Core Teaching
08. Ruby Enumerable - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching02. Ruby Basic slides - Ruby Core Teaching
02. Ruby Basic slides - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching01. Ruby Introduction - Ruby Core Teaching
01. Ruby Introduction - Ruby Core Teaching
New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta New York University degree Cert offer diploma Transcripta
New York University degree Cert offer diploma Transcripta
09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching09. Ruby Object Oriented Programming - Ruby Core Teaching
09. Ruby Object Oriented Programming - Ruby Core Teaching
Literals - A Machine Independent Feature
Literals - A Machine Independent FeatureLiterals - A Machine Independent Feature
Literals - A Machine Independent Feature
Learning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - PrincetonLearning Rust with Advent of Code 2023 - Princeton
Learning Rust with Advent of Code 2023 - Princeton
4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf4. The Build System _ Embedded Android.pdf
4. The Build System _ Embedded Android.pdf
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
Bring Strategic Portfolio Management to using OnePlan - Webinar 18...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...
The Politics of Agile Development.pptx
The  Politics of  Agile Development.pptxThe  Politics of  Agile Development.pptx
The Politics of Agile Development.pptx
Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()Fixing Git Catastrophes - Nebraska.Code()
Fixing Git Catastrophes - Nebraska.Code()
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)
The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024The two flavors of Python 3.13 - PyHEP 2024
The two flavors of Python 3.13 - PyHEP 2024

Apache Arrow - An Overview

  • 1. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Apache Arrow Columnar In-Memory Analytics UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 2. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Dremio [NOT TODAY’S TOPIC] Jacques Nadeau Founder & CTO • Recognized SQL & NoSQL expert • Apache Drill PMC Chair • Quigo (AOL); Offermatica (ADBE); aQuantive (MSFT) Tomer Shiran Founder & CEO • VP Product, MapR; Microsoft; IBM Research • Apache Drill Founder • Carnegie Mellon, Technion Julien Le Dem Architect • Apache Parquet Founder • Apache Pig PMC Member • Twitter (Lead, Analytics Data Pipeline); Yahoo! (Architect) Top Silicon Valley VCs• Founded in June 2015 • Led by experts in Big Data and open source (Apache Parquet, Drill, Pig, Calcite and more) • Currently in stealth
  • 3. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Introducing Apache Arrow • New open source project under the Apache Software Foundation – Top-level project (directly!) • Introduces new era of Columnar In-Memory Analytics 1. 10-100x speedup & concurrency for most workloads 2. Common data layer enables companies to choose best of breed systems 3. Users can utilize any programming language 4. Works with relational and complex data as-is; no ETL required • 13 major open source Big Data projects are already on board – A significant % of the world’s data will be processed through Arrow! UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET
  • 4. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Arrow Turbo-Charges Big Data Execution Engines Apache Arrow Apache Arrow Apache Arrow Apache Arrow Impala Apache ArrowApache Arrow Apache Arrow Apache Arrow …
  • 5. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Performance Advantage of Columnar In-Memory Intel CPU SELECT * FROM clickstream WHERE session_id = 1331246351 Traditional Memory Buffer Arrow Memory Buffer • Arrow leverages the data parallelism (SIMD) in modern Intel CPUs • Arrow optimizes CPU prefetching and caching
  • 6. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Evolution Towards Heterogeneous Data Infrastructure RDBMS Hadoop MapReduce Databases Cassandra Elasticsearch HBase Kudu MongoDB Parquet Phoenix Execution Engines Drill Ibis Impala MapReduce Pandas Spark Storm Phase 1 Common Scheduler YARN Mesos Kubernetes Phase 2 Common Data/Memory Arrow
  • 7. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Advantages of a Common Data Layer Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 8. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Who’s Behind Apache Arrow? • The creators and lead developers of 13 major open source Big Data projects – Employees of Cloudera, Databricks, Datastax, Dremio, Hortonworks, MapR, Salesforce, Twitter • Jacques Nadeau is the PMC Chair (aka VP Apache Arrow) – Co-founder & CTO of Dremio Calcite Cassandra Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm
  • 9. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Current Status • C, C++, Python and Java implementations currently underway • Will be adopted by Drill, Ibis, Impala, Kudu, Parquet and Spark by EOY • Additional languages (eg, R, JavaScript) and projects also expected to adopt Arrow by EOY
  • 10. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET Questions? Jacques Nadeau Dremio Founder & CTO VP Apache Arrow Julien Le Dem Dremio Architect VP Apache Parquet
  • 12. DREMIODremio Confidential UNDER EMBARGO UNTIL WEDNESDAY, FEBRUARY 17 AT 7:00 AM ET PMC Members/Committers Jacques Nadeau (PMC Chair) Todd Lipcon Ted Dunning Michael Stack P. Taylor Goetz Reynold Xin Julian Hyde Julien Le Dem James Taylor Jake Luciani Parth Chandra Alex Levenson Marcel Kornacker Steven Phillips Hanifi Gunes Jason Altekruse Abdel Hakim Deneche Wes McKinney Karthik Ramasamy David Alves Seshadri Mahalingam Ippokratis Pandis

Editor's Notes

  1. This is changing the world! Emphasize that.
  2. Trying to turbo-charge all the major technologies that people use today.
  3. Explain that columnar on disk existed for several years, this is columnar in memory Is this only CPU and cache, or also main memory? BOTH, EVERYTHING. That’s what’s amazing here. Very technical explanation – simplify it. One blue vs 4 blues
  4. Maybe improve the slide – from common scheduling to common data in memory
  5. Don’t say it will come in in the coming months and years. Years is too far in the future. Everyone has the need today. We’re not offloading the work for them, they are going to do the work. Relationships – good point Call this a platform?