SlideShare a Scribd company logo
Apache Arrow: Open Source Standard
Becomes Enterprise Necessity
March 3, 2022
Subsurface Live Winter 2022
Wes McKinney
@wesmckinn
2
About Me
● Apache Arrow co-creator, PMC member
● pandas creator
● Voltron Data co-founder, CTO
● Previously at Ursa Labs, Two Sigma, Cloudera, AQR
● Author of Python for Data Analysis
3
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of open source developers
● Enables unification of database and data science tech stacks
● Thriving user and developer community
● Provides implementations or bindings in twelve languages
● Adopted by numerous projects and products in the ecosystem
4
Voltron Data
● Creating a unified foundation for the future of
analytical computing with Apache Arrow
● Founded in 2021
● Raised $110M in seed and Series A funding
● Leading corporate contributor to Arrow
We’re hiring!
voltrondata.com/careers
5
What’s New in Apache Arrow
6
Project and Ecosystem Growth
● Sustained growth in users, contributors, applications
● Arrow version 7.0.0 (24th major release) in February 2022
≈800 Arrow contributors
>50M monthly PyArrow downloads
7
Arrow Flight SQL
Next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight
● Reduces implementation complexity for developers
● Shipped in Arrow version 7.0.0 (February 2020)
○ C++ and Java implementations available
○ Additional development and documentation is ongoing 🚧
8
Arrow C++ Query Engine
Work is ongoing to implement comprehensive
query execution capabilities in the Arrow C++ library
● Common scalar functions ✔
● Common grouped aggregate functions ✔
● Common joins ✔
● Performance and efficiency improvements 🚧
● Window functions 🚧
● Additional join types 🚧
9
Substrait
A cross-language, interoperable specification
for data compute operations
○ A standard, flexible way for APIs and engines to share the
representation of analytics computations
■ Produced by APIs (dplyr, Ibis, SQL, …)
■ Consumed by engines (Arrow C++ engine, DuckDB, …)
○ Work is underway to implement producers and consumers 🚧
10
Ibis
A high-level Python API for data analysis
● Fluent pandas-like syntax
● Can express virtually any SQL query
● Supports modular backends for querying systems including
PostgreSQL, MySQL, SQLite, Impala, ClickHouse, BigQuery
● Work is underway to produce Substrait plans 🚧
○ Enabling Ibis queries to run on the Arrow C++ engine
11
Challenges that Enterprises
Are Solving with Apache Arrow
12
Challenge: Obstacles to Interoperability
● Enterprise data tools and workflows often incorporate multiple
languages, query engines, storage systems, and clouds.
● Arrow provides standard formats, interfaces, and frameworks
to enable efficient integration of these diverse components.
13
Case Study: Microsoft
Microsoft uses Arrow in their Magpie data science middleware
● Unifies multiple cloud and
database backends into a
single end user interface
● Stores intermediate results
as Arrow Tables
● Uses Arrow Flight to move
data between systems
Source: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
14
Case Study: Streamlit
Streamlit uses Arrow to move data from Python to JavaScript
● Reduces server–browser
data transfer times
● Simplifies addition of new
features to Streamlit
● Allowed deletion of >1000
lines of custom code
Source: https://blog.streamlit.io/all-in-on-apache-arrow/
15
Challenge: Slow Data Access Protocols
● Legacy data transport protocols are often slowed by
serialization and deserialization overhead and other causes.
● The Arrow columnar data format and the Arrow Flight
framework enable fast, serialization-free data transport.
16
Case Study: Snowflake
Snowflake uses Arrow in its JDBC client, Python client,
and Spark connector
● Eliminates serialization
between columnar and row-
oriented data formats
● Achieves 4 – 10x reduction
in fetch times
Figure 1. JDBC fetch performance benchmark for JDBC client
version 3.9.x (without Arrow) versus 3.11.0 (with Arrow)
Source: https://www.snowflake.com/blog/fetching-query-results-
from-snowflake-just-got-a-lot-faster-with-apache-arrow/
17
Case Study: Google Cloud Platform
Google Cloud uses Arrow in its BigQuery Storage API
● Eliminates or speeds
serialization to multiple
data formats
● Achieves 4 – 31x reduction
in download times
● Speedup is stable across
data sizes
Source: https://medium.com/google-cloud/announcing-google-
cloud-bigquery-version-1-17-0-1fc428512171
18
Challenge: Limits of JVM-Based Engines
● Computing engines implemented in JVM languages including
Java and Scala often suffer from performance bottlenecks.
● Arrow columnar data structures can accelerate JVM-based
engines and enable use of pluggable high-performance
components implemented in lower-level languages like C++.
19
Case Study: KNIME
KNIME uses Arrow in its
Columnar Table Backend
● Stores data more compactly in
memory, improving performance
● Eliminates the need to create
Java objects to represent table
elements, reducing GC pressure
● Enables use of shared memory
Source: https://www.knime.com/blog/improved-
performance-with-new-table-backend
20
Case Study: Meta
Meta uses Arrow in its Velox C++ database acceleration library
● Improves the performance of Spark and Presto jobs
● Uses Arrow columnar memory layouts for most data types
● Enables SIMD vectorized expression evaluation
Source: https://github.com/facebookincubator/velox
21
Challenge: Embeddable Query Processing
● Enterprises want the flexibility to embed in-memory analytical
execution capabilities directly in business applications.
● By integrating with Arrow, embeddable engines can achieve
excellent performance, efficiency, and developer experience.
22
Case Study: DataFusion
Arrow DataFusion is extensible query execution framework
● Implemented as an embeddable Rust library
● Supports distributed execution through Ballista
● Uses Arrow as its native in-memory format
● Supports SQL and a DataFrame API
● Donated to the Apache Arrow project
Source: https://arrow.apache.org/datafusion
23
Case Study: DuckDB
DuckDB is an in-process SQL OLAP database system
● Implemented as an embeddable C++ library
● Offers zero-copy integration with Arrow
● Can push down filters and projections into Arrow scans
● Interoperates with the Arrow Python and R APIs
Source: https://duckdb.org/2021/12/03/duck-arrow.html
24
How Voltron Data Supports
Enterprise Applications of
Apache Arrow
25
Enterprise Subscription for Arrow
● A focused set of services designed to accelerate
business success with Apache Arrow
● Offered in three editions starting with free
● Available now (March 2022)
● Learn more and sign up at
voltrondata.com/subscription
○ Professional support
○ Deployment assistance
○ Content and events
○ Private consultations
Trusted by:
Thank you
Wes McKinney
@wesmckinn
Sign up:
voltrondata.com/subscription

More Related Content

What's hot

Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
Databricks
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
Julien Le Dem
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
Jacques Nadeau
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
Spark Summit
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Edureka!
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 

What's hot (20)

Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
How to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized OptimizationsHow to Extend Apache Spark with Customized Optimizations
How to Extend Apache Spark with Customized Optimizations
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Apache Arrow Flight Overview
Apache Arrow Flight OverviewApache Arrow Flight Overview
Apache Arrow Flight Overview
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Debugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden KarauDebugging PySpark: Spark Summit East talk by Holden Karau
Debugging PySpark: Spark Summit East talk by Holden Karau
 
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
 
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 Best Practice of Compression/Decompression Codes in Apache Spark with Sophia... Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
 

Similar to Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
DATAVERSITY
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
Sion Smith
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
FIWARE
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
Timothy Spann
 
Belgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep WaterBelgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep Water
Sri Ambati
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
Jo-fai Chow
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspective
Alluxio, Inc.
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
Getting value from IoT, Integration and Data Analytics
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
Lucas Jellema
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
Roy Kim
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
Shu-Jeng Hsieh
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
Paco Nathan
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
Databricks
 

Similar to Apache Arrow: Open Source Standard Becomes an Enterprise Necessity (20)

Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Unlocking the Value of Your Data Lake
Unlocking the Value of Your Data LakeUnlocking the Value of Your Data Lake
Unlocking the Value of Your Data Lake
 
Enterprise guide to building a Data Mesh
Enterprise guide to building a Data MeshEnterprise guide to building a Data Mesh
Enterprise guide to building a Data Mesh
 
Session 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers ProgramSession 8 - Creating Data Processing Services | Train the Trainers Program
Session 8 - Creating Data Processing Services | Train the Trainers Program
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Day 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers ProgramDay 13 - Creating Data Processing Services | Train the Trainers Program
Day 13 - Creating Data Processing Services | Train the Trainers Program
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Belgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep WaterBelgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep Water
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
 
Presto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspectivePresto: Query Anything - Data Engineer’s perspective
Presto: Query Anything - Data Engineer’s perspective
 
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
AMIS Oracle OpenWorld en Code One Review 2018 - Pillar 2: Custom Application ...
 
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
AMIS Oracle OpenWorld & CodeOne Review - Pillar 2 - Custom Application Develo...
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
Databricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User GroupDatabricks Meetup @ Los Angeles Apache Spark User Group
Databricks Meetup @ Los Angeles Apache Spark User Group
 
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformTeaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
 
What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3What's New in Upcoming Apache Spark 2.3
What's New in Upcoming Apache Spark 2.3
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 

Recently uploaded

FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Alliance
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
Fwdays
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
Yury Chemerkin
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
Zilliz
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Quentin Reul
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
Priyanka Aash
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
Razin Mustafiz
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
AmandaCheung15
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
AMol NAik
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
DianaGray10
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Zilliz
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
siddu769252
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
Low Hong Chuan
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
Sara Kroft
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
Priyanka Aash
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 

Recently uploaded (20)

FIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptxFIDO Munich Seminar Workforce Authentication Case Study.pptx
FIDO Munich Seminar Workforce Authentication Case Study.pptx
 
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
 
Enterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdfEnterprise_Mobile_Security_Forum_2013.pdf
Enterprise_Mobile_Security_Forum_2013.pdf
 
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
 
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceCracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
Cracking AI Black Box - Strategies for Customer-centric Enterprise Excellence
 
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
 
Keynote : Presentation on SASE Technology
Keynote : Presentation on SASE TechnologyKeynote : Presentation on SASE Technology
Keynote : Presentation on SASE Technology
 
The Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - CoatueThe Path to General-Purpose Robots - Coatue
The Path to General-Purpose Robots - Coatue
 
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
 
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
 
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
 
Indian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for StartupsIndian Privacy law & Infosec for Startups
Indian Privacy law & Infosec for Startups
 
Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1Discovery Series - Zero to Hero - Task Mining Session 1
Discovery Series - Zero to Hero - Task Mining Session 1
 
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
 
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
 
AMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech DayAMD Zen 5 Architecture Deep Dive from Tech Day
AMD Zen 5 Architecture Deep Dive from Tech Day
 
The Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdfThe Challenge of Interpretability in Generative AI Models.pdf
The Challenge of Interpretability in Generative AI Models.pdf
 
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
 
Redefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI CapabilitiesRedefining Cybersecurity with AI Capabilities
Redefining Cybersecurity with AI Capabilities
 
Self-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - HealeniumSelf-Healing Test Automation Framework - Healenium
Self-Healing Test Automation Framework - Healenium
 

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

  • 1. Apache Arrow: Open Source Standard Becomes Enterprise Necessity March 3, 2022 Subsurface Live Winter 2022 Wes McKinney @wesmckinn
  • 2. 2 About Me ● Apache Arrow co-creator, PMC member ● pandas creator ● Voltron Data co-founder, CTO ● Previously at Ursa Labs, Two Sigma, Cloudera, AQR ● Author of Python for Data Analysis
  • 3. 3 Apache Arrow Multi-language toolbox for accelerated data interchange and in-memory processing ● Founded in 2016 by a group of open source developers ● Enables unification of database and data science tech stacks ● Thriving user and developer community ● Provides implementations or bindings in twelve languages ● Adopted by numerous projects and products in the ecosystem
  • 4. 4 Voltron Data ● Creating a unified foundation for the future of analytical computing with Apache Arrow ● Founded in 2021 ● Raised $110M in seed and Series A funding ● Leading corporate contributor to Arrow We’re hiring! voltrondata.com/careers
  • 5. 5 What’s New in Apache Arrow
  • 6. 6 Project and Ecosystem Growth ● Sustained growth in users, contributors, applications ● Arrow version 7.0.0 (24th major release) in February 2022 ≈800 Arrow contributors >50M monthly PyArrow downloads
  • 7. 7 Arrow Flight SQL Next-generation standard for data access using SQL ● Adds SQL semantics to Arrow Flight ● Enables ODBC/JDBC-style data access at the speed of Flight ● Reduces implementation complexity for developers ● Shipped in Arrow version 7.0.0 (February 2020) ○ C++ and Java implementations available ○ Additional development and documentation is ongoing 🚧
  • 8. 8 Arrow C++ Query Engine Work is ongoing to implement comprehensive query execution capabilities in the Arrow C++ library ● Common scalar functions ✔ ● Common grouped aggregate functions ✔ ● Common joins ✔ ● Performance and efficiency improvements 🚧 ● Window functions 🚧 ● Additional join types 🚧
  • 9. 9 Substrait A cross-language, interoperable specification for data compute operations ○ A standard, flexible way for APIs and engines to share the representation of analytics computations ■ Produced by APIs (dplyr, Ibis, SQL, …) ■ Consumed by engines (Arrow C++ engine, DuckDB, …) ○ Work is underway to implement producers and consumers 🚧
  • 10. 10 Ibis A high-level Python API for data analysis ● Fluent pandas-like syntax ● Can express virtually any SQL query ● Supports modular backends for querying systems including PostgreSQL, MySQL, SQLite, Impala, ClickHouse, BigQuery ● Work is underway to produce Substrait plans 🚧 ○ Enabling Ibis queries to run on the Arrow C++ engine
  • 11. 11 Challenges that Enterprises Are Solving with Apache Arrow
  • 12. 12 Challenge: Obstacles to Interoperability ● Enterprise data tools and workflows often incorporate multiple languages, query engines, storage systems, and clouds. ● Arrow provides standard formats, interfaces, and frameworks to enable efficient integration of these diverse components.
  • 13. 13 Case Study: Microsoft Microsoft uses Arrow in their Magpie data science middleware ● Unifies multiple cloud and database backends into a single end user interface ● Stores intermediate results as Arrow Tables ● Uses Arrow Flight to move data between systems Source: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
  • 14. 14 Case Study: Streamlit Streamlit uses Arrow to move data from Python to JavaScript ● Reduces server–browser data transfer times ● Simplifies addition of new features to Streamlit ● Allowed deletion of >1000 lines of custom code Source: https://blog.streamlit.io/all-in-on-apache-arrow/
  • 15. 15 Challenge: Slow Data Access Protocols ● Legacy data transport protocols are often slowed by serialization and deserialization overhead and other causes. ● The Arrow columnar data format and the Arrow Flight framework enable fast, serialization-free data transport.
  • 16. 16 Case Study: Snowflake Snowflake uses Arrow in its JDBC client, Python client, and Spark connector ● Eliminates serialization between columnar and row- oriented data formats ● Achieves 4 – 10x reduction in fetch times Figure 1. JDBC fetch performance benchmark for JDBC client version 3.9.x (without Arrow) versus 3.11.0 (with Arrow) Source: https://www.snowflake.com/blog/fetching-query-results- from-snowflake-just-got-a-lot-faster-with-apache-arrow/
  • 17. 17 Case Study: Google Cloud Platform Google Cloud uses Arrow in its BigQuery Storage API ● Eliminates or speeds serialization to multiple data formats ● Achieves 4 – 31x reduction in download times ● Speedup is stable across data sizes Source: https://medium.com/google-cloud/announcing-google- cloud-bigquery-version-1-17-0-1fc428512171
  • 18. 18 Challenge: Limits of JVM-Based Engines ● Computing engines implemented in JVM languages including Java and Scala often suffer from performance bottlenecks. ● Arrow columnar data structures can accelerate JVM-based engines and enable use of pluggable high-performance components implemented in lower-level languages like C++.
  • 19. 19 Case Study: KNIME KNIME uses Arrow in its Columnar Table Backend ● Stores data more compactly in memory, improving performance ● Eliminates the need to create Java objects to represent table elements, reducing GC pressure ● Enables use of shared memory Source: https://www.knime.com/blog/improved- performance-with-new-table-backend
  • 20. 20 Case Study: Meta Meta uses Arrow in its Velox C++ database acceleration library ● Improves the performance of Spark and Presto jobs ● Uses Arrow columnar memory layouts for most data types ● Enables SIMD vectorized expression evaluation Source: https://github.com/facebookincubator/velox
  • 21. 21 Challenge: Embeddable Query Processing ● Enterprises want the flexibility to embed in-memory analytical execution capabilities directly in business applications. ● By integrating with Arrow, embeddable engines can achieve excellent performance, efficiency, and developer experience.
  • 22. 22 Case Study: DataFusion Arrow DataFusion is extensible query execution framework ● Implemented as an embeddable Rust library ● Supports distributed execution through Ballista ● Uses Arrow as its native in-memory format ● Supports SQL and a DataFrame API ● Donated to the Apache Arrow project Source: https://arrow.apache.org/datafusion
  • 23. 23 Case Study: DuckDB DuckDB is an in-process SQL OLAP database system ● Implemented as an embeddable C++ library ● Offers zero-copy integration with Arrow ● Can push down filters and projections into Arrow scans ● Interoperates with the Arrow Python and R APIs Source: https://duckdb.org/2021/12/03/duck-arrow.html
  • 24. 24 How Voltron Data Supports Enterprise Applications of Apache Arrow
  • 25. 25 Enterprise Subscription for Arrow ● A focused set of services designed to accelerate business success with Apache Arrow ● Offered in three editions starting with free ● Available now (March 2022) ● Learn more and sign up at voltrondata.com/subscription ○ Professional support ○ Deployment assistance ○ Content and events ○ Private consultations Trusted by:
  • 26. Thank you Wes McKinney @wesmckinn Sign up: voltrondata.com/subscription