Building a Streaming Microservice Architecture: with Apache Spark Structured ...Databricks
As we continue to push the boundaries of what is possible with respect to pipeline throughput and data serving tiers, new methodologies and techniques continue to emerge to handle larger and larger workloads
How to Extend Apache Spark with Customized OptimizationsDatabricks
There are a growing set of optimization mechanisms that allow you to achieve competitive SQL performance. Spark has extension points that help third parties to add customizations and optimizations without needing these optimizations to be merged into Apache Spark. This is very powerful and helps extensibility. We have added some enhancements to the existing extension points framework to enable some fine grained control. This talk will be a deep dive at the extension points that is available in Spark today. We will also talk about the enhancements to this API that we developed to help make this API more powerful. This talk will be of benefit to developers who are looking to customize Spark in their deployments.
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
Hive tables are an integral part of the big data ecosystem, but the simple directory-based design that made them ubiquitous is increasingly problematic. Netflix uses tables backed by S3 that, like other object stores, don’t fit this directory-based model: listings are much slower, renames are not atomic, and results are eventually consistent. Even tables in HDFS are problematic at scale, and reliable query behavior requires readers to acquire locks and wait.
Owen O’Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout addresses the challenges of current Hive tables, with properties specifically designed for cloud object stores, such as S3. Iceberg is an Apache-licensed open source project. It specifies the portable table format and standardizes many important features, including:
* All reads use snapshot isolation without locking.
* No directory listings are required for query planning.
* Files can be added, removed, or replaced atomically.
* Full schema evolution supports changes in the table over time.
* Partitioning evolution enables changes to the physical layout without breaking existing queries.
* Data files are stored as Avro, ORC, or Parquet.
* Support for Spark, Pig, and Presto.
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers will share the design, architecture & use-cases of the second generation of ‘Hudi’, a self contained Apache Spark library to build large scale analytical datasets designed to serve such needs and beyond. Hudi (formerly Hoodie) is created to effectively manage petabytes of analytical data on distributed storage, while supporting fast ingestion & queries. In this talk, we will discuss how we leveraged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. We will also show to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
The document provides an overview of Delta Lake, which is a storage layer that brings ACID transactions to Apache Spark SQL. It discusses key concepts like the Delta log (transaction log), optimistic concurrency control, computing and updating the state of a Delta table, and time travel capabilities. It also covers batch and streaming queries on Delta tables and concludes with a demo.
Delta Lake, an open-source innovations which brings new capabilities for transactions, version control and indexing your data lakes. We uncover how Delta Lake benefits and why it matters to you. Through this session, we showcase some of its benefits and how they can improve your modern data engineering pipelines. Delta lake provides snapshot isolation which helps concurrent read/write operations and enables efficient insert, update, deletes, and rollback capabilities. It allows background file optimization through compaction and z-order partitioning achieving better performance improvements. In this presentation, we will learn the Delta Lake benefits and how it solves common data lake challenges, and most importantly new Delta Time Travel capability.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base.
We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."
The columnar roadmap: Apache Parquet and Apache ArrowJulien Le Dem
This document discusses Apache Parquet and Apache Arrow, open source projects for columnar data formats. Parquet is an on-disk columnar format that optimizes I/O performance through compression and projection pushdown. Arrow is an in-memory columnar format that maximizes CPU efficiency through vectorized processing and SIMD. It aims to serve as a standard in-memory format between systems. The document outlines how Arrow builds on Parquet's success and provides benefits like reduced serialization overhead and ability to share functionality through its ecosystem. It also describes how Parquet and Arrow representations are integrated through techniques like vectorized reading and predicate pushdown.
Arrow Flight is a proposed RPC layer for Apache Arrow that allows for efficient transfer of Arrow record batches between systems. It uses GRPC as the foundation to define streams of Arrow data that can be consumed in parallel across locations. Arrow Flight supports custom actions that can be used to build services on top of the generic API. By extending GRPC, Arrow Flight aims to simplify the creation of data applications while enabling high performance data transfer and locality awareness.
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...Edureka!
This Edureka Spark Streaming Tutorial will help you understand how to use Spark Streaming to stream data from twitter in real-time and then process it for Sentiment Analysis. This Spark Streaming tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) What is Streaming?
2) Spark Ecosystem
3) Why Spark Streaming?
4) Spark Streaming Overview
5) DStreams
6) DStream Transformations
7) Caching/ Persistence
8) Accumulators, Broadcast Variables and Checkpoints
9) Use Case – Twitter Sentiment Analysis
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
Wes McKinney is a leading open source developer who created Python's pandas library and now leads the Apache Arrow project. Apache Arrow is an open standard for in-memory analytics that aims to improve data sharing and reuse across systems by defining a common columnar data format and memory layout. It allows data to be accessed and algorithms to be reused across different programming languages with near-zero data copying. Arrow is being integrated into various data systems and is working to expand its computational libraries and language support.
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
This document discusses Apache Arrow, an open source project that provides cross-language data structures and algorithms for efficient data analytics. It summarizes the history and goals of Arrow, provides examples of how it has been adopted, and outlines ongoing development initiatives. Key points include that Arrow aims to accelerate data processing by standardizing columnar data formats and protocols, it has seen widespread adoption with over 50M installs in 2019, and active areas of work include the C++ development platform and Arrow Flight RPC framework.
Today, data lakes are widely used and have become extremely affordable as data volumes have grown. However, they are only meant for storage and by themselves provide no direct value. With up to 80% of data stored in the data lake today, how do you unlock the value of the data lake? The value lies in the compute engine that runs on top of a data lake.
Join us for this webinar where Ahana co-founder and Chief Product Officer Dipti Borkar will discuss how to unlock the value of your data lake with the emerging Open Data Lake analytics architecture.
Dipti will cover:
-Open Data Lake analytics - what it is and what use cases it supports
-Why companies are moving to an open data lake analytics approach
-Why the open source data lake query engine Presto is critical to this approach
Enterprise guide to building a Data MeshSion Smith
Making Data Mesh simple, Open Source and available to all; without vendor lock-in, without complex tooling and to use an approach centered around ‘specifications’, existing tools and baking in a ‘domain’ model.
Session 8 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
Wes McKinney gave a talk on Apache Arrow and the future of data frames. He discussed how Arrow aims to standardize columnar data formats and reduce inefficiencies in data processing. It defines an efficient binary format for transferring data between systems and programming languages. As more tools support Arrow natively, it will become more efficient to process data directly in Arrow format rather than converting between data structures. Arrow is gaining adoption in popular data tools like Spark, BigQuery, and InfluxDB to improve performance.
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere."
This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem.
Speaker
Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant
Day 13 - Creating Data Processing Services | Train the Trainers ProgramFIWARE
This technical session for Local Experts in Data Sharing (LEBDs), this session will explain how to create data processing services that are key to i4Trust.
Belgrade R - Intro to H2O and Deep WaterSri Ambati
The document provides an agenda and summary of a presentation on H2O.ai's machine learning platform and recent developments. The presentation includes an introduction to H2O, the company and why their platform H2O is useful. It demonstrates H2O's machine learning capabilities including deep learning, and discusses latest features like integrating xgboost and automatic machine learning. Real-world examples and demos are also provided to illustrate how to use H2O with R, Python and via its web interface.
The document summarizes a presentation given by Joe Chow on H2O at the BelgradeR Meetup. The agenda includes an introduction to H2O, the company, why H2O is useful, the H2O machine learning platform, Deep Water for deep learning, latest H2O developments, and demos. Joe will discuss H2O's introduction to machine learning, distributed algorithms, interfaces for R, Python and Flow, and Deep Water for distributed deep learning on GPUs with TensorFlow, MXNet or Caffe.
Presto: Query Anything - Data Engineer’s perspectiveAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Presto: Query Anything - Data Engineer’s perspective
Speakers:
Kamil Bajda-Pawlikowski, Starburst, Presto Company
Martin Traverso, Presto Software Foundation
For more Alluxio events: https://www.alluxio.io/events/
The annual review session by the AMIS team on their findings, interpretations and opinions regarding news, trends, announcements and roadmaps around Oracle's product portfolio.
Custom application development according to Oracle is primarily relevant for extending SaaS applications and creating customer experiences. The current recommended approach for building graphical user interface (on web and mobile) is through low code Visual Builder with high code JET injections when required. An alternative low code stack is available from Oracle in the form of APEX, This slide set discusses the above as well as ADF and Forms. It then introduces Digital Assistant, talks about the state and future of Java and concludes with CI/CD and DevOps. As presented on November 5th 2018 at AMIS HQ, Nieuwegein, The Netherlands.
Big Data Analytics from Azure Cloud to Power BI MobileRoy Kim
This document discusses using Azure services for big data analytics and data insights. It provides an overview of Azure services like Azure Batch, Azure Data Lake, Azure HDInsight and Power BI. It then describes a demo solution that uses these Azure services to analyze job posting data, including collecting data using a .NET application, storing in Azure Data Lake Store, processing with Azure Data Lake Analytics and Azure HDInsight, and visualizing results in Power BI. The presentation includes architecture diagrams and discusses implementation details.
A sharing in a meetup of the AWS Taiwan User Group.
The registration page: https://bityl.co/7yRK
The promotion page: https://www.facebook.com/groups/awsugtw/permalink/4123481584394988/
Databricks Meetup @ Los Angeles Apache Spark User GroupPaco Nathan
This document summarizes a presentation on Apache Spark and Spark Streaming. It provides an overview of Spark, describing it as an in-memory cluster computing framework. It then discusses Spark Streaming, explaining that it runs streaming computations as small batch jobs to provide low latency processing. Several use cases for Spark Streaming are presented, including from companies like Stratio, Pearson, Ooyala, and Sharethrough. The presentation concludes with a demonstration of Python Spark Streaming code.
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...Andrew Lamb
DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems.
This presentation explains where it fits into the data eco system and how it helps implement your system in Rust
Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao
Yao Yao Mooyoung Lee
https://github.com/yaowser/learn-spark/tree/master/Final%20project
https://www.youtube.com/watch?v=IVMbSDS4q3A
https://www.academia.edu/35646386/Teaching_Apache_Spark_Demonstrations_on_the_Databricks_Cloud_Platform
https://www.slideshare.net/YaoYao44/teaching-apache-spark-demonstrations-on-the-databricks-cloud-platform-86063070/
Apache Spark is a fast and general engine for big data analytics processing with libraries for SQL, streaming, and advanced analytics
Cloud Computing, Structured Streaming, Unified Analytics Integration, End-to-End Applications
Apache Spark 2.0 set the architectural foundations of Structure in Spark, Unified high-level APIs, Structured Streaming, and the underlying performant components like Catalyst Optimizer and Tungsten Engine. Since then the Spark community has continued to build new features and fix numerous issues in releases Spark 2.1 and 2.2.
Continuing forward in that spirit, the upcoming release of Apache Spark 2.3 has made similar strides too, introducing new features and resolving over 1300 JIRA issues. In this talk, we want to share with the community some salient aspects of soon to be released Spark 2.3 features:
• Kubernetes Scheduler Backend
• PySpark Performance and Enhancements
• Continuous Structured Streaming Processing
• DataSource v2 APIs
• Structured Streaming v2 APIs
Similar to Apache Arrow: Open Source Standard Becomes an Enterprise Necessity (20)
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
The document discusses the future of composable data systems and provides an overview from Wes McKinney. Some key points:
- Composable data systems are designed to be modular and reusable across different components through open standards and protocols. This allows new engines to be developed more easily.
- The data landscape is shifting to an era of composability, where monolithic systems will be replaced by modular, reusable pieces.
- Areas of focus for composable systems include execution engines, query interfaces, storage protocols, and optimization.
- Projects like Apache Arrow, Ibis, Substrait, and modular engines like DuckDB, DataFusion, and Velox are moving the industry toward composability.
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
This document discusses how structured data is often moved inefficiently between systems, causing waste. It introduces Apache Arrow, an open standard for in-memory data, and how Arrow can help make data movement more efficient. Systems like Snowflake and BigQuery are now using Arrow to help speed up query result fetching by enabling zero-copy data transfers and sharing file formats between query processing and storage.
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
Wes McKinney gave a presentation on the past, present, and future of Python for data analysis. He discussed the origins and development of pandas over the past 12 years from the first open source release in 2009 to the current state. Key points included pandas receiving its first formal funding in 2019, its large community of contributors, and factors driving Python's growth for data science like its package ecosystem and education. McKinney also addressed early concerns about Python and looked to the future, highlighting projects like Apache Arrow that aim to improve performance and interoperability.
Apache Arrow: Leveling Up the Analytics StackWes McKinney
This document discusses the development of Apache Arrow, an open source in-memory data format designed for efficient analytical data processing on modern hardware. It provides a brief history of big data and analytics technologies leading to the need for Arrow. Key points about Arrow include that it aims to eliminate data serialization, enable code sharing across languages, and has over 400 contributors representing 11 programming languages. Notable subcomponents include DataFusion, Gandiva, and Plasma; and development is supported by organizations like Ursa Labs.
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
Technical deep dive for database system developers in the Arrow columnar format, binary protocol, C++ development platform, and Arrow Flight RPC.
See demo Jupyter notebooks at https://github.com/wesm/vldb-2019-apache-arrow-workshop
Apache Arrow: Leveling Up the Data Science StackWes McKinney
Ursa Labs builds cross-language libraries like Apache Arrow for data science. Arrow provides a columnar data format and utilities for efficient serialization, IO, and querying across programming languages. Ursa Labs contributes to Arrow and funds open source developers to grow the Arrow ecosystem. Their goal is to reduce the CPU time spent on data serialization and enable faster data analysis in languages like R.
This document discusses the history and development of Python data analysis tools, including pandas. It covers Wes McKinney's work on pandas from 2008 to the present, including the motivations for making data analysis easier and more productive. It also summarizes the development of related projects like Apache Arrow for standardizing columnar data representations to improve code reuse across languages.
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
Apache Arrow is an open standard for in-memory columnar data and an analytical data processing platform. It aims to simplify system architectures, improve interoperability between systems, and enable data and algorithms to be reused across different programming languages. Arrow provides a portable in-memory data format and computational libraries to build analytical data processing systems. It is language-independent and supports data sharing and algorithm reuse between libraries and processes via shared memory with near-zero overhead.
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
Wes McKinney is the creator of Python's pandas project and a primary developer of Apache Arrow, Apache Parquet, and other open-source projects. Apache Arrow is an open-source cross-language development platform for in-memory analytics that aims to improve data science tools. It provides a shared standard for memory interoperability and computation across languages through its columnar memory format and libraries. Apache Arrow has growing adoption in data science systems and is working to expand language support and computational capabilities.
Shared Infrastructure for Data ScienceWes McKinney
Wes McKinney discussed the evolution of data science tools and infrastructure over the past 10 years and a vision for the next 10 years. He argued that current data science languages like Python, R, and Julia operate in "silos" with separate implementations for data storage, processing, and analytics. However, new projects like Apache Arrow aim to break down these silos by establishing shared standards for in-memory data formats and interchange that can unite the implementations across languages. Arrow provides a portable data frame format, zero-copy interchange capabilities, and potential for high performance data access and flexible computation engines. This would allow data science work to be more portable across programming languages while improving performance.
Data Science Without Borders (JupyterCon 2017)Wes McKinney
Talk about building shared, language-agnostic computational infrastructure for data science. Discusses the motivation and work that's happening in the Apache Arrow project to help (http://arrow.apache.org)
Memory Interoperability in Analytics and Machine LearningWes McKinney
Wes McKinney gave a talk on Apache Arrow, an open source project for memory interoperability between analytics and machine learning systems. Arrow provides efficient columnar memory structures and zero-copy sharing of data between applications. It defines common data types and schemas that can be used across programming languages. Arrow is implemented in C++ and provides language bindings for other languages like Python. It aims to improve performance for tasks like data loading, preprocessing, modeling and serving. Projects like pandas, Spark and Ray are exploring using Arrow internally for more efficient data handling.
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
The document discusses trends in open source analytics for data science. It notes that industry giants are opening core AI and machine learning technologies. There is also open source "disruption" in data science languages and tools. Two Sigma aims to build a collaborative data science platform through open source contributions to scale access to data and computational capabilities while enhancing productivity and collaboration. Two Sigma participates in open source to drive innovation, increase value of proprietary systems, raise awareness of challenges at scale, and attract talent. Areas of investment include Apache Arrow, Parquet, Pandas, and projects for resource management, distributed computing, and collaboration.
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
Slides from Spark Summit East 2017 — February 9, 2017 in Boston. Discusses ongoing development work to accelerate Python-on-Spark performance using Apache Arrow and other tools
Python Data Wrangling: Preparing for the FutureWes McKinney
The document is a slide deck for a presentation on Python data wrangling and the future of the pandas project. It discusses the growth of the Python data science community and key projects like NumPy, pandas, and scikit-learn that have contributed to pandas' popularity. It outlines some issues with the current pandas codebase and proposes a new C++-based core called libpandas for pandas 2.0 to improve performance and interoperability. Benchmark results show serialization formats like Arrow and Feather outperforming pickle and CSV for transferring data.
Wes McKinney gave the keynote presentation at PyCon APAC 2016 in Seoul. He discussed his work on Python data analysis tools like pandas, Apache Arrow, and Feather. He also talked about open source sustainability and governance. McKinney is working on the second edition of his book Python for Data Analysis, which is scheduled for release in 2017.
This document discusses Apache Arrow, an open source project that aims to standardize in-memory data representations to enable efficient data sharing across systems. It summarizes Arrow's goals of improving performance by 10-100x on many workloads through a common data layer, reducing serialization overhead. The document outlines Arrow's language bindings for Java, C++, Python, R, and Julia and efforts to integrate Arrow with systems like Spark, Drill and Impala to enable faster analytics. It encourages involvement in the Apache Arrow community.
High Performance Python on Apache SparkWes McKinney
This document contains the slides from a presentation given by Wes McKinney on high performance Python on Apache Spark. The presentation discusses why Python is an important and productive language, defines what is meant by "high performance Python", and explores techniques for building fast Python software such as embracing limitations of the Python interpreter and using native data structures and compiled extensions where needed. Specific examples are provided around control flow, reading CSV files, and the importance of efficient in-memory data structures.
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
Wes McKinney gives a presentation on the Python data ecosystem and building open source communities. He discusses his background working on Python data tools like pandas and Apache projects. McKinney emphasizes the importance of transparency, consensus building, and valuing all contributions when developing open source software. He also examines challenges in Python packaging and sees opportunities in building bridges between data science languages and tools for analyzing new data types and storage technologies.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
Retrieval Augmented Generation Evaluation with RagasZilliz
Retrieval Augmented Generation (RAG) enhances chatbots by incorporating custom data in the prompt. Using large language models (LLMs) as judge has gained prominence in modern RAG systems. This talk will demo Ragas, an open-source automation tool for RAG evaluations. Christy will talk about and demo evaluating a RAG pipeline using Milvus and RAG metrics like context F1-score and answer correctness.
Cracking AI Black Box - Strategies for Customer-centric Enterprise ExcellenceQuentin Reul
The democratization of Generative AI is ushering in a new era of innovation for enterprises. Discover how you can harness this powerful technology to deliver unparalleled customer value and securing a formidable competitive advantage in today's competitive market. In this session, you will learn how to:
- Identify high-impact customer needs with precision
- Harness the power of large language models to address specific customer needs effectively
- Implement AI responsibly to build trust and foster strong customer relationships
Whether you're at the early stages of your AI journey or looking to optimize existing initiatives, this session will provide you with actionable insights and strategies needed to leverage AI as a powerful catalyst for customer-driven enterprise success.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
The Zaitechno Handheld Raman Spectrometer is a powerful and portable tool for rapid, non-destructive chemical analysis. It utilizes Raman spectroscopy, a technique that analyzes the vibrational fingerprint of molecules to identify their chemical composition. This handheld instrument allows for on-site analysis of materials, making it ideal for a variety of applications, including:
Material identification: Identify unknown materials, minerals, and contaminants.
Quality control: Ensure the quality and consistency of raw materials and finished products.
Pharmaceutical analysis: Verify the identity and purity of pharmaceutical compounds.
Food safety testing: Detect contaminants and adulterants in food products.
Field analysis: Analyze materials in the field, such as during environmental monitoring or forensic investigations.
The Zaitechno Handheld Raman Spectrometer is easy to use and features a user-friendly interface. It is compact and lightweight, making it ideal for field applications. With its rapid analysis capabilities, the Zaitechno Handheld Raman Spectrometer can help you improve efficiency and productivity in your research or quality control workflows.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
The History of Embeddings & Multimodal EmbeddingsZilliz
Frank Liu will walk through the history of embeddings and how we got to the cool embedding models used today. He'll end with a demo on how multimodal RAG is used.
The Challenge of Interpretability in Generative AI Models.pdfSara Kroft
Navigating the intricacies of generative AI models reveals a pressing challenge: interpretability. Our blog delves into the complexities of understanding how these advanced models make decisions, shedding light on the mechanisms behind their outputs. Explore the latest research, practical implications, and ethical considerations, as we unravel the opaque processes that drive generative AI. Join us in this insightful journey to demystify the black box of artificial intelligence.
Dive into the complexities of generative AI with our blog on interpretability. Find out why making AI models understandable is key to trust and ethical use and discover current efforts to tackle this big challenge.
Redefining Cybersecurity with AI CapabilitiesPriyanka Aash
In this comprehensive overview of Cisco's latest innovations in cybersecurity, the focus is squarely on resilience and adaptation in the face of evolving threats. The discussion covers the imperative of tackling Mal information, the increasing sophistication of insider attacks, and the expanding attack surfaces in a hybrid work environment. Emphasizing a shift towards integrated platforms over fragmented tools, Cisco introduces its Security Cloud, designed to provide end-to-end visibility and robust protection across user interactions, cloud environments, and breaches. AI emerges as a pivotal tool, from enhancing user experiences to predicting and defending against cyber threats. The blog underscores Cisco's commitment to simplifying security stacks while ensuring efficacy and economic feasibility, making a compelling case for their platform approach in safeguarding digital landscapes.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
Self-Healing Test Automation Framework - Healenium
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
1. Apache Arrow: Open Source Standard
Becomes Enterprise Necessity
March 3, 2022
Subsurface Live Winter 2022
Wes McKinney
@wesmckinn
2. 2
About Me
● Apache Arrow co-creator, PMC member
● pandas creator
● Voltron Data co-founder, CTO
● Previously at Ursa Labs, Two Sigma, Cloudera, AQR
● Author of Python for Data Analysis
3. 3
Apache Arrow
Multi-language toolbox for accelerated
data interchange and in-memory processing
● Founded in 2016 by a group of open source developers
● Enables unification of database and data science tech stacks
● Thriving user and developer community
● Provides implementations or bindings in twelve languages
● Adopted by numerous projects and products in the ecosystem
4. 4
Voltron Data
● Creating a unified foundation for the future of
analytical computing with Apache Arrow
● Founded in 2021
● Raised $110M in seed and Series A funding
● Leading corporate contributor to Arrow
We’re hiring!
voltrondata.com/careers
6. 6
Project and Ecosystem Growth
● Sustained growth in users, contributors, applications
● Arrow version 7.0.0 (24th major release) in February 2022
≈800 Arrow contributors
>50M monthly PyArrow downloads
7. 7
Arrow Flight SQL
Next-generation standard for data access using SQL
● Adds SQL semantics to Arrow Flight
● Enables ODBC/JDBC-style data access at the speed of Flight
● Reduces implementation complexity for developers
● Shipped in Arrow version 7.0.0 (February 2020)
○ C++ and Java implementations available
○ Additional development and documentation is ongoing 🚧
8. 8
Arrow C++ Query Engine
Work is ongoing to implement comprehensive
query execution capabilities in the Arrow C++ library
● Common scalar functions ✔
● Common grouped aggregate functions ✔
● Common joins ✔
● Performance and efficiency improvements 🚧
● Window functions 🚧
● Additional join types 🚧
9. 9
Substrait
A cross-language, interoperable specification
for data compute operations
○ A standard, flexible way for APIs and engines to share the
representation of analytics computations
■ Produced by APIs (dplyr, Ibis, SQL, …)
■ Consumed by engines (Arrow C++ engine, DuckDB, …)
○ Work is underway to implement producers and consumers 🚧
10. 10
Ibis
A high-level Python API for data analysis
● Fluent pandas-like syntax
● Can express virtually any SQL query
● Supports modular backends for querying systems including
PostgreSQL, MySQL, SQLite, Impala, ClickHouse, BigQuery
● Work is underway to produce Substrait plans 🚧
○ Enabling Ibis queries to run on the Arrow C++ engine
12. 12
Challenge: Obstacles to Interoperability
● Enterprise data tools and workflows often incorporate multiple
languages, query engines, storage systems, and clouds.
● Arrow provides standard formats, interfaces, and frameworks
to enable efficient integration of these diverse components.
13. 13
Case Study: Microsoft
Microsoft uses Arrow in their Magpie data science middleware
● Unifies multiple cloud and
database backends into a
single end user interface
● Stores intermediate results
as Arrow Tables
● Uses Arrow Flight to move
data between systems
Source: http://cidrdb.org/cidr2021/papers/cidr2021_paper08.pdf
14. 14
Case Study: Streamlit
Streamlit uses Arrow to move data from Python to JavaScript
● Reduces server–browser
data transfer times
● Simplifies addition of new
features to Streamlit
● Allowed deletion of >1000
lines of custom code
Source: https://blog.streamlit.io/all-in-on-apache-arrow/
15. 15
Challenge: Slow Data Access Protocols
● Legacy data transport protocols are often slowed by
serialization and deserialization overhead and other causes.
● The Arrow columnar data format and the Arrow Flight
framework enable fast, serialization-free data transport.
16. 16
Case Study: Snowflake
Snowflake uses Arrow in its JDBC client, Python client,
and Spark connector
● Eliminates serialization
between columnar and row-
oriented data formats
● Achieves 4 – 10x reduction
in fetch times
Figure 1. JDBC fetch performance benchmark for JDBC client
version 3.9.x (without Arrow) versus 3.11.0 (with Arrow)
Source: https://www.snowflake.com/blog/fetching-query-results-
from-snowflake-just-got-a-lot-faster-with-apache-arrow/
17. 17
Case Study: Google Cloud Platform
Google Cloud uses Arrow in its BigQuery Storage API
● Eliminates or speeds
serialization to multiple
data formats
● Achieves 4 – 31x reduction
in download times
● Speedup is stable across
data sizes
Source: https://medium.com/google-cloud/announcing-google-
cloud-bigquery-version-1-17-0-1fc428512171
18. 18
Challenge: Limits of JVM-Based Engines
● Computing engines implemented in JVM languages including
Java and Scala often suffer from performance bottlenecks.
● Arrow columnar data structures can accelerate JVM-based
engines and enable use of pluggable high-performance
components implemented in lower-level languages like C++.
19. 19
Case Study: KNIME
KNIME uses Arrow in its
Columnar Table Backend
● Stores data more compactly in
memory, improving performance
● Eliminates the need to create
Java objects to represent table
elements, reducing GC pressure
● Enables use of shared memory
Source: https://www.knime.com/blog/improved-
performance-with-new-table-backend
20. 20
Case Study: Meta
Meta uses Arrow in its Velox C++ database acceleration library
● Improves the performance of Spark and Presto jobs
● Uses Arrow columnar memory layouts for most data types
● Enables SIMD vectorized expression evaluation
Source: https://github.com/facebookincubator/velox
21. 21
Challenge: Embeddable Query Processing
● Enterprises want the flexibility to embed in-memory analytical
execution capabilities directly in business applications.
● By integrating with Arrow, embeddable engines can achieve
excellent performance, efficiency, and developer experience.
22. 22
Case Study: DataFusion
Arrow DataFusion is extensible query execution framework
● Implemented as an embeddable Rust library
● Supports distributed execution through Ballista
● Uses Arrow as its native in-memory format
● Supports SQL and a DataFrame API
● Donated to the Apache Arrow project
Source: https://arrow.apache.org/datafusion
23. 23
Case Study: DuckDB
DuckDB is an in-process SQL OLAP database system
● Implemented as an embeddable C++ library
● Offers zero-copy integration with Arrow
● Can push down filters and projections into Arrow scans
● Interoperates with the Arrow Python and R APIs
Source: https://duckdb.org/2021/12/03/duck-arrow.html
25. 25
Enterprise Subscription for Arrow
● A focused set of services designed to accelerate
business success with Apache Arrow
● Offered in three editions starting with free
● Available now (March 2022)
● Learn more and sign up at
voltrondata.com/subscription
○ Professional support
○ Deployment assistance
○ Content and events
○ Private consultations
Trusted by: