This talk explains how Spark 3.0 can improve the performance of SQL applications. Spark 3.0 provides many performance features such as dynamic partitioning and enhanced pushdown. Each of them can improve the performance of a different type of SQL application.
Join is one of most important and critical SQL operation in most data warehouses. This is essential when we want to get insights from multiple input datasets. Over the last year, we’ve added a series of join optimizations internally at Facebook, and we started to contribute back to upstream open source recently.
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
Zstandard is a fast compression algorithm which you can use in Apache Spark in various way. In this talk, I briefly summarized the evolution history of Apache Spark in this area and four main use cases and the benefits and the next steps:
1) ZStandard can optimize Spark local disk IO by compressing shuffle files significantly. This is very useful in K8s environments. It’s beneficial not only when you use `emptyDir` with `memory` medium, but also it maximizes OS cache benefit when you use shared SSDs or container local storage. In Spark 3.2, SPARK-34390 takes advantage of ZStandard buffer pool feature and its performance gain is impressive, too.
2) Event log compression is another area to save your storage cost on the cloud storage like S3 and to improve the usability. SPARK-34503 officially switched the default event log compression codec from LZ4 to Zstandard.
3) Zstandard data file compression can give you more benefits when you use ORC/Parquet files as your input and output. Apache ORC 1.6 supports Zstandardalready and Apache Spark enables it via SPARK-33978. The upcoming Parquet 1.12 will support Zstandard compression.
4) Last, but not least, since Apache Spark 3.0, Zstandard is used to serialize/deserialize MapStatus data instead of Gzip.
There are more community works to utilize Zstandard to improve Spark. For example, Apache Avro community also supports Zstandard and SPARK-34479 aims to support Zstandard in Spark’s avro file format in Spark 3.2.0.
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
Catalyst is becoming one of the most important components of Apache Spark, as it underpins all the major new APIs in Spark 2.0 and later versions, from DataFrames and Datasets to Streaming. At its core, Catalyst is a general library for manipulating trees.
In this talk, Yin explores a modular compiler frontend for Spark based on this library that includes a query analyzer, optimizer, and an execution planner. Yin offers a deep dive into Spark SQL’s Catalyst optimizer, introducing the core concepts of Catalyst and demonstrating how developers can extend it. You’ll leave with a deeper understanding of how Spark analyzes, optimizes, and plans a user’s query.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Top 5 mistakes when writing Spark applicationshadooparchbook
This document discusses common mistakes made when writing Spark applications and provides recommendations to address them. It covers issues like having executors that are too small or large, shuffle blocks exceeding size limits, data skew slowing jobs, and excessive stages. The key recommendations are to optimize executor and partition sizes, increase partitions to reduce skew, use techniques like salting to address skew, and favor transformations like ReduceByKey over GroupByKey to minimize shuffles and memory usage.
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
Apache Spark 2.1.0 boosted the performance of Apache Spark SQL due to Project Tungsten software improvements. Another 16x times faster has been achieved by using Oracle’s innovations for Apache Spark SQL. This 16x improvement is made possible by using Oracle’s Software in Silicon accelerator offload technologies.
Apache Spark SQL In-memory performance is becoming more important due to many factors. Users are now performing more advanced SQL processing on multi-terabyte workloads. In addition on-prem and cloud servers are getting larger physical memory to enable storing these huge workloads be stored in memory. In this talk we will look at using Spark SQL in feature creation, feature generation within pipelines for Spark ML.
This presentation will explore workloads at scale and with complex interactions. We also provide best practices and tuning suggestion to support these kinds of workloads on real applications in cloud deployments. In addition ideas for next generation Tungsten project will also be discussed.
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
"Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.
What are you trying to consume? Single source? Joining multiple streaming sources? Joining streaming with static data?
What are you trying to produce? What is the final output that the business wants? What type of queries does the business want to run on the final output?
When do you want it? When does the business want to the data? What is the acceptable latency? Do you really want to millisecond-level latency?
How much are you willing to pay for it? This is the ultimate question and the answer significantly determines how feasible is it solve the above questions.
These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem."
SparkSQL: A Compiler from Queries to RDDsDatabricks
SparkSQL, a module for processing structured data in Spark, is one of the fastest SQL on Hadoop systems in the world. This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will walk away with a deeper understanding of how Spark analyzes, optimizes, plans and executes a user’s query.
Speaker: Sameer Agarwal
This talk was originally presented at Spark Summit East 2017.
Hive Bucketing in Apache Spark with Tejas PatilDatabricks
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
The document summarizes a presentation given by Chris Fregly on Project Tungsten and optimizations in Apache Spark. It discusses techniques like using off-heap memory, minimizing cache misses, and saturating I/O to sort 100 terabytes of data in Spark. The presentation also covered a recap of the "100TB GraySort challenge" where custom data structures and algorithms were used to optimize sorting and shuffling of data.
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
In data warehouse area, it is common to use one or more columns in complex type, such as map, and put many subfields into it. It may impact the query performance dramatically because: 1) It is a waste of IO. The whole column (in map), which may contain tens of subfields, need to be read. And Spark will traverse the whole map and get the value of the target key. 2) Vectorized read can not be exploit when nested type column is read. 3) Filter pushdown can not be utilized when nested columns is read. Over the last year, we have added a series of optimizations in Apache Spark to solve the above problems for Parquet.
You’ve seen the technical deep dives on Spark’s Catalyst query optimizer. You understand how to fix joins, how to find common traps in a logical query plan. But what happens when you’re alone with Spark UI and the cluster goes idle for 40 minutes? How can you diagnose what’s gone wrong with your query and fix it?
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark’s built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
HOW are going to architect the solution? And how much are you willing to pay for it?
Clarity in understanding the ‘what and why’ of any problem can automatically much clarity on the ‘how’ to architect it using Structured Streaming and, in many cases, Delta Lake.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
In a world where compute is paramount, it is all too easy to overlook the importance of storage and IO in the performance and optimization of Spark jobs.
The document discusses new features in Apache Spark 3.0, including Adaptive Query Execution (AQE) which optimizes queries during execution based on metrics. AQE allows optimizing the number of shuffle partitions dynamically based on mapper output. The new EXPLAIN format in Spark 3.0 makes query execution plans easier to read. A new tail function was also introduced to read data from the last partition of a DataFrame.
Spark 2.0 is a major release of Apache Spark. This release has brought many changes to API(s) and libraries of Spark. So in this KnolX, we will be looking at some improvements that are made in Spark 2.0. Also, in these slides we will be getting an introduction to some new features in Spark 2,0 like SparkSession API and Structured Streaming.
Apache CarbonData & Spark Meetup
Apache Spark™ is a unified analytics engine for large-scale data processing.
CarbonData is a high-performance data solution that supports various data analytic scenarios, including BI analysis, ad-hoc SQL query, fast filter lookup on detail record, streaming analytics, and so on. CarbonData has been deployed in many enterprise production environments, in one of the largest scenario it supports queries on single table with 3PB data (more than 5 trillion records) with response time less than 3 seconds!
Apache Spark 3.0: Overview of What’s New and Why CareDatabricks
Spark 3.0 introduces several new features and enhancements to improve performance, usability and compatibility. Key highlights include adaptive query execution which optimizes query plans at runtime based on statistics, dynamic partition pruning to avoid unnecessary data scans, and join hints to influence join strategies. Usability is improved with richer APIs like pandas UDF enhancements and a new structured streaming UI. Compatibility and extensibility is enhanced with Java 11 support, Hive 3.x metastore support and Hadoop 3 support.
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
This document provides an agenda for a 3+ hour workshop on Apache Spark 2.x on Databricks. It includes introductions to Databricks, Spark fundamentals and architecture, new features in Spark 2.0 like unified APIs, and workshops on DataFrames/Datasets, Spark SQL, and structured streaming concepts. The agenda covers lunch and breaks and is divided into hour and half hour segments.
Catalyst optimizer optimizes queries written in Spark SQL and DataFrame API to run faster. It uses both rule-based and cost-based optimization. Rule-based optimization applies rules to determine query execution, while cost-based generates multiple plans and selects the most efficient. Catalyst optimizer transforms logical plans through four phases - analysis, logical optimization, physical planning, and code generation. It represents queries as trees that can be manipulated using pattern matching rules to optimize queries.
Spark Summit EU 2016 Keynote - Simplifying Big Data in Apache Spark 2.0Databricks
Apache Spark 2.0 was released this summer and is already being widely adopted. In this presentation Matei talks about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Apache Spark 2.4 comes packed with a lot of new functionalities and improvements, including the new barrier execution mode, flexible streaming sink, the native AVRO data source, PySpark’s eager evaluation mode, Kubernetes support, higher-order functions, Scala 2.12 support, and more.
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
In this webcast, Reynold Xin from Databricks will be speaking about Apache Spark's new 2.0 major release.
The major themes for Spark 2.0 are:
- Unified APIs: Emphasis on building up higher level APIs including the merging of DataFrame and Dataset APIs
- Structured Streaming: Simplify streaming by building continuous applications on top of DataFrames allow us to unify streaming, interactive, and batch queries.
- Tungsten Phase 2: Speed up Apache Spark by 10X
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
Adi Polak - Light up the Spark in Catalyst by avoiding UDFs - Codemotion Mila...Codemotion
Processing data at scale usually results in struggling with performance, strict SLA, limited hardware etc. I've struggled with cutting Spark SQL query run-time and found the culprit! This culprit, and SOLUTION! I would like to share with you. Today in the world of Big Data and Spark we are processing high volume transactions. Catalyst is the Spark SQL query optimizer and in this talk, you will learn how to fully utilize Catalyst optimization power in order to make our queries as fast as possible, by pushing down actions and trying to avoid UDFs as much as possible and maximizing performance.
Spark sql under the hood - Data KRK meetupMikołaj Kromka
In recent years Apache Spark has received a lot of hype in the Big Data community. It is seen as a silver bullet for all problems related to gathering, processing and analysing massive datasets. Due to its rapid evolution (do not forget that Spark is one the most active open source projects), some of the ideas behind it seem to be unclear and require digging into different blog posts and presentations. During this talk we will dive into the internals of Spark SQL, look how our queries are translated to the actual code executed on the nodes and find different ways to debug and optimize them.
Deep Dive into the New Features of Apache Spark 3.1Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.1 extends its scope with more than 1500 resolved JIRAs. We will talk about the exciting new developments in the Apache Spark 3.1 as well as some other major initiatives that are coming in the future. In this talk, we want to share with the community many of the more important changes with the examples and demos.
The following features are covered: the SQL features for ANSI SQL compliance, new streaming features, and Python usability improvements, the performance enhancements and new tuning tricks in query compiler.
Deep Dive of ADBMS Migration to Apache Spark—Use Cases SharingDatabricks
eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.
Similar to SQL Performance Improvements at a Glance in Apache Spark 3.0 (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and TuningDonghwan Lee
이 세션에서는 SageMaker Training Jobs / SageMaker Jumpstart를 사용하여 Foundation Model 을 Pre-Triaining 하거나 Fine Tuing 하는 방안을 제시합니다. 이 세션을 통해 아래 3가지가 소개됩니다.
1. 파운데이션 모델을 처음부터 Training
2. 오픈 소스 모델을 사용하여 파운데이션 모델을 Pre-Training
3. 도메인에 맞게 모델을 Fine Tuning하는 방안
발표자:
Miron Perel, Principal ML GTM Specialist, AWS
Kristine Pearce, Principal ML BD, AWS
Amazon DocumentDB(MongoDB와 호환됨)는 빠르고 안정적이며 완전 관리형 데이터베이스 서비스입니다. Amazon DocumentDB를 사용하면 클라우드에서 MongoDB 호환 데이터베이스를 쉽게 설치, 운영 및 규모를 조정할 수 있습니다. Amazon DocumentDB를 사용하면 MongoDB에서 사용하는 것과 동일한 애플리케이션 코드를 실행하고 동일한 드라이버와 도구를 사용하는 것을 실습합니다.
LLM powered contract compliance application which uses Advanced RAG method Self-RAG and Knowledge Graph together for the first time.
It provides highest accuracy for contract compliance recorded so far for Oil and Gas Industry.
2. About Me – Kazuaki Ishizaki
▪ Researcher at IBM Research – Tokyo
https://ibm.biz/ishizaki
– Compiler optimization, language runtime, and parallel processing
▪ Apache Spark committer from 2018/9 (SQL module)
▪ Work for IBM Java (Open J9, now) from 1996
– Technical lead for Just-in-time compiler for PowerPC
▪ ACM Distinguished Member
▪ SNS
– @kiszk
– https://www.slideshare.net/ishizaki/
2 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
3. Spark 3.0
▪ The long wished-for release…
– More than 1.5 years passed after Spark 2.4 has been released
3 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
4. Spark 3.0
▪ Four Categories of Major Changes for SQL
4 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Interactions with developers Dynamic optimizations
Catalyst improvements Infrastructure updates
5. When Spark 2.4 was released?
▪ The long wished-for release…
– More than 1.5 years passed after Spark 2.4 has been released
5 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
2018 November
6. What We Expected Last Year?
▪ The long wished-for release…
– More than 1.5 years passed after Spark 2.4 has been released
6 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Keynote at Spark+AI Summit 2019
2019 April
7. Spark 3.0 Looks Real
▪ The long wished-for release…
– More than 1.5 years passed after Spark 2.4 has been released
7 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Keynote at Spark+AI Summit 2019
2019 November
8. Spark 3.0 has been released!!
▪ The long wished-for release…
– More than 1.5 years passed after Spark 2.4 has been released
8 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Keynote at Spark+AI Summit 2019 (April, 2019)3.0.0 has released
early June, 2020
9. Community Worked for Spark 3.0 Release
▪ 3464 issues (as of June 8th, 2020)
– New features
– Improvements
– Bug fixes
9 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Source https://issues.apache.org/jira/projects/SPARK/versions/12339177
10. Many Many Changes for 1.5 years
▪ 3369 issues (as of May 15, 2020)
– Features
– Improvements
– Bug fixes
10 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Hard to understand what’s new
due to many many changes
11. Many Many Changes for 1.5 years
▪ 3369 issues (as of May 15, 2020)
– Features
– Improvements
– Bug fixes
11 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Hard to understand what’s new
due to many many changes
This session guides you to understand
what’s new for SQL performance
12. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
12 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
13. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
13 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Interactions
with developers
Dynamic optimizations
Catalyst
improvements
Infrastructure updates
14. What is Important to Improve Performance?
▪ Understand how a query is optimized
14 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27395
15. What is Important to Improve Performance?
▪ Understand how a query is optimized
15 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27395
Easy to Read a Query Plan
16. Read a Query Plan
16 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27395
“SELECT key, Max(val) FROM temp WHERE key > 0 GROUP BY key HAVING max(val) > 0”
From #24759
17. Not Easy to Read a Query Plan on Spark 2.4
▪ Not easy to understand how a query is optimized
17 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27395
scala> val query = “SELECT key, Max(val) FROM temp WHERE key > 0 GROUP BY key HAVING max(val) > 0”
scala> sql(“EXPLAIN “ + query).show(false)
From #24759
Output is too long!!
!== Physical Plan ==
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
+- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15,
max(val#3)#18])
+- Exchange hashpartitioning(key#2, 200)
+- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2,
max#21])
+- *(1) Project [key#2, val#3]
+- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
+- *(1) FileScan parquet default.temp[key#2,val#3] Batched: true,
DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location:
InMemoryFileIndex[file:/user/hive/warehouse/temp], PartitionFilters: [], PushedFilters:
[IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
18. Easy to Read a Query Plan on Spark 3.0
▪ Show a query in a terse format with detail information
18 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27395
!== Physical Plan ==
Project (8)
+- Filter (7)
+- HashAggregate (6)
+- Exchange (5)
+- HashAggregate (4)
+- Project (3)
+- Filter (2)
+- Scan parquet default.temp1 (1)
(1) Scan parquet default.temp [codegen id : 1]
Output: [key#2, val#3]
(2) Filter [codegen id : 1]
Input : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))
(3) Project [codegen id : 1]
Output : [key#2, val#3]
Input : [key#2, val#3]
(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]
(5) Exchange
Input: [key#2, max#11]
(6) HashAggregate [codegen id : 2]
Input: [key#2, max#11]
(7) Filter [codegen id : 2]
Input : [key#2, max(val)#5, max(val#3)#8]
Condition : (isnotnull(max(val#3)#8) AND
(max(val#3)#8 > 0))
(8) Project [codegen id : 2]
Output : [key#2, max(val)#5]
Input : [key#2, max(val)#5, max(val#3)#8]
scala> sql(“EXPLAIN FORMATTED “ + query).show(false)
19. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
19 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Interactions
with developers
20. Only One Join Type Can be Used on Spark 2.4
▪
20 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27225
Join type 2.4
Broadcast BROADCAST
Sort Merge X
Shuffle Hash X
Cartesian X
21. All of Join Type Can be Used for a Hint
▪
21 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-27225
Join type 2.4 3.0
Broadcast BROADCAST BROADCAST
Sort Merge X SHUFFLE_MERGE
Shuffle Hash X SHUFFLE_HASH
Cartesian X SHUFFLE_REPLICATE_NL
Examples
SELECT /*+ SHUFFLE_HASH(a, b) */ * FROM a, b
WHERE a.a1 = b.b1
val shuffleHashJoin = aDF.hint(“shuffle_hash”)
.join(bDF, aDF(“a1”) === bDF(“b1”))
22. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
22 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Dynamic optimizations
23. Automatically Tune Parameters for Join and Reduce
▪ Three parameters by using runtime statistics information
(e.g. data size)
1. Set the number of reducers to avoid wasting memory and I/O resource
2. Select better join strategy to improve performance
3. Optimize skewed join to avoid imbalance workload
23 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
Yield 8x performance improvement of Q77 in TPC-DS
Source: Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Without manual tuning properties run-by-run
24. Used Preset Number of Reduces on Spark 2.4
▪ The number of reducers is set based on the property
spark.sql.shuffle.partitions (default: 200)
24 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
One task includes five types to be grouped
Five reducers for five partitions
Reducer 0
Reducer 1
Reducer 2
Reducer 3
Reducer 4
25. Tune the Number of Reducers on Spark 3.0
▪ Select the number of reducers to meet the given target partition
size at each reducer
25 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
spark.sql.adaptive.enabled -> true (false in Spark 3.0)
spark.sql.adaptive.coalescePartitions.enabled -> true (false in Spark 3.0)
Three reducers for five partitions
26. Statically Selected Join Strategy on Spark 2.4
▪ Spark 2.4 decided sort merge join strategy using statically
available information (e.g. 100GB and 80GB)
26 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
Filter
Shuffle
Sort merge
Join
80GB
???
Sort
Scan table2
Shuffle
Scan table1
Sort
100GB
27. Dynamically Change Join Strategy on Spark 3.0
▪ Spark 3.0 dynamically select broadcast hash join strategy using
runtime information (e.g. 80MB)
27 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
Shuffle
Scan table1
Filter
ShuffleSort
Sort merge
Join
100GB 80GB
80MB
Sort
Scan table2
Shuffle
Scan table1
Filter
Broadcast
Broadcast
hash Join
Scan table2
spark.sql.adaptive.enabled -> true (false in Spark 3.0)
28. Skewed Join is Slow on Spark 2.4
▪ The join time is dominated by processing the largest partition
28 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
Table BTable A
Partition 2
Partition 0
Partition 1
Join table A and table B
29. Skewed Join is Faster on Spark 3.0
▪ The large partition is split into multiple partitions
29 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-23128 & 30864
Table BTable A
Partition 2
Partition 0
Partition 1
Join table A and table B
spark.sql.adaptive.enabled -> true (false in Spark 3.0)
spark.sql.adaptive.skewJoin.enabled-> true (false in Spark 3.0)
Partition 3
Split
Duplicate
30. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
30 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
31. Dynamic Partitioning Pruning
▪ Avoid to read unnecessary partitions in a join operation
– By using results of filter operations in another table
▪ Dynamic filter can avoid to read unnecessary partition
31 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-11150
Source: Dynamic Partition Pruning in Apache Spark
Yield 85x performance improvement of Q98 in TPC-DS 10TB
32. Naïve Broadcast Hash Join on Spark 2.4
▪ All of the data in Large table is read
32 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-11150
Broadcast
Table small
Table large
filter
Broadcast hash
join
FileScan
33. Prune Data with Dynamic Filter on Spark 3.0
▪ Large table can reduce the amount of data to be read
using pushdown with dynamic filter
33 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-11150
Broadcast
Table small
Table large
filter
FileScan with
pushdown
Broadcast hash
join
FileScan
34. Example of Dynamic Partitioning Pruning
34 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-11150
scala> spark.range(7777).selectExpr("id", "id AS key").write.partitionBy("key").saveAsTable("tableLarge")
scala> spark.range(77).selectExpr("id", "id AS key").write.partitionBy("key").saveAsTable("tableSmall")
scala> val query = "SELECT * FROM tableLarge JOIN tableSmall ON tableLarge.key = tableSmall.key AND tableSmall.id < 3"
scala> sql("EXPLAIN FORMATTED " + query).show(false)
|== Physical Plan ==
* BroadcastHashJoin Inner BuildRight (8)
:- * ColumnarToRow (2)
: +- Scan parquet default.tablelarge (1)
+- BroadcastExchange (7)
+- * Project (6)
+- * Filter (5)
+- * ColumnarToRow (4)
+- Scan parquet default.tablesmall (3)
(1) Scan parquet default.tablelarge
Output [2]: [id#19L, key#20L]
Batched: true
Location: InMemoryFileIndex [file:/home/ishizaki/Spark/300RC1/spark-3.0.0-bin-hadoop2.7/spark-
warehouse/tablelarge/key=0, ... 7776 entries]
PartitionFilters: [isnotnull(key#20L), dynamicpruningexpression(key#20L IN dynamicpruning#56)]
ReadSchema: struct<id:bigint>
Source: Quick Overview of Upcoming Spark 3.0
35. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
35 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Catalyst
improvements
36. Nested Column Pruning on Spark 2.4
▪ Column pruning that read only necessary column for Parquet
– Can be applied to limited operations (e.g. LIMIT)
36 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
Source: #23964
Project Limit
col1 col2
_1 _2
37. Limited Nested Column Pruning on Spark 2.4
▪ Column pruning that read only necessary column for Parquet
– Can be applied to limited operations (e.g. LIMIT)
– Cannot be applied other operations (e.g. REPARTITION)
37 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
Source: #23964
Project Limit
col1 col2
_1 _2
Project Repartition
col1 col2
_1 _2 Project
38. Generalize Nested Column Pruning on Spark 3.0
▪ Nested column pruning can be applied to all operators
– e.g. LIMITS, REPARTITION, …
38 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
Project Repartition
col1 col2
_1 _2
Source: #23964
39. Example of Nested Column Pruning
▪ Parquet only reads col2._1, as shown in ReadSchema
39 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
== Physical Plan ==
Exchange RoundRobinPartitioning(1)
+- *(1) Project [col2#5._1 AS _1#11L]
+- *(1) FileScan parquet [col2#5] ..., PushedFilters: [], ReadSchema: struct<col2:struct<_1:bigint>>
scala> sql("SELECT col2._1 FROM (SELECT /*+ REPARTITION(1) */ col2 FROM temp)").explain
Source: #23964
scala> spark.range(1000).map(x => (x, (x, s"$x" * 10))).toDF("col1", "col2").write.parquet("/tmp/p")
scala> spark.read.parquet("/tmp/p").createOrReplaceTempView("temp")
scala> sql("SELECT col2._1 FROM (SELECT col2 FROM tp LIMIT 1000000)").explain
== Physical Plan ==
CollectLimit 1000000
+- *(1) Project [col2#22._1 AS _1#28L]
+- *(1) FileScan parquet [col2#22] ..., ReadSchema: struct<col2:struct<_1:bigint>>
LIMIT
Repartition
40. No Nested Column Pushdown on Spark 2.4
▪ Parquet cannot apply predication pushdown
40 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
scala> spark.range(1000).map(x => (x, (x, s"$x" * 10))).toDF("col1", "col2").write.parquet("/tmp/p")
scala> spark.read.parquet(“/tmp/p”).filter(“col2._1 = 100").explain
== Physical Plan ==
*(1) Project [col1#12L, col2#13]
+- *(1) Filter (isnotnull(col2#13) && (col2#13._1 = 100))
+- *(1) FileScan parquet [col1#12L,col2#13] ..., PushedFilters: [IsNotNull(nested)], ...
Spark 2.4
Project
col1 col2
_1 _2 Filter
Source: #28319
All data rows only if
col2._1=100
41. Nested Column Pushdown on Spark 3.0
▪ Parquet can apply pushdown filter and can read part of columns
41 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-25603 & 25556
scala> spark.range(1000).map(x => (x, (x, s"$x" * 10))).toDF("col1", "col2").write.parquet("/tmp/p")
scala> spark.read.parquet(“/tmp/p”).filter(“col2._1 = 100").explain
Spark 3.0
== Physical Plan ==
*(1) Project [col1#0L, col2#1]
+- *(1) Filter (isnotnull(col2#1) AND (col2#1._1 = 100))
+- FileScan parquet [col1#0L,col2#1] ..., DataFilters: [isnotnull(col2#1), (col2#1.x = 100)],
..., PushedFilters: [IsNotNull(col2), EqualTo(col2._1,100)], ...
Project
col1 col2
_1 _2 Filter
Source: #28319
chunks including
col2._1=100
rows only if
col2._1=100
42. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
42 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
43. Complex Aggregation is Slow on Spark 2.4
▪ A complex query is not compiled to native code
43 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
Not good performance of Q66 in TPC-DS
Source: #20695
44. How SQL is Translated to native code
▪ In Spark, Catalyst translates a given query to Java code
▪ HotSpot compiler in OpenJDK translates Java code into native
code
44 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
Catalyst
Java code
generationSQL
Spark
HotSpot
45. How SQL is Translated to native code
▪ In Spark, Catalyst translates a given query to Java code
▪ HotSpot compiler in OpenJDK gives up generating native code
for more than 8000 Java bytecode instruction per method
45 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
Catalyst
Java code
generationSQL
Spark
HotSpot
46. Making Aggregation Java Code Small
▪ In Spark, Catalyst translates a given query to Java code
▪ HotSpot compiler in OpenJDK gives up generating native code
for more than 8000 Java bytecode instruction per method
46 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
Catalyst splits a large Java method into small ones
to allow HotSpot to generate native code
47. Example of Small Aggregation Code
▪ Average function (100 rows) for 50 columns
47 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
scala> val numCols = 50
scala> val colExprs = (0 until numCols).map { i => s"id AS col$i" }
scala> spark.range(100).selectExpr(colExprs: _*).createOrReplaceTempView("temp”)
scala> val aggExprs = (0 until numCols).map { I => s”AVG(col$i)" }
scala> val query = s"SELECT ${aggExprs.mkString(", ")} FROM temp“
scala> import org.apache.spark.sql.execution.debug._
scala> sql(query).debugCodegen()
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:3679; maxConstantPoolSize:1107(1.69% used); numInnerClasses:0) ==
...
== Subtree 2 / 2 (maxMethodCodeSize:5581; maxConstantPoolSize:882(1.35% used); numInnerClasses:0) ==
Source: PR #20965
48. Example of Small Aggregation Code
▪ Average function (100 rows) for 50 columns
48 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-21870
scala> val numCols = 50
scala> val colExprs = (0 until numCols).map { i => s"id AS col$i" }
scala> spark.range(100).selectExpr(colExprs: _*).createOrReplaceTempView("temp”)
scala> val aggExprs = (0 until numCols).map { I => s”AVG(col$i)" }
scala> val query = s"SELECT ${aggExprs.mkString(", ")} FROM temp“
scala> import org.apache.spark.sql.execution.debug._
scala> sql(query).debugCodegen()
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:3679; maxConstantPoolSize:1107(1.69% used); numInnerClasses:0) ==
...
== Subtree 2 / 2 (maxMethodCodeSize:5581; maxConstantPoolSize:882(1.35% used); numInnerClasses:0) ==
...
scala> sql("SET spark.sql.codegen.aggregate.splitAggregateFunc.enabled=false")
scala> sql(query).debugCodegen()
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:8917; maxConstantPoolSize:957(1.46% used); numInnerClasses:0) ==
...
== Subtree 2 / 2 (maxMethodCodeSize:9862; maxConstantPoolSize:728(1.11% used); numInnerClasses:0) ==
...
Disable this feature
Source: PR #20965
49. Seven Major Changes for SQL Performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
49 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Infrastructure updates
50. Support New Versions of Languages
▪ Java 11 (the latest Long-Term-Support of OpenJDK from 2018 to
2026)
– Further optimizations in HotSpot compiler
– Improved G1GC (for large heap)
– Experimental new ZGC (low latency)
▪ Scala 2.12 (released on 2016 Nov.)
– Newly designed for leveraging Java 8 new features
50 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki SPARK-24417 & 25956
NOTE: Other class libraries are also updated
51. Takeaway
▪ Spark 3.0 improves SQL application performance
1. New EXPLAIN format
2. All type of join hints
3. Adaptive query execution
4. Dynamic partitioning pruning
5. Enhanced nested column pruning & pushdown
6. Improved aggregation code generation
7. New Scala and Java
51 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
Please visit https://www.slideshare.net/ishizaki/ tomorrow
if you want to see this slide again
52. Resources
▪ Introducing Apache Spark 3.0: Now available in Databricks
Runtime 7.0
– https://databricks.com/jp/blog/2020/06/18/introducing-apache-spark-3-0-
now-available-in-databricks-runtime-7-0.html
▪ Now on Databricks: A Technical Preview of Databricks Runtime 7
Including a Preview of Apache Spark 3.0
– https://databricks.com/blog/2020/05/13/now-on-databricks-a-technical-
preview-of-databricks-runtime-7-including-a-preview-of-apache-spark-3-
0.html
▪ Quick Overview of Upcoming Spark 3.0 (in Japanese)
– https://www.slideshare.net/maropu0804/quick-overview-of-upcoming-
spark-3052 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki
53. Resources…
▪ Madhukar’s Blog
– https://blog.madhukaraphatak.com/
▪ Adaptive Query Execution: Speeding Up Spark SQL at Runtime
– https://databricks.com/blog/2020/05/29/adaptive-query-execution-
speeding-up-spark-sql-at-runtime.html
▪ Dynamic Partition Pruning in Apache Spark
– https://databricks.com/session_eu19/dynamic-partition-pruning-in-
apache-spark
53 SQL performance improvements at a glance in Apache Spark 3.0 - Kazuaki Ishizaki