Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

•

2 likes•516 views

In data warehouse area, it is common to use one or more columns in complex type, such as map, and put many subfields into it. It may impact the query performance dramatically because: 1) It is a waste of IO. The whole column (in map), which may contain tens of subfields, need to be read. And Spark will traverse the whole map and get the value of the target key. 2) Vectorized read can not be exploit when nested type column is read. 3) Filter pushdown can not be utilized when nested columns is read. Over the last year, we have added a series of optimizations in Apache Spark to solve the above problems for Parquet.

Materialized Column——An Efficient Way
to Optimize Queries on Nested Columns
Guo, Jun (jason.guo.vip@gmail.com)
Lead of Data Engine Team, @ByteDance

Who we are
o Data Engine team of ByteDance
o Build a platform of one-stop
experience for OLAP , on which users
can analyze PB level data by writing
SQL without caring about the
underlying execution engine

What we do
o Manage Spark SQL / Presto / Hive
workload
o Offer Open API and self-serve platform
o Optimize Spark SQL / Presto / Hive
engine
o Design data architecture for most
business lines in ByteDance

Agenda
▪ Spark SQL at ByteDance
▪ Why nested type are widely used
▪ What are the main issues of nested type
▪ Optional solutions
▪ How does Materialized Column solve these problems

Spark SQL at ByteDance
2016 2017 2018 2019 2020
Small Scale Experiments
Ad-hoc workload
Few ETL pipelines in production
Full-production deployment
Main engine in DW area

Why nested type are widely used
▪ Event log
▪ A lot of new tracking events are created everyday
▪ It is not a good idea to create a new column for a new type of event
▪ Dimension
▪ Dimension tables are dumped from MySQL of service backend
▪ Service backend may add some new fields on demand. These fields may not be
helpful for now but they may be useful in the future

Main issues for nested type
▪ Unnecessary data are read which is a
waste of IO
▪ Vectorized read can not be exploit when
nested type column is read
▪ Filter pushdown can not be utilized
when nested column is read
▪ Duplicated computation. e.g. JSON
parsing is CPU-intensive

Optional solutions – A separate table
▪ DW users design a solution to solve
these problems
▪ Maintain a new table which add new
columns which are extracted from the
nested columns
▪ Downstream users should query on this
new table and new columns for better
performance

Optional solutions – A separate table
▪ Pros
▪ Queries are on simple type so that all the
problems are solved
▪ Cons
▪ Need to push all the downstream users to
migrate their queries / pipelines to the new
table and new columns
▪ Duplicated storage and computation cost
▪ Can not handle frequent subfields changing

Optional solutions – Vectorized Read on Nested Column
▪ Refactor Parquet vectorized reader to
support vectorized read for nested types
▪ Support predicate pushdown for struct

Optional solutions – Vectorized Read on Nested Column
▪ Pros
▪ Enable vectorized read without any storage
overhead
▪ Cons
▪ Need to refactor vectorized reader for
Parquet and ORC respectively
▪ Filter pushdown for Array/Map is still not
available
▪ The performance of vectorized read on
nested type is not as good as that for simple
type
▪ Improve performance with struct by
about 100%
▪ Improve performance with map by
about 163%

How does Materialized Column solve these problems

How does Materialized Column solve these problems
CREATE TABLE base_table (
item STRING,
count INT,
people<STRING, STRING>
date STRING
)
USING parquet
PARTITIONED BY (date);
ALTER TABLE base_table ADD COLUMNS
(
age INT MATERIALIZED CAST(peopl
e[‘age’] AS INTEGER)
);
Add materialized columnOriginal table

How does Materialized Column solve these problems
Write with materialized column
explain extended insert into base_table partition(date='20201010') select 'appole', 1,
map('age','18','name','jack','gender','male')

How does Materialized Column solve these problems
Query with materialized column rewriteQuery without materialized column rewrite

How does Materialized Column solve these problems
Test case
Without Materialized
Column rewrite
With Materialized
Column rewrite
Performance Read data size
SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓
SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓
SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓
Query without materialized column rewrite

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

What's hot

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

Cloudera, Inc.

The document discusses 5 common mistakes people make when writing Spark applications: 1) Not properly sizing executors for memory and cores. 2) Having shuffle blocks larger than 2GB which can cause jobs to fail. 3) Not addressing data skew which can cause joins and shuffles to be very slow. 4) Not properly managing the DAG to minimize shuffles and stages. 5) Classpath conflicts from mismatched dependencies causing errors.

A Deep Dive into Query Execution Engine of Spark SQL

Databricks

Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.

Memory Management in Apache Spark

Databricks

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Databricks

Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.

Parquet performance tuning: the missing guide

Ryan Blue

Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.

Streaming SQL with Apache Calcite

Julian Hyde

With the rise of the Internet of Things (IoT) and low-latency analytics, streaming data becomes ever more important. Surprisingly, one of the most promising approaches for processing streaming data is SQL. In this presentation, Julian Hyde shows how to build streaming SQL analytics that deliver results with low latency, adapt to network changes, and play nicely with BI tools and stored data. He also describes how Apache Calcite optimizes streaming queries, and the ongoing collaborations between Calcite and the Storm, Flink and Samza projects. This talk was given Julian Hyde at Apache Big Data conference, Vancouver, on 2016/05/09.

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Databricks

Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.

Optimizing Apache Spark UDFs

Databricks

User Defined Functions is an important feature of Spark SQL which helps extend the language by adding custom constructs. UDFs are very useful for extending spark vocabulary but come with significant performance overhead. These are black boxes for Spark optimizer, blocking several helpful optimizations like WholeStageCodegen, Null optimization etc. They also come with a heavy processing cost associated with String functions requiring UTF-8 to UTF-16 conversions which slows down spark jobs and increases memory requirements. In this talk, we will go over how at Informatica we optimized UDFs to be as performant as Spark native functions both in terms of time and memory and allow these functions to participate in spark optimization steps.

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

Databricks

As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements. 1) Generality: support reading/writing most data management/storage systems. 2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities. Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Databricks

The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.

Delta Lake: Optimizing Merge

Databricks

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Databricks

The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!

Spark shuffle introduction

colorant

This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.

Autoscaling Flink with Reactive Mode

Flink Forward

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

Modularized ETL Writing with Apache Spark

Databricks

Apache Spark has been an integral part of Stitch Fix’s compute infrastructure. Over the past five years, it has become our de facto standard for most ETL and heavy data processing needs and expanded our capabilities in the Data Warehouse. Since all our writes to the Data Warehouse are through Apache Spark, we took advantage of that to add more modules that supplement ETL writing. Config driven and purposeful, these modules perform tasks onto a Spark Dataframe meant for a destination Hive table. These are organized as a sequence of transformations on the Apache Spark dataframe prior to being written to the table.These include a process of journalizing. It is a process which helps maintain a non-duplicated historical record of mutable data associated with different parts of our business. Data quality, another such module, is enabled on the fly using Apache Spark. Using Apache Spark we calculate metrics and have an adjacent service to help run quality tests for a table on the incoming data. And finally, we cleanse data based on provided configurations, validate and write data into the warehouse. We have an internal versioning strategy in the Data Warehouse that allows us to know the difference between new and old data for a table. Having these modules at the time of writing data allows cleaning, validation and testing of data prior to entering the Data Warehouse thus relieving us, programmatically, of most of the data problems. This talk focuses on ETL writing in Stitch Fix and describes these modules that help our Data Scientists on a daily basis.

The Apache Spark File Format Ecosystem

Databricks

Dynamic Partition Pruning in Apache Spark

Databricks

In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Flink Forward

Flink Forward San Francisco 2022. To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy. by Aansh Shah

Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...

Sandesh Rao

In this session, I will cover under-the-hood features that power Oracle Real Application Clusters (Oracle RAC) 19c specifically around Cache Fusion and Service management. Improvements in Oracle RAC helps in integration with features such as Multitenant and Data Guard. In fact, these features benefit immensely when used with Oracle RAC. Finally we will talk about changes to the broader Oracle RAC Family of Products stack and the algorithmic changes that helps quickly detect sick/dead nodes/instances and the reconfiguration improvements to ensure that the Oracle RAC Databases continue to function without any disruption

Improving Spark SQL at LinkedIn

Databricks

Improving the Spark SQL usability and computing efficiency is one of the missions for Linkedin’s Spark team. In this talk, we will present the Spark SQL ecosystem and roadmaps at Linkedin, and introduce the highlighted projects we are working on, such as: * Improving Dataset performance with automated column pruning * Bringing an efficient 2d join algorithm to Spark SQL * Fixing join skewness with adaptive execution * Enhancing the cost-optimizer with a history-based learning approach

What's hot (20)

Top 5 Mistakes to Avoid When Writing Apache Spark Applications

A Deep Dive into Query Execution Engine of Spark SQL

Memory Management in Apache Spark

Adaptive Query Execution: Speeding Up Spark SQL at Runtime

Parquet performance tuning: the missing guide

Streaming SQL with Apache Calcite

Building a SIMD Supported Vectorized Native Engine for Spark SQL

Optimizing Apache Spark UDFs

Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang

From Query Plan to Query Performance: Supercharging your Apache Spark Queries...

Delta Lake: Optimizing Merge

Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring

Spark shuffle introduction

Autoscaling Flink with Reactive Mode

Modularized ETL Writing with Apache Spark

The Apache Spark File Format Ecosystem

Dynamic Partition Pruning in Apache Spark

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Oracle Real Application Clusters 19c- Best Practices and Internals- EMEA Tour...

Improving Spark SQL at LinkedIn

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

The Science of DBMS: Data Storage & Organization

SAP Technology

The thinking persons guide to data warehouse design

Calpont

The document discusses key considerations for designing a data warehouse, including building a logical design, transitioning to a physical design, and monitoring and tuning the design. It recommends using a modeling tool to capture logical designs, manual partitioning in some cases, and letting database engines do the work. It also covers physical design decisions like SQL vs NoSQL, row vs column storage, partitioning, indexing and optimizing data loads. Regular monitoring of workloads, bottlenecks and ratios is advised to tune performance.

Ibm redbook

Rahul Verma

The document summarizes a new 660-page IBM RedBook about DataStage documentation and examples. It provides overviews of DataStage architecture, best practices, popular stage descriptions, and a retail processing scenario with hundreds of pages and downloadable files. The RedBook aims to address past complaints about a lack of DataStage documentation by providing extensive guidelines, tips and examples.

MWLUG 2016 : AD117 : Xpages & jQuery DataTables

Michael Smith

- DataTables is a jQuery plugin that enhances the accessibility of data in HTML tables. It allows for easy creation of rich, interactive views in XPages applications. - Data can be added to DataTables from HTML, a JavaScript array, or an Ajax data source like REST. Callbacks provide a way to add interactivity similar to the XPages lifecycle. - Advanced configuration options include click handlers, renderers, filtering, lazy loading, and categorization to create feature-rich views.

The Science of DBMS: Query Optimization

SAP Technology

Best practice bi_design_bestpracticesv_1_5

rajibzzaman

This document discusses best practices for designing and tuning Oracle Business Intelligence 11g repositories, dashboards, reports, and queries. It covers topics such as repository design, including physical layer, business model, and presentation layer considerations. It also discusses dashboard and report design best practices, reading the query log, and performance tuning for both relational and multidimensional databases. The document provides an agenda and guidelines for each topic area.

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

Cloudera, Inc.

Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.

SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...

Datavail

GIDS 2016 Understanding and Building No SQLs

techmaddy

Storage becomes the key part of any Big Data system. There are few non-functional parameters that are expected from the Big Data storage systems like reliability, horizontal scalability, high availability, fault tolerance, etc. To support these properties and the change of data storage and access patterns in Big Data systems lead to a class of storage - NoSQLs. If there’s one rule in design -- there will always be trade-offs. CAP theorem defines the choices that we can make with the trade-offs. And ACID rules change to BASE in NoSQLs. This talk focuses on understanding NoSQLs, the design decisions for designing NoSQL databases, an complete design example of key-value database, and patterns of replication and sharding.

Pl sql best practices document

Ashwani Pandey

This document provides guidelines for developing PL/SQL components including naming conventions, formatting, commenting practices, and optimizations. Key guidelines include using prefixes for different object types, indenting with 3 spaces, writing descriptive header comments, avoiding unnecessary full table scans, and leveraging collections like nested tables for persistence. Performance best practices focus on proper indexing, avoiding context switching between SQL and PL/SQL, and bulk operations over iterative processing.

Ledingkart Meetup #2: Scaling Search @Lendingkart

Mukesh Singh

Scaling Search at Lendingkart discusses how Lendingkart scaled their search capabilities to handle large increases in data volume. They initially tried scaling databases vertically and horizontally, but searches were still slow at 8 seconds. They implemented ElasticSearch for its near real-time search, high scalability, and out-of-the-box functionality. Logstash was used to seed data from MySQL and MongoDB into ElasticSearch. Custom analyzers and mappings were developed. Searches then reduced to 230ms and aggregations to 200ms, allowing the business to scale as transactional data grew 3000% and leads 250%.

World2016_T5_S7_TeradataFunctionalOverview

Farah Omer

MicroStrategy and Teradata have a long partnership in providing business intelligence capabilities. MicroStrategy is optimized to run on Teradata and leverages many Teradata features and extensions for performance and scalability. These include multi-pass SQL, bulk inserts, Teradata indexing, functions, and syntax. MicroStrategy also integrates with Teradata tools and provides additional functionality like middle-tier computations and caching.

Myth busters - performance tuning 102 2008

paulguerin

The document provides an overview of various techniques for optimizing database and application performance. It discusses fundamentals like minimizing logical I/O, balancing workload, and serial processing. It also covers the cost-based optimizer, column constraints and indexes, SQL tuning tips, subqueries vs joins, and non-SQL issues like undo storage and data migrations. Key recommendations include using column constraints, focusing on serial processing per table, and not over-relying on statistics to solve all performance problems.

SPL_ALL_EN.pptx

政宏张

Dan Hotka's Top 10 Oracle 12c New Features

Embarcadero Technologies

Watch the full webinar at: http://embt.co/1pb4Zb4 This presentation is a must-see for anyone interested in Oracle 12! Dan is an Oracle ACE Director and has assembled this presentation with fresh and inside information from Oracle Corp and OOW13. Dan has pulled his top Oracle 12 features from the plethora of new features available and documented in his user group presentations "Oracle 12c New Features for Developers" and "Oracle 12c New Features for DBA's". Top 10 features will include: New SQL Syntax New SQL and PL/SQL Limits Pluggable Database New Packages Deprecated Features New SQL Tuning Features This presentation covers new SQL & PL/SQL syntax and options, the container DB of course, new SQL optimizer features, deprecated features, hints, and more. If you're supporting applications, then you won't want to miss this webinar!

Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group

Russell Spangler

This document discusses a Tableau user group meeting focused on transitioning from Excel to data visualization using Tableau. The meeting will feature presentations from two Tableau experts at Amazon on their experience using Tableau. The objectives of the meeting are explained, including comparing Excel and Tableau, overcoming data obstacles, design tips, and real world examples. Tips are provided on leveraging Tableau's capabilities, assessing existing Excel reports, starting out with Tableau, and achieving the goal of transitioning to Tableau.

Taming the shrew Power BI

Kellyn Pot'Vin-Gorman

This document discusses techniques for optimizing Power BI performance. It recommends tracing queries using DAX Studio to identify slow queries and refresh times. Tracing tools like SQL Profiler and log files can provide insights into issues occurring in the data sources, Power BI layer, and across the network. Focusing on optimization by addressing wait times through a scientific process can help resolve long-term performance problems.

MySQL Optimizer: What's New in 8.0

Manyi Lu

The document discusses new features and improvements in the MySQL 8.0 optimizer. Key highlights include: - New SQL syntax like SELECT...FOR UPDATE SKIP LOCKED and NOWAIT to handle row locking contention. - Support for common table expressions to improve readability and allow referencing derived tables multiple times. - Enhancements to the cost model to produce more accurate estimates based on factors like data location. - Better support for data types like UUID and IPv6, including optimized storage formats and new functions.

Recent MariaDB features to learn for a happy life

Federico Razzoli

After MariaDB 10.6 LTS was made available last year, three Short Term Support versions were released. While they shouldn’t be used in production, they allow us to test the features that will be included in the next LTS version. I follow the development of MariaDB through their JIRA, I test the new features, and I regularly review each new major version on the Vettabase website. In this talk I will summarise the most relevant features, show how to use them, and discuss how we can leverage them for real-world cases.

Be A Hero: Transforming GoPro Analytics Data Pipeline

Chester Chen

The document discusses GoPro's transition to a new data platform architecture. The old architecture had several clusters for different workloads which caused operational overhead and lack of elasticity. The new architecture separates storage and computing, uses S3 for storage and ephemeral instances as compute clusters. It also introduces a centralized Hive metastore and uses dynamic DDL to flexibly ingest and aggregate both batch and streaming data while allowing the schema to change on the fly. This improves cost, scalability and enables more advanced analytics capabilities.

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

The Science of DBMS: Data Storage & Organization

The thinking persons guide to data warehouse design

Ibm redbook

MWLUG 2016 : AD117 : Xpages & jQuery DataTables

The Science of DBMS: Query Optimization

Best practice bi_design_bestpracticesv_1_5

HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce

SQL Pass Summit Presentations from Datavail - Optimize SQL Server: Query Tuni...

GIDS 2016 Understanding and Building No SQLs

Pl sql best practices document

Ledingkart Meetup #2: Scaling Search @Lendingkart

World2016_T5_S7_TeradataFunctionalOverview

Myth busters - performance tuning 102 2008

SPL_ALL_EN.pptx

Dan Hotka's Top 10 Oracle 12c New Features

Seatug Presentation (Excel to Data Viz culture) Seattle Tableau User Group

Taming the shrew Power BI

MySQL Optimizer: What's New in 8.0

Recent MariaDB features to learn for a happy life

Be A Hero: Transforming GoPro Analytics Data Pipeline

More from Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

Democratizing Data Quality Through a Centralized Platform

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Learn to Use Databricks for Data Science

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Why APM Is Not the Same As ML Monitoring

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Scaling your Data Pipelines with Apache Spark on Kubernetes

Databricks

There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal. In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Sawtooth Windows for Feature Aggregations

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Re-imagine Data Monitoring with whylogs and Spark

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Massive Data Processing in Adobe Using Delta Lake

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

More from Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling your Data Pipelines with Apache Spark on Kubernetes

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Recently uploaded

Where to order Frederick Community College diploma?

SomalyEng

Full Disclosure Board Policy.docx BRGY LICUMA

brgylicumaormoccity

Vrinda store data analysis project using Excel

SantuJana12

Histology of Muscle types histology o.ppt

SamanArshad11

393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf

Ladislau5

Accounting and Auditing Laws-Rules-and-Regulations

DALubis

The Rise of Python in Finance,Automating Trading Strategies: _.pdf

Riya Sen

In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.

SFBA Splunk Usergroup meeting July 17, 2024

Becky Burwell

Getting Started with Interactive Brokers API and Python.pdf

Riya Sen

In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.

Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...

femim26318

Training on CSPro and step by steps.pptx

lenjisoHussein

SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING

PrabhuB33

Acid Base Practice Test 4- KEY.pdfkkjkjk

talha2khan2k

Audits Of Complaints Against the PPD Report_2022.pdf

evwcarr

Data Analytics for Decision Making By District 11 Solutions

District 11 Solutions

Combined supervised and unsupervised neural networks for pulse shape discrimi...

Samuel Jackson

Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem. Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.

future-of-asset-management-future-of-asset-management

Aadee4

Big Data and Analytics Shaping the future of Payments

RuchiRathor2

From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...

Milind Agarwal

Field Diary and lab record, Importance.pdf

hritikbui

Recently uploaded (20)

Where to order Frederick Community College diploma?

Full Disclosure Board Policy.docx BRGY LICUMA

Vrinda store data analysis project using Excel

Histology of Muscle types histology o.ppt

393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf

Accounting and Auditing Laws-Rules-and-Regulations

The Rise of Python in Finance,Automating Trading Strategies: _.pdf

SFBA Splunk Usergroup meeting July 17, 2024

Getting Started with Interactive Brokers API and Python.pdf

Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...

Training on CSPro and step by steps.pptx

SOFTWARE ENGINEERING-UNIT-1SOFTWARE ENGINEERING

Acid Base Practice Test 4- KEY.pdfkkjkjk

Audits Of Complaints Against the PPD Report_2022.pdf

Data Analytics for Decision Making By District 11 Solutions

Combined supervised and unsupervised neural networks for pulse shape discrimi...

future-of-asset-management-future-of-asset-management

Big Data and Analytics Shaping the future of Payments

From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...

Field Diary and lab record, Importance.pdf

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

1. Materialized Column——An Efficient Way to Optimize Queries on Nested Columns Guo, Jun (jason.guo.vip@gmail.com) Lead of Data Engine Team, @ByteDance

2. Who we are o Data Engine team of ByteDance o Build a platform of one-stop experience for OLAP , on which users can analyze PB level data by writing SQL without caring about the underlying execution engine

3. What we do o Manage Spark SQL / Presto / Hive workload o Offer Open API and self-serve platform o Optimize Spark SQL / Presto / Hive engine o Design data architecture for most business lines in ByteDance

4. Agenda ▪ Spark SQL at ByteDance ▪ Why nested type are widely used ▪ What are the main issues of nested type ▪ Optional solutions ▪ How does Materialized Column solve these problems

5. Spark SQL at ByteDance

6. Spark SQL at ByteDance 2016 2017 2018 2019 2020 Small Scale Experiments Ad-hoc workload Few ETL pipelines in production Full-production deployment Main engine in DW area

7. Why nested type are widely used

8. Why nested type are widely used ▪ Event log ▪ A lot of new tracking events are created everyday ▪ It is not a good idea to create a new column for a new type of event ▪ Dimension ▪ Dimension tables are dumped from MySQL of service backend ▪ Service backend may add some new fields on demand. These fields may not be helpful for now but they may be useful in the future

9. Main issues for nested type

10. Main issues for nested type ▪ Unnecessary data are read which is a waste of IO ▪ Vectorized read can not be exploit when nested type column is read ▪ Filter pushdown can not be utilized when nested column is read ▪ Duplicated computation. e.g. JSON parsing is CPU-intensive

11. Optional solutions

12. Optional solutions – A separate table ▪ DW users design a solution to solve these problems ▪ Maintain a new table which add new columns which are extracted from the nested columns ▪ Downstream users should query on this new table and new columns for better performance

13. Optional solutions – A separate table ▪ Pros ▪ Queries are on simple type so that all the problems are solved ▪ Cons ▪ Need to push all the downstream users to migrate their queries / pipelines to the new table and new columns ▪ Duplicated storage and computation cost ▪ Can not handle frequent subfields changing

14. Optional solutions – Vectorized Read on Nested Column ▪ Refactor Parquet vectorized reader to support vectorized read for nested types ▪ Support predicate pushdown for struct

15. Optional solutions – Vectorized Read on Nested Column ▪ Pros ▪ Enable vectorized read without any storage overhead ▪ Cons ▪ Need to refactor vectorized reader for Parquet and ORC respectively ▪ Filter pushdown for Array/Map is still not available ▪ The performance of vectorized read on nested type is not as good as that for simple type ▪ Improve performance with struct by about 100% ▪ Improve performance with map by about 163%

16. How does Materialized Column solve these problems

17. How does Materialized Column solve these problems CREATE TABLE base_table ( item STRING, count INT, people<STRING, STRING> date STRING ) USING parquet PARTITIONED BY (date); ALTER TABLE base_table ADD COLUMNS ( age INT MATERIALIZED CAST(peopl e[‘age’] AS INTEGER) ); Add materialized columnOriginal table

18. How does Materialized Column solve these problems

19. How does Materialized Column solve these problems Write with materialized column explain extended insert into base_table partition(date='20201010') select 'appole', 1, map('age','18','name','jack','gender','male')

20. How does Materialized Column solve these problems Query with materialized column rewriteQuery without materialized column rewrite

21. How does Materialized Column solve these problems Test case Without Materialized Column rewrite With Materialized Column rewrite Performance Read data size SQL_adhoc_1 6.3 min / 797.6 GB 3.4 min / 111.8 GB 85.3%↑ 86% ↓ SQL_adhoc_2 16.5 min / 3.2 TB 5.0 min / 111.1 GB 230%↑ 96.6%↓ SQL_etl_1 24 min / 3.7 TB 9.1 min / 686.1 GB 130.8%↑ 82%↓ Query without materialized column rewrite

22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

More Related Content

What's hot

What's hot (20)

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns

Similar to Materialized Column: An Efficient Way to Optimize Queries on Nested Columns (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Materialized Column: An Efficient Way to Optimize Queries on Nested Columns