Bucketing is a popular data partitioning technique to pre-shuffle and (optionally) pre-sort data during writes. This is ideal for a variety of write-once and read-many datasets at Facebook, where Spark can automatically avoid expensive shuffles/sorts (when the underlying data is joined/aggregated on its bucketed keys) resulting in substantial savings in both CPU and IO.
Over the last year, we’ve added a series of optimizations in Apache Spark as a means towards achieving feature parity with Hive and Spark. These include avoiding shuffle/sort when joining/aggregating/inserting on tables with mismatching buckets, allowing user to skip shuffle/sort when writing to bucketed tables, adding data validators before writing bucketed data, among many others. As a direct consequence of these efforts, we’ve witnessed over 10x growth (spanning 40% of total compute) in queries that read one or more bucketed tables across the entire data warehouse at Facebook.
In this talk, we’ll take a deep dive into the internals of bucketing support in SparkSQL, describe use-cases where bucketing is useful, touch upon some of the on-going work to automatically suggest bucketing tables based on query column lineage, and summarize the lessons learned from developing bucketing support in Spark at Facebook over the last 2 years
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
Over the years, there has been extensive and continuous effort on improving Spark SQL’s query optimizer and planner, in order to generate high quality query execution plans. One of the biggest improvements is the cost-based optimization framework that collects and leverages a variety of data statistics (e.g., row count, number of distinct values, NULL values, max/min values, etc.) to help Spark make better decisions in picking the most optimal query plan.
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Understanding Memory Management In Spark For Fun And ProfitSpark Summit
1) The document discusses memory management in Spark applications and summarizes different approaches tried by developers to address out of memory errors in Spark executors.
2) It analyzes the root causes of memory issues like executor overheads and data sizes, and evaluates fixes like increasing memory overhead, reducing cores, frequent garbage collection.
3) The document dives into Spark and JVM level configuration options for memory like storage pool sizes, caching formats, and garbage collection settings to improve reliability, efficiency and performance of Spark jobs.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Why you should care about data layout in the file system with Cheng Lian and ...Databricks
Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Presto is an open-source distributed SQL query engine for interactive analytics. It uses a connector architecture to query data across different data sources and formats in the same query. Presto's query planning and execution involves scanning data sources, optimizing query plans, distributing queries across workers, and aggregating results. Understanding Presto's query plans helps optimize queries and troubleshoot performance issues.
1) Ross Lawley presented on connecting Spark to MongoDB. The MongoDB Spark connector started as an intern project in 2015 and was officially launched in 2016, written in Scala with Python and R support.
2) To read data from MongoDB, the connector partitions the collection, optionally using preferred shard locations for locality. It computes each partition's data as an iterator to be consumed by Spark.
3) For writing data, the connector groups data into batches by partition and inserts into MongoDB collections. DataFrames/Datasets will upsert if there is an ID.
4) The connector supports structured data in Spark by inferring schemas, creating relations, and allowing multi-language access from Scala, Python and R
Top 5 Mistakes When Writing Spark ApplicationsSpark Summit
This document discusses 5 common mistakes when writing Spark applications:
1) Improperly sizing executors by not considering cores, memory, and overhead. The optimal configuration depends on the workload and cluster resources.
2) Applications failing due to shuffle blocks exceeding 2GB size limit. Increasing the number of partitions helps address this.
3) Jobs running slowly due to data skew in joins and shuffles. Techniques like salting keys can help address skew.
4) Not properly managing the DAG to avoid shuffles and bring work to the data. Using ReduceByKey over GroupByKey and TreeReduce over Reduce when possible.
5) Classpath conflicts arising from mismatched library versions, which can be addressed using sh
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
This talk will break down merge in Delta Lake—what is actually happening under the hood—and then explain about how you can optimize a merge. There are even some code snippet and sample configs that will be shared.
These slides present how DBT, Coral, and Iceberg can provide a novel data management experience for defining SQL workflows. In this UX, users define their workflows as a cascade of SQL queries, which then get auto-materialized and incrementally maintained. Applications of this user experience include Declarative DAG workflows, streaming/batch convergence, and materialized views.
Query optimizers and people have one thing in common: the better they understand their data, the better they can do their jobs. Optimizing queries is hard if you don't have good estimates for the sizes of the intermediate join and aggregate results. Data profiling is a technique that scans data, looking for patterns within the data such as keys, functional dependencies, and correlated columns. These richer statistics can be used in Apache Calcite's query optimizer, and the projects that use it, such as Apache Hive, Phoenix and Drill. We describe how we built a data profiler as a table function in Apache Calcite, review the recent research and algorithms that made it possible, and show how you can use the profiler to improve the quality of your data.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Cost-Based Optimizer in Apache Spark 2.2 Databricks
The document discusses Apache Spark's new cost-based optimizer (CBO) in version 2.2. It describes how the CBO works in two key steps:
1. It collects and propagates statistics about tables and columns to estimate the cardinality of operations like filters, joins and aggregates.
2. It calculates the estimated cost of different execution plans and selects the most optimal plan based on minimizing the estimated cost. This allows it to pick more efficient join orders and join algorithms.
The document provides examples of how the CBO improves queries on TPC-DS benchmarks by producing smaller intermediate results and faster execution times compared to the previous rule-based optimizer in Spark 2.1.
Slides for my talk in Spark Summit 2017 SF : https://spark-summit.org/2017/events/hive-bucketing-in-apache-spark/
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you’ll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Robert Haas
Why does my query need a plan? Sequential scan vs. index scan. Join strategies. Join reordering. Joins you can't reorder. Join removal. Aggregates and DISTINCT. Using EXPLAIN. Row count and cost estimation. Things the query planner doesn't understand. Other ways the planner can fail. Parameters you can tune. Things that are nearly always slow. Redesigning your schema. Upcoming features and future work.
This document discusses different searching methods like sequential, binary, and hashing. It defines searching as finding an element within a list. Sequential search searches lists sequentially until the element is found or the end is reached, with efficiency of O(n) in worst case. Binary search works on sorted arrays by eliminating half of remaining elements at each step, with efficiency of O(log n). Hashing maps keys to table positions using a hash function, allowing searches, inserts and deletes in O(1) time on average. Good hash functions uniformly distribute keys and generate different hashes for similar keys.
The document outlines an agenda for a workshop on hands-on data science with R. The pre-lunch session covers R fundamentals including installing R and RStudio, data types and structures, control flow, and data import/export. The post-lunch session covers data analysis techniques in R like working with datasets, visualization, classification, regression, clustering, and association rule mining. Examples use the built-in iris dataset and other datasets.
Data engineering and analytics using pythonPurna Chander
This document provides an overview of data engineering and analytics using Python. It discusses Jupyter notebooks and commonly used Python modules for data science like Pandas, NumPy, SciPy, Matplotlib and Seaborn. It describes Anaconda distribution and the key features of Pandas including data loading, structures like DataFrames and Series, and core operations like filtering, mapping, joining, sorting, cleaning and grouping. It also demonstrates data visualization using Seaborn and a machine learning example of linear regression.
SQLFire is a scalable SQL database that provides an alternative to NoSQL databases by using SQL instead. It features hash partitioning, entity groups to colocate related data, and data aware stored procedures that can run queries in parallel across partitions. SQLFire uses a tunable consistency model and supports distributed transactions without global locks. Data is stored primarily in memory but can optionally be persisted to disk in parallel for high throughput writes without seeks. Benchmark results showed SQLFire can scale linearly from 2 to 10 servers and 200 to 1200 clients.
Don’t optimize my queries, optimize my data!Julian Hyde
The document discusses strategies for optimizing data through materialized views and how data systems can learn to optimize themselves. It proposes an algorithm that uses sketches and information theory to profile data cardinalities and recommend materialized views. The algorithm aims to defeat the combinatorial search space by only considering combinations with "surprising" cardinalities. This profiling provides the cost and benefit information needed to optimize data structures. The document also discusses using query logs and statistics to infer relationships between tables and design summary tables through lattices.
This document provides an overview of basic usage of the Apache Spark framework for data analysis. It describes what Spark is, how to install it, and how to use it from Scala, Python, and R. It also explains the key concepts of RDDs (Resilient Distributed Datasets), transformations, and actions. Transformations like filter, map, join, and reduce return new RDDs, while actions like collect, count, and first return results to the driver program. The document provides examples of common transformations and actions in Spark.
This document introduces Spark SQL and the Catalyst query optimizer. It discusses that Spark SQL allows executing SQL on Spark, builds SchemaRDDs, and optimizes query execution plans. It then provides details on how Catalyst works, including its use of logical expressions, operators, and rules to transform query trees and optimize queries. Finally, it outlines some interesting open issues and how to contribute to Spark SQL's development.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
Covered:
1. Databases and Schemas
2. Tablespaces
3. Data Type
4. Exploring Databases
5. Locating the database server's message log
6. Locating the database's system identifier
7. Listing databases on this database server
8. How much disk space does a table use?
9. Which are my biggest tables?
10. How many rows are there in a table?
11. Quickly estimating the number of rows in a table
12. Understanding object dependencies
1. The document discusses various sorting algorithms like bubble sort, quick sort, and merge sort. It explains their time and space complexities.
2. It describes how to count inversions during merge sort by tracking the number of elements a back element passes during the merging process. Pseudocode for the merge-and-count and sort-and-count functions is provided.
3. Related problems on sorting and inversion counting from online judges like UVA are listed along with references for further reading.
A talk given by Julian Hyde at DataEngConf SF on April 17th 2018.
Did you know that databases often “cheat”? Even with a scalable query engine and smart optimizer, many real-world queries would be too slow if the engine read all the data, so the engine re-writes your query to use a pre-materialized result. B-tree indexes made the first relational databases possible, and there are now many flavors of materialization, from explicit materialized views to OLAP-style caching and spatial indexes. Materialization is more relevant than ever in today’s heterogenous, distributed systems.
If you are evaluating data engines, we describe what materialization features to look for in your next engine. If you are implementing an engine, we describe the features provided by Apache Calcite to design, maintain and use materializations.
This document describes a project that provides methods for estimating the cardinality of conjunctive queries over RDF data. It discusses the key modules including an RDF loader, parser and importer to extract triples from RDF files and load them into a database. The database stores the triples and generates statistics tables. A cardinality estimator takes a conjunctive query and database statistics to output an estimate of the query's cardinality.
JDD 2016 - Pawel Szulc - Writing Your Wwn RDD For Fun And ProfitPROIDEA
You all know what RDD stands for, right? You have the mental model of a distributed collection. But have you ever consider writing your own RDD? During this talk we will do just that. We will start by explaining essence of how RDDs are implemented internally, following by semi-live demo (*), where we will implement few RDDs from the scratch.
After this talk you will not only be able to write your own RDD, but you will also have a deeper understanding of how Apache Spark works under the hood.
I guarantee fun during the talk and profit during your next job interview.
(*) by 'semi-live' author means not really code live because that almost never works, but slowly pulling small commits from the repo :)
Ontop: Answering SPARQL Queries over Relational DatabasesGuohui Xiao
We present Ontop, an open-source Ontology-Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA, which avoids materializing triples and is implemented through the query rewriting technique, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
This document discusses various methods for working with lists in Python. It covers how to create lists, common list methods like append and insert, built-in functions like len and random.shuffle, accessing list elements using indexes, slicing lists, list operators like + and in, list comprehensions, comparing lists, splitting strings into lists, and examples of using lists for problems involving lottery numbers, cards, and letter frequencies. It also discusses passing lists to functions, copying lists, and searching lists using linear and binary search algorithms.
CRL: A Rule Language for Table Analysis and InterpretationAlexey Shigarov
Tables presented in spreadsheets can be a source of important information that needs to be loaded into relational databases. However, many of them have complex structures. This does not allow to populate databases with their information directly. The presentation is devoted to the issues of the rule-based information extraction from arbitrary tables presented in spreadsheets and its transformation into structured canonical form that can be loaded into a database by standard ETL tools. We suggest a novel rule language called CRL for table analysis and interpretation. It enables developing a simple program to recover missing relationships describing table semantics. Particular sets of rules can be designed for different types of tables to provide extraction and transformation steps in a process of unstructured tabular data integration.
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
3. About me
Cheng Su
• Software Engineer at Facebook (Data Infrastructure
Organization)
• Working in Spark team
• Previously worked in Hive/Corona team
3#UnifiedDataAnalytics #SparkAISummit
4. Agenda
• Spark at Facebook
• What is Bucketing
• Spark Bucketing Optimizations (JIRA: SPARK-19256)
• Bucketing Compatability across SQL Engines
• The Road Ahead
4#UnifiedDataAnalytics #SparkAISummit
7. What is Bucketing (query plan)
CREATE TABLE user
(id INT, info STRING)
CLUSTERED BY (id)
SORTED BY (id)
INTO 8 BUCKETS
7#UnifiedDataAnalytics #SparkAISummit
SQL query to create
bucketed table
InsertIntoTable
Sort(id)
ShuffleExechange
(id, 8, HashFunc)
. . .
Query plan to write
bucketed table
INSERT OVERWRITE
TABLE user
SELECT id, info
FROM . . .
WHERE . . .
SQL query to write
bucketed table
9. Spark Bucketing Optimizations (join)
9#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-merge-join bucketed tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
Sort(id)
Shuffle(id)
TableScan(L) TableScan(R)
SortMergeJoin
TableScan(L) TableScan(R)
Query plan to sort-merge-
join two bucketed tables
with same buckets
12. 12#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when shuffled-hash-join bucketed
tables
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
ShuffledHashJoin
Shuffle(id) Shuffle(id)
TableScan(L) TableScan(R)
ShuffledHashJoin
TableScan(L) TableScan(R)
Query plan to shuffled-
hash-join two bucketed
tables with same buckets
Spark Bucketing Optimizations (join)
15. 15#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining non-bucketed, and bucketed
table
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
Sort(id)
Shuffle(id)
TableScan(L)
TableScan(R)
Query plan to sort-merge-join
non-bucketed table (L) with
bucketed table (R)
Spark Bucketing Optimizations (join)
17. 17#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
SortedCoalesce(4)
SortedCoalesceExec
(physical plan operator
inherits child ordering )
SortedCoalescedRDD
(extends CoalescedRDD
to read children RDDs in
sort-merge-way)
(priority-queue)
18. Table Scan L Table Scan R
Join
Sort merge join of
bucketed sorted
table with different
buckets
- Coalesce the bigger one
in sort-merge way
- Join by buffer one, stream
the bigger one
(1, )
(1, )
(3, )
(0, )
(0, )
(2, )
(1, )
(9, )
(0, )
(4, )
(3, )
(7, )
(7, )
(2, )
(2, )
(6, )
(0, )
(0, )
(2, )
(0, )
(2, )
(2, )
(4, )
(6, )
Sorted-Coalesce
Join
(1, )
(1, )
(3, )
(1, )
(3, )
(7, )
(7, )
(9, )
Sorted-Coalesce
19. 19#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when joining bucketed tables with different buckets
SELECT . . .
FROM left L
JOIN right R
ON L.id = R.id
SQL query to join
tables
SortMergeJoin
TableScan(L)
TableScan(R)
Query plan to join 4-buckets-table
(L) with 16-buckets-table (R)
Spark Bucketing Optimizations (join)
Repartition(16)
RepartitionWithoutShuffleExe
c
(physical plan operator
inherits child ordering)
RepartitionWithoutShuffleRD
D (divide-read-filter children
RDD partitions)
21. Spark Bucketing Optimizations (group-by)
21#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when sort-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
SortAggregate
Sort(id)
Shuffle(id)
TableScan(t)
Query plan to sort-
aggregate bucketed table
SortAggregate
TableScan(t)
24. Spark Bucketing Optimizations (group-by)
24#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle when hash-aggregate bucketed tables
SELECT . . .
FROM t
GROUP BY id
SQL query to group-
by table
HashAggregate
Shuffle(id)
TableScan(t)
Query plan to hash-
aggregate bucketed table
HashAggregate
TableScan(t)
27. Spark Bucketing Optimizations (union all)
27#UnifiedDataAnalytics #SparkAISummit
Avoid shuffle and sort when join/group-by on union-all of bucketed tables
SELECT . . .
FROM (
SELECT … FROM L
UNION ALL
SELECT … FROM R
)
GROUP BY id
SQL query to group-by
on union-all of tables
SortAggregate
Union
TableScan(L)
Query plan to hash-
aggregate union-all of
bucketed tables
TableScan(R)
Change UnionExec to
produce
SortedCoalescedRDD
instead of CoalescedRDD
30. Spark Bucketing Optimizations (filter)
30#UnifiedDataAnalytics #SparkAISummit
Filter pushdown for bucketed table
SELECT … FROM t
WHERE id = 1
SQL query to read
bucketed table with
filter on bucketed
column (id)
Filter
Query plan to read
bucketed table with filter
pushdown
PushDownBucketFilter
physical plan rule to extract
bucketed column filter from
FilterExec, then filtering out
unnecessary buckets from e.g.
HiveTableScanExec
(i.e. not read unrelated buckets
at all)
TableScan(t)SELECT … FROM t
WHERE id IN (1, 2, 3)
31. Bucket Filter Push
Down
SELECT … FROM t
WHERE id = 1
Normal Filter
Bucket Filter Push Down
. . . . . .(9, )
(1, )
(4 )
(0, )
(4, )
(7, )
(3, )
(7, )
(1, )
- Only read required bucket
files
(9, )
(1, )
(1, )
32. Spark Bucketing Optimizations (validation)
32#UnifiedDataAnalytics #SparkAISummit
Validate bucketing and sorting before writing bucketed tables
INSERT OVERWRITE
TABLE t
SELECT …
FROM …
SQL query to write
bucketed table
InsertIntoTable(t)
SortVerifie
r
Query plan to validate
bucketing and sorting
before writing table
ShuffleVerifierExec
compute bucket-id for each
row on-the-fly, compare
bucket-id with RDD-partition-id
ShuffleVerifie
r
SortVerifierExec
compare ordering
between current and
previous rows
34. Spark Bucketing Optimizations (others)
34#UnifiedDataAnalytics #SparkAISummit
• Sorted-coalesced-read multiple partitions of bucketed table
• Prefer sort-merge-join for bucketed sorted tables
• Prefer sort-aggregate for bucketed sorted tables
• Avoid shuffle for NULL-safe-equal join (<=>) on bucketed tables
• Allow to skip shuffle and sort before writing bucketed table
• Automatically align dynamic allocation maximal executors, with
buckets
• Efficiently hive table sampling support
35. • Hive hash is different from murmur3 hash! (bitwise-and with 2^31-1 in
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorUtils.getBucketNumber)
• Should use same bucketing hash function (e.g. hive hash) across SQL
engines (Spark/Presto/Hive)
• Number of buckets of all tables should be divisible by each other (e.g.
power-of-two)
35#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
36. • Change number of buckets should be easy and pain-less across
compute engines for SQL users
• When and What to bucket?
• Have more than one query to do join or group-by on some columns
36#UnifiedDataAnalytics #SparkAISummit
Bucketing Compatability across SQL Engines
37. The Road Ahead
• Bucketing should be user-transparent
• Auto-bucketing project
• Audit join/group-by columns information for all warehouse queries
• Recommend bucketed columns and number of buckets based on
computational cost models
• What is problem of bucketing? Can we have better data placement,
besides bucketing and partitioning?
37#UnifiedDataAnalytics #SparkAISummit
38. DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT