In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Join operations in Apache Spark is often the biggest source of performance problems and even full-blown exceptions in Spark. After this talk, you will understand the two most basic methods Spark employs for joining DataFrames – to the level of detail of how Spark distributes the data within the cluster. You’ll also find out how to work out common errors and even handle the trickiest corner cases we’ve encountered! After this talk, you should be able to write performance joins in Spark SQL that scale and are zippy fast!
This session will cover different ways of joining tables in Apache Spark.
Speaker: Vida Ha
This talk was originally presented at Spark Summit East 2017.
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
The architectural tradeoffs between the map/reduce paradigm and parallel databases has been a long and open discussion since the dawn of MapReduce over more than a decade ago. At Facebook, we have spent the past several years in independently building and scaling both Presto and Spark to Facebook scale batch workloads, and it is now increasingly evident that there is significant value in coupling Presto’s state-of-art low-latency evaluation with Spark’s robust and fault tolerant execution engine.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...Databricks
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit. In this talk we want to give a gentle introduction to how to read this SQL tab. We will first go over all the common spark operations, such as scans, projects, filter, aggregations and joins; and how they relate to the Spark code written. In the second part of the talk we will show how to read the associated statistics to pinpoint performance bottlenecks.
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
This document summarizes Project Tungsten, an effort by Databricks to substantially improve the memory and CPU efficiency of Spark applications. It discusses how Tungsten optimizes memory and CPU usage through techniques like explicit memory management, cache-aware algorithms, and code generation. It provides examples of how these optimizations improve performance for aggregation queries and record sorting. The roadmap outlines expanding Tungsten's optimizations in Spark 1.4 through 1.6 to support more workloads and achieve end-to-end processing using binary data representations.
Join is one of most important and critical SQL operation in most data warehouses. This is essential when we want to get insights from multiple input datasets. Over the last year, we’ve added a series of join optimizations internally at Facebook, and we started to contribute back to upstream open source recently.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
This document provides an overview and summary of Amazon S3 best practices and tuning for Hadoop/Spark in the cloud. It discusses the relationship between Hadoop/Spark and S3, the differences between HDFS and S3 and their use cases, details on how S3 behaves from the perspective of Hadoop/Spark, well-known pitfalls and tunings related to S3 consistency and multipart uploads, and recent community activities related to S3. The presentation aims to help users optimize their use of S3 storage with Hadoop/Spark frameworks.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
Radical Speed for SQL Queries on Databricks: Photon Under the HoodDatabricks
Join this session to hear from the Photon product and engineering team talk about the latest developments with the project.
As organizations embrace data-driven decision-making, it has become imperative for them to invest in a platform that can quickly ingest and analyze massive amounts and types of data. With their data lakes, organizations can store all their data assets in cheap cloud object storage. But data lakes alone lack robust data management and governance capabilities. Fortunately, Delta Lake brings ACID transactions to your data lakes – making them more reliable while retaining the open access and low storage cost you are used to.
Using Delta Lake as its foundation, the Databricks Lakehouse platform delivers a simplified and performant experience with first-class support for all your workloads, including SQL, data engineering, data science & machine learning. With a broad set of enhancements in data access and filtering, query optimization and scheduling, as well as query execution, the Lakehouse achieves state-of-the-art performance to meet the increasing demands of data applications. In this session, we will dive into Photon, a key component responsible for efficient query execution.
Photon was first introduced at Spark and AI Summit 2020 and is written from the ground up in C++ to take advantage of modern hardware. It uses the latest techniques in vectorized query processing to capitalize on data- and instruction-level parallelism in CPUs, enhancing performance on real-world data and applications — all natively on your data lake. Photon is fully compatible with the Apache Spark™ DataFrame and SQL APIs to ensure workloads run seamlessly without code changes. Come join us to learn more about how Photon can radically speed up your queries on Databricks.
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Edureka!
This Edureka Spark Tutorial will help you to understand all the basics of Apache Spark. This Spark tutorial is ideal for both beginners as well as professionals who want to learn or brush up Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Introduction
2) Batch vs Real Time Analytics
3) Why Apache Spark?
4) What is Apache Spark?
5) Using Spark with Hadoop
6) Apache Spark Features
7) Apache Spark Ecosystem
8) Demo: Earthquake Detection Using Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...Databricks
The convergence of big data technology towards traditional database domain has became an industry trend. At present, open source big data processing engines, such as Apache Spark, Apache Hadoop, Apache Flink, etc., already support SQL interfaces, and the usage of SQL basically occupies a dominant position. Companies use above open source software to build their own ETL framework and OLAP technology. However, in terms of OLTP technology, it is still a strong point of traditional databases. One of the main reasons is the support of ACID by traditional databases.
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...Databricks
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Comcast, GrubHub, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Databricks
This talk presents how we accelerated deep learning processing from preprocessing to inference and training on Apache Spark in SK Telecom. In SK Telecom, we have half a Korean population as our customers. To support them, we have 400,000 cell towers, which generates logs with geospatial tags.
The document provides details about Oracle's SPARC S7 servers and SPARC S7 processor. It discusses the key features and capabilities of the SPARC S7 processor, including software-in-silicon features for security, compression, and analytics acceleration. It also provides specifications for the SPARC S7-2 and SPARC S7-2L server models, which are based on the SPARC S7 processor.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
The Apache Spark config behind the indsutry's first 100TB Spark SQL benchmarkLenovo Data Center
Some configurations deserve their own SlideShare entry: this is one of them. When the indsutry's first 100TB Spark SQL benchmark was reached, the media took notice. For good reason.
Intel, Mellanox, Lenovo and IBM came together to investigate a topology that leveraged advances in CPU, memory, storage and networking to assess the readiness of Spark SQL to harness new capabilities -- and speeds.
Advancing GPU Analytics with RAPIDS Accelerator for Spark and AlluxioAlluxio, Inc.
This document discusses accelerating Apache Spark workloads using RAPIDS Accelerator for Spark and Alluxio. It provides an introduction to RAPIDS Accelerator for Spark, shows significant performance gains over CPU-only Spark, and discusses combining GPU acceleration with Alluxio for optimized performance and cost on cloud datasets. Configuration options for RAPIDS and Alluxio are also covered.
This document summarizes a presentation about accelerating Apache Spark workloads using NVIDIA's RAPIDS accelerator. It notes that global data generation is expected to grow exponentially to 221 zettabytes by 2026. RAPIDS can provide significant speedups and cost savings for Spark workloads by leveraging GPUs. Benchmark results show a NVIDIA decision support benchmark running 5.7x faster and with 4.5x lower costs on a GPU cluster compared to CPU. The document outlines RAPIDS integration with Spark and provides information on qualification, configuration, and future developments.
The document discusses optimizing Oracle and Siebel applications on the Sun UltraSPARC T1 platform. It describes how Siebel's multi-threaded architecture is well-suited to the T1 processor's ability to run multiple threads in parallel. It provides examples of consolidating Siebel environments and optimizing performance through Solaris, Siebel, and Oracle database tuning. Metrics show Siebel performing well with low CPU utilization on T1 systems.
The document discusses accelerating Ceph storage performance using SPDK. SPDK introduces optimizations like asynchronous APIs, userspace I/O stacks, and polling mode drivers to reduce software overhead and better utilize fast storage devices. This allows Ceph to better support high performance networks and storage like NVMe SSDs. The document provides an example where SPDK helped XSKY's BlueStore object store achieve significant performance gains over the standard Ceph implementation.
The document discusses optimizing Oracle and Siebel applications on Sun Microsystems' UltraSPARC T1 (Niagara) platform. It provides an overview of Siebel architecture and its suitability for the T1 processor. Performance benchmarks show Siebel scaling well by taking advantage of the T1 processor's multithreading capabilities. The document also discusses various optimizations that can be done at the application, database, storage, and operating system levels to further improve performance.
Presentation oracle super cluster t5-8 technical deep divesolarisyougood
This document provides an overview and agenda for a presentation on the Oracle SuperCluster T5-8. The document outlines key specifications of the Oracle SuperCluster T5-8 including its SPARC T5 compute nodes, Exadata storage servers, ZFS storage appliance, and InfiniBand networking. It also discusses configurations for the SuperCluster including database and application domains on the SPARC T5 nodes. Use cases and competitive advantages are highlighted such as performance, efficiency through data compression, and reliability.
New Generation of SPARC Processors Boosting Oracle S/W Angelo RajaduraiOrgad Kimchi
This document discusses Oracle's SPARC T5 processor and SPARC T5 server systems. It provides an overview of the SPARC T5 processor's specifications and performance advantages. It then describes the new SPARC T5-8 and T5-4 server models, which offer up to 128 processor cores, 4TB of memory, and improved I/O and storage capabilities. Benchmark results are presented showing that the SPARC T5-8 significantly outperforms IBM Power systems on price/performance for database, middleware, and other workloads. A case study is also described where a financial services company found the SPARC T5-8 offered better streaming performance and lower costs than IBM Power solutions
Building a big data intelligent application on top of xPatterns using tools that leverage Spark, Shark, Mesos, Tachyon and Cassandra. Jaws, open sourcing our own spark sql restful service and our own contributions to the Spark and Mesos projects, lessons learned
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
This document discusses automating Apache Cassandra operations using Apache Airflow. It recommends using Airflow to schedule and automate workflows for ETL, data hygiene, import/export, and more. It provides an overview of using Apache Spark jobs within Airflow DAGs to perform tasks like data cleaning, deduplication, and migrations for Cassandra. The document includes demos of using Airflow and Spark with Cassandra on DataStax Astra and discusses considerations for implementing this solution.
Spark is a framework for efficient parallel data processing. It uses resilient distributed datasets (RDDs) that can be operated on in parallel, cached in memory, and recomputed when needed. The core of Spark provides functions for data sharing and basic operations like filtering, mapping, and reducing RDDs. Additional Spark modules provide capabilities for SQL, streaming, machine learning, and graph processing.
This document discusses security features of the SPARC M7 CPU. It introduces Silicon Secured Memory, which provides hardware-based memory protection to stop malicious programs from accessing other application memory without performance impact. This results in improved security, reliability, and availability of applications. Benchmark results are also provided showing the SPARC M7's performance advantages over other chips.
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld
VMworld 2013
Michael Corey, Ntirety, Inc
Jeff Szastak, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
This document outlines the agenda and content for a presentation on xPatterns, a tool that provides APIs and tools for ingesting, transforming, querying and exporting large datasets on Apache Spark, Shark, Tachyon and Mesos. The presentation demonstrates how xPatterns has evolved its infrastructure to leverage these big data technologies for improved performance, including distributed data ingestion, transformation APIs, an interactive Shark query server, and exporting data to NoSQL databases. It also provides examples of how xPatterns has been used to build applications on large healthcare datasets.
Similar to Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based FPGA Accelerators (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B...rightmanforbloodline
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Solution Manual for First Course in Abstract Algebra A, 8th Edition by John B. Fraleigh, Verified Chapters 1 - 56,.pdf
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
4. ABOUT US
Calvin Hung @wasaitech
Calvin is the co-founder, CEO and CTO of WASAI
Technology, which is specialized in FPGA-based
datacenter accelerations for Apache Spark,
Apache Hadoop and Genomics Analysis
applications. He has more than 15 years of
experience in software and hardware architecture
co-design and performance optimization.
Weiting Chen(William) @intel
Weiting is a senior software engineer at Intel
Software. He has worked for Big Data and Cloud
Solutions including Spark, Hadoop, OpenStack, and
Kubernetes for more than 6 years.
5. MOTIVATION
▪ CPU support SIMD instructions such as SSE, AVX2, AVX512, …etc. We
would like to unleash the power in SPARK.
▪ Many accelerators such as FPGA, GPU, ASIC, …etc in the world can
help CPU to offload functions and speed up the performance.
6. PROBLEM DEFINITION
▪ How to avoid row and columnar convert overhead during data
processing?
▪ How SPARK can leverage AVX support?
▪ How to coordinate the accelerators (e.g. FPGA, GPU, …etc) to work
with CPU in SPARK3.0?
▪ How FPGA can help to speed up SPARK?
▪ How to minimize data copy and serialization overhead when copying
from host to device?
▪ How to enhance the performance during DMA transfer?
7. SOLUTION: SPARK + ARROW + FPGA
A better way to run SPARK with AVX and accelerators support
- Apache Arrow
- SPARK 3.0 New Features
- OAP Native SQL Engine (Intel)
- FPGA Accelerators (WASAI)
8. THE GOALS
End-to-End Columnar-to-Columnar Data Processing:
To avoid columnar-to-row or row-to-columnar overhead when processing data.
Support AVX(via OAP Native SQL):
Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Writer Result
With FPGA Integration and acceleration:
Columnar based Reader -> Columnar based Data Processing(w/ AVX) -> Columnar based Data Copy to
Device -> Columnar based Data Processing(on FPGA) -> Columnar
9. APACHE ARROW
• Each system has its own internal memory format
• 70-80% computation wasted on serialization and
deserialization
• Similar functionality implemented in multiple
projects
• All systems utilize the same memory format
• No overhead for cross-system communication
• Projects can share functionality
Reference: https://arrow.apache.org
10. NEW FEATURES in SPARK3.0
SPARK-27396 Public APIs for extended Columnar Processing Support
https://issues.apache.org/jira/browse/SPARK-27396
▪ An interface to extend columnar processing API
▪ Provide an opportunity to create a custom API for columnar data processing with
OAP Native SQL Engine and FPGA support
▪ Advanced user can define a new interface to communication with accelerators
such as GPU or FPGA
SPARK-24615 Accelerator-aware task scheduling for SPARK
https://issues.apache.org/jira/browse/SPARK-24615
▪ An interface for SPARK to allocate accelerators in task level
▪ Make SPARK task to be aware accelerators such as GPU, FPGA, …etc
▪ Currently only support GPU
▪ FPGA can be supported in the same way (vendor specific)
11. OAP Native SQL Engine Plugin
Intel Optimized Analytics Package(OAP): Native SQL Engine
https://github.com/Intel-bigdata/OAP/
An End-to-End SPARK Columnar based data processing with Intel AVX support
Apache Arrow
Arrow Data Source Arrow Data Processing
Intel CPU Other Accelerators (FPGA, GPU, …)
Columnar Shuffle
SPARK SQL
SPARK Catalyst
12. FPGA Acceleration
END TO END, COLUMNAR TO COLUMNAR DATA PROCESSING
SPARK3.0
FileScan
FileWriter
Parquet Writer
Parquet Reader
ColumnarVector
ColumnarVector
Columnar-to-Row
Row-to-Columnar
InternalRow
Whole Stage Codegen
Row based Operator
Row based Operator
InternalRow
InternalRow
ColumnarVector
SPARK3.0 + OAP Native SQL Engine
FileScan
FileWriter
Parquet Writer
Parquet Reader
Arrow
Arrow
Columnar based
Operator
Arrow
Columnar based
Operators
Arrow
FPGA Templates
Arrow
FPGA Operators
Aggregation/GroupBy/…
Arrow
13. OAP NATIVE SQL ENGINE HIGHLIGHT
https://github.com/Intel-bigdata/OAP/
▪ An Open Source Columnar based Data Processing for SPARK
▪ Apache Arrow based Solution
▪ Enable AVX Support with SIMD instruction acceleration
▪ Leverage SPARK3.0 Support
▪ Communicate with 3rd Party Accelerators
▪ Support Data Source Parsing, SQL Operators, and Columnar Shuffle
▪ Common SQL Operators Support such as filter, join, groupby,
aggregate, …etc.
14. SPARK + FPGA + ARROW
Why SPARK SQL + FPGA
- SPARK SQL essentially processes structured row-based dataset at once with
single query of a bunch of SQL operators. The operators can be simple while the
dataset could be extremely large.
- FPGA with highly specialized IPs can deal with such multiple-instruction, single-
dataset analysis faster, more power and resource-efficiently than CPU and GPU
under the same total-cost-of-ownership.
Why Arrow
- In order to offload SPARK SQL workload from Java runtime to FPGA, leveraging a
new WholeStageCodegen to invoke native function calls to process data with
FPGA can be messy. Apache Arrow can hold Columnar Batch data inside native
memory and manage its memory reference inside Spark.
15. SPARK SQL FPGA ACCELERATION
▪ SQL Operators (Aggregation, GroupBy,
Filter, Sort, Join, …etc)
▪ Using Apache Arrow for data transfer
between Java runtime and FPGA to
reduce data traffic
▪ Next step will be leveraging
Arrow::RecordBatch
18. SPARK SQL + ARROW PERFORMANCE
▪ A simple query with 300GB dataset from TPC-DS Q55
▪ With Apache Arrow, performance boost can be up to 33% and CPU is
obviously offloaded.
SELECT ss_sold_date_sk, sum(ss_ext_sales_price) FROM store_sales
WHERE ss_item_sk = 3175 GROUP BY ss_sold_date_sk
0
2
4
6
8
10
12
14
32 90 300
Minutes
Arrow-boosted Original
Intel® Xeon® Gold 6120 CPU x2
DDR4 256GB
Intel PAC Arria10 x1
(GB)
19. SYSTEM STACK
Storage
Storage/Data Format
JSON Parquet
Distributed
Execution
Big Data Cores
MapReduce
Spark SQL
Engine Spark RDD/DFOS
OS Core System
CentOS RHEL Ubuntu
FPGA
Accelerator
Accelerators
MapReduce
Accelerator
Spark SQL
Accelerator
Spark RDD/DF
Accelerator
Data
Decoder
WASAI
System Lib & Drivers
WASAI
IOBooster
WASAI
EvoCores
Compressor
20. OTHER ACCELERATORS
Solution Description Workloads Result
Spark RDD groupByKey, foldByKey, etc microbench
80%~3x performance boost. Shuffle size 90%
to 99% reduction.
General Sort
General TimSort for both
Hadoop & Spark
TeraSort
microbench
20% performance boost
Compression
Compression
encoding/decoding
microbench Ongoing
Erasure Coding EC codec microbench Reach maximum throughput of PCIe
Input Format
Parsing
JSON, CSV, Parquet format
parser
microbench 2X~7.8X performance boost
Intel® Xeon® Gold 6120 CPU x2
DDR4 256GB
Intel PAC Arria10 x1
21. KEY TAKEAWAYS
▪ End-to-End Columnar Data Processing can optimize the performance
in CPU, FPGA and other Accelerators in native layer.
▪ FPGA can help to accelerate SPARK in many cases that involved heavy
CPU-intensive operations.
▪ Last but not least, with SPARK3.0 support, many new opportunities
can be done in the future.
22. NEXT …
▪ More features in OAP Native SQL Engine
▪ OAP Native SQL Engine + FPGA integration
OAP Native SQL Engine Plugin
Apache Arrow
Arrow Data Source Arrow Data Processing
Intel CPU WASAI FPGA Accelerators
Columnar Shuffle WASAI
CodeGen
23. CALL TO ACTION
We encourage you to try OAP Native SQL Engine for SPARK in
https://github.com/Intel-bigdata/OAP/
Wasai SPARK SQL + FPGA Solution
https://www.wasaitech.com/
Please contact
Intel: weiting.chen@intel.com
Wasai: calvin.hung@wasaitech.com