The increasing challenge to serve ever-growing data driven by AI and analytics workloads makes disaggregated storage and compute more attractive as it enables companies to scale their storage and compute capacity independently to match data & compute growth rate. Cloud based big data services is gaining momentum as it provides simplified management, elasticity, and pay-as-you-go model.
In Spark SQL the physical plan provides the fundamental information about the execution of the query. The objective of this talk is to convey understanding and familiarity of query plans in Spark SQL, and use that knowledge to achieve better performance of Apache Spark queries. We will walk you through the most common operators you might find in the query plan and explain some relevant information that can be useful in order to understand some details about the execution. If you understand the query plan, you can look for the weak spot and try to rewrite the query to achieve a more optimal plan that leads to more efficient execution.
The main content of this talk is based on Spark source code but it will reflect some real-life queries that we run while processing data. We will show some examples of query plans and explain how to interpret them and what information can be taken from them. We will also describe what is happening under the hood when the plan is generated focusing mainly on the phase of physical planning. In general, in this talk we want to share what we have learned from both Spark source code and real-life queries that we run in our daily data processing.
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
Spark SQL is a highly scalable and efficient relational processing engine with ease-to-use APIs and mid-query fault tolerance. It is a core module of Apache Spark. Spark SQL can process, integrate and analyze the data from diverse data sources (e.g., Hive, Cassandra, Kafka and Oracle) and file formats (e.g., Parquet, ORC, CSV, and JSON). This talk will dive into the technical details of SparkSQL spanning the entire lifecycle of a query execution. The audience will get a deeper understanding of Spark SQL and understand how to tune Spark SQL performance.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...Spark Summit
The document discusses Spark's DataFrame API and the Tungsten project. DataFrames make Spark accessible to different users by providing a common API across languages like Python, R and Scala. Tungsten aims to improve Spark's performance for the next five years through techniques like runtime code generation and off-heap memory management. Initial results show Tungsten doubling performance. Together, DataFrames and Tungsten will help Spark scale to larger data and queries across different languages and execution backends.
Magnet Shuffle Service: Push-based Shuffle at LinkedInDatabricks
The number of daily Apache Spark applications at LinkedIn has increased by 3X in the past year. The shuffle process alone, which is one of the most costly operators in batch computation, is processing PBs of data and billions of blocks daily in our clusters. With such a rapid increase of Apache Spark workloads, we quickly realized that the shuffle process can become a severe bottleneck for both infrastructure scalability and workloads efficiency. In our production clusters, we have observed both reliability issues due to shuffle fetch connection failures and efficiency issues due to the random reads of small shuffle blocks on HDDs.
To tackle those challenges and optimize shuffle performance in Apache Spark, we have developed Magnet shuffle service, a push-based shuffle mechanism that works natively with Apache Spark. Our paper on Magnet has been accepted by VLDB 2020. In this talk, we will introduce how push-based shuffle can drastically increase shuffle efficiency when compared with the existing pull-based shuffle. In addition, by combining push-based shuffle and pull-based shuffle, we show how Magnet shuffle service helps to harden shuffle infrastructure at LinkedIn scale by both reducing shuffle related failures and removing scaling bottlenecks. Furthermore, we will share our experiences of productionizing Magnet at LinkedIn to process close to 10 PB of daily shuffle data.
Problems with PostgreSQL on Multi-core Systems with MultiTerabyte DataJignesh Shah
This document discusses PostgreSQL performance on multi-core systems with multi-terabyte data. It covers current market trends towards more cores and larger data sizes. Benchmark results show that PostgreSQL scales well on inserts up to a certain number of clients/cores but struggles with OLTP and TPC-E workloads due to lock contention. Issues are identified with sequential scans, index scans, and maintenance tasks like VACUUM as data sizes increase. The document proposes making PostgreSQL utilities and tools able to leverage multiple cores/processes to improve performance on modern hardware.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
Netflix’s architecture involves thousands of microservices built to serve unique business needs. As this architecture grew, it became clear that the data storage and query needs were unique to each area; there is no one silver bullet which fits the data needs for all microservices. CDE (Cloud Database Engineering team) offers polyglot persistence, which promises to offer ideal matches between problem spaces and persistence solutions. In this meetup you will get a deep dive into the Self service platform, our solution to repairing Cassandra data reliably across different datacenters, Memcached Flash and cross region replication and Graph database evolution at Netflix.
How to Actually Tune Your Spark Jobs So They WorkIlya Ganelin
This document summarizes a USF Spark workshop that covers Spark internals and how to optimize Spark jobs. It discusses how Spark works with partitions, caching, serialization and shuffling data. It provides lessons on using less memory by partitioning wisely, avoiding shuffles, using the driver carefully, and caching strategically to speed up jobs. The workshop emphasizes understanding Spark and tuning configurations to improve performance and stability.
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
Cosco is an efficient shuffle-as-a-service that powers Spark (and Hive) jobs at Facebook warehouse scale. It is implemented as a scalable, reliable and maintainable distributed system. Cosco is based on the idea of partial in-memory aggregation across a shared pool of distributed memory. This provides vastly improved efficiency in disk usage compared to Spark's built-in shuffle. Long term, we believe the Cosco architecture will be key to efficiently supporting jobs at ever larger scale. In this talk we'll take a deep dive into the Cosco architecture and describe how it's deployed at Facebook. We will then describe how it's integrated to run shuffle for Spark, and contrast it with Spark's built-in sort-based shuffle mechanism and SOS (presented at Spark+AI Summit 2018).
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Databricks
In Big Data field, Spark SQL is important data processing module for Apache Spark to work with structured row-based data in a majority of operators. Field-programmable gate array(FPGA) with highly customized intellectual property(IP) can not only bring better performance but also lower power consumption to accelerate CPU-intensive segments for an application.
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
As a general computing engine, Spark can process data from various data management/storage systems, including HDFS, Hive, Cassandra and Kafka. For flexibility and high throughput, Spark defines the Data Source API, which is an abstraction of the storage layer. The Data Source API has two requirements.
1) Generality: support reading/writing most data management/storage systems.
2) Flexibility: customize and optimize the read and write paths for different systems based on their capabilities.
Data Source API V2 is one of the most important features coming with Spark 2.3. This talk will dive into the design and implementation of Data Source API V2, with comparison to the Data Source API V1. We also demonstrate how to implement a file-based data source using the Data Source API V2 for showing its generality and flexibility.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Chris Fregly
The document summarizes a presentation given by Chris Fregly on Project Tungsten and optimizations in Apache Spark. It discusses techniques like using off-heap memory, minimizing cache misses, and saturating I/O to sort 100 terabytes of data in Spark. The presentation also covered a recap of the "100TB GraySort challenge" where custom data structures and algorithms were used to optimize sorting and shuffling of data.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
Spark Search is a personal project that integrates Lucene with Apache Spark for interactive search, analytics, and machine learning on big data. Experiments showed that indexing large datasets with Lucene directly was faster than using Solr or Elasticsearch on a single node with minimum parallelism, due to their additional overhead. Spark provides an in-memory distributed computing framework that can help address the challenges of indexing and searching big data with Lucene at scale more easily than traditional distributed search technologies. The presentation called for participation to help build out the Spark Search community and project.
Accelerate and Scale Big Data Analytics with Disaggregated Compute and StorageAlluxio, Inc.
Alluxio Tech Talk
Jul 17, 2019
Speakers:
Brien Porter, Intel
Alex Ma, Alluxio
The ever increasing challenge to process and extract value from exploding data with AI and analytics workloads makes a memory centric architecture with disaggregated storage and compute more attractive. This decoupled architecture enables users to innovate faster and scale on-demand. Enterprises are also increasingly looking towards object stores to power their big data & machine learning workloads in a cost-effective way. However, object stores don’t provide big data compatible APIs as well as the required performance.
In this webinar, the Intel and Alluxio teams will present a proposed reference architecture using Alluxio as the in-memory accelerator for object stores to enable modern analytical workloads such as Spark, Presto, Tensorflow, and Hive. We will also present a technical overview of Alluxio.
How AI and ML are driving Memory Architecture changesDanny Sabour
Artificial intelligence and machine learning are fundamentally changing compute workloads in the cloud, the edge, and the IoT node. Memory architectures have changed to require persistence as a primary need over speed and low power. MRAM with its inherent persistence, low power and speed, is destined to become the next generation memory of choice all the way from the IoT node to the edge and the cloud.
Flash memory summit enterprise udate 2019Howard Marks
The document provides an annual update on enterprise flash storage trends. It discusses how flash has become mainstream for primary storage due to declining costs. All-flash arrays now have a larger market share than hybrid arrays. Emerging technologies discussed include NVMe over Fabrics, which extends NVMe protocols over Ethernet and Fibre Channel, and Storage Class Memory using 3D XPoint, which provides faster storage than NAND flash. The document highlights several vendors that are adopting these technologies.
Optimizing Performance and Computing Resource Efficiency of In-Memory Big Dat...Databricks
The performance of modern Big Data frameworks, e.g. Spark, depends greatly on high-speed storage and shuffling, which impose a significant memory burden on production data centers. In many production situations, the persistence and shuffling intensive applications can lead to a major performance loss due to lack of memory. Thus, the common practice is usually to over-allocate the memory assigned to the data workers for production applications, which in turn reduces overall resource utilization. One efficient way to address the dilemma between the performance and cost efficiency of Big Data applications is through data center computing resource disaggregation. This paper proposes and implements a system that incorporates the Apache Spark Big Data framework with a novel in-memory distributed file system to achieve memory disaggregation for data persistence and shuffling. We address the challenge of optimizing performance at a low cost by co-designing the proposed in-memory distributed file system with large-volume DIMMbased persistent memory (PMEM) and RDMA technology. The disaggregation design allows each part of the system to be scaled independently, which is particularly suitable for cloud deployments. The proposed system is evaluated in a productionlevel cluster using real enterprise-level Spark production applications. The results of an empirical evaluation show that the system can achieve up to a 3.5-fold performance improvement for shuffle-intensive applications with the same amount of memory, compared to the default Spark setup. Moreover, by leveraging PMEM, we demonstrate that our system can effectively increase by 66.5% the memory capacity of the computing cluster with affordable cost, with a reasonable execution time overhead with respect to using local DRAM only.
Healthcare Claim Reimbursement using Apache SparkDatabricks
The document discusses rewriting a claims reimbursement system using Spark. It describes how Spark provides better performance, scalability and cost savings compared to the previous Oracle-based system. Key points include using Spark for ETL to load data into a Delta Lake data lake, implementing the business logic in a reusable Java library, and seeing significant increases in processing volumes and speeds compared to the prior system. Challenges and tips for adoption are also provided.
Use cases like high-performance computing (HPC), AI, and IoTA can generate a huge volume of data. Learn how Intel® Optane™ DC persistent memory can be an alternative to DRAM for applications that benefit from a very large volatile memory capacity.
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Realizing Exabyte-scale PM Centric Architectures and Memory Fabricsinside-BigData.com
This presentation discusses the need for exabyte-scale persistent memory architectures and memory fabrics to support growing data and workload demands. It notes that current distributed systems rely on general-purpose CPUs and RDMA, but purpose-built architectures are needed. Achieving sub-microsecond latency across large memory pools spanning thousands of nodes requires innovations in memory technology, networking fabrics and protocols. P4 programmable switches could enable prototyping of new memory fabrics like Gen-Z that meet performance requirements.
Accelerating Shuffle: A Tailor-Made RDMA Solution for Apache Spark with Yuval...Spark Summit
The opportunity in accelerating Spark by improving its network data transfer facilities has been under much debate in the last few years. RDMA (remote direct memory access) is a network acceleration technology that is very prominent in the HPC (high-performance computing) world, but has not yet made its way to mainstream Apache Spark. Proper implementation of RDMA in network-oriented applications can improve scalability, throughput, latency and CPU utilization. In this talk we are going to present a new RDMA solution for Apache Spark that shows amazing improvements in multiple Spark use cases. The solution is under development in our labs, and is going to be released to the public as an open-source plug-in.
Building a High Performance Analytics PlatformSantanu Dey
The document discusses using flash memory to build a high performance data platform. It notes that flash memory is faster than disk storage and cheaper than RAM. The platform utilizes NVMe flash drives connected via PCIe for high speed performance. This allows it to provide in-memory database speeds at the cost and density of solid state drives. It can scale independently by adding compute nodes or storage nodes. The platform offers a unified database for both real-time and analytical workloads through common APIs.
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
The document discusses in-memory computing and emerging technologies. It describes how in-memory applications are driving new storage class memory like 3D XPoint that has lower latency than NAND but higher capacity than DRAM. The document also discusses how in-memory solutions are using tiering of memory and storage like DRAM, 3D XPoint, NVM, and NAND to handle larger datasets. Emerging high speed fabrics and disaggregated storage are enabling more efficient scaling of memory and storage tiers independent of compute.
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
This document summarizes a presentation about using the Crail distributed storage system to improve Spark performance on high-performance computing clusters with RDMA networking and NVMe flash storage. The key points are:
1) Traditional Spark storage and networking APIs do not bypass the operating system kernel, limiting performance on modern hardware.
2) The Crail system provides user-level APIs for RDMA networking and NVMe flash to improve Spark shuffle, join, and sorting workloads by 2-10x on a 128-node cluster.
3) Crail allows Spark workloads to fully utilize high-speed networks and disaggregate memory and flash storage across nodes without performance penalties.
EVCache: Lowering Costs for a Low Latency Cache with RocksDBScott Mansfield
EVCache is a distributed, sharded, replicated key-value store optimized for Netflix's use cases on AWS. It is based on Memcached but uses RocksDB for persistent storage, lowering costs compared to storing all data in memory. Moneta is the next generation EVCache server, using Rend and Mnemonic libraries to intelligently manage data placement in RAM and SSD. This provides high performance for both volatile and batch workloads while reducing costs by 70% compared to the original Memcached-based design.
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
This session will begin with an overview of current non-volatile memory (NVM, aka persistent memory) architectures and its relationship between several levels of memory and storage hierarchy, both near- and far-processor. A discussion on its significant impact on computing analytic workloads now and in the near future will ensue, including use cases and the concept of very large persistent memory surfaces as applied to both analytic computation and storage for big data workflows. The presentation will end with ‘why you should care’ about such technologies which inevitably will completely change the way we think about solving data-intensive problems.
Dataplane networking acceleration with OpenDataplane / Максим Уваров (Linaro)Ontico
HighLoad++ 2017
Зал «Москва», 7 ноября, 13:00
Тезисы:
http://www.highload.ru/2017/abstracts/2909.html
OpenDataPlane (ODP, https://www.opendataplane.org) является open-source-разработкой API для сетевых data plane-приложений, представляющий абстракцию между сетевым чипом и приложением. Сейчас вендоры, такие как TI, Freescale, Cavium, выпускают SDK с поддержкой ODP на своих микросхемах SoC. Если проводить аналогию с графическим стеком, то ODP можно сравнить с OpenGL API, но только в области сетевого программирования.
...
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
In this deck, Johann Lombardi from Intel presents: DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence.
"Intel has been building an entirely open source software ecosystem for data-centric computing, fully optimized for Intel® architecture and non-volatile memory (NVM) technologies, including Intel Optane DC persistent memory and Intel Optane DC SSDs. Distributed Asynchronous Object Storage (DAOS) is the foundation of the Intel exascale storage stack. DAOS is an open source software-defined scale-out object store that provides high bandwidth, low latency, and high I/O operations per second (IOPS) storage containers to HPC applications. It enables next-generation data-centric workflows that combine simulation, data analytics, and AI."
Unlike traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to make use of new NVM technologies, and it is extremely lightweight because it operates end-to-end in user space with full operating system bypass. DAOS offers a shift away from an I/O model designed for block-based, high-latency storage to one that inherently supports fine- grained data access and unlocks the performance of next- generation storage technologies.
Watch the video: https://youtu.be/wnGBW31yhLM
Learn more: https://www.intel.com/content/www/us/en/high-performance-computing/daos-high-performance-storage-brief.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
RedisConf18 - Re-architecting Redis-on-Flash with Intel 3DX Point™ MemoryRedis Labs
The document discusses re-architecting Redis-on-Flash with Intel 3D XPoint memory. It introduces 3D XPoint memory as a new type of memory that is persistent, has high capacity of 6 TB per system, and is cheaper than DRAM. RedisLabs and Intel are collaborating to build the next version of Redis-on-Flash using 3D XPoint memory to increase scalability through larger memory modules and reduce costs compared to DRAM. The challenges include higher latency compared to DRAM and evolving standards.
Ceph Community Talk on High-Performance Solid Sate Ceph Ceph Community
The document summarizes a presentation given by representatives from various companies on optimizing Ceph for high-performance solid state drives. It discusses testing a real workload on a Ceph cluster with 50 SSD nodes that achieved over 280,000 read and write IOPS. Areas for further optimization were identified, such as reducing latency spikes and improving single-threaded performance. Various companies then described their contributions to Ceph performance, such as Intel providing hardware for testing and Samsung discussing SSD interface improvements.
Similar to Accelerating Apache Spark Shuffle for Data Analytics on the Cloud with Remote Persistent Memory Pools (20)
The document discusses migrating a data warehouse to the Databricks Lakehouse Platform. It outlines why legacy data warehouses are struggling, how the Databricks Platform addresses these issues, and key considerations for modern analytics and data warehousing. The document then provides an overview of the migration methodology, approach, strategies, and key takeaways for moving to a lakehouse on Databricks.
Data Lakehouse Symposium | Day 1 | Part 1Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
Data Lakehouse Symposium | Day 1 | Part 2Databricks
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse.
Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today.
Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow.
This is an educational event.
Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.
The document discusses the challenges of modern data, analytics, and AI workloads. Most enterprises struggle with siloed data systems that make integration and productivity difficult. The future of data lies with a data lakehouse platform that can unify data engineering, analytics, data warehousing, and machine learning workloads on a single open platform. The Databricks Lakehouse platform aims to address these challenges with its open data lake approach and capabilities for data engineering, SQL analytics, governance, and machine learning.
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.
Democratizing Data Quality Through a Centralized PlatformDatabricks
Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale.
At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including:
Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal
Performing data quality validations using libraries built to work with spark
Dynamically generating pipelines that can be abstracted away from users
Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers
Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time
Learn to Use Databricks for Data ScienceDatabricks
Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Why APM Is Not the Same As ML MonitoringDatabricks
Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications.
As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored.
In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs.
There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs.
The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks.
Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model?
The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity.
The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters.
In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
There is no doubt Kubernetes has emerged as the next generation of cloud native infrastructure to support a wide variety of distributed workloads. Apache Spark has evolved to run both Machine Learning and large scale analytics workloads. There is growing interest in running Apache Spark natively on Kubernetes. By combining the flexibility of Kubernetes and scalable data processing with Apache Spark, you can run any data and machine pipelines on this infrastructure while effectively utilizing resources at disposal.
In this talk, Rajesh Thallam and Sougata Biswas will share how to effectively run your Apache Spark applications on Google Kubernetes Engine (GKE) and Google Cloud Dataproc, orchestrate the data and machine learning pipelines with managed Apache Airflow on GKE (Google Cloud Composer). Following topics will be covered: – Understanding key traits of Apache Spark on Kubernetes- Things to know when running Apache Spark on Kubernetes such as autoscaling- Demonstrate running analytics pipelines on Apache Spark orchestrated with Apache Airflow on Kubernetes cluster.
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark.
Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations.
Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.
Sawtooth Windows for Feature AggregationsDatabricks
In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
Re-imagine Data Monitoring with whylogs and SparkDatabricks
In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data.
In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components.
We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure.
This allows us to introduce optimization rules that
(i) reduce unnecessary computations by passing information between the data processing and ML operators
(ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and
(iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator.
We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis.
Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them.
Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy.
This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.
Massive Data Processing in Adobe Using Delta LakeDatabricks
At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences.
What are we storing?
Multi Source – Multi Channel Problem
Data Representation and Nested Schema Evolution
Performance Trade Offs with Various formats
Go over anti-patterns used
(String FTW)
Data Manipulation using UDFs
Writer Worries and How to Wipe them Away
Staging Tables FTW
Datalake Replication Lag Tracking
Performance Time!
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...JeevanKp7
Long-term oxygen therapy (LTOT) and novel techniques of evaluating treatment efficacy have enhanced the quality of life and decreased healthcare expenses for COPD patients.
The cost of a pulmonary blood gas test is comparable to the cost of two days of oxygen therapy and the cost of a hospital stay is equivalent to the cost of one month of oxygen therapy, long-term oxygen therapy (LTOT) is a cost-effective technique of treating this disease.
A small number of clinical investigations on LTOT have shown that it improves the quality of life of COPD patients by reducing the loss of their respiratory capacity. A study of 8487 Danish patients found that LTOT for 1524 hours per day extended life expectancy from 1.07 to 1.40 years.
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
Combined supervised and unsupervised neural networks for pulse shape discrimi...Samuel Jackson
Our methodology for pulse shape discrimination is split into two steps. Firstly, we learn a model to discriminate between pulses using "clean" low-rate examples by removing pile-up & saturated events. In addition to traditional tail sum discrimination, we investigate three different choices for discrimination between γ-pulses, fast, thermal neutrons. We consider clustering the pulses directly using Gaussian Mixture Modelling (GMM), using variational autoencoders to learn a representation of the pulses and then clustering the learned representation (VAE+GMM) and using density ratio estimation to discriminate between a mixed (γ + neutron) and pure (γ only) sources using a multi-layer perceptron (MLP) as a supervised learning problem.
Secondly, we aim to classify and recover pile-up events in the < 150 ns regime by training a single unified multi-label MLP. To frame the problem as a multi-label supervised learning method, we first simulate pile-up events with known components. Then, using the simulated data and combining it with single event data, we train a final multi-label MLP to output a binary code indicating both how many and which type of events are present within an event window.
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
The Rise of Python in Finance,Automating Trading Strategies: _.pdfRiya Sen
In the dynamic realm of finance, where every second counts, the integration of technology has become indispensable. Aspiring traders and seasoned investors alike are turning to coding as a powerful tool to unlock new avenues of financial success. In this blog, we delve into the world of Python live trading strategies, exploring how coding can be the key to navigating the complexities of the market and securing your path to prosperity.
5. Motivation - Challenges of Spark shuffle
▪ Data Center Infrastructure evolution
▪ Compute and storage disaggregation become a key trend, diskless environment becoming more and more popular
▪ Modern datacenter is evolving: high speed network between compute and disaggregated storage and tiered storage architecture makes local storage less
attractive
▪ New storage technologies are emerging, e.g., storage class memory (or PMem)
▪ Spark shuffle problems
▪ Uneven resource utilization of CPU and Memory
▪ Out of memory issues and GC
▪ Disk I/O too slow
▪ Data spill degrades performance
▪ Shuffle I/O grows quadratically with data
▪ Local SSDs wear out by frequent intermediate data writes
▪ Unaffordable re-compute cost
▪ Other related works
▪ Intel Disaggregated shuffle w/ DAOS1, Facebook cosco2, Baidu DCE shuffle3, JD.com & MemVerge RSS 4 and etc.
1. https://www.slideshare.net/databricks/improving-apache-spark-by-taking-advantage-of-disaggregated-architecture
2. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service
3. http://apache-spark-developers-list.1001551.n3.nabble.com/Enabling-fully-disaggregated-shuffle-on-Spark-td28329.html
4. https://databricks.com/session/optimizing-performance-and-computing-resource-efficiency-of-in-memory-big-data-analytics-with-disaggregated-persistent-memory
6. Re-cap of Shuffle
load
load
Input
A HDFS file
load
sort
Output
A HDFS file
sort
sort
Intermediate Data
Each Map’s output Shuffle (Random Partition)
2
1
9
1
5
8
2
6
5
2
1
1
2
5
6
5
9
8
1
1
2
2
5
5
6
8
9
9
2
1
8
1
5
6
5
2
Decompression Compression Decompression Compression
Local
Local
Local
Write Local, can use shuffle service to cache the data.
Read Remote via Network
7. Spark Shuffle Bottlenecks
▪ Spark Shuffle (nWeight – a Graph Computation Workload)
▪ Context: Iterative graph-parallel algorithm, implemented with GraphX, to
compute the association for 2 vertices in 2-3 hops distance in the graph. (e.g.
recommend a video for my friends’ friends)
0
20
40
60
80
100
120
0
116
225
335
445
555
665
775
885
995
1106
1216
1326
1436
1546
1656
1766
1876
1986
2096
2206
2316
CPUUtilization
Spark Worker Node CPU Utilization
Average of %idle
Average of %steal
Average of %iowait
Average of %nice
Average of %system
Average of %user
▪ Spark Shuffle (TeraSort)
▪ Context: TeraSort samples the input data and uses
map/reduce to sort the date into a total order.
9. PMem - A New Memory Tier
▪ IDC reports indicated that data is
growing very fast
▪ Global datasphere growth rate (CAGR) 27%**
▪ But DRAM density scaling is becoming slower: from
4X/3yr to 2X/3yr to 2X/4yr*
▪ A new memory system will be needed to met the data
growth needs for new cases
▪ PMem: new category that sits between
memory and storage
▪ Delivers a unique combination of affordable large
capacity and support for data persistence
▪ Two operational Mode
▪ Memory Mode: Enlarge system Memory size
▪ App Direct Mode: exposes two set of independent
memory resources to OS and applications
**Source: Data Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018
*Source: ”3D NAND Technology – Implications for Enterprise Storage Applications” by J.Yoon (IBM), 2015 Flash Memory Summit
10. Remote Persistent Memory Usage
High Availability
Data Replication
• Replicate Data in local PM
across Fabric and Store in
remote PM
• For backup
Remote PM
• Extend on-node memory
capacity (w/ or w/o persistency)
in a disaggregated architecture
to enlarge compute node
memory
• IMDB
Shared Remote PM
• PM holds SHARED data
among distributed
applications
• Remote Shuffle service,
IMDB
https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/05_PM_Summit_Grun_PM_%20Final_Post_CORRECTED.pdf
11. Access Remote Persistent Memory over RDMA
• PM offers remote persistence, without losing any of
characteristic of memory
• PM is really Fast
• Needs ultra low-latency networking
• PM has very high bandwidth
• Needs ultra efficient protocol, transport offload, high
BW
• Remote access must not add significant latency
• Network switches & adaptors deliver predictability,
fairness, zero packet loss
• Moving data between (zero-copy) two system with
Volatile DRAM, offload data movement from CPU to NIC
• Low latency
• Latency < uses
• High BW
• 200Gb/s, 400Gb/s, zero-copy, kernel bypass, HW
offered one side memory to remote memory
operations
• Reliable credit base data and control delivered by HW
• Network resiliency, scale-out
Remote Persistent Memory offers RDMA offers
• RPMem over Fabric additional complexity:
• In order to guarantee written data is durable on the target node, CPU caches need to be bypassed or flushed to get data in to the ADR
power fail safe domain
• When writing to PMEM, need a synchronous acknowledgement when writes have made it to the durability domain but the current RDMA
Write semantics do not provide such write acknowledgement
12. RPMem Durability
• RDMA
• Guarantee that Data has been successfully received and
accepted for execution by the remote HCA
• Doesn’t guarantee data has reached remote host memory – need
ADR
• Doesn’t guarantee the data can be visible/durable for other
consumers accesses (other connections, host processor)
• Using small RDMA read to forces write data to PMem
• New transport operation – RDMA FLUSH
• New RDMA command opcode
• Flush all previous writes or specific regions
• Provides memory placement guarantee to the upper layer
software
• RDMA Flush forces previous RDMA Write data to durability
domain
• It makes PM operations with RDMA more efficient!
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory Controller
RDMA
Write
RDMA
Write RDMA
Write
RDMA Read RDMA
Read
Read ACK
RDMA Read
ACKRDMA Read
ACK
Flushing Read
Peer B
PMEM
Write
Write
Read
Read Data
RDMA
Write
Posted Write
(Non-Allocating)
Posted Write
(Non-Allocating)
APP SW
Peer A
RNIC
Peer B
RNIC
Peer B
Memory ControllerRDMA
Write
RDMA
Write
RDMA
Write
RDMA Flush RDMA
Flush
Flush
Flush
Flush
Flush
Peer B
PMEM
Flush
14. Re-cap: Remote Persistent Memory Extension for Spark shuffle Design
▪ 1. Serialize obj to off-heap memory
▪ 2. Write to local shuffle dir
▪ 3. Read from local shuffle dir
▪ 4. Send to remote reader through TCP-IP
Ø Lots of context switch
Ø POSIX buffered read/write on shuffle disk
Ø TCP/IP based socket send for remote shuffle read
PMEM
Shuffle file
Spark.Local.dir
Shuffle file
Executor JVM #1
User
Kernel
SSD HDD
3
Shuffle write
Shuffle read
2
4
Worker
1. Serialize obj to off-heap memory
2. Persistent to PMEM
3. Read from remote PMEM through RDMA, PMEM
is used as RDMA memory buffer
Ø No context switch
Ø Efficient read/write on PMEM
Ø RDMA read for remote shuffle read
Executor JVM #1
User
Kernel 3
Worker
PMEM
Shuffle ManagerShuffle Manager
NIC
Shuffle
Writer
RDMA NICPMEM
Drivers
Shuffle
Readerbytebuffer
1 obj
Heap
Off-heap
Shuffle
Writer(new)
Shuffle
Reader(new)
obj
bytebuffer
1
Heap
Off-heap2
Spark PMoF: https://github.com/intel-bigdata/spark-pmof
Strata-ca-2019: https://conferences.oreilly.com/strata/strata-ca-2019/public/schedule/detail/72992
15. Spark-PMoF End-to-End Time Evaluation – TeraSort Workload
§ Terasort
§ End-to-end time gains .vs Vanilla Spark: 22.7x
§ 1.29x speedup over 4x NVMe
§ PMoF shorten the remote read latency extremely
§ Readblocked time for HDD, NVMe & PMem (from Spark UI): 8.3min vs.
11s vs. 7ms
§ PMem provides higher write/read bandwidth per node than HDD
& NVMe and higher endurance
§ Decision support workload
§ is less I/O intensive compared with Terasort
§ 3.2x speed up for total execution time of 99 queries
§ IO intensive workloads can be benefit more from PMoF performance
improvement.
Performance results are based on testing as of 12/06/2019 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
0
500
1000
1500
2000
2500
3000
3500
4000
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
101
99 Queries Execution Time - Spark-PMoF vs Vanilla Spark
Spark-PMoF Vanilla Spark
12277.2
695 540.5
1
10
100
1000
10000
100000
Second
Spark 550GB TeraSort End-to-End Time (lower is better)
terasort-hdd terasort-nvme terasort-pmof
16. Extending to fully Disaggregated Shuffle solution
▪ Remote Persistent Memory demonstrated good results, but what’s more?
▪ In real production environment, there are more challenges
▪ Disaggregated, diskless environment
▪ Scale shuffle/compute independently
▪ CPU/Memory unbalanced issue
▪ Some jobs lasts for long time, stage recompute cost is intolerable in case of shuffle failure
▪ Elastic Deployment with compute and storage disaggregation requires independent shuffle solution
▪ Decouple shuffle I/O from a specific network/storage is capable of delivering dedicated SLA for critical applications
▪ Fault tolerant in case on shuffle failure, no need for recompute
▪ Offload spill as well, reduced compute memory resource requirements
▪ Balanced resource utilization
▪ Leverage state-of-art storage medium as storage media
▪ To provide high performance, high endurance storage backend
▪ Drive the intention to build a RPMem based, fully disaggregated shuffle solution!
17. RPMP Architecture
▪ Remote Persistent Memory Pool for Spark (Spark RPMP): a new
fully disaggregated shuffle solution that leverage state-of-art
hardware technologies including persistent memory and RDMA.
▪ A new pluggable shuffle manager
▪ A persistent memory based distributed storage system
▪ An RDMA powered network library and an innovative approach
to use persistent memory as both shuffle media as well as
RDMA memory region to reduce additional memory copies and
context switches.
▪ Features
▪ Provides allocate/free/read/write APIs on pooled PMem
resources
▪ Data will be replicated to multiple nodes for High availability
▪ Can be extended to other usage scenario such as PMem based
database, data store, cache store
▪ Benefits
▪ Improved Spark scalability by disaggregating Spark shuffle from
compute node to a high-performance distributed storage
▪ Improved spark shuffle performance with high speed persistent
memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly
available shuffle service supports shuffle data replication and
fault-tolerant.
RPMP
Transactions Streaming Machine LearningSQL
S3
K/V
Remote Shuffle
shuffle
PMem
DRAM
RNIC
DRAM
Storage
Compute
data cache
DRAM
Proxy
PMem
18. Remote Persistent Memory Pool overview
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
Shuffle Writer Shuffle Reader
PMoF Shuffle Manager
MapperN ReducerN
RDMA
Read and Write
RDMA
Read and Write
§ Care more about write performance
(latency/bandwidth)
§ 100GB NIC needed, theoretically 8x
PMEM on single node provide 10GB+
write bandwidth.
RPMP node 1
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Global Memory Address
RPMP node 2
RPMP Core
RPMP Proxy
Network layer
Controller layer
Storage layer
Heartbeat
Replication
§ RPMP storage node was
choosen by using
Consistent Hash to avoid
single point failure.
§ An timely ActiveNodeMap
maintained by using
Heartbeat.
§ Data will be replicated to a
worker node from driver node
over RDMA
§ Once driver node goes down,
worker node is still writable
and readable.
Mapper1 Reducer1
19. RPMP CORE architecture details
§ RPMP Client
§ RPMP client provides transactional
read/write/allocate/free, and obj put/get interfaces
to users
§ Both cpp and java API are provided
§ Data will be transferred by HPNL(RDMA) between
selected server nodes and client later on.
§ RPMP Server
§ RPMP proxy is used to maintain an unified
ActiveNodeMap.
§ Network Layer is based on HPNL to provide
RDMA Data transfer.
§ Controller Layer is responsible for Global Address
Management, TransactionalProcess, etc.
§ Storage Layer is responsible for Pmem
management using high performance PMDK libs
RPMP (Server)
PmemAllocatorStorage Layer
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Controller layer Scheduler TransactionGlobal Address Mgmt
PMem
/dev/dax0.1 /dev/dax1.0 /dev/dax2.0
/dev/dax0.0 /dev/dax1.1 /dev/dax2.1
Checksum
RPMP (Client)
Interface tx_alloc/tx_free/tx_read/tx_write/put/get
Encode/DecodeHPNLNetwork Layer Buffer Mgmt
Storage Proxy
RPMP Proxy
Accelerator
20. Spark RPMP optimization features
▪ Optimized RDMA communication
▪ Leverage HPNL as high performance, protocol agnostic network messenger
▪ Server handle all the write operations and clients implement read-only operations using one-sided RDMA reads.
▪ Controller accelerator layer
▪ Partition Merge
▪ Aggregate small partitions to larger blocks to accelerate reduce, reduced number of reducer connections
▪ Sort
▪ Sort the shuffle data on the fly, no compute on reduce phase, reduce compute node CPU resource utilization
▪ Provided controllable fine granularity control or resource utilization if compute node CPU resources is limited
▪ Storage
▪ Global address space accessible with memory-like APIs
▪ A Key-value store based on libpmemobj, transactional
▪ Allocator to manager on PMem, storage proxy directing request to different allocators
* https://github.com/Cyan4973/xxHash
21. RPMP Workflow
1. Write data to specific address.
2. Server issue RDMA read (client DRAM ->
server DRAM).
3. Flush (DRAM -> PMEM).
4. Request ACK.
client
server
1 2
3
4
1. Read data from specific
address.
2. RDMA write (server PMEM
-> client DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
DRAM
PMem
DRAM
PMem
DRAM DRAM
1. Write data to specific address.
2. RDMA read (client DRAM -> server DRAM),
Secondary Node DRAM -> Primary Node
DRAM)
3. Flush (DRAM -> PMEM).
4. Request ACK.
DRAM
client
server
1 2
3
4
1. Read data from specific address.
2. RDMA write (server PMEM -> client
DRAM).
3. Request ACK.
client
server
1 2 3
Write Read
server
2
server
3
Proxy
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM
PMem
DRAM
22. PMEM Based Shuffle Write optimization
§ Map
§ Provision pmem name space in advance
§ Leverage a circular buffer to build unidirectional
channels for RDMA primitives
§ Serialized data write to an offheap buffer, once hit
threshold (4MB by default), create a block via
libpmemobj on pmem device with memcopy
§ Append write, only write once
§ No index file, mapping info stored in pmem object
metadata
§ Libpmem based, kernel bypass
§ Reduce
§ Reduce use memcopy to read the data
§ Reduce read through PMem memory directly from
RDMA
N x
Reduce
PMEM Shuffle Writer
Map task
KV in memory
PMDK(libpmemobj c)
JNI
Partition [0]
Partition [1]
Partition[2]
Partition[n]
Persistent Memory Device
(devdax/fsdax mode)
DRAM
RPMP client circular buffer
JVM
Native
24. Performance Evaluation
§ Configuration
§ 2 nodes, one as RPMEM client and another as RPMEM server.
§ 40 GB RDMA NIC, 4x PMems on RPMEM server.
§ Tested remote_allocate, remote_free, remote_write, remote_read
interfaces with single client.
§ Performance
§ Remote read max out NIC BW
RPMEM Client
Node 1
RPMEM Server
PMem PMem
PMem PMem
Node 2
Interface Performance
allocate 3.7 GB/s Expect higher performance with
more clients.
remote_write 2.9 GB/s Expect higher performance with
more clients.
remote_read 4.9 GB/s Limited by 40GB NIC
Spark
Node 1
Spark
PMem PMem
HDD HDD
Node 2
HDD HDDHDFS
§ Configuration
§ 2 nodes
§ Baseline: Shuffle on HDD
§ RPMem: 2x 128GB PMem on Node 2 ofr Shuffle and external sort
§ Workload: Terasort 100GB
§ Performance
§ 1.98x speedup
Execution Time
Vanilla spark 100G 416 s
RPMem Shuffle 210 s
HDFS
Performance results are based on testing as of 5/30/2020 and may not reflect all publicly available security updates. See configuration disclosure on slide for details. No product
can be absolutely secure. For more complete information about performance and benchmark results, visit www.intel.com/benchmarks. Configurations refer to page 34
26. Summary
• Spark shuffle posed lots of challenges in large scale production environment
• Remote Persistent Memory extending PM new usage mode to new scenarios with RDMA being the most acceptable
technology used for remote persistent memory access
• Remote persistent memory pool for Spark shuffle enables a fully disaggregated, high performance, low latency
shuffle solution to accelerate spark shuffle
▪ Improved Spark scalability by disaggregating Spark shuffle from compute node to a high-performance distributed
storage
▪ Improved spark shuffle performance with high speed persistent memory and low latency RDMA network
▪ Improved reliability by providing a manageable and highly available shuffle service supports shuffle data replication
and fault-tolerant.
28. 28
Accelerate Your Data Analytics & AI Journey with Intel
Optimized
ML/DL
Libraries &
tools
Optimized
Cloud
Platforms
GOOGLE CLOUD PLATFORMAMAZON WEB SERVICES MICROSOFT AZURE
Intel Distribution
for Python
Intel Optimized
BAIDU CLOUD & More
Data Center
DL Inference
(Goya)
FPGA: Real-Time &
Multi-Use DL
Inference
Data Center
DL Training
(Gaudi)
Edge
DL Inference
GPU: AI, HPC, Media
& Graphics
CPU: Multi-Purpose
Analytics/AI Foundation
High performance in-
memory analytics
Analytics &
AI intel
Hardware
Intel.com/AI intel.com/yourdataonintelSoftware.intel.com
* In development
*
*
INTEL-OPTIMIZED END-TO-END DATA ANALYTICS
& AI PIPELINE
DISCOVE
RY Data Develop Deploy
of possibilities & next steps setup, ingestion & cleaning models using analytics/AI into production & iterate
29. Intel OAP
https://github.com/Intel-bigdata/OAP/
An End-to-End Columnar based data processing with Intel AVX support
Intel CPU Other Accelerators(FPGA, GPU, …)
OAP Native SQL Engine Plugin
Apache Arrow
Arrow Data
Source
Arrow Data
Processing
Columnar
Shuffle
SPARK SQL
SPARK Catalyst
https://github.com/Intel-bigdata/OAP/
Optimized Analytics Packages for Spark
Shuffle
Remote Shuffle
Remote Persistent
Memory Shuffle
Remote Persistent Memory Pool
32. Legal Information: Benchmark and Performance
Disclaimers
▪ Performance results are based on testing as of Feb. 2019 and may not reflect all publicly available security updates. See
configuration disclosure for details. No product can be absolutely secure.
▪ Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems,
components, software, operations and functions. Any change to any of those factors may cause the results to vary. You
should consult other information and performance tests to assist you in fully evaluating your contemplated purchases,
including the performance of that product when combined with other products. For more information, see Performance
Benchmark Test Disclosure.
▪ Configurations: see performance benchmark test configurations.