The document summarizes a presentation given by Chris Fregly on Project Tungsten and optimizations in Apache Spark. It discusses techniques like using off-heap memory, minimizing cache misses, and saturating I/O to sort 100 terabytes of data in Spark. The presentation also covered a recap of the "100TB GraySort challenge" where custom data structures and algorithms were used to optimize sorting and shuffling of data.
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
The document discusses 5 common mistakes people make when writing Spark applications:
1) Not properly sizing executors for memory and cores.
2) Having shuffle blocks larger than 2GB which can cause jobs to fail.
3) Not addressing data skew which can cause joins and shuffles to be very slow.
4) Not properly managing the DAG to minimize shuffles and stages.
5) Classpath conflicts from mismatched dependencies causing errors.
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
Spark SQL enables Spark to perform efficient and fault-tolerant relational query processing with analytics database technologies. The relational queries are compiled to the executable physical plans consisting of transformations and actions on RDDs with the generated Java code. The code is compiled to Java bytecode, executed at runtime by JVM and optimized by JIT to native machine code at runtime. This talk will take a deep dive into Spark SQL execution engine. The talk includes pipelined execution, whole-stage code generation, UDF execution, memory management, vectorized readers, lineage based RDD transformation and action.
This is the presentation I made on JavaDay Kiev 2015 regarding the architecture of Apache Spark. It covers the memory model, the shuffle implementations, data frames and some other high-level staff and can be used as an introduction to Apache Spark
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
Apache Spark is a fast and flexible compute engine for a variety of diverse workloads. Optimizing performance for different applications often requires an understanding of Spark internals and can be challenging for Spark application developers. In this session, learn how Facebook tunes Spark to run large-scale workloads reliably and efficiently. The speakers will begin by explaining the various tools and techniques they use to discover performance bottlenecks in Spark jobs. Next, you’ll hear about important configuration parameters and their experiments tuning these parameters on large-scale production workload. You’ll also learn about Facebook’s new efforts towards automatically tuning several important configurations based on nature of the workload. The speakers will conclude by sharing their results with automatic tuning and future directions for the project.ing several important configurations based on nature of the workload. We will conclude by sharing our result with automatic tuning and future directions for the project.
Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.
Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?
The document discusses tuning Spark parameters to optimize performance. It describes how to control Spark's resource usage through parameters like num-executors, executor-cores, and executor-memory. Advanced parameters like spark.shuffle.memoryFraction and spark.reducer.maxSizeInFlight are also covered. Dynamic allocation allows scaling resources up and down based on workload. Tips provided include tuning memory usage, choosing serialization and storage levels, setting parallelism, and avoiding operations like groupByKey. An example recommends tuning the collaborative filtering algorithm in the RW project, reducing runtime from 27 minutes to under 7 minutes.
The document discusses intra-cluster replication in Apache Kafka, including its architecture where partitions are replicated across brokers for high availability. Kafka uses a leader and in-sync replicas approach to strongly consistent replication while tolerating failures. Performance considerations in Kafka replication include latency and durability tradeoffs for producers and optimizing throughput for consumers.
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
The slides explain how shuffle works in Spark and help people understand more details about Spark internal. It shows how the major classes are implemented, including: ShuffleManager (SortShuffleManager), ShuffleWriter (SortShuffleWriter, BypassMergeSortShuffleWriter, UnsafeShuffleWriter), ShuffleReader (BlockStoreShuffleReader).
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
This document discusses best practices for optimizing Apache Spark applications. It covers techniques for speeding up file loading, optimizing file storage and layout, identifying bottlenecks in queries, dealing with many partitions, using datasource tables, managing schema inference, file types and compression, partitioning and bucketing files, managing shuffle partitions with adaptive execution, optimizing unions, using the cost-based optimizer, and leveraging the data skipping index. The presentation aims to help Spark developers apply these techniques to improve performance.
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
Apache Spark defaults provide decent performance for large data sets but leave room for significant performance gains if able to tune parameters based on resources and job.
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...Databricks
Nowadays, people are creating, sharing and storing data at a faster pace than ever before, effective data compression / decompression could significantly reduce the cost of data usage. Apache Spark is a general distributed computing engine for big data analytics, and it has large amount of data storing and shuffling across cluster in runtime, the data compression/decompression codecs can impact the end to end application performance in many ways.
However, there’s a trade-off between the storage size and compression/decompression throughput (CPU computation). Balancing the data compress speed and ratio is a very interesting topic, particularly while both software algorithms and the CPU instruction set keep evolving. Apache Spark provides a very flexible compression codecs interface with default implementations like GZip, Snappy, LZ4, ZSTD etc. and Intel Big Data Technologies team also implemented more codecs based on latest Intel platform like ISA-L(igzip), LZ4-IPP, Zlib-IPP and ZSTD for Apache Spark; in this session, we’d like to compare the characteristics of those algorithms and implementations, by running different micro workloads as well as end to end workloads, based on different generations of Intel x86 platform and disk.
It’s supposedly to be the best practice for big data software engineers to choose the proper compression/decompression codecs for their applications, and we also will present the methodologies of measuring and tuning the performance bottlenecks for typical Apache Spark workloads.
This document discusses Spark shuffle, which is an expensive operation that involves data partitioning, serialization/deserialization, compression, and disk I/O. It provides an overview of how shuffle works in Spark and the history of optimizations like sort-based shuffle and an external shuffle service. Key concepts discussed include shuffle writers, readers, and the pluggable block transfer service that handles data transfer. The document also covers shuffle-related configuration options and potential future work.
Spark Streaming allows processing of live data streams in Spark. It integrates streaming data and batch processing within the same Spark application. Spark SQL provides a programming abstraction called DataFrames and can be used to query structured data in Spark. Structured Streaming in Spark 2.0 provides a high-level API for building streaming applications on top of Spark SQL's engine. It allows running the same queries on streaming data as on batch data and unifies streaming, interactive, and batch processing.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
Parquet performance tuning: the missing guideRyan Blue
Parquet performance tuning focuses on optimizing Parquet reads by leveraging columnar organization, encoding, and filtering techniques. Statistics and dictionary filtering can eliminate unnecessary data reads by filtering at the row group and page levels. However, these optimizations require columns to be sorted and fully dictionary encoded within files. Increasing dictionary size thresholds and decreasing row group sizes can help avoid dictionary encoding fallback and improve filtering effectiveness. Future work may include new encodings, compression algorithms like Brotli, and page-level filtering in the Parquet format.
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
Structured Streaming provides stateful stream processing capabilities in Spark SQL through built-in operations like aggregations and joins as well as user-defined stateful transformations. It handles state automatically through watermarking to limit state size by dropping old data. For arbitrary stateful logic, MapGroupsWithState requires explicit state management by the user.
The document compares on-heap and off-heap caching options. It discusses heap memory usage in the JVM and alternatives like off-heap memory using memory mapped files, ByteBuffers, and Unsafe. Popular off-heap caches like Chronicle, Hazelcast, and Redis are presented along with comparisons of their features, performance, and garbage collection impact. The document aims to help developers choose the most suitable cache for their application needs.
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
Building on a previous talk about how Spark beat Hadoop @ 100TB Daytona GraySort, we present low-level details of Project Tungsten which includes many CPU and Memory optimizations.
Project Tungsten: Bringing Spark Closer to Bare MetalDatabricks
As part of the Tungsten project, Spark has started an ongoing effort to dramatically improve performance to bring the execution closer to bare metal. In this talk, we’ll go over the progress that has been made so far and the areas we’re looking to invest in next. This talk will discuss the architectural changes that are being made as well as some discussion into how Spark users can expect their application to benefit from this effort. The focus of the talk will be on Spark SQL but the improvements are general and applicable to multiple Spark technologies.
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
Tech-talk at Bay Area Apache Spark Meetup.
Apache Spark 2.0 will ship with the second generation Tungsten engine. Building upon ideas from modern compilers and MPP databases, and applying them to data processing queries, we have started an ongoing effort to dramatically improve Spark’s performance and bringing execution closer to bare metal. In this talk, we’ll take a deep dive into Apache Spark 2.0’s execution engine and discuss a number of architectural changes around whole-stage code generation/vectorization that have been instrumental in improving CPU efficiency and gaining performance.
Spark Summit EU talk by Sameer AgarwalSpark Summit
This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Spark Summit
This document summarizes Project Tungsten, an effort by Databricks to substantially improve the memory and CPU efficiency of Spark applications. It discusses how Tungsten optimizes memory and CPU usage through techniques like explicit memory management, cache-aware algorithms, and code generation. It provides examples of how these optimizations improve performance for aggregation queries and record sorting. The roadmap outlines expanding Tungsten's optimizations in Spark 1.4 through 1.6 to support more workloads and achieve end-to-end processing using binary data representations.
The document is a slide deck presentation given by Chris Fregly on Spark and related big data technologies. It discusses techniques for improving performance through mechanical sympathy with hardware, including optimizing data layout for CPU cache locality and using lock-free thread synchronization. It also covers Spark SQL query optimization and the Spark core, with a live demo of an example "After Dark" application that uses various big data tools.
This document appears to be a slide deck presentation about optimizing Apache Spark performance. Some key points discussed include techniques for improving CPU cache locality through data structure design, avoiding unnecessary object creation, and using lock-free thread synchronization. Examples demonstrated include optimizing matrix multiplication and incrementing shared counters across threads. Performance improvements of 30-65% were shown from these optimizations.
This document appears to be a slide deck presentation about Apache Spark. It discusses Spark's core concepts like tuning and mechanical sympathy. It provides examples of how to optimize Spark performance by improving CPU cache line affinity, such as through a cache-friendly matrix multiplication algorithm. The presentation also covers Spark SQL and query optimization using the Catalyst optimizer. It promotes the speaker's Advanced Apache Spark Meetup group and upcoming speaking events.
This document summarizes a presentation about Apache Spark. It discusses Spark's capabilities for streaming data, machine learning, and SQL queries. It also covers topics like Spark performance tuning, integration with other technologies like Kafka and Cassandra, and techniques for optimizing performance like leveraging CPU caches and avoiding thread context switches. The presentation includes live demos of Spark processing and analyzing user-submitted data.
This document summarizes a presentation about optimizing Apache Spark for performance. It discusses techniques like leveraging CPU caches, reducing random memory access, avoiding thread context switches, and using immutable data structures to minimize locking. It also promotes the concept of "mechanical sympathy" - designing software and hardware to work together efficiently. The presentation contains demos showing the impact of these optimizations on sorting and matrix multiplication performance.
1. The document discusses techniques for improving Apache Spark performance through mechanical sympathy, which means optimizing for hardware performance by considering factors like CPU cache usage and minimizing random memory access.
2. It provides examples of how to improve sorting, matrix multiplication, and thread synchronization by making them more cache-friendly and reducing cache misses and context switches.
3. The speaker demonstrates performance improvements from these techniques using Linux perf and flame graph profiling tools. Optimizations like Project Tungsten that customize Spark for the hardware are also discussed.
This document summarizes a presentation about optimizing Apache Spark for performance. Some key points discussed include:
- Techniques for optimizing Spark sorting and matrix multiplication to be more CPU cache friendly by using sequential access patterns and minimizing random memory access
- Different approaches for synchronizing concurrent updates to shared counters from multiple threads, including lock-based, immutable, and lock-free atomic implementations
- The IBM Project Tungsten efforts to customize Spark's data structures and algorithms to operate directly on compressed byte arrays to reduce garbage collection and maximize CPU cache usage
Dallas DFW Data Science Meetup Jan 21 2016Chris Fregly
The document discusses approximate and probabilistic algorithms and data structures for big data. It summarizes scaling techniques like parallelism and composability. Common algorithms covered include Bloom filters, Count-Min sketches, HyperLogLog, and Locality Sensitive Hashing. These algorithms trade exact results for speed and memory efficiency using techniques like hashing elements multiple times. The document also discusses when approximation is appropriate versus not. Example use cases and visual explanations of the algorithms are provided.
This document summarizes a presentation given by Chris Fregly on recommendations and similarity algorithms in Apache Spark. The presentation covered scaling techniques like parallelism and composability in Spark. It also discussed different similarity measures like Euclidean, cosine, and Jaccard similarity. Recommendation algorithms mentioned include clustering users based on behavior and item metadata to find similar users and items. The document provides an overview of the key topics presented.
Spark Summit East NYC Meetup 02-16-2016 Chris Fregly
This document summarizes a presentation on recommendations and machine learning using Apache Spark. It discusses scaling techniques like parallelism and composability. It also covers similarity measures, recommendation algorithms, and common approximations used like sampling. Common algorithms and data structures are explained, like similarity graphs and PageRank. Finally, it provides an overview of Netflix's recommendations system and data pipeline.
Advanced Analytics and Recommendations with Apache Spark - Spark Maryland/DC ...Chris Fregly
Title
Real-time, Advanced Analytics and Recommendations using Machine Learning, Graph Processing, Natural Language Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
BONUS: Netflix Recommendations: Then and Now
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
BONUS: Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
Bio
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer.
Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
Related Links
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
5th Athens Big Data Meetup - PipelineIO Workshop - Real-Time Training and Dep...Athens Big Data
Title: Real-Time Training and Deploying Spark ML Recommendations With Kafka and NetflixOSS
Speaker: Chris Fregly (https://linkedin.com/in/cfregly/)
Date: Monday, October 17, 2016
Event: https://meetup.com/Athens-Big-Data/events/234546355/
DC Spark Users Group March 15 2016 - Spark and Netflix RecommendationsChris Fregly
This document discusses various techniques for providing recommendations, including non-personalized and personalized recommendations. Non-personalized recommendations include using highest rated actors, top K aggregations, social graphs, or PageRank on likes/dislikes to address the cold start problem for new users. Personalized recommendations techniques discussed include user-to-user clustering based on similarity of viewing patterns or ratings. The document also covers feature engineering techniques such as normalization, dimensionality reduction, and one-hot encoding for modeling recommendations.
Title:
Real-time, Advanced Analytics and Recommendations using Machine Learning, Natural Language Processing, Graph Processing, and Approximations with Apache Spark, Stanford CoreNLP, and Twitter Algebird
Agenda
Intro
Live, Interactive Recommendations Demo
Spark ML, GraphX, Streaming, Kafka, Cassandra, Docker
Types of Similarity
Euclidean vs. Non-Euclidean Similarity
User-to-User Similarity
Content-based, Item-to-Item Similarity (Amazon)
Collaborative-based, User-to-Item Similarity (Netflix)
Graph-based, Item-to-Item Similarity Pathway (Spotify)
Similarity Approximations at Scale
Twitter Algebird
MinHash and Bucketing
Locality Sensitive Hashing (LSH)
Netflix Recommendations: From Ratings to Real-Time
DVD-Ratings-based $1M Netflix Prize (2009)
Streaming-based "Trending Now" (2016)
Wrap Up
Q & A
*Bio*
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, and a Netflix Open Source Committer. Chris is also the founder of the global Advanced Apache Spark Meetup and author of the upcoming book, Advanced Spark @ advancedspark.com. Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
*Related Links*
https://github.com/fluxcapacitor/pipeline/wiki
http://cdn.oreillystatic.com/en/assets/1/event/105/Algebra%20for%20Scalable%20Analytics%20Presentation.pdf
http://static.echonest.com/BoilTheFrog/
http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf
http://blog.echen.me/2011/10/24/winning-the-netflix-prize-a-summary/
http://www.cc.gatech.edu/~zha/CSE8801/CF/kdd-fp074-koren.pdf
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
The document is a presentation about Apache Spark and recommendations. It discusses scaling with parallelism and composability, different types of similarity metrics like Euclidean, cosine, and Jaccard, feature engineering, non-personalized recommendations for cold starts, and personalized recommendations using clustering of users and items. It also covers approximating similarity calculations and common machine learning libraries and tools.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Brussels Spark Meetup Oct 30, 2015: Spark After Dark 1.5: Real-time, Advanc...Chris Fregly
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Zurich, Berlin, Vienna Spark and Big Data Meetup Nov 02 2015Chris Fregly
This document discusses optimizations made in Apache Spark to improve performance for large-scale sorting and shuffling of data. Some key optimizations discussed include improving data layout and algorithms to maximize CPU cache locality, saturating network and disk I/O, and implementing more efficient shuffle algorithms and data structures like sort-based shuffling and optimized hash maps. These optimizations helped Spark achieve world-record performance by sorting 100TB of data in under 12 hours.
Similar to Advanced Apache Spark Meetup Project Tungsten Nov 12 2015 (20)
AWS reInvent 2022 reCap AI/ML and DataChris Fregly
This document discusses Amazon Web Services (AWS) products and services for building end-to-end machine learning and data strategies. It covers topics such as ML infrastructure, governance, data preparation, model training, deployment, and education. Specific services mentioned include Amazon SageMaker, AWS Lake Formation, Amazon Redshift, Amazon EMR, AWS Glue, and AWS services for hardware acceleration like AWS Trainium and AWS Graviton.
Pandas on AWS - Let me count the ways.pdfChris Fregly
Chris Fregly (Principal Solution Architect, AI and machine learning at AWS) will give a brief presentation on the various ways to perform scalable Pandas, Modin, and Ray on AWS. He will then answer questions from the audience and moderator, Alejandro Herrera (whatever he is) at Ponder.
Chris Fregly is a Principal Solution Architect for AI and Machine Learning at Amazon Web Services (AWS) based in San Francisco, California. He is the organizer of the Global Data Science on AWS meetup. He is co-author of the O'Reilly Book, "Data Science on AWS."
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Ray AI Runtime (AIR) on AWS - Data Science On AWS MeetupChris Fregly
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Talk #0: Introductions and Meetup Announcements By Chris Fregly and Antje Barth
Talk #1: Ray Overview, Ray AI Runtime on AWS using Amazon SageMaker, EC2, EMR, EKS by Chris Fregly, Principal Specialist Solution Architect, AI and Machine Learning @ AWS
Talk #2: Deep-dive Blueprints for Amazon Elastic Kubernetes Service (EKS) including Ray and Spark by Apoorva Kulkarni, Sr. Specialist Solution Architect, Containers and Kubernetes @ AWS
RSVP Webinar: https://www.eventbrite.com/e/webinarkubeflow-tensorflow-tfx-pytorch-gpu-spark-ml-amazonsagemaker-tickets-45852865154
Zoom link: https://us02web.zoom.us/j/82308186562
Related Links
O'Reilly Book: https://www.amazon.com/dp/1492079391/
Website: https://datascienceonaws.com
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Smokey and the Multi-Armed Bandit Featuring BERT Reynolds UpdatedChris Fregly
The document discusses using multi-armed bandit tests to compare natural language models. It describes training BERT models with TensorFlow and PyTorch, and training a multi-armed bandit model with Vowpal Wabbit for reinforcement learning. It then demonstrates testing the BERT models with the bandit model and scaling multi-armed bandits on AWS.
Amazon reInvent 2020 Recap: AI and Machine LearningChris Fregly
Amazon reInvent 2020 Recap: AI and Machine Learning
Video here: https://youtu.be/YSXe02Y5pHM
NEW RELEASE! Build, Automate, Manage, and Scale ML Workflows with the NEW Amazon SageMaker Pipelines by Hallie Crosby Weishahn.
Description of Talk and Demo
AWS recently announced Amazon SageMaker Pipelines (https://aws.amazon.com/sagemaker/pipelines/), the first purpose-built, easy-to-use Continuous Integration and Continuous Delivery (CI/CD) service for machine learning.
SageMaker Pipelines has three main components which improve the operational resilience and reproducibility of your workflows: 1) pipelines, 2) model registry, and 3) projects.
In this talk and demo, Hallie will walk us through the new Amazon SageMaker Pipelines feature including MLOps support.
Date/Time
9-10am US Pacific Time (Third Monday of Every Month)
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Meetup:
https://www.meetup.com/Data-Science-on-AWS/
Zoom:
https://zoom.us/j/690414331
Webinar ID: 690 414 331
Phone:
+1 646 558 8656 (US Toll) or +1 408 638 0968 (US Toll)
Related Links
Meetup: https://meetup.datascienceonaws.com
GitHub Repo: https://github.com/data-science-on-aws/
O'Reilly Book: https://datascienceonaws.com
YouTube: https://youtube.datascienceonaws.com
Slideshare: https://slideshare.datascienceonaws.com
Support: https://support.pipeline.ai
Monthly Workshop: https://www.eventbrite.com/e/full-day-workshop-kubeflow-gpu-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-tickets-63362929227
RSVP: https://www.eventbrite.com/e/1-hr-free-workshop-pipelineai-gpu-tpu-spark-ml-tensorflow-ai-kubernetes-kafka-scikit-tickets-45852865154
Waking the Data Scientist at 2am: Detect Model Degradation on Production Mod...Chris Fregly
The document discusses Amazon SageMaker Model Monitor and Debugger for monitoring machine learning models in production. SageMaker Model Monitor collects prediction data from endpoints, creates a baseline, and runs scheduled monitoring jobs to detect deviations from the baseline. It generates reports and metrics in CloudWatch. SageMaker Debugger helps debug training issues by capturing debug data with no code changes and providing real-time alerts and visualizations in Studio. Both services help detect model degradation and take corrective actions like retraining.
Quantum Computing with Amazon Braket
In this talk, I describe some fundamental principles of quantum computing including qu-bits, superposition, and entanglement. I will demonstrate how to perform secure quantum computing tasks across many Quantum Processing Units (QPUs) using Amazon Braket, IAM, and S3.
AI and Machine Learning, Quantum Computing, Amazon Braket, QPU
15 Tips to Scale a Large AI/ML Workshop - Both Online and In-PersonChris Fregly
In this talk, we present tips and best practices for scaling a large workshop for 1,000's of simultaneous attendees - both online and in-person. While our workshop is focused on AI and machine learning on AWS, we generalize our learnings for any domain or specialization.
The document provides an overview of announcements from Amazon Web Services' annual re:Invent conference in December 2019. Key details include:
- The conference had 65,000 attendees and 3,000 sessions.
- Announcements covered improving the developer experience, compute, storage, AI/ML, databases/analytics, networking, security, and extending AWS beyond regions.
- New services and features were announced for Lambda, API Gateway, Step Functions, EventBridge, Amplify, SageMaker, EC2, EKS, EBS, S3, Rekognition, Lex, Translate, Transcribe, Comprehend, Personalize, Forecast, Fraud Detector, and more.
This document provides an overview and agenda for a workshop on end-to-end machine learning pipelines using TFX, Kubeflow, Airflow and MLflow. The agenda covers setting up an environment with Kubernetes, using TensorFlow Extended (TFX) components to build pipelines, ML pipelines with Airflow and Kubeflow, hyperparameter tuning with Kubeflow, and deploying notebooks with Kubernetes. Hands-on exercises are also provided to explore key areas like TensorFlow Data Validation, TensorFlow Transform, TensorFlow Model Analysis and Airflow ML pipelines.
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
Speaker: Umayah Abdennabi
Agenda
* Intro Grammarly (Umayah Abdennabi, 5 mins)
* Meetup Updates and Announcements (Chris, 5 mins)
* Custom Functions in Spark SQL (30 mins)
Speaker: Umayah Abdennabi
Spark comes with a rich Expression library that can be extended to make custom expressions. We will look into custom expressions and why you would want to use them.
* TF 2.0 + Keras (30 mins)
Speaker: Francesco Mosconi
Tensorflow 2.0 was announced at the March TF Dev Summit, and it brings many changes and upgrades. The most significant change is the inclusion of Keras as the default model building API. In this talk, we'll review the main changes introduced in TF 2.0 and highlight the differences between open source Keras and tf.keras
* SQUAD Deep-Dive: Question & Answer with Context (45 mins)
Speaker: Brett Koonce (https://quarkworks.co)
SQuAD (Stanford Question Answer Dataset) is an NLP challenge based around answering questions by reading Wikipedia articles, designed to be a real-world machine learning benchmark. We will look at several different ways to tackle the SQuAD problem, building up to state of the art approaches in terms of time, complexity, and accuracy.
https://rajpurkar.github.io/SQuAD-explorer/
https://dawn.cs.stanford.edu/benchmark/#squad
Food and drinks will be provided. The event will be held at Grammarly's office at One Embarcadero Center on the 9th floor. When you arrive at One Embarcadero, take the escalator to the second floor where you will find the lobby and elevators to the office suites. Come on up to the 9th floor (no need to check in at security), and ring the Grammarly doorbell.
PipelineAI Continuous Machine Learning and AI - Rework Deep Learning Summit -...Chris Fregly
Traditional machine learning pipelines end with life-less models sitting on disk in the research lab. These traditional models are typically trained on stale, offline, historical batch data. Static models and stale data are not sufficient to power today's modern, AI-first Enterprises that require continuous model training, continuous model optimizations, and lightning-fast model experiments directly in production. Through a series of open source, hands-on demos and exercises, we will use PipelineAI to breathe life into these models using 4 new techniques that we’ve pioneered:
* Continuous Validation (V)
* Continuous Optimizing (O)
* Continuous Training (T)
* Continuous Explainability (E).
The Continuous "VOTE" techniques has proven to maximize pipeline efficiency, minimize pipeline costs, and increase pipeline insight at every stage from continuous model training (offline) to live model serving (online.)
Attendees will learn to create continuous machine learning pipelines in production with PipelineAI, TensorFlow, and Kafka.
PipelineAI Real-Time Machine Learning - Global Artificial Intelligence Confer...Chris Fregly
Perform Online Predictions using Slack
A/B and multi-armed bandit model compare
Train Online Models with Kafka Streams
Create new models quickly
Deploy to production safely
Mirror traffic to validate online performance
Any Framework, Any Hardware, Any Cloud
Dashboard to manage the lifecycle of models from local development to live production
Generates optimized runtimes for the models
Custom targeting rules, shadow mode, and percentage-based rollouts to safely test features in live production
Continuous model training, model validation, and pipeline optimization
https://youtu.be/zpkH9oiIovU
https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/258276286/
Related Links
PipelineAI Home: https://pipeline.ai
PipelineAI Community Edition: https://community.pipeline.ai
PipelineAI GitHub: https://github.com/PipelineAI/pipeline
PipelineAI Quick Start: https://quickstart.pipeline.ai
Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
YouTube Videos: https://youtube.pipeline.ai
SlideShare Presentations: https://slideshare.pipeline.ai
Slack Support:
https://joinslack.pipeline.ai
Web Support and Knowledge Base: https://support.pipeline.ai
Email Support: help@pipeline.ai
Hyper-Parameter Tuning Across the Entire AI Pipeline GPU Tech Conference San ...Chris Fregly
Chris Fregly, Founder @ PipelineAI, will walk you through a real-world, complete end-to-end Pipeline-optimization example. We highlight hyper-parameters - and model pipeline phases - that have never been exposed until now.
While most Hyperparameter Optimizers stop at the training phase (ie. learning rate, tree depth, ec2 instance type, etc), we extend model validation and tuning into a new post-training optimization phase including 8-bit reduced precision weight quantization and neural network layer fusing - among many other framework and hardware-specific optimizations.
Next, we introduce hyperparameters at the prediction phase including request-batch sizing and chipset (CPU v. GPU v. TPU).
Lastly, we determine a PipelineAI Efficiency Score of our overall Pipeline including Cost, Accuracy, and Time. We show techniques to maximize this PipelineAI Efficiency Score using our massive PipelineDB along with the Pipeline-wide hyper-parameter tuning techniques mentioned in this talk.
Bio
Chris Fregly is Founder and Applied AI Engineer at PipelineAI, a Real-Time Machine Learning and Artificial Intelligence Startup based in San Francisco.
He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Training and Video Series titled, "High Performance TensorFlow in Production with Kubernetes and GPUs."
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member and Principal Engineer at the IBM Spark Technology Center in San Francisco.
PipelineAI Optimizes Your Enterprise AI Pipeline from Distributed Training to...Chris Fregly
https://pipeline.ai
With PipelineAI, You Can…
* Generate Hardware-Specific Model Optimizations
* Deploy and Compare Models in Live Production
* Optimize Complete AI Pipeline Across Many Models
* Hyper-Parameter Tune Both Training & Predicting Phases
Advanced Spark and TensorFlow Meetup - Dec 12 2017 - Dong Meng, MapR + Kubern...Chris Fregly
This document discusses distributed deep learning on the MapR Converged Data Platform. It provides an overview of MapR's enterprise big data journey and capabilities for distributed deep learning. It describes using containers and Kubernetes for deep learning model development and deployment, with NVIDIA GPUs for computation. It presents architectures and patterns for separating or collocating MapR and GPU clusters. Finally, it previews demos of parameter server/workers and real-time face detection using streams.
High Performance Distributed TensorFlow in Production with GPUs - NIPS 2017 -...Chris Fregly
Online Workshop
Note: A GPU-based cloud instance will be provided to each attendee for the duration of this event!!
At 8am PT on the morning of this workshop, we will email the Webinar details to your email address registered with Eventbrite.
If this email address is not up to date - or you do not get the email by 8am PT - please email your Eventbrite confirmation to help@pipeline.ai and we'll send you the details.
http://pipeline.ai
Title
PipelineAI Distributed Spark ML + Tensorflow AI + GPU Workshop
Time
Start: 9am PT Time
End: 1pm PT Time
Highlights
We will each build an end-to-end, continuous Tensorflow AI model training and deployment pipeline on our own GPU-based cloud instance.
At the end, we will combine our cloud instances to create the LARGEST Distributed Tensorflow AI Training and Serving Cluster in the WORLD!
Pre-requisites
Just a modern browser, internet connection, and a good night's sleep! We'll provide the rest.
Agenda
Spark ML
TensorFlow AI
Storing and Serving Models with HDFS
Trade-offs of CPU vs. *GPU, Scale Up vs. Scale Out
CUDA + cuDNN GPU Development Overview
TensorFlow Model Checkpointing, Saving, Exporting, and Importing
Distributed TensorFlow AI Model Training (Distributed Tensorflow)
TensorFlow's Accelerated Linear Algebra Framework (XLA)
TensorFlow's Just-in-Time (JIT) Compiler, Ahead of Time (AOT) Compiler
Centralized Logging and Visualizing of Distributed TensorFlow Training (Tensorboard)
Distributed Tensorflow AI Model Serving/Predicting (TensorFlow Serving)
Centralized Logging and Metrics Collection (Prometheus, Grafana)
Continuous TensorFlow AI Model Deployment (TensorFlow, Airflow)
Hybrid Cross-Cloud and On-Premise Deployments (Kubernetes)
High-Performance and Fault-Tolerant Micro-services (NetflixOSS)
More Info including GitHub and Docker Repos
http://pipeline.ai
iBirds Services - Comprehensive Salesforce CRM and Software Development Solut...vijayatibirds
Unlock the full potential of your business with iBirds Services. As a trusted Salesforce Consulting Partner, iBirds Software Pvt. Ltd. offers a wide range of customer-centric consulting services to help you seamlessly integrate, customize, and optimize your Salesforce CRM. Our team of experts specializes in delivering innovative software development solutions tailored to meet your unique business needs.
In this document, you will discover:
An overview of iBirds Services and our expertise in Salesforce CRM implementation.
Detailed insights into our software development services, including custom applications, integrations, and automation.
Case studies highlighting our successful projects and satisfied clients.
Key benefits of partnering with iBirds Services for your CRM and software development needs.
Whether you are a small business or a large enterprise, our proven strategies and cutting-edge technologies ensure your business stays ahead of the competition. Explore our services and learn how iBirds can transform your business operations with scalable and efficient solutions.
Three available editions of Windows Servers crucial to your organization’s op...Q-Advise
Three available editions of Windows Servers crucial to your organization’s operations
Windows Server, Microsoft’s robust operating system, is the cornerstone of enterprise IT infrastructure, tailored for mission-critical operations. It helps in managing enterprise-level tasks, including data storage, applications, and communication.
Proper licensing of Windows Server is essential for both legal compliance and optimal functionality within business environments.
Windows Server comes in various edition and before any edition is used in your organization, it is required you license them appropriately. The licensing can be complex and capital demanding when you don’t know what you want or understand the licensing requirements.
Even if successfully licensed, there are various activities you can practice as an organization to make sure your Server is operating optimally and there is real value for money. This requires a deeper understanding of best practices and our team of cloud and licensing experts can be of support.
Send the team an email, info@q-advise.com let’s have a look at your needs, together with you decide which licensing model will best work in your case, assist you with savings options and share with you how pre-owned licensing can help you get licensed adequately also.
Predicting Test Results without Execution (FSE 2024)Andre Hora
As software systems grow, test suites may become complex, making it challenging to run the tests frequently and locally. Recently, Large Language Models (LLMs) have been adopted in multiple software engineering tasks. It has demonstrated great results in code generation, however, it is not yet clear whether these models understand code execution. Particularly, it is unclear whether LLMs can be used to predict test results, and, potentially, overcome the issues of running real-world tests. To shed some light on this problem, in this paper, we explore the capability of LLMs to predict test results without execution. We evaluate the performance of the state-of-the-art GPT-4 in predicting the execution of 200 test cases of the Python Standard Library. Among these 200 test cases, 100 are passing and 100 are failing ones. Overall, we find that GPT-4 has a precision of 88.8%, recall of 71%, and accuracy of 81% in the test result prediction. However, the results vary depending on the test complexity: GPT-4 presented better precision and recall when predicting simpler tests (93.2% and 82%) than complex ones (83.3% and 60%). We also find differences among the analyzed test suites, with the precision ranging from 77.8% to 94.7% and recall between 60% and 90%. Our findings suggest that GPT-4 still needs significant progress in predicting test results.
The SQDC (Safety, Quality, Delivery, Cost) process enhances manufacturing performance through daily safety meetings, defect tracking, and waste reduction. Orcalean’s FactoryKPI digital dashboard streamlines this process, providing real-time data and AI-powered analytics for continuous improvement.
BDRSuite - #1 Cost effective Data Backup and Recovery Solutionpraveene26
BDRSuite and BDRCloud by Vembu are comprehensive and cost-effective backup and disaster recovery solutions designed to meet the diverse data protection requirements of Businesses and Service Providers.
With BDRSuite & BDRCloud, you can backup diverse IT workloads from any location, including VMs (VMware, Hyper-V, KVM, Proxmox VE, oVirt), Servers & Endpoints (Windows, Linux, Mac), SaaS Applications (Microsoft 365, Google Workspace), Cloud VMs (AWS, Azure), NAS/File Shares and Databases & Applications (Microsoft Exchange Server, SQL Server, SharePoint Server, PostgreSQL, MySQL).
You can store backup anywhere like On-Premise/Remote storage, Private/Public Cloud, and BDRCloud.
You can centrally manage the entire backup infrastructure with BDRSuite’s self-hosted centralized management console (or) BDRCloud-hosted centralized management console.
You can quickly recover from data loss or ransomware attacks—all at an affordable price.
To know more visit our website -
https://www.bdrsuite.com/
https://www.bdrcloud.com/
Monitoring the Execution of 14K Tests: Methods Tend to Have One Path that Is ...Andre Hora
The literature has provided evidence that developers are likely to test some behaviors of the program and avoid other ones. Despite this observation, we still lack empirical evidence from real-world systems. In this paper, we propose to automatically identify the tested paths of a method as a way to detect the method’s behaviors. Then, we provide an empirical study to assess the tested paths quantitatively. We monitor the execution of 14,177 tests from 25 real-world Python systems and assess 11,425 tested paths from 2,357 methods. Overall, our empirical study shows that one tested path is prevalent and receives most of the calls, while others are significantly less executed. We find that the most frequently executed tested path of a method has 4x more calls than the second one. Based on these findings, we discuss practical implications for practitioners and researchers and future research directions.
Crowd Strike\Windows Update Issue: Overview and Current Statusramaganesan0504
Crowd Strike\Windows Update Issue: Overview and Current Status
Discover the latest on the CrowdStrike Windows update issue, including an overview, current status, and support steps for affected customers. Learn about the identified defect, its impact on Windows hosts, and CrowdStrike's committed actions to ensure ongoing security and stability.
What is CrowdStrike?
CrowdStrike is a prominent cybersecurity technology company that specializes in providing advanced threat intelligence and endpoint protection solutions. Founded in 2011 by George Kurtz, Dmitri Alperovitch, and Gregg Marston, CrowdStrike has quickly established itself as a leader in the cybersecurity industry. Here are some key aspects of
Literals - A Machine Independent Feature21h16charis
Introduction to Literals, A machine independent feature. The presentation is based on the prescribed textbook for System Software and Compiler Design, Computer Science and Engineering - System Software by Leland. L. Beck,
D Manjula.
Tube Magic Software | Youtube Software | Best AI Tool For Growing Youtube Cha...David D. Scott
Tube Magic Software is your ultimate tool for creating stunning video content with ease. Designed with both beginners and professionals in mind, it offers a user-friendly interface packed with powerful features. From seamless editing to eye-catching effects, Tube Magic helps you bring your creative vision to life. Elevate your videos and captivate your audience effortlessly. Join our community of content creators and experience the magic today!
Test Polarity: Detecting Positive and Negative Tests (FSE 2024)Andre Hora
Positive tests (aka, happy path tests) cover the expected behavior of the program, while negative tests (aka, unhappy path tests) check the unexpected behavior. Ideally, test suites should have both positive and negative tests to better protect against regressions. In practice, unfortunately, we cannot easily identify whether a test is positive or negative. A better understanding of whether a test suite is more positive or negative is fundamental to assessing the overall test suite capability in testing expected and unexpected behaviors. In this paper, we propose test polarity, an automated approach to detect positive and negative tests. Our approach runs/monitors the test suite and collects runtime data about the application execution to classify the test methods as positive or negative. In a first evaluation, test polarity correctly classified 117 tests as as positive or negative. Finally, we provide a preliminary empirical study to analyze the test polarity of 2,054 test methods from 12 real-world test suites of the Python Standard Library. We find that most of the analyzed test methods are negative (88%) and a minority is positive (12%). However, there is a large variation per project: while some libraries have an equivalent number of positive and negative tests, others have mostly negative ones.
AI is revolutionizing DevOps by advancing algorithmic optimizations in pipelines, elevating efficiency levels, and introducing predictive functionalities. This article examines how AI is reshaping continuous integration, deployment strategies, monitoring practices, and incident management within DevOps ecosystems, ultimately amplifying efficiency and dependability.
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
1. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Project Tungsten
Advanced Apache Spark Meetup
Chris Fregly
Principal Data Solutions Engineer
We’re Hiring - Only Nice People!
Nov 12, 2015
2. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Am I?
2
Streaming Data Engineer
Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer
IBM Technology Center
Founder
Advanced Apache Meetup
Author
Advanced .
Due 2016
My Ma’s First Time in California
3. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California
Using Chopsticks
Using “New” iPhone
4. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Upcoming Meetups and Conferences
London Spark Meetup (Oct 12th)
Scotland Data Science Meetup (Oct 13th)
Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th)
Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th)
Amsterdam Spark Summit (Oct 27th)
Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd)
Geneva Spark Meetup (Nov 5th)
San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th)
Oslo Big Data Hadoop Meetup (Nov 18th)
Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd)
Copenhagen Spark Meetup (Nov 26th)
Budapest Spark Meetup (Nov 27th)
Singapore Strata Conference (Dec 1st)
San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th)
Toronto Spark Meetup (Dec 14th)
Washington DC Spark Meetup (Jan 2016)
5. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Advanced Apache Spark Meetup
Meetup Metrics
~1600 Members in just 4 mos!
4th Most Active Spark Meetup!!
Meetup Goals
Dig deep into codebase of Spark and related projects
Study integrations of Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R
Surface and share patterns and idioms of these
well-designed, distributed, big data components
THANKS TO ALL OF YOU!!
6. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
All Slides and Code Are Available!
slideshare.net/cfregly
github.com/fluxcapacitor
hub.docker.com/r/fluxcapacitor
6
7. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Themes of this Talk
Filter
Off-Heap
Parallelize
Approximate
Find Similarity
Minimize Seeks
Maximize Scans
Customize for Workload
Tune Performance At Every Layer
7
Be Nice, Collaborate!
Like a Mom!!
8. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
8
9. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Mechanical Sympathy
Hardware and software working together in harmony.
- Martin Thompson
http://mechanical-sympathy.blogspot.com
Whatever your data structure, my array will beat it.
- Scott Meyers
Every C++ Book, basically
9
Hair
Sympathy
- Bruce Jenner
10. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spark and Mechanical Sympathy
10
Project
Tungsten
(Spark 1.4-1.6+)
GraySort
Challenge
(Spark 1.1-1.2)
Minimize Memory and GC
Maximize CPU Cache Locality
Saturate Network I/O
Saturate Disk I/O
11. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
11
Value
Ptr
Key
Dereference Not Required!
AlphaSort
List [(Key, Pointer)]
Key is directly available for comparison
Naïve
List [Pointer]
Must dereference key for comparison
Ptr
Dereference for Key Comparison
Key
12. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Line and Memory Sympathy
Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs
= 14 bytes
12
Key
Ptr
Not CPU Cache-line Friendly!
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Key-Prefix (4 bytes) + Pointer (4 bytes)
= 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes)
= 16 bytes
Key
Ptr
Pad
/Pad
CPU Cache-line Friendly!
13. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Comparison
13
14. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Similar Trick: Direct Cache Access (DCA)
Pull out packet header along side pointer to payload
14
15. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Lines: Sequential vs. Random
15
16. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector
for (i <- 0 until numRowA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
16
Bad: Row-wise traversal,
not using CPU cache line,
ineffective pre-fetching
17. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Matrix Multiplication
// Transpose B
for (i <- 0 until numRowsB)
for (j <- 0 until numColsB)
matBT[ i ][ j ] = matB[ j ][ i ];
// Modify dot product calculation for B Transpose
for (i <- 0 until numRowsA)
for (j <- 0 until numColsB)
for (k <- 0 until numColsA)
res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
17
Good: Full CPU cache line,
effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference j
before k
18. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Instrumenting and Monitoring CPU
Use Linux perf command!
18
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
19. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
Cache-Friendly Matrix Multiply
~72x
~8x
~3x
~3x
~2x
~7x
~10x
perf stat -XX:+PreserveFramePointer -XX:-Inline
–event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses,
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x
55 hp
550 hp
20. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Matrix Multiplication
20
21. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Tuple Counters
object CacheNaiveTupleIncrement {
var tuple = (0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = {
this.synchronized {
tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement)
tuple
}
}
}
21
22. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Naïve Case Class Counters
case class MyTuple(left: Int, right: Int)
object CacheNaiveCaseClassCounters {
var tuple = new MyTuple(0,0)
…
def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = {
this.synchronized {
tuple = new MyTuple(tuple.left + leftIncrement,
tuple.right + rightIncrement)
tuple
}
}
}
22
23. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU Cache Friendly Lock-Free Counters
object CacheFriendlyLockFreeCounters {
// a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each)
val tuple = new AtomicLong()
…
def increment(leftIncrement: Int, rightIncrement: Int) : Long = {
var originalLong = 0L
var updatedLong = 0L
do {
originalLong = tuple.get()
val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter
val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter
val updatedRightInt = originalRightInt + rightIncrement // increment right counter
val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter
updatedLong = updatedLeftInt // update the new long with the left counter
updatedLong = updatedLong << 32 // shift the new long left
updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false)
updatedLong
}
23
Quiz: Why not @volatile?
24. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Demo!
Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
24
25. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Results of Counters Comparison
Naïve Tuple Counters
Naïve Case Class Counters
Cache Friendly Lock-Free Counters
~2x
~1.5x
~3.5x
~2x
~2x
~1.5x
~1.5x
~1.5x
26. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Profiling Visualizations: Flame Graphs
With Java Stack Traces!!
26
Example: Spark Word Count
Java Stack Traces
are Good!
Plateaus
are Bad!!
27. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
27
28. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
100TB GraySort Challenge
Sort 100TB of 100-Byte Records with 10-byte Keys
Custom Data Structs & Algos for Sort & Shuffle
Saturate Network and Disk I/O Controllers
28
29. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
100TB GraySort Challenge Results
29
Performance Goals
Saturate Network I/O
Saturate Disk I/O
(2013) (2014)
30. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Hardware Configuration
Compute
206 Workers, 1 Master (AWS EC2 i2.8xlarge)
32 Intel Xeon CPU E5-2670 @ 2.5 Ghz
244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4
3 GBps mixed read/write disk I/O per node
Network
AWS Placement Groups, VPC, Enhanced Networking
Single Root I/O Virtualization (SR-IOV)
10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
30
31. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Winning Software Configuration
Spark 1.2, OpenJDK 1.7
Disable caching, compression, spec execution, shuffle spill
Force NODE_LOCAL task scheduling for optimal data locality
HDFS 2.4.1 short-circuit local reads, 2x replication
Empirically chose between 4-6 partitions per cpu
206 nodes * 32 cores = 6592 cores
6592 cores * 4 = 26,368 partitions
6592 cores * 6 = 39,552 partitions
6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace
Required ~10s of sampling 79 keys from in each partition
31
32. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
New Sort Shuffle Manager for Spark 1.2
Original “hash-based”
New “sort-based”
① Use less OS resources (socket buffers, file descriptors)
② TimSort partitions in-memory
③ MergeSort partitions on-disk into a single master file
④ Serve partitions from master file: seek once, sequential scan
32
33. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Asynchronous Network Module
Switch to asyncronous Netty vs. synchronous java.nio
Switch to zero-copy epoll
Use only kernel-space between disk and network controllers
Custom memory management
spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning
spark.shuffle.io.preferDirectBuffers=true
Reuse off-heap buffers
spark.shuffle.io.numConnectionsPerPeer=8 (for example)
Increase to saturate hosts with multiple disks (8x800 SSD)
33
Details in
SPARK-2468
34. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Algorithms and Data Structures
Optimized for sort & shuffle workloads
o.a.s.util.collection.TimSort[K,V]
Based on JDK 1.7 TimSort
Performs best with partially-sorted runs
Optimized for elements of (K,V) pairs
Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap
Open addressing hash, quadratic probing
Array of [(K, V), (K, V)]
Good memory locality
Keys never removed, values only append
34
35. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Daytona GraySort Challenge Goal Success
1.1 Gbps/node network I/O (Reducers)
Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
35
Aggregate
Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
36. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Shuffle Performance Tuning Tips
Hash Shuffle Manager (Deprecated)
spark.shuffle.consolidateFiles (Mapper)
o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files
Increase spark.shuffle.file.buffer (Reducer)
Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors
Minimizes intermediate files and overall shuffle
More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin
spark.sql.autoBroadcastJoinThreshold
Use DataFrame.explain(true) or EXPLAIN to verify
36
Many Threads
(1 per CPU)
37. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Outline
① Mechanical Sympathy
② Recap of 100TB GraySort Challenge
③ Project Tungsten Deep Dive
37
38. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Project Tungsten
Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
38
SPARK-7076
(Spark 1.4)
39. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Quick Review of Project Tungsten Jiras
39
SPARK-7076
(Spark 1.4)
40. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Why is CPU the Bottleneck?
CPU is used for serialization, hashing, compression!
Network and Disk I/O bandwidth are relatively high
GraySort optimizations improved network & shuffle
Partitioning, pruning, and predicate pushdowns
Binary, compressed, columnar file formats (Parquet)
40
41. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Yet Another Spark Shuffle Manager!
spark.shuffle.manager =
hash (Deprecated)
< 10,000 reducers
Output partition file hashes the key of (K,V) pair
Mapper creates an output file per partition
Leads to M*P output files for all partitions
sort (GraySort Challenge)
> 10,000 reducers
Default from Spark 1.2-1.5
Mapper creates single output file for all partitions
Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory
Uses custom data structures and algorithms for sort-shuffle workload
Wins Daytona GraySort Challenge
tungsten-sort (Project Tungsten)
Default since 1.5
Modification of existing sort-based shuffle
Uses com.misc.Unsafe for self-managed memory and garbage collection
Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms
Perform joins, sorts, and other operators on both serialized and compressed byte buffers
41
42. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
CPU & Memory Optimizations
Custom Managed Memory
Reduces GC overhead
Both on and off heap
Exact size calculations
Direct Binary Processing
Operate on serialized/compressed arrays
Kryo can reorder/sort serialized records
LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms
o.a.s.sql.catalyst.expression.UnsafeRow
o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5)
Generate source code from overall query plan
100+ UDFs converted to use code generation
42
UnsafeFixedWithAggregationMap
TungstenAggregationIterator
CodeGenerator
GeneratorUnsafeRowJoiner
UnsafeSortDataFormat
UnsafeShuffleSortDataFormat
PackedRecordPointer
UnsafeRow
UnsafeInMemorySorter
UnsafeExternalSorter
UnsafeShuffleWriter
Mostly Same Join Code,
UnsafeProjection
UnsafeShuffleManager
UnsafeShuffleInMemorySorter
UnsafeShuffleExternalSorter
Details in
SPARK-7075
43. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
sun.misc.Unsafe
43
Info
addressSize()
pageSize()
Objects
allocateInstance()
objectFieldOffset()
Classes
staticFieldOffset()
defineClass()
defineAnonymousClass()
ensureClassInitialized()
Synchronization
monitorEnter()
tryMonitorEnter()
monitorExit()
compareAndSwapInt()
putOrderedInt()
Arrays
arrayBaseOffset()
arrayIndexScale()
Memory
allocateMemory()
copyMemory()
freeMemory()
getAddress() – not guaranteed after GC
getInt()/putInt()
getBoolean()/putBoolean()
getByte()/putByte()
getShort()/putShort()
getLong()/putLong()
getFloat()/putFloat()
getDouble()/putDouble()
getObjectVolatile()/putObjectVolatile()
Used by
Tungsten
45. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Traditional Java Object Row Layout
4-byte String
Multi-field Object
45
46. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Data Structures for Workload
UnsafeRow
(Dense Binary Row)
TaskMemoryManager
(Virtual Memory Address)
BytesToBytesMap
(Dense Binary HashMap)
46
Dense, 8-bytes per field (word-aligned)
Key
Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
47. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeRow Layout Example
47
Pre-Tungsten
Tungsten
48. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Custom Memory Management
o.a.s.memory.
TaskMemoryManager & MemoryConsumer
Memory management: virtual memory allocation, pageing
Off-heap: direct 64-bit address
On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort.
PackedRecordPointer
64-bit word
(24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types.
UTF8String
Primitive Array[Byte]
48
2^13 pages * 2^27 page size = 1 TB RAM per Task
49. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
UnsafeFixedWidthAggregationMap
Aggregations
o.a.s.sql.execution.
UnsafeFixedWidthAggregationMap
Uses BytesToBytesMap
In-place updates of serialized data
No object creation on hot-path
Improved external agg support
No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen.
GenerateUnsafeRowJoiner
Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate.
TungstenAggregate & TungstenAggregationIterator
Operates directly on serialized, binary UnsafeRow
2 Steps: hash-based agg (grouping), then sort-based agg
Supports spilling and external merge sorting
49
50. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Equality
Bitwise comparison on UnsafeRow
No need to calculate equals(), hashCode()
Row 1
Equals!
Row 2
50
51. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Joins
Surprisingly, not many code changes
o.a.s.sql.catalyst.expressions.
UnsafeProjection
Converts InternalRow to UnsafeRow
51
52. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Sorting
o.a.s.util.collection.unsafe.sort.
UnsafeSortDataFormat
UnsafeInMemorySorter
UnsafeExternalSorter
RecordPointerAndKeyPrefix
UnsafeShuffleWriter
AlphaSort-Style Cache Friendly
52
Ptr
Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat
simultaneously will prevent JIT inlining.
This affects sort & shuffle performance.
Supports merging compressed records
if compression CODEC supports it (LZF)
53. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Spilling
Efficient Spilling
Exact data size is known
No need to maintain heuristics & approximations
Controls amount of spilling
Spill merge on compressed, binary records!
If compression CODEC supports it
53
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory
for Spark Jobs
54. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Code Generation
Problem
Boxing causes excessive object creation
Expensive expression tree evals per row
JVM can’t inline polymorphic impls
Solution
Codegen by-passes virtual function calls
Defer source code generation to each operator, UDF, UDAF
Use Scala quasiquote macros for Scala AST source code gen
Rewrite and optimize code for overall plan, 8-byte align, etc
Use Janino to compile generated source code into bytecode
54
55. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
IBM | spark.tc
Spark SQL UDF Code Generation
100+ UDFs now generating code
More to come in Spark 1.6+
Details in
SPARK-8159, SPARK-9571
Each Implements
Expression.genCode()!
56. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Creating a Custom UDF with Codegen
Study existing implementations
https://github.com/apache/spark/pull/7214/files
Extend base trait
o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function
o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits)
o.a.s.sql.functions.scala
Don’t forget about Python!
python.pyspark.sql.functions.py
56
57. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Who Benefits from Project Tungsten?
Users of DataFrames
All Spark SQL Queries
Catalyst
All RDDs
Serialization, Compression, and Aggregations
57
58. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Performance Results
Query Time
Garbage
Collection
58
OOM’d on
Large Dataset!
59. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Thank You!!!
Chris Fregly @cfregly
IBM Spark Technology Center
San Francisco, California
Relevant Links
advancedspark.com
Signup for the book & global meetup!
github.com/fluxcapacitor/pipeline
Clone, contribute, and commit code!
hub.docker.com/r/fluxcapacitor/pipeline/wiki
Run all demos in your own environment with Docker!
59
60. Power of data. Simplicity of design. Speed of innovation.
IBM Spark
spark.tc
Power of data. Simplicity of design. Speed of innovation.
IBM Spark