Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
Apache Spark has the ‘speculative execution’ feature to handle the slow tasks in a stage due to environment issues like slow network, disk etc. If one task is running slowly in a stage, Spark driver can launch a speculation task for it on a different host. Between the regular task and its speculation task, Spark system will later take the result from the first successfully completed task and kill the slower one.
When we first enabled the speculation feature for all Spark applications by default on a large cluster of 10K+ nodes at LinkedIn, we observed that the default values set for Spark’s speculation configuration parameters did not work well for LinkedIn’s batch jobs. For example, the system launched too many fruitless speculation tasks (i.e. tasks that were killed later). Besides, the speculation tasks did not help shorten the shuffle stages. In order to reduce the number of fruitless speculation tasks, we tried to find out the root cause, enhanced Spark engine, and tuned the speculation parameters carefully. We analyzed the number of speculation tasks launched, number of fruitful versus fruitless speculation tasks, and their corresponding cpu-memory resource consumption in terms of gigabytes-hours. We were able to reduce the average job response times by 13%, decrease the standard deviation of job elapsed times by 40%, and lower total resource consumption by 24% in a heavily utilized multi-tenant environment on a large cluster. In this talk, we will share our experience on enabling the speculative execution to achieve good job elapsed time reduction at the same time keeping a minimal overhead.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
The document discusses Apache Flink, an open source stream processing framework. It provides high throughput and low latency processing of both streaming and batch data. Flink allows for explicit handling of event time, stateful stream processing with exactly-once semantics, and high performance. It also supports features like windowing, sessionization, and complex event processing that are useful for building streaming applications.
Tl;dr; How to make Apache Spark process data efficiently? Lessons learned from running petabyte scale Hadoop cluster and dozens of spark jobs’ optimisations including the most spectacular: from 2500 gigs of RAM to 240.
Apache Spark is extremely popular for processing data on Hadoop clusters. If Your Spark executors go down, an amount of memory is increased. If processing goes too slow, number of executors is increased. Well, this works for some time but sooner or later You end up with a whole cluster fully utilized in an inefficient way.
During the presentation, we will present our lessons learned and performance improvements on Spark jobs including the most spectacular: from 2500 gigs of RAM to 240. We will also answer the questions like:
- How does pySpark job differ from Scala jobs in terms of performance?
- How does caching affect dynamic resource allocation
- Why is it worth to use mapPartitions?
and many more.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
Deep Dive into the New Features of Apache Spark 3.0Databricks
Continuing with the objectives to make Spark faster, easier, and smarter, Apache Spark 3.0 extends its scope with more than 3000 resolved JIRAs. We will talk about the exciting new developments in the Spark 3.0 as well as some other major initiatives that are coming in the future.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Introduction to Apache Spark. With an emphasis on the RDD API, Spark SQL (DataFrame and Dataset API) and Spark Streaming.
Presented at the Desert Code Camp:
http://oct2016.desertcodecamp.com/sessions/all
Galene - LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram S...Lucidworks
LinkedIn's search architecture called Galene uses Lucene to index hundreds of millions of profiles. Galene improves search quality and scalability through techniques like offline indexing for complex features, live updates at fine granularity, static ranking to prioritize more popular profiles, and early termination to quickly return top results. The architecture includes base, live, and snapshot indexes to support these techniques.
This document provides an overview of Apache Flink internals. It begins with an introduction and recap of Flink programming concepts. It then discusses how Flink programs are compiled into execution plans and executed in a pipelined fashion, as opposed to being executed eagerly like regular code. The document outlines Flink's architecture including the optimizer, runtime environment, and data storage integrations. It also covers iterative processing and how Flink handles iterations both by unrolling loops and with native iterative datasets.
This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.
Apache Spark is a cluster computing framework designed for fast, general-purpose processing of large datasets. It uses in-memory computing to improve processing speeds. Spark operations include transformations that create new datasets and actions that return values. The Spark stack includes Resilient Distributed Datasets (RDDs) for fault-tolerant data sharing across a cluster. Spark Streaming processes live data streams using a discretized stream model.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
What’s New in the Upcoming Apache Spark 3.0Databricks
Learn about the latest developments in the open-source community with Apache Spark 3.0 and DBR 7.0The upcoming Apache Spark™ 3.0 release brings new capabilities and features to the Spark ecosystem. In this online tech talk from Databricks, we will walk through updates in the Apache Spark 3.0.0-preview2 release as part of our new Databricks Runtime 7.0 Beta, which is now available.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
This document provides an overview of Apache Spark, including its goal of providing a fast and general engine for large-scale data processing. It discusses Spark's programming model, components like RDDs and DAGs, and how to initialize and deploy Spark on a cluster. Key aspects covered include RDDs as the fundamental data structure in Spark, transformations and actions, and storage levels for caching data in memory or disk.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
Spark is an open source cluster computing framework for large-scale data processing. It provides high-level APIs and runs on Hadoop clusters. Spark components include Spark Core for execution, Spark SQL for SQL queries, Spark Streaming for real-time data, and MLlib for machine learning. The core abstraction in Spark is the resilient distributed dataset (RDD), which allows data to be partitioned across nodes for parallel processing. A word count example demonstrates how to use transformations like flatMap and reduceByKey to count word frequencies from an input file in Spark.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
This document provides an overview of Apache Spark and machine learning using Spark. It introduces the speaker and objectives. It then covers Spark concepts including its architecture, RDDs, transformations and actions. It demonstrates working with RDDs and DataFrames. Finally, it discusses machine learning libraries available in Spark like MLib and how Spark can be used for supervised machine learning tasks.
This slide introduces Hadoop Spark.
Just to help you construct an idea of Spark regarding its architecture, data flow, job scheduling, and programming.
Not all technical details are included.
Abstract –
Spark 2 is here, while Spark has been the leading cluster computation framework for severl years, its second version takes Spark to new heights. In this seminar, we will go over Spark internals and learn the new concepts of Spark 2 to create better scalable big data applications.
Target Audience
Architects, Java/Scala developers, Big Data engineers, team leaders
Prerequisites
Java/Scala knowledge and SQL knowledge
Contents:
- Spark internals
- Architecture
- RDD
- Shuffle explained
- Dataset API
- Spark SQL
- Spark Streaming
This document provides an introduction and overview of Apache Spark. It discusses why Spark is useful, describes some Spark basics including Resilient Distributed Datasets (RDDs) and DataFrames, and gives a quick tour of Spark Core, SQL, and Streaming functionality. It also provides some tips for using Spark and describes how to set up Spark locally. The presenter is introduced as a data engineer who uses Spark to load data from Kafka streams into Redshift and Cassandra. Ways to learn more about Spark are suggested at the end.
We are a company driven by inquisitive data scientists, having developed a pragmatic and interdisciplinary approach, which has evolved over the decades working with over 100 clients across multiple industries. Combining several Data Science techniques from statistics, machine learning, deep learning, decision science, cognitive science, and business intelligence, with our ecosystem of technology platforms, we have produced unprecedented solutions. Welcome to the Data Science Analytics team that can do it all, from architecture to algorithms.
Our practice delivers data driven solutions, including Descriptive Analytics, Diagnostic Analytics, Predictive Analytics, and Prescriptive Analytics. We employ a number of technologies in the area of Big Data and Advanced Analytics such as DataStax (Cassandra), Databricks (Spark), Cloudera, Hortonworks, MapR, R, SAS, Matlab, SPSS and Advanced Data Visualizations.
This presentation is designed for Spark Enthusiasts to get started and details of the course are below.
1. Introduction to Apache Spark
2. Functional Programming + Scala
3. Spark Core
4. Spark SQL + Parquet
5. Advanced Libraries
6. Tips & Tricks
7. Where do I go from here?
Presentation detailed about capabilities of In memory Analytic using Apache Spark. Apache Spark overview with programming mode, cluster mode with Mosos, supported operations and comparison with Hadoop Map Reduce. Elaborating Apache Spark Stack expansion like Shark, Streaming, MLib, GraphX
A tutorial presentation based on spark.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
Apache Spark is a fast and general engine for large-scale data processing. It was originally developed in 2009 and is now supported by Databricks. Spark provides APIs in Java, Scala, Python and can run on Hadoop, Mesos, standalone or in the cloud. It provides high-level APIs like Spark SQL, MLlib, GraphX and Spark Streaming for structured data processing, machine learning, graph analytics and stream processing.
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
Apache spark sneha challa- google pittsburgh-aug 25thSneha Challa
The document is a presentation about Apache Spark given on August 25th, 2015 in Pittsburgh by Sneha Challa. It introduces Spark as a fast and general cluster computing engine for large-scale data processing. It discusses Spark's Resilient Distributed Datasets (RDDs) and transformations/actions. It provides examples of Spark APIs like map, reduce, and explains running Spark on standalone, Mesos, YARN, or EC2 clusters. It also covers Spark libraries like MLlib and running machine learning algorithms like k-means clustering and logistic regression.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Apache Pig is a high-level platform for creating programs that runs on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.
This document discusses various concepts related to Hadoop MapReduce including combiners, speculative execution, custom counters, input formats, multiple inputs/outputs, distributed cache, and joins. It explains that a combiner acts as a mini-reducer between the map and reduce stages to reduce data shuffling. Speculative execution allows redundant tasks to improve performance. Custom counters can track specific metrics. Input formats handle input splitting and reading. Multiple inputs allow different mappers for different files. Distributed cache shares read-only files across nodes. Joins can correlate large datasets on a common key.
Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured datastores such as relational databases. Sqoop helps offload certain tasks (such as ETL processing) from the EDW to Hadoop for efficient execution at a much lower cost. Sqoop can also be used to extract data from Hadoop and export it into external structured datastores. Sqoop works with relational databases such as Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
MongoDB is an open-source document database, and the leading NoSQL database. Written in C++.
MongoDB has official drivers for a variety of popular programming languages and development environments. There are also a large number of unofficial or community-supported drivers for other programming languages and frameworks.
SonarQube is an open platform to manage code quality. It has got a very efficient way of navigating, a balance between high-level view, dashboard, TimeMachine and defect hunting tools.
SonarQube tool is a web-based application. Rules, alerts, thresholds, exclusions, settings… can be configured online.
Keynote : AI & Future Of Offensive SecurityPriyanka Aash
In the presentation, the focus is on the transformative impact of artificial intelligence (AI) in cybersecurity, particularly in the context of malware generation and adversarial attacks. AI promises to revolutionize the field by enabling scalable solutions to historically challenging problems such as continuous threat simulation, autonomous attack path generation, and the creation of sophisticated attack payloads. The discussions underscore how AI-powered tools like AI-based penetration testing can outpace traditional methods, enhancing security posture by efficiently identifying and mitigating vulnerabilities across complex attack surfaces. The use of AI in red teaming further amplifies these capabilities, allowing organizations to validate security controls effectively against diverse adversarial scenarios. These advancements not only streamline testing processes but also bolster defense strategies, ensuring readiness against evolving cyber threats.
Finetuning GenAI For Hacking and DefendingPriyanka Aash
Generative AI, particularly through the lens of large language models (LLMs), represents a transformative leap in artificial intelligence. With advancements that have fundamentally altered our approach to AI, understanding and leveraging these technologies is crucial for innovators and practitioners alike. This comprehensive exploration delves into the intricacies of GenAI, from its foundational principles and historical evolution to its practical applications in security and beyond.
DefCamp_2016_Chemerkin_Yury-publish.pdf - Presentation by Yury Chemerkin at DefCamp 2016 discussing mobile app vulnerabilities, data protection issues, and analysis of security levels across different types of mobile applications.
Self-Healing Test Automation Framework - HealeniumKnoldus Inc.
Revolutionize your test automation with Healenium's self-healing framework. Automate test maintenance, reduce flakes, and increase efficiency. Learn how to build a robust test automation foundation. Discover the power of self-healing tests. Transform your testing experience.
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...Fwdays
.NET 8 brought a lot of improvements for developers and maturity to the Azure serverless container ecosystem. So, this talk will cover these changes and explain how you can apply them to your projects. Another reason for this talk is the re-invention of Serverless from a DevOps perspective as a Platform Engineering trend with Backstage and the recent Radius project from Microsoft. So now is the perfect time to look at developer productivity tooling and serverless apps from Microsoft's perspective.
Top 12 AI Technology Trends For 2024.pdfMarrie Morris
Technology has become an irreplaceable component of our daily lives. The role of AI in technology revolutionizes our lives for the betterment of the future. In this article, we will learn about the top 12 AI technology trends for 2024.
Discovery Series - Zero to Hero - Task Mining Session 1DianaGray10
This session is focused on providing you with an introduction to task mining. We will go over different types of task mining and provide you with a real-world demo on each type of task mining in detail.
Keynote : Presentation on SASE TechnologyPriyanka Aash
Secure Access Service Edge (SASE) solutions are revolutionizing enterprise networks by integrating SD-WAN with comprehensive security services. Traditionally, enterprises managed multiple point solutions for network and security needs, leading to complexity and resource-intensive operations. SASE, as defined by Gartner, consolidates these functions into a unified cloud-based service, offering SD-WAN capabilities alongside advanced security features like secure web gateways, CASB, and remote browser isolation. This convergence not only simplifies management but also enhances security posture and application performance across global networks and cloud environments. Discover how adopting SASE can streamline operations and fortify your enterprise's digital transformation strategy.
Increase Quality with User Access Policies - July 2024Peter Caitens
⭐️ Increase Quality with User Access Policies ⭐️, presented by Peter Caitens and Adam Best of Salesforce. View the slides from this session to hear all about “User Access Policies” and how they can help you onboard users faster with greater quality.
This PDF delves into the aspects of information security from a forensic perspective, focusing on privacy leaks. It provides insights into the methods and tools used in forensic investigations to uncover and mitigate privacy breaches in mobile and cloud environments.
1. What Is Apache Spark?
Apache Spark is a cluster computing platform designed to be
fast and general purpose.
On the speed side, Spark extends the popular MapReduce
model to efficiently support more types of computations,
including interactive queries and stream processing.
On the generality side, Spark is designed to cover a wide
range of workloads that previously required separate
distributed systems, including batch applications, iterative
algorithms, interactive queries, and streaming.
Spark is designed to be highly accessible, offering simple APIs
in Python, Java, Scala, and SQL, and rich built-in libraries.
Spark itself is written in Scala, and runs on the Java Virtual
Machine (JVM).
3. The Spark stack
Spark Core
Spark Core contains the basic functionality of Spark, including components for task
scheduling, memory management, fault recovery, interacting with storage systems, and
more. Spark Core is also home to the API that defines resilient distributed datasets
(RDDs).
Spark SQL
Spark SQL is Spark’s package for working with structured data. It allows querying data via
SQL as well as the Apache Hive variant of SQL, HQL.
Spark Streaming
Spark Streaming is a Spark component that enables processing of live streams of data.
MLlib
Spark comes with a library containing common machine learning (ML) functionality,
called MLlib. MLlib provides multiple types of machine learning algorithms, including
classification, regression, clustering, and collaborative filtering, as well as supporting
functionality such as model evaluation and data import.
GraphX
GraphX is a library for manipulating graphs (e.g., a social network’s friend graph) and
performing graph-parallel computations.
Cluster Managers
Under the hood, Spark is designed to efficiently scale up from one to many thousands of
compute nodes. To achieve this while maximizing flexibility, Spark can run over a variety
of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster
manager included in Spark itself called the Standalone Scheduler.
4. Storage Layers for Spark
• Spark can create distributed datasets from any file
stored in the Hadoop distributed file system (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local file system, Amazon S3,
Cassandra, Hive, HBase, etc.).
• It’s important to remember that Spark does not
require Hadoop; it simply has support for storage
systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro,
Parquet, and any other Hadoop InputFormat.
5. Installing Spark
• Download and extract
Download a compressed TAR file, or tar ball.
You don’t need to have Hadoop, but if you have an existing Hadoop cluster or
HDFS installation, download the matching version.
Extract the tar.
Update the bashrc file
• Spark directory Contents.
README.md
Contains short instructions for getting started with Spark.
Bin
Contains executable files that can be used to interact with Spark in various ways like
shell.
core, streaming, python, …
Contains the source code of major components of the Spark project.
examples
Contains some helpful Spark standalone jobs that you can look at and run to learn about
the Spark API.
6. Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
The driver program contains your application’s main
function and defines distributed datasets on the cluster,
then applies operations to them.
Driver programs access Spark through a
SparkContext object, which represents a
connection to a computing cluster.
Once you have a SparkContext, you can
use it to build RDDs.
To run operations on RDD, driver
programs typically manage a number of
nodes called executors.
7. Spark’s Python and Scala Shells
• Spark comes with interactive shells that enable ad hoc data
analysis.
• Unlike most other shells, however, which let you manipulate
data using the disk and memory on a single machine, Spark’s
shells allow you to interact with data that is distributed on
disk or in memory across many machines, and Spark takes
care of automatically distributing this processing.
• bin/pyspark and bin/spark-shell to open the respective shell.
• When the shell starts, you will notice a lot of log messages. It
can be controlled by changing log4j.rootCategory=INFO in
conf/log4j.properties.
8. StandaloneApplications
• The main difference from using it in the shell is that you need to
initialize your own SparkContext. After that, the API is the same.
• Initializing Spark in Scala
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setMaster("local").setAppName(“WordCount")
val sc = new SparkContext(conf)
• Once we have our build defined, we can easily package and run our
application using the bin/spark-submit script.
• The spark-submit script sets up a number of environment variables
used by Spark.
Maven build and run (mvn clean && mvn compile && mvn package)
$SPARK_HOME/bin/spark-submit
--class com.oreilly.learningsparkexamples.mini.java.WordCount
./target/learning-spark-mini-example-0.0.1.jar
./README.md ./wordcounts
9. Word count Scala application
• // Create a Scala Spark Context.
• val conf = new SparkConf().setAppName("wordCount")
• val sc = new SparkContext(conf)
• // Load our input data.
• val input = sc.textFile(inputFile)
• // Split it up into words.
• val words = input.flatMap(line => line.split(" "))
• // Transform into pairs and count.
• val counts = words.map(word => (word, 1)).reduceByKey{case (x, y)
=> x + y}
• // Save the word count back out to a text file, causing evaluation.
• counts.saveAsTextFile(outputFile)
10. Resilient Distributed Dataset
• An RDD is simply a immutable distributed collection of
elements.
• Each RDD is split into multiple partitions, which may be
computed on different nodes of the cluster.
• Once created, RDDs offer two types of operations:
transformations and actions.
• Transformations construct a new RDD from a previous one.
• Actions, on the other hand, compute a result based on an
RDD, and either return it to the driver program or save it to
an external storage system.
• In Spark all work is expressed as either creating new RDDs,
transforming existing RDDs, or calling operations on RDDs
to compute a result.
11. Create RDDs
• Spark provides two ways to create RDDs:
loading an external dataset
val lines = sc.textFile("/path/to/README.md")
parallelizing a collection in your driver program.
The simplest way to create RDDs is to take an existing
collection in your program and pass it to SparkContext’s
parallelize() method.
Beyond prototyping and testing, this is not widely used
since it requires that you have your entire dataset in
memory on one machine.
val lines = sc.parallelize(List(1,2,3))
12. Lazy Evaluation
• Lazy evaluation means that when we call a transformation
on an RDD (for instance, calling map()), the operation is not
immediately performed. Instead, Spark internally records
metadata to indicate that this operation has been
requested.
• Rather than thinking of an RDD as containing specific data,
it is best to think of each RDD as consisting of instructions
on how to compute the data that we build up through
transformations. Loading data into an RDD is lazily
evaluated in the same way transformations are. So, when
we call sc.textFile(), the data is not loaded until it is
necessary.
• Spark uses lazy evaluation to reduce the number of passes
it has to take over our data by grouping operations
together.
13. RDD Operations(Transformations)
• Transformations are operations on RDDs that return a new RDD, such as
map() and filter().
• Transformed RDDs are computed lazily, only when you use them in an
action.
val inputRDD = sc.textFile("log.txt")
val errorsRDD = inputRDD.filter(line => line.contains("error"))
• Transformations operation does not mutate the existing inputRDD. Instead,
it returns a pointer to an entirely new RDD.
• InputRDD can still be reused later in the program
• Transformations can actually operate on any number of input RDDs. Like
Union to merge the RDDs.
• As you derive new RDDs from each other using transformations, Spark keeps
track of the set of dependencies between different RDDs, called the lineage
graph. It uses this information to compute each RDD on demand and to
recover lost data if part of a persistent RDD is lost.
14. Common Transformations
• Element-wise transformations
MAP
The map() transformation takes in a function and applies it to each
element in the RDD with the result of the function being the new
value of each element in the resulting RDD.
val input = sc.parallelize(List(1, 2, 3, 4))
val result = input.map(x => x * x)
println(result.collect().mkString(","))
Filter
The filter() transformation takes in a function and returns an RDD
that only has elements that pass the filter() function.
FLATMAP
Sometimes we want to produce multiple output elements for each
input element. The operation to do this is called flatMap().
val lines = sc.parallelize(List("hello world", "hi"))
val words = lines.flatMap(line => line.split(" "))
words.first() // returns "hello"
16. Examples(Transformation)
• Create RDD / collect(Action)
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq)
rdd.collect()
• Filter
val filteredrdd = rdd.filter(x=>x>2)
• Distinct
val rdddist = rdd.distinct()
rdddist.collect()
• Map (square a number)
val rddsquare = rdd.map(x=>x*x);
• Flatmap
val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
val rdd2 = rdd1.flatMap(x => x.split(" "))
• Sample an RDD
rdd.sample(false,0.5).collect()
17. Examples(Transformation) Cont…
• Create RDD
val x = Seq(1,2,3)
val y = Seq(3,4,5)
val rdd1 = sc.parallelize(x)
val rdd2 = sc.parallelize(y)
• Union
rdd1.union(rdd2).collect()
• Intersection
rdd1.intersection(rdd2).collect()
• subtract
Rdd1.subtract(rdd2).collect()
• Cartesian
rdd1.cartesian(rdd2).collect()
18. RDD Operations(Actions)
• Actions are operations that return a result to the driver
program or write it to storage, and kick off a
computation, such as count() and first().
• Actions force the evaluation of the transformations
required for the RDD they were called on, since they
need to actually produce output.
println("Input had " + badLinesRDD.count() + " concerning lines")
println("Here are 10 examples:")
badLinesRDD.take(10).foreach(println)
• It is important to note that each time we call a new
action, the entire RDD must be computed “from
scratch.”
20. Examples(Action)
• Reduce
The most common action on basic RDDs you will likely use is reduce(), which
takes a function that operates on two elements of the type in your RDD and
returns a new element of the same type. A simple example of such a function
is +, which we can use to sum our RDD. With reduce(), we can easily sum the
elements of our RDD, count the number of elements, and perform other types
of aggregations.
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq)
val sum = rdd.reduce((x,y)=> x+y)
• Fold
Similar to reduce() is fold(), which also takes a function with the same
signature as needed for reduce(), but in addition takes a “zero value” to be
used for the initial call on each partition. The zero value you provide should be
the identity element for your operation;
val seq = Seq(4,2,5,3,1,4)
val rdd = sc.parallelize(seq,2)
val rdd2 = rdd.fold(1)((x,y)=>x+y);
22. Examples(Action) cont.…
• takeOrdered(num)(ordering)
Reverse Order (Highest number)
val seq = Seq(3,9,2,3,5,4)
val rdd = sc.parallelize(seq,2)
rdd.takeOrdered(1)(Ordering[Int].reverse)
Custom Order (Highest based on age)
case class Person(name:String,age:Int)
val rdd = sc.parallelize(Array(("x",10),("y",14),("z",12)))
val rdd2 = rdd.map(x=>Person(x._1,x._2))
rdd2.takeOrdered(1)(Ordering[Int].reverse.on(x=>x.age))
(highest/lowest repeated word in word count program)
val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"), ("Hadoop
Hadoop Hadoop")))
val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2)) //Lowest value
rdd3.takeOrdered(3)(Ordering[Int].reverse.on(x=>x._2)) //Higest value
23. Persist(Cache) RDD
• Spark’s RDDs are by default recomputed each time you run an action on them. If
you would like to reuse an RDD in multiple actions, you can ask Spark to persist it
using RDD.persist().
• After computing it the first time, Spark will store the RDD contents in memory
(partitioned across the machines in your cluster), and reuse them in future actions.
Persisting RDDs on disk instead of memory is also possible.
• If a node that has data persisted on it fails, Spark will recompute the lost partitions
of the data when needed.
• We can also replicate our data on multiple nodes if we want to be able to handle
node failure without slowdown.
• If you attempt to cache too much data to fit in memory, Spark will automatically
evict old partitions using a Least Recently Used (LRU) cache policy. Caching
unnecessary data can lead to eviction of useful data and more recomputation
time.
• RDDs come with a method called unpersist() that lets you manually remove them
from the cache.
• cache() is the same as calling persist() with the default storage level.
24. Persistencelevels
Note: Cache and persist will work only after the RDD is computed again either
by performing action on the RDD or on the transformed RDD when persist is not
given when the RDD is created.
25. Spark Summary
• To summarize, every Spark program and shell
session will work as follows:
Create some input RDDs from external data.
Transform them to define new RDDs using
transformations like filter().
Ask Spark to persist() any intermediate RDDs that will
need to be reused. cache() is the same as calling
persist() with the default storage level.
Launch actions such as count() and first() to kick off a
parallel computation, which is then optimized and
executed by Spark.
26. Pair RDD (Key/Value Pair)
• Spark provides special operations on RDDs containing
key/value pairs, called pair RDDs.
• Key/value RDDs are commonly used to perform
aggregations, and often we will do some initial ETL to get
our data into a key/value format.
• Key/value RDDs expose new operations (e.g., counting up
reviews for each product, grouping together data with the
same key, and grouping together two different RDDs)
• Creating Pair RDDs
Some loading formats directly return the Pair RDDs.
Using map function
Scala
val pairs = lines.map(x => (x.split(" ")(0), x))
27. Transformations on Pair RDDs
• Pair RDDs are allowed to use all the transformations
available to standard RDDs.
• Aggregations
When datasets are described in terms of key/value
pairs, it is common to want to aggregate statistics
across all elements with the same key. We have
looked at the fold(), combine(), and reduce() actions
on basic RDDs, and similar per-key transformations
exist on pair RDDs. Spark has a similar set of
operations that combines values that have the same
key. These operations return RDDs and thus are called
transformations rather than actions.
28. Transformations...cont…
• reduceByKey()
is quite similar to reduce(); both take a function and use it to combine values.
reduceByKey() runs several parallel reduce operations, one for each key in the
dataset, where each operation combines values that have the same key.
Because datasets can have very large numbers of keys, reduceByKey() is not
implemented as an action that returns a value to the user program. Instead, it
returns a new RDD consisting of each key and the reduced value for that key.
• Example Word count in Scala
val input = sc.textFile("s3://...")
val words = input.flatMap(x => x.split(" "))
val result = words.map(x => (x, 1)).reduceByKey((acc, val) => acc + val)
• We can actually implement word count even faster by using the
countByValue() function on the first RDD: input.flatMap(x =>x.split("
")).countByValue().
30. Transformations...cont…
• foldByKey()
is quite similar to fold(); both use a zero value of
the same type of the data in our RDD and
combination function. As with fold(), the provided
zero value for foldByKey() should have no impact
when added with your combination function to
another element.
Those familiar with the combiner concept from MapReduce should note that
calling reduceByKey() and foldByKey() will automatically perform combining locally
on each machine before computing global totals for each key. The user does not
need to specify a combiner. The more general combineByKey() interface allows you
to customize combining behavior.
31. Transformations...cont…
• combineByKey()
is the most general of the per-key aggregation functions. Most of the other
per-key combiners are implemented using it. Like aggregate(), combineBy
Key() allows the user to return values that are not the same type as our input
data.
As combineByKey() goes through the elements in a partition, each element
either has a key it hasn’t seen before or has the same key as a previous
element.
If it’s a new element, combineByKey() uses a function we provide, called
createCombiner(), to create the initial value for the accumulator on that key. It’s
important to note that this happens the first time a key is found in each partition, rather
than only the first time the key is found in the RDD.
If it is a value we have seen before while processing that partition, it will instead use the
provided function, mergeValue(), with the current value for the accumulator for that key
and the new value.
Since each partition is processed independently, we can have multiple
accumulators for the same key. When we are merging the results from each
partition, if two or more partitions have an accumulator for the same key we
merge the accumulators using the user-supplied mergeCombiners() function.
32. Transformations...cont…
• Per-key average using combineByKey() in Scala
val result = input.combineByKey((v) => (v, 1), (acc:
(Int, Int), v) => (acc._1 + v, acc._2 + 1), (acc1: (Int,
Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2
+ acc2._2)).map{ case (key, value) => (key,
value._1 / value._2.toFloat)
}result.collectAsMap().map(println(_))
33. Transformations...cont…
• GroupByKey()
With keyed data a common use case is grouping our data by key—for
example, viewing all of a customer’s orders together.
If our data is already keyed in the way we want, groupByKey() will
group our data using the key in our RDD. On an RDD consisting of keys
of type K and values of type V, we get back an RDD of type [K,
Iterable[V]].
If you find yourself writing code where you groupByKey() and then use
a reduce() or fold() on the values, you can probably achieve the same
result more efficiently by using one of the per-key aggregation
functions. Rather than reducing the RDD to an inmemory value, we
reduce the data per key and get back an RDD with the reduced values
corresponding to each key. For example, rdd.reduceByKey(func)
produces the same RDD as rdd.groupByKey().mapValues(value =>
value.reduce(func)) but is more efficient as it avoids the step of
creating a list of values for each key.
34. Transformations...cont…
• cogroup
In addition to grouping data from a single RDD, we can group
data sharing the same key from multiple RDDs using a function
called cogroup(). cogroup() over two RDDs sharing the same key
type, K, with the respective value types V and W gives us back
RDD[(K, (Iterable[V], Iterable[W]))]. If one of the RDDs doesn’t
have elements for a given key that is present in the other RDD,
the corresponding Iterable is simply empty.
cogroup() gives us the power to group data from multiple RDDs.
cogroup() is used as a building block for the joins. However
cogroup() can be used for much more than just implementing
joins. We can also use it to implement intersect by key.
Additionally, cogroup() can work on three or more RDDs at
once.
35. Transformations...cont…
• Joins
Joining data together is probably one of the most common operations on a
pair RDD, and we have a full range of options including right and left outer
joins, cross joins, and inner joins.
• Scala shell inner join
storeAddress = {(Store("Ritual"), "1026 Valencia St"), (Store("Philz"), "748 Van
Ness Ave"), (Store("Philz"), "3101 24th St"), (Store("Starbucks"), "Seattle")}
storeRating = {(Store("Ritual"), 4.9), (Store("Philz"), 4.8))}
storeAddress.join(storeRating) == {(Store("Ritual"), ("1026 Valencia St", 4.9)),
(Store("Philz"), ("748 Van Ness Ave", 4.8)), (Store("Philz"), ("3101 24th St",
4.8))}
• leftOuterJoin() and rightOuterJoin()
storeAddress.leftOuterJoin(storeRating) == {(Store("Ritual"),("1026 Valencia
St",Some(4.9))), (Store("Starbucks"),("Seattle",None)), (Store("Philz"),("748
Van Ness Ave",Some(4.8))), (Store("Philz"),("3101 24th St",Some(4.8)))}
storeAddress.rightOuterJoin(storeRating) == {(Store("Ritual"),(Some("1026
Valencia St"),4.9)), (Store("Philz"),(Some("748 Van Ness Ave"),4.8)),
(Store("Philz"), (Some("3101 24th St"),4.8))}
38. Tuning the level of parallelism
• When performing aggregations or grouping operations, we can ask Spark
to use a specific number of partitions. Spark will always try to infer a
sensible default value based on the size of your cluster, but in some cases
you will want to tune the level of parallelism for better performance.
val data = Seq(("a", 3), ("b", 4), ("a", 1))
sc.parallelize(data).reduceByKey((x, y) => x + y) // Default parallelism
sc.parallelize(data).reduceByKey((x, y) => x + y,10) // Custom parallelism
• Repartitioning your data is a fairly expensive operation. Spark also has an
optimized version of repartition() function called coalesce() that allows
avoiding data movement, but only if you are decreasing the number of
RDD partitions.
• To know whether you can safely call coalesce(), you can check the size of
the RDD using rdd.partitions.size() or rdd.getNumPartitions().
• To see each partition data of a RDD,
scala> rdd.mapPartitionsWithIndex( (index: Int, it: Iterator[(Int,Int)]) =>
it.toList.map(x => index + ":" + x).iterator).collect
39. Sorting Data
• Having sorted data is quite useful in many cases, especially when
you’re producing downstream output. We can sort an RDD with
key/value pairs provided that there is an ordering defined on the
key. Once we have sorted our data, any subsequent call on the
sorted data to collect() or save() will result in ordered data.
• Since we often want our RDDs in the reverse order, the sortByKey()
function takes a parameter called true/false indicating whether we
want it in ascending order (it defaults to true i.e. ascending).
• Sometimes we want a different sort order entirely, and to support
this we can provide our own comparison function.
Custom sort order in Scala, sorting integers as if strings
val rdd1 = sc.parallelize(List(("Hadoop PIG Hive"), ("Hive PIG PIG Hadoop"),
("Hadoop Hadoop Hadoop")))
val rdd2 = rdd1.flatMap(x => x.split(" ")).map(x => (x,1))
val rdd3 = rdd2.reduceByKey((x,y) => (x+y))
rdd3.takeOrdered(3)(Ordering[Int].on(x=>x._2))
40. Actions available on Pair RDDs
• As with the transformations, all of the traditional actions available on the
base RDD are also available on pair RDDs. Some additional actions are
available on pair RDDs to take advantage of the key/value nature of the
data.
41. Accumulators
• Shared variable, accumulators, provides a simple syntax for aggregating values
from worker nodes back to the driver program.
• One of the most common uses of accumulators is to count events that occur
during job execution for debugging purposes.
val sc = new SparkContext(...)
val file = sc.textFile("file.txt")
val blankLines = sc.accumulator(0) // Create Accumulator[Int] initialized to 0
val callSigns = file.flatMap(line => {
if (line == "") {
blankLines += 1 // Add to the accumulator
}
line.split(" ")})
callSigns.saveAsTextFile("output.txt")
println("Blank lines: " + blankLines.value)
Note that we will see the right count only after we run the saveAsTextFile() action, because the
transformation above it, map(), is lazy, so the side-effect of incrementing of the accumulator will
happen only when the map() transformation is forced to occur by the saveAsTextFile() action.
Also note that tasks on worker nodes cannot access the accumulator’s value()—from the point of view
of these tasks, accumulators are write-only variables. This allows accumulators to be implemented
efficiently, without having to communicate every update.
42. Accumulators cont…
• Accumulators and Fault Tolerance
Spark automatically deals with failed or slow machines by re-executing failed or slow tasks. For
example, if the node running a partition of a map() operation crashes, Spark will rerun it on
another node; and even if the node does not crash but is simply much slower than other
nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and
take its result if that finishes. Even if no nodes fail, Spark may have to rerun a task to rebuild a
cached value that falls out of memory. The net result is therefore that the same function may
run multiple times on the same data depending on what happens on the cluster.
The end result is that for accumulators used in actions, Spark applies each task’s update to
each accumulator only once. Thus, if we want a reliable absolute value counter, regardless of
failures or multiple evaluations, we must put it inside an action like foreach().
• Custom Accumulators
Spark supports accumulators of type Int, Double, Long, and Float.
Spark also includes an API to define custom accumulator types and custom aggregation
operations.
Custom accumulators need to extend AccumulatorParam.
we can use any operation for add, provided that operation is commutative and associative.
An operation op is commutative if a op b = b op a for all values a, b.
An operation op is associative if (a op b) op c = a op (b op c) for all values a, b, and c.
43. Broadcast Variables
• Shared variable, broadcast variables, allows the program to efficiently send a large, read-only
value to all the worker nodes for use in one or more Spark operations.
It’s expensive to send that Array from the master with each task. Also if we used same
object later (if we ran the same code for other files), it would send again to each node.
val c = sc.broadcast(Array(100,200))
val a = sc.parallelize(List(“This is Krishna” ,“Sureka learning the Spark”))
val d = a.flatMap(x=>x.split(" ")).map(x=>x+c.value(1))
• Optimizing Broadcasts
When we are broadcasting large values, it is important to choose a data serialization format
that is both fast and compact.
44. Working on a Per-Partition Basis
Working with data on a per-partition basis allows us to avoid redoing
setup work for each data item. Operations like opening a database
connection or creating a random number generator are examples of
setup steps that we wish to avoid doing for each element.
Spark has per-partition versions of map and foreach to help reduce the
cost of these operations by letting you run code only once for each
partition of an RDD.
Example-1 (mapPartitions)
val a = sc.parallelize(List(1,2,3,4,5,6),2)
scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.toList.map(y=>y+1).iterator})
scala> val b = a.mapPartitions((x:Iterator[Int])=>{println("Hello"); x.map(y=>y+1)})
scala> b.collect
Hello
Hello
res20: Array[Int] = Array(2, 3, 4, 5, 6, 7)
scala> a.mapPartitions(x=>x.filter(y => y > 1)).collect
res35: Array[Int] = Array(2, 3, 4, 5, 6)
45. Working on a Per-Partition Basis
Example-2 (mapPartitionWithIndex)
scala> val b =
a.mapPartitionsWithIndex((index:Int,x:Iterator[Int])=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at
mapPartitionsWithIndex at <console>:29
scala> val b = a.mapPartitionsWithIndex((index,x)=>{println("Hello
from "+ index); if(index==1){x.map(y=>y+1)}else{(x.map(y=>y+200))}})
scala> b.collect
Hello from 0
Hello from 1
res44: Array[Int] = Array(201, 202, 203, 5, 6, 7)