Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Storm is a distributed real-time computation framework created by Nathan Marz at BackType/Twitter to analyze tweets, links, and users on Twitter in real-time. It provides scalability, fault tolerance, and guarantees of data processing. Storm addresses problems with Hadoop like lack of real-time processing, long latency, and tedious coding through its stream processing capabilities and by being stateless. It has features like scalability, fault tolerance through Zookeeper, and guarantees of at least once processing.
This document provides an overview of resource aware scheduling in Apache Storm. It discusses the challenges of scheduling Storm topologies at Yahoo scale, including increasing heterogeneous clusters, low cluster utilization, and unbalanced resource usage. It then introduces the Resource Aware Scheduler (RAS) built for Storm, which allows fine-grained resource control and isolation for topologies through APIs and cgroups. Key features of RAS include pluggable scheduling strategies, per user resource guarantees, and topology priorities. Experimental results from Yahoo Storm clusters show significant improvements to throughput and resource utilization with RAS. Future work may include improved scheduling strategies and real-time resource monitoring.
Bobby from Yahoo presents on running Apache Storm as a service on and off Hadoop. Storm provides low-latency data processing through streaming data flows defined by topologies of spouts and bolts. Yahoo runs Storm as a service and also maintains Spark. Bobby discusses securing standalone Storm, running Storm on YARN for security, reduced overhead and elasticity, and future work including Nimbus high availability and running Storm topologies as unmanaged applications in YARN.
Storm is an open source distributed real-time computation system. It provides guarantees of processing data reliably in real-time. Storm allows for building real-time streaming data pipelines that process unbounded streams of data reliably. Key features include being distributed, fault-tolerant, guaranteeing message processing, and providing a high level abstraction over message passing.
The document provides an introduction to Apache Storm, an open-source distributed real-time computation system. It outlines the core concepts of Storm including topologies, spouts, bolts, streams and tuples. Spouts are sources of streams, while bolts process input streams and produce output streams. Topologies define the logic of an application as a graph of operators and streams. The document also discusses Storm's architecture, guaranteed processing, usage examples at companies like Twitter and Yahoo, and comparisons to other frameworks.
A tutorial presentation based on storm.apache.org documentation.
I gave this presentation at Amirkabir University of Technology as Teaching Assistant of Cloud Computing course of Dr. Amir H. Payberah in spring semester 2015.
The document summarizes updates and new features in Apache Storm 1.0 including:
- Pacemaker replaces Zookeeper for distributed coordination allowing larger clusters
- A distributed cache API allows sharing files between topologies
- High availability Nimbus allows Nimbus hosts to join/leave dynamically
- Native streaming windows support event time processing and out of order tuples
- State management provides automatic checkpointing for stateful bolts
- Automatic back pressure throttles spouts when buffers reach high watermarks
- The resource aware scheduler allocates tasks based on component resource requirements
- Usability improvements include dynamic log levels, tuple sampling for debugging, and distributed log searching
- New integrations support Cassandra, Sol
This document compares the batch and streaming capabilities of Spark and Storm. Spark supports both batch and micro-batch processing while Storm supports micro-batch and real-time stream processing. Spark has been in production mode since 2013 and is implemented in Scala, while Storm has been used since 2011 and is implemented in Clojure and Java. Spark includes libraries for SQL, streaming, and machine learning while Storm uses spouts to read data streams and bolts to filter and join data in topologies. Both integrate with Hadoop and support fault tolerance, though Spark has improved reliability when used with YARN. Performance tests show Spark Streaming can process more records per second than Storm.
Storm: distributed and fault-tolerant realtime computationnathanmarz
Storm is a distributed real-time computation system that provides guaranteed message processing, horizontal scalability, and fault tolerance. It allows users to define data processing topologies and submit them to a Storm cluster for distributed execution. Spouts emit streams of tuples that are processed by bolts. Storm tracks processing to ensure reliability and replays failed tasks. It provides tools for deployment, monitoring, and optimization of real-time data processing.
Java 9 is just around the corner. In this session, we'll describe the new modularization support (Jigsaw), new JDK tools, enhanced APIs and many performance improvements that were added to the new version.
This document discusses how to use Storm and Hadoop together to enable real-time and batch processing of large datasets. It describes using Hadoop to precompute batch views of data, and Storm to incrementally update real-time views as new data streams in. This allows for low-latency queries by combining precomputed batch views with real-time views that compensate for recent data not yet absorbed into the batch views.
This document presents two new approaches for reliable message processing in distributed streaming systems like Apache Storm:
1. A fingerprint-based approach that embeds a digest representing message context that is recursively passed down and updated.
2. A share-split approach that embeds a "share" with each message and splits the share at each component until the leaf where shares are reported.
It also discusses prototyping one approach by integrating it into Apache Storm and notes on the implementation.
One Grid to rule them all: Building a Multi-tenant Data Cloud with YARNDataWorks Summit
This document discusses using Apache Helix for managing multi-tenant data and applications on YARN. Helix is a generic cluster management framework that handles task and container assignment, failure handling, and workload balancing in a decoupled manner from the core application logic. It provides a high-level overview of key Helix concepts like resources, partitions, and states. The document also outlines how Helix integrates with YARN by using components like the TargetProvider to determine container requirements, Provisioner to acquire/release containers from YARN, and Rebalancer to assign tasks to containers based on constraints. This allows building fault-tolerant applications that can scale efficiently based on workload without having to handle complex cluster management code.
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Apache Storm and twitter Streaming API integrationUday Vakalapudi
Storm is a distributed, fault-tolerant system for real-time computation on streams of data. It uses a non-persistent API and executes topologies made up of spouts that produce data tuples and bolts that transform tuples. Once submitted, Storm topologies will continuously execute the computation logic in the bolts in a distributed, fault-tolerant manner as long as the logic is correct.
Developing Java Streaming Applications with Apache StormLester Martin
This document provides an overview of developing Java streaming applications with Apache Storm. It discusses what Storm is, its conceptual model including tuples, streams, spouts, bolts and topologies. It demonstrates developing a word count topology with code examples. It also covers Storm's runtime architecture, additional features like reliability, and integrating Storm with technologies like Kafka and HBase.
This document summarizes a presentation on Inferno, a system for scalable deep learning on Apache Spark. Inferno allows deep learning models built with Blaze, La Trobe University's deep learning system, to be trained faster using a Spark cluster. It coordinates distributed training of Blaze models across worker nodes, with optimized communication of weights and hyperparameters. Evaluation shows Inferno can train ResNet models on ImageNet up to 4-5 times faster than a single GPU. The presentation provides an overview of deep learning and Spark, demonstrates how Blaze allows easy model building, and explains Inferno's architecture for distributed deep learning training on Spark.
Spark is an open-source cluster computing framework that provides high performance for both batch and streaming data processing. It addresses limitations of other distributed processing systems like MapReduce by providing in-memory computing capabilities and supporting a more general programming model. Spark core provides basic functionalities and serves as the foundation for higher-level modules like Spark SQL, MLlib, GraphX, and Spark Streaming. RDDs are Spark's basic abstraction for distributed datasets, allowing immutable distributed collections to be operated on in parallel. Key benefits of Spark include speed through in-memory computing, ease of use through its APIs, and a unified engine supporting multiple workloads.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Some key components of Apache Spark include Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL for structured data processing. Spark also supports streaming, machine learning via MLlib, and graph processing with GraphX.
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
This document provides an overview of Spark driven big data analytics. It begins by defining big data and its characteristics. It then discusses the challenges of traditional analytics on big data and how Apache Spark addresses these challenges. Spark improves on MapReduce by allowing distributed datasets to be kept in memory across clusters. This enables faster iterative and interactive processing. The document outlines Spark's architecture including its core components like RDDs, transformations, actions and DAG execution model. It provides examples of writing Spark applications in Java and Java 8 to perform common analytics tasks like word count.
spark example spark example spark examplespark examplespark examplespark exampleShidrokhGoudarzi1
Spark is a fast general-purpose engine for large-scale data processing. It has advantages over MapReduce like speed, ease of use, and running everywhere. Spark supports SQL querying, streaming, machine learning, and graph processing. It can run on Scala, Java, Python. Spark applications have drivers, executors, tasks and run RDDs and shared variables. The Spark shell provides an interactive way to learn the API and analyze data.
��� Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Java, Scala, Python and R.
Typical machine learning workflows in Spark involve loading data, preprocessing, feature engineering, training models, evaluating performance, and tuning hyperparameters. Spark MLlib provides algorithms for common tasks like classification, regression, clustering and collaborative filtering.
The document provides an example of building a spam filtering application in Spark. It involves reading email data, extracting features using tokenization and hashing, training a logistic regression model, evaluating performance on test data, and tuning hyperparameters via cross validation.
Spark is a fast and general engine for large-scale data processing. It was designed to be fast, easy to use and supports machine learning. Spark achieves high performance by keeping data in-memory as much as possible using its Resilient Distributed Datasets (RDDs) abstraction. RDDs allow data to be partitioned across nodes and operations are performed in parallel. The Spark architecture uses a master-slave model with a driver program coordinating execution across worker nodes. Transformations operate on RDDs to produce new RDDs while actions trigger job execution and return results.
Apache Spark is an open-source distributed processing engine that is up to 100 times faster than Hadoop for processing data stored in memory and 10 times faster for data stored on disk. It provides high-level APIs in Java, Scala, Python and SQL and supports batch processing, streaming, and machine learning. Spark runs on Hadoop, Mesos, Kubernetes or standalone and can access diverse data sources using its core abstraction called resilient distributed datasets (RDDs).
Apache Spark presentation at HasGeek FifthElelephant
https://fifthelephant.talkfunnel.com/2015/15-processing-large-data-with-apache-spark
Covering Big Data Overview, Spark Overview, Spark Internals and its supported libraries
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
The document provides an overview of Apache Spark, including what it is, its ecosystem, features, and architecture. Some key points:
- Apache Spark is an open-source cluster computing framework for large-scale data processing. It is up to 100x faster than Hadoop for iterative/interactive algorithms.
- Spark features include its RDD abstraction, lazy evaluation, and use of DAGs to optimize performance. It supports Scala, Java, Python, and R.
- The Spark ecosystem includes tools like Spark SQL, MLlib, GraphX, and Spark Streaming. It can run on Hadoop YARN, Mesos, or in standalone mode.
- Spark's architecture includes the SparkContext,
Spark is a framework for large-scale data processing that improves on MapReduce. It handles batch, iterative, and streaming workloads using a directed acyclic graph (DAG) model. Spark aims for generality, low latency, fault tolerance, and simplicity. It uses an in-memory computing model with Resilient Distributed Datasets (RDDs) and a driver-executor architecture. Common Spark performance issues relate to partitioning, shuffling data between stages, task placement, and load balancing. Evaluation tools include the Spark UI, Sar, iostat, and benchmarks like SparkBench and GroupBy tests.
Your data is getting bigger while your boss is getting anxious to have insights! This tutorial covers Apache Spark that makes data analytics fast to write and fast to run. Tackle big datasets quickly through a simple API in Python, and learn one programming paradigm in order to deploy interactive, batch, and streaming applications while connecting to data sources incl. HDFS, Hive, JSON, and S3.
Introduction to Data Science
1.1 What is Data Science, importance of data science,
1.2 Big data and data Science, the current Scenario,
1.3 Industry Perspective Types of Data: Structured vs. Unstructured Data,
1.4 Quantitative vs. Categorical Data,
1.5 Big Data vs. Little Data, Data science process
1.6 Role of Data Scientist
Big Data and Analytics Shaping the future of PaymentsRuchiRathor2
The payments industry is experiencing a data-driven revolution powered by big data and analytics.
Here's a glimpse into 5 ways this dynamic duo is transforming how we pay.
In essence, big data and analytics are playing a pivotal role in building a future filled with faster, more secure, and convenient payment methods for everyone.
Overview of Statistical software such as ODK, surveyCTO,and CSPro
2. Software installation(for computer, and tablet or mobile devices)
3. Create a data entry application
4. Create the data dictionary
5. Create the data entry forms
6. Enter data
7. Add Edits to the Data Entry Application
8. CAPI questions and texts
Annex K RBF's The World Game pdf documentSteven McGee
Signals & Telemetry Annex K for RBF's The World Game / Trade Federations / USPTO 13/573,002 Heart Beacon Cycle Time - Space Time Chain meters, metrics, standards. Adaptive Procedural template framework structured data derived from DoD / NATO's system of systems engineering tech framework
Data analytics is a powerful tool that can transform business decision-making across industries. Contact District 11 Solutions, which specializes in data analytics, to make informed decisions and achieve your business goals.
How AI is Revolutionizing Data Collection.pdfPromptCloud
Artificial Intelligence (AI) is transforming the landscape of data collection, making it more efficient, accurate, and insightful than ever before. With AI, businesses can automate the extraction of vast amounts of data from diverse sources, analyze patterns in real-time, and gain deeper insights with minimal human intervention. This revolution in data collection enables companies to make faster, data-driven decisions, enhance their competitive edge, and unlock new opportunities for growth.
AI-powered tools can handle complex and dynamic web content, adapt to changes in website structures, and even understand the context of data through natural language processing. This means that data collection is not only faster but also more precise, reducing the time and effort required for manual data extraction. Furthermore, AI can process unstructured data, such as social media posts and customer reviews, providing valuable insights into customer sentiment and market trends.
Embrace the future of data collection with AI and stay ahead of the curve. Learn more about how PromptCloud’s AI-driven web scraping solutions can transform your data strategy. https://www.promptcloud.com/contact/
Getting Started with Interactive Brokers API and Python.pdfRiya Sen
In the fast-paced world of finance, automation is key to staying ahead of the curve. Traders and investors are increasingly turning to programming languages like Python to streamline their strategies and enhance their decision-making processes. In this blog post, we will delve into the integration of Python with Interactive Brokers, one of the leading brokerage platforms, and explore how this dynamic duo can revolutionize your trading experience.
Towards an Analysis-Ready, Cloud-Optimised service for FAIR fusion dataSamuel Jackson
We present our work to improve data accessibility and performance for data-intensive tasks within the fusion research community. Our primary goal is to develop services that facilitate efficient access for data-intensive applications while ensuring compliance with FAIR principles [1], as well as adoption of interoperable tools, methods and standards.
The major outcome of our work is the successful creation and deployment of a data service for the MAST (Mega Ampere Spherical Tokamak) experiment [2], leading to substantial enhancements in data discoverability, accessibility, and overall data retrieval performance, particularly in scenarios involving large-scale data access. Our work follows the principles of Analysis-Ready, Cloud Optimised (ARCO) data [3] by using cloud optimised data formats for fusion data.
Our system consists of a query-able metadata catalogue, complemented with an object storage system for publicly serving data from the MAST experiment. We will show how our solution integrates with the Pandata stack [4] to enable data analysis and processing at scales that would have previously been intractable, paving the way for data-intensive workflows running routinely with minimal pre-processing on the part of the researcher. By using a cloud-optimised file format such as zarr [5] we can enable interactive data analysis and visualisation while avoiding large data transfers. Our solution integrates with common python data analysis libraries for large, complex scientific data such as xarray [6] for complex data structures and dask [7] for parallel computation and lazily working with larger that memory datasets.
The incorporation of these technologies is vital for advancing simulation, design, and enabling emerging technologies like machine learning and foundation models, all of which rely on efficient access to extensive repositories of high-quality data. Relying on the FAIR guiding principles for data stewardship not only enhances data findability, accessibility, and reusability, but also fosters international cooperation on the interoperability of data and tools, driving fusion research into new realms and ensuring its relevance in an era characterised by advanced technologies in data science.
[1] Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016) https://doi.org/10.1038/sdata.2016.18
[2] M Cox, The Mega Amp Spherical Tokamak, Fusion Engineering and Design, Volume 46, Issues 2–4, 1999, Pages 397-404, ISSN 0920-3796, https://doi.org/10.1016/S0920-3796(99)00031-9
[3] Stern, Charles, et al. "Pangeo forge: crowdsourcing analysis-ready, cloud optimized data production." Frontiers in Climate 3 (2022): 782909.
[4] Bednar, James A., and Martin Durant. "The Pandata Scalable Open-Source Analysis Stack." (2023).
[5] Alistair Miles (2024) ‘zarr-developers/zarr-python: v2.17.1’. Zenodo. doi: 10.5281/zenodo.10790679
[6] Hoyer, S. & Hamman, J., (20
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...weiwchu
We recently discovered that models trained with large-scale speech datasets sourced from the web could achieve superior accuracy and potentially lower cost than traditionally human-labeled or simulated speech datasets. We developed a customizable AI-driven data labeling system. It infers word-level transcriptions with confidence scores, enabling supervised ASR training. It also robustly generates phone-level timestamps even in the presence of transcription or recognition errors, facilitating the training of TTS models. Moreover, It automatically assigns labels such as scenario, accent, language, and topic tags to the data, enabling the selection of task-specific data for training a model tailored to that particular task. We assessed the effectiveness of the datasets by fine-tuning open-source large speech models such as Whisper and SeamlessM4T and analyzing the resulting metrics. In addition to openly-available data, our data handling system can also be tailored to provide reliable labels for proprietary data from certain vertical domains. This customization enables supervised training of domain-specific models without the need for human labelers, eliminating data breach risks and significantly reducing data labeling cost.
2. Apache Spark
• Apache Spark is a lightning-fast cluster computing designed for fast
computation
• It was built on top of Hadoop MapReduce and it extends the
MapReduce model to efficiently use more types of computations
which includes Interactive Queries and Stream Processing
• Spark is not a modified version of Hadoop and is not, really,
dependent on Hadoop because it has its own cluster management
• Spark uses Hadoop in two ways – one is storage and second is
processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only
2
3. Apache Spark
• The main feature of Spark is its in-memory cluster computing that
increases the processing speed of an application
• Spark is designed to cover a wide range of workloads such as batch
applications, iterative algorithms, interactive queries and streaming
• Apart from supporting all these workload in a respective system, it
reduces the management burden of maintaining separate tools
3
4. Features of Apache Spark
• Speed − Spark helps to run an application in Hadoop cluster, up to
100 times faster in memory, and 10 times faster when running on
disk. This is possible by reducing number of read/write operations to
disk. It stores the intermediate processing data in memory
• Supports multiple languages − Spark provides built-in APIs in Java,
Scala, or Python. Therefore, you can write applications in different
languages
• Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It
also supports SQL queries, Streaming data, Machine learning (ML),
and Graph algorithms
4
5. Components of Spark
• The following illustration depicts the different components of Spark
Apache Spark Core
• Spark Core is the underlying general execution engine for spark platform that all
other functionality is built upon. It provides In-Memory computing and
referencing datasets in external storage systems
5
6. Components of Spark
Spark SQL
• Spark SQL is a component on top of Spark Core that introduces a new data abstraction called
SchemaRDD, which provides support for structured and semi-structured data
Spark Streaming
• Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics.
It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations
on those mini-batches of data
MLlib (Machine Learning Library)
• MLlib is a distributed machine learning framework above Spark because of the distributed
memory-based Spark architecture. Spark MLlib is nine times as fast as the Hadoop disk-based
version of Apache Mahout
GraphX
• GraphX is a distributed graph-processing framework on top of Spark. It provides an API for
expressing graph computation that can model the user-defined graphs by using Pregel abstraction
API
6
7. Spark Architecture
Spark Architecture includes following three main components:
• Data Storage
• API
• Management Framework
Data Storage:
• Spark uses HDFS file system for data storage purposes. It works with
any Hadoop compatible data source including HDFS, HBase,
Cassandra, etc.
7
8. Spark Architecture
API:
• The API provides the application developers to create Spark based
applications using a standard API interface. Spark provides API for
Scala, Java, and Python programming languages
Resource Management:
• Spark can be deployed as a Stand-alone server or it can be on a
distributed computing framework like Mesos or YARN
8
9. Resilient Distributed Datasets
• Resilient Distributed Datasets is the core concept in Spark framework
• Spark stores data in RDD on different partitions
• They help with rearranging the computations and optimizing the data
processing
• They are also fault tolerance because an RDD know how to recreate
and recompute the datasets
• RDDs are immutable. You can modify an RDD with a transformation
but the transformation returns you a new RDD whereas the original
RDD remains the same
9
10. Resilient Distributed Datasets
• It provides API for various transformations and materializations of
data as well as for control over caching and partitioning of elements
to optimize data placement
• RDD can be created either from external storage or from another RDD
and stores information about its parents recompute partition in case
of failure
10
11. Resilient Distributed Datasets
RDD supports two types of operations:
• Transformation: Transformations don't return a single value, they return a
new RDD. Nothing gets evaluated when you call a Transformation function,
it just takes an RDD and return a new RDD
• Some of the Transformation functions are map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, pipe, and coalesce
• Action: Action operation evaluates and returns a new value. When an
Action function is called on a RDD object, all the data processing queries
are computed at that time and the result value is returned
• Some of the Action operations are reduce, collect, count, first, take,
countByKey, and foreach
11
12. RDD Persistence
• One of the most important capabilities in Spark is persisting (or
caching) a dataset in memory across operations
• When you persist an RDD, each node stores any partitions of it that it
computes in memory and reuses them in other actions on that
dataset. This allows future actions to be much faster
• Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will
automatically be recomputed using the transformations that
originally created it
12
14. Components
• Spark applications run as independent sets of processes on a cluster,
coordinated by the SparkContext object in main program (called the driver
program)
• The SparkContext can connect to several types of cluster managers (either
Spark’s own standalone cluster manager, Mesos or YARN), which allocate
resources across applications
• Spark acquires executors on nodes in the cluster, which are processes that
run computations and store data for application
• Next, it sends application code (defined by JAR or Python files passed to
SparkContext) to the executors
• Finally, SparkContext sends tasks to the executors to run
14
15. Components
There are several useful things to note about this architecture:
• Each application gets its own executor processes, which stay up for
the duration of the whole application and run tasks in multiple
threads
• The driver program must listen for and accept incoming connections
from its executors throughout its lifetime. As such, the driver program
must be network addressable from the worker nodes
• Because the driver schedules tasks on the cluster, it should be run
close to the worker nodes, preferably on the same local area network
15
16. Spark Streaming
• Spark Streaming is an extension of the core Spark API that enables scalable,
high-throughput, fault-tolerant stream processing of live data streams
• Data can be ingested from many sources like Kafka, Flume, Kinesis, or TCP
sockets, and can be processed using complex algorithms expressed with
high-level functions like map, reduce, join and window
• Finally, processed data can be pushed out to filesystems
16
17. Spark Streaming
• The way Spark Streaming works is it divides the live stream of data
into batches (called micro batches) of a pre-defined interval (N
seconds) and then treats each batch of data as RDDs
• It's important to decide the time interval for Spark Streaming, based
on your use case and data processing requirements
• If the value of N is too low, then the micro-batches will not have
enough data to give meaningful results during the analysis
17
19. Spark Streaming
• Spark Streaming receives live input data streams and divides the data
into batches, which are then processed by the Spark engine to
generate the final stream of results in batches
• Spark Streaming provides a high-level abstraction called discretized
stream or DStream, which represents a continuous stream of data.
Internally, a DStream is represented as a sequence of RDDs
19
20. Discretized Streams (DStreams)
• It represents a continuous stream of data, either the input data
stream received from source, or the processed data stream generated
by transforming the input stream
• Internally, a DStream is represented by a continuous series of RDDs,
which is Spark’s abstraction of an immutable, distributed dataset
• Each RDD in a DStream contains data from a certain interval
20
21. Spark runtime components
21
Figure 1: Spark runtime components in cluster deploy mode. Elements of a Spark application are in blue
boxes and an application’s tasks running inside task slots are labeled with a “T”. Unoccupied task slots
are in white boxes.
22. Responsibilities of the client process
component
• The client process starts the driver program
• For example, the client process can be a spark-submit script for
running applications, a spark-shell script, or a custom application
using Spark API
• The client process prepares the class path and all configuration
options for the Spark application
• It also passes application arguments, if any, to the application running
inside the driver
22
23. Responsibilities of the driver component
• The driver orchestrates and monitors execution of a Spark application
• There’s always one driver per Spark application
• The Spark context and scheduler – are responsible for:
• Requesting memory and CPU resources from cluster managers
• Breaking application logic into stages and tasks
• Sending tasks to executors
• Collecting the results
23
24. Responsibilities of the driver component
24
Figure 2: Spark runtime components in client deploy mode. The driver is running inside the client’s
JVM process.
25. Responsibilities of the driver component
Two basic ways the driver program can be run are:
• Cluster deploy mode is depicted in figure 1. In this mode, the driver
process runs as a separate JVM process inside a cluster, and the
cluster manages its resources
• Client deploy mode is depicted in figure 2. In this mode, the driver’s
running inside the client’s JVM process and communicates with the
executors managed by the cluster
25
26. Responsibilities of the executors
• The executors, which JVM processes, accept tasks from the driver,
execute those tasks, and return the results to the driver
• Each executor has several task slots (or CPU cores) for running tasks in
parallel
• Although these task slots are often referred to as CPU cores in Spark,
they’re implemented as threads and don’t need to correspond to the
number of physical CPU cores on the machine
26
27. Creation of the Spark context
• Once the driver’s started, it configures an instance of SparkContext
• When running a standalone Spark application by submitting a jar file,
or by using Spark API from another program, your Spark application
starts and configures the Spark context
• There can be only one Spark context per JVM
27
28. High-level architecture
• Spark provides a well-defined and layered architecture where all its
layers and components are loosely coupled and integration with
external components/libraries/extensions is performed using well-
defined contracts
28
29. High-level architecture
• Physical machines: This layer represents the physical or virtual machines/nodes on which Spark jobs are executed. These
nodes collectively represent the total capacity of the cluster with respect to the CPU, memory, and data storage.
• Data storage layer: This layer provides the APIs to store and retrieve the data from the persistent storage area to Spark
jobs/applications. This layer is used by Spark workers to dump data on the persistent storage whenever the cluster
memory is not sufficient to hold the data. Spark is extensible and capable of using any kind of filesystem. RDD, which hold
the data, are agnostic to the underlying storage layer and can persist the data in various persistent storage areas, such as
local filesystems, HDFS, or any other NoSQL database such as HBase, Cassandra, MongoDB, S3, and Elasticsearch.
• Resource manager: The architecture of Spark abstracts out the deployment of the Spark framework and its associated
applications. Spark applications can leverage cluster managers such as YARN and Mesos for the allocation and deallocation
of various physical resources, such as the CPU and memory for the client jobs. The resource manager layer provides the
APIs that are used to request for the allocation and deallocation of available resource across the cluster.
• Spark core libraries: The Spark core library represents the Spark Core engine, which is responsible for the execution of the
Spark jobs. It contains APIs for in-memory distributed data processing and a generalized execution model that supports a
wide variety of applications and languages.
• Spark extensions/libraries: This layer represents the additional frameworks/APIs/libraries developed by extending the
Spark core APIs to support different use cases. For example, Spark SQL is one such extension, which is developed to
perform ad hoc queries and interactive analysis over large datasets.
29
31. Spark execution model – master worker view
• Spark is built around the concepts of Resilient Distributed Datasets
and Direct Acyclic Graph representing transformations and
dependencies between them
32
32. Spark execution model – master worker view
• Spark Application (often referred to as Driver Program or Application
Master) at high level consists of SparkContext and user code which
interacts with it creating RDDs and performing series of
transformations to achieve final result
• These transformations of RDDs are then translated into DAG and
submitted to Scheduler to be executed on set of worker nodes
33
33. Execution workflow
• User code containing RDD transformations forms Direct Acyclic Graph
which is then split into stages of tasks by DAGScheduler
• Tasks run on workers and results then return to client
34
35. Execution workflow
• SparkContext
• represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast
variables on that cluster
• DAGScheduler
• computes a DAG of stages for each job and submits them to TaskScheduler
• determines preferred locations for tasks (based on cache status or shuffle files locations) and finds minimum
schedule to run the jobs
• TaskScheduler
• responsible for sending tasks to the cluster, running them, retrying if there are failures, and mitigating
stragglers
• SchedulerBackend
• backend interface for scheduling systems that allows plugging in different implementations(Mesos, YARN,
Standalone, local)
• BlockManager
• provides interfaces for putting and retrieving blocks both locally and remotely into various stores (memory,
disk, and off-heap)
38