SlideShare a Scribd company logo
APACHE SPARK
✓ Need for spark
✓ Introducton to Apache Spark
✓ Spark features
✓ Spark architecture
✓ What is RDDs
✓ Transformations & Actions
✓ Spark execution model
✓ Spark ecosystem
2
Why spark?
Need for general purpose cluster computing system
as:
➢MapReduce limited to batch processing
➢Storm limited to real time stream processing
➢Impala/Tez limited to interactive processing
➢Neo4J/Giraph limited to graph processing
3
Need for Spark
• Need for a powerful engine that can process
the data in real time(streaming) as well as in
batch mode
• Need for a powerful engine that can respond in
sub-seconds and perform in-memory analytics
• Apache Spark is a powerful open source engine
that provides real-time(stream), interactive,
graph, in-memory as well as batch processing
with speed, ease of use & sophisticated
analytics.
4
What is Apache Spark
Lightning fast and general purpose cluster
computing system
5
Introduction to Apache Spark
➢Apache Spark is lightning-fast cluster computing
tool
➢General purpose distributed system
➢Up to 100 times faster than MapReduce
➢Written in Scala
➢Provides APIs in Scala, Java and python
➢Integrate with Hadoop and can process existing
data
6
History
• Introduced by UC Berkeley’s in 2009
• Open sourced in 2010
• Donated to the Apache in 2013,beacme top-level
project in 2014
• Became most active project at Apache in 2015
7
Sort Record
8
Apache Spark features
• Speed
• Ease of use
• Low latency
• Integration with Hadoop
• Rich set of operators
• Fault tolerant
• Generalized execution model
9
Spark Architecture
• Works in master and slave fashion
– Master node
– Slave node
10
Spark Nodes
11
Master node
• Manager node
• Assign the work to slave nodes
• Management, monitoring, maintenance of
slaves, assign work to them, keep track of
work
• Master daemon -runs on master node
12
Slave Nodes
• Worker nodes
• Does the work assigned by master
• Slave daemon-runs on all the slave nodes
13
Basic Spark Architecture
14
• User develop the work/application
• Submit work on the master
• Master will divide the work
• And submit it to all the nodes on the cluster
• All the slaves are doing sub-works
– In this manner Spark enjoys Distributed
Computing , parallel processing
15
Resilient Distributed Dataset
• Basic core abstraction in spark
– Resilient – if data is lost it will be recreated
automatically(fault tolerant )
– Distributed – data is distributedly stored/processed
– Dataset – data can come from different data-stores
16
• RDD is a simple and immutable collection of
objects
• RDD can contain any type of (Scala, Java,
Python and R)objects
• Each RDD is split-up into different partitions ,
which may be computed on different nodes of
clusters
17
What is RDD?
• RDDs are the fundamental unit of data in Spark
• Core spark abstraction
• Enable parallel processing on dataset
• Immutable, recomputable, fault tolerant
• During spark programming we perform
operations on RDDs
• Transformations and actions are used to process
RDDs
18
RDD operations
• Two types of operations
▪ Transformation
- Create a new RDD from the existing one
- Eg : map, filterMap, join ..etc
▪ Action
- Return a result or write it to storage
- Eg: count, collect, save..etc
19
• Lazy evaluation
– the execution will not start until an action is
triggered
20
Spark context
• Spark context is an object
• Every spark application requires a spark context
• Main entry point for spark application
• Interact with cluster manager
• Specify spark how to access the cluster
• RDDs are created using spark context
21
Spark execution model
22
• Developer develops the application/program
• Needs the spark context object, the main
entry point of spark application, which can
interact with cluster manager
• Data nodes, slaves of HDFS
• Worker nodes, slaves of Spark
• Cluster manager will interact with the worker
node and get the resources
• Executer is the distributed agent responsible
for the execution of tasks
23
The driver program
• The driver program runs the main () function
of the application and is the place where the
Spark Context is created
• The driver program that runs on the master
node of the spark cluster schedules the job
execution and negotiates with the cluster
manager
24
Executor
• Executor is a distributed agent responsible for
the execution of tasks
• Every spark applications has its own executor
process
• Executor performs all the data processing.
• Reads from and Writes data to external
sources.
• Executor stores the computation results data
in-memory, cache or on hard disk drives.
• Interacts with the storage systems.
25
Cluster manager
• An external service responsible for acquiring
resources on the spark cluster and allocating
them to a spark job
26
Spark ecosystem
27
Spark core
• Main spark engine
• Kernel of spark
• it is in charge of essential I/O functionalities
28
Spark SQL
• Enables users to run sql queries
• Can handle structured or semi-structured data
• One of the most popular sql engine in big data
29
Spark streaming
• Can handle live streams without any latency
• A powerful interactive and analytical
application
• Can process near real-time data from multiple
sources
• Internally convert the streams into micro
batches, process the in cluster, pushes to
data-stores
30
MLlib
• Machine Learning Library, scalable
• Used for advanced analytics
31
GraphX
• Enable users to handles the graph data processing
• We can represent our data in terms of graph
• Eg:
– in LinkedIn degree of connections, 1st degree, 2nd
degree connections
– In Facebook, friends of friends
Such type of requirements can be handle efficiently by the
Graph engine
32
Storage system
• Spark is dependent on third party storage
system, like:
– HDFS
– HBASE
– CASSANDRA
– AMAZON S3 and so on
33
Use cases
34
Companies using Spark
35
Disadvantages
• No File Management System
• Expensive
• Near Real-time Processing
36
37

More Related Content

What's hot

Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
Venkateswaran Kandasamy
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
GauravBiswas9
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
DataWorks Summit
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
Samy Dindane
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
Jyotasana Bharti
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
Edureka!
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
Whizlabs
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
Home
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
Shashi Prakash
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Alex Zeltov
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
Edureka!
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
Edureka!
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
phanleson
 
Module01
 Module01 Module01
Module01
NPN Training
 

What's hot (20)

Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)Apache spark architecture (Big Data and Analytics)
Apache spark architecture (Big Data and Analytics)
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Spark_Part 1
Spark_Part 1Spark_Part 1
Spark_Part 1
 
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton UniversitySpark Advanced Analytics NJ Data Science Meetup - Princeton University
Spark Advanced Analytics NJ Data Science Meetup - Princeton University
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Spark for big data analytics
Spark for big data analyticsSpark for big data analytics
Spark for big data analytics
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Module01
 Module01 Module01
Module01
 

Similar to An Introduction to Apache Spark

Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
Girish Khanzode
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
Antonios Katsarakis
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
Shravan (Sean) Pabba
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Spark 101
Spark 101Spark 101
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
RUHULAMINHAZARIKA
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
Databricks
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
Gareth Rogers
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
Evan Chan
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
DeepaThirumurugan
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
Spark Summit
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 

Similar to An Introduction to Apache Spark (20)

Apache Spark
Apache SparkApache Spark
Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptxCLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark 101
Spark 101Spark 101
Spark 101
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Productionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan ChanProductionizing Spark and the REST Job Server- Evan Chan
Productionizing Spark and the REST Job Server- Evan Chan
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 

Recently uploaded

Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
sweetygupta8413
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
Celine George
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
MarkKennethBellen1
 
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
OH TEIK BIN
 
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL ProgrammingLecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Murugan146644
 
Pedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/TypesPedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/Types
SobiaAlvi
 
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfPRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
nservice241
 
NLC 2024 - Certificate of Recognition
NLC  2024  -  Certificate of RecognitionNLC  2024  -  Certificate of Recognition
NLC 2024 - Certificate of Recognition
Deped
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
Amra Quiz Pagoler Dol (AQPD)
 
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
Celine George
 
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docxMATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
AmabellePagalunanAcl
 
Why study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPointWhy study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPoint
nealem1
 
How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17
Celine George
 
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry  by V.Jesinthal MaryPlato and Aristotle's Views on Poetry  by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
jessintv
 
How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17
Celine George
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
Scholarhat
 
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
OliverVillanueva13
 
VRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptxVRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptx
Banker and Adjunct Lecturer
 
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesHow to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
Celine George
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
Utsav Yagnik
 

Recently uploaded (20)

Class 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk StoriesClass 6 English Chapter 1 Fables and Folk Stories
Class 6 English Chapter 1 Fables and Folk Stories
 
Reports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo SlidesReports in Odoo 17 Point of Sale - Odoo Slides
Reports in Odoo 17 Point of Sale - Odoo Slides
 
SD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptxSD_Instructional-Design-Frameworkzz.pptx
SD_Instructional-Design-Frameworkzz.pptx
 
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...
 
Lecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL ProgrammingLecture Notes Unit5 chapter 15 PL/SQL Programming
Lecture Notes Unit5 chapter 15 PL/SQL Programming
 
Pedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/TypesPedagogy/Definition/Features/Approaches/Types
Pedagogy/Definition/Features/Approaches/Types
 
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfPRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdf
 
NLC 2024 - Certificate of Recognition
NLC  2024  -  Certificate of RecognitionNLC  2024  -  Certificate of Recognition
NLC 2024 - Certificate of Recognition
 
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler DolBANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
BANG E BHARAT QSN SET by Amra Quiz Pagoler Dol
 
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesHow to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
How to Load Custom Field to POS in Odoo 17 - Odoo 17 Slides
 
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docxMATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
MATATAG-MUSIC-and-ARTS_CG-2023_GRADE-4-and-7.docx
 
Why study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPointWhy study French Mackenzie Neale PowerPoint
Why study French Mackenzie Neale PowerPoint
 
How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17How to Fix Field Does Not Exist Error in Odoo 17
How to Fix Field Does Not Exist Error in Odoo 17
 
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry  by V.Jesinthal MaryPlato and Aristotle's Views on Poetry  by V.Jesinthal Mary
Plato and Aristotle's Views on Poetry by V.Jesinthal Mary
 
How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17How to Configure Field Cleaning Rules in Odoo 17
How to Configure Field Cleaning Rules in Odoo 17
 
Java MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHatJava MCQ Questions and Answers PDF By ScholarHat
Java MCQ Questions and Answers PDF By ScholarHat
 
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
21stcenturyskillsframeworkfinalpresentation2-240509214747-71edb7ee.pptx
 
VRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptxVRS An Strategic Approch to Meet Need of Organisation.pptx
VRS An Strategic Approch to Meet Need of Organisation.pptx
 
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesHow to Create an XLS Report in Odoo 17 - Odoo 17 Slides
How to Create an XLS Report in Odoo 17 - Odoo 17 Slides
 
2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx2024 Winter SWAYAM NPTEL & A Student.pptx
2024 Winter SWAYAM NPTEL & A Student.pptx
 

An Introduction to Apache Spark

  • 2. ✓ Need for spark ✓ Introducton to Apache Spark ✓ Spark features ✓ Spark architecture ✓ What is RDDs ✓ Transformations & Actions ✓ Spark execution model ✓ Spark ecosystem 2
  • 3. Why spark? Need for general purpose cluster computing system as: ➢MapReduce limited to batch processing ➢Storm limited to real time stream processing ➢Impala/Tez limited to interactive processing ➢Neo4J/Giraph limited to graph processing 3
  • 4. Need for Spark • Need for a powerful engine that can process the data in real time(streaming) as well as in batch mode • Need for a powerful engine that can respond in sub-seconds and perform in-memory analytics • Apache Spark is a powerful open source engine that provides real-time(stream), interactive, graph, in-memory as well as batch processing with speed, ease of use & sophisticated analytics. 4
  • 5. What is Apache Spark Lightning fast and general purpose cluster computing system 5
  • 6. Introduction to Apache Spark ➢Apache Spark is lightning-fast cluster computing tool ➢General purpose distributed system ➢Up to 100 times faster than MapReduce ➢Written in Scala ➢Provides APIs in Scala, Java and python ➢Integrate with Hadoop and can process existing data 6
  • 7. History • Introduced by UC Berkeley’s in 2009 • Open sourced in 2010 • Donated to the Apache in 2013,beacme top-level project in 2014 • Became most active project at Apache in 2015 7
  • 9. Apache Spark features • Speed • Ease of use • Low latency • Integration with Hadoop • Rich set of operators • Fault tolerant • Generalized execution model 9
  • 10. Spark Architecture • Works in master and slave fashion – Master node – Slave node 10
  • 12. Master node • Manager node • Assign the work to slave nodes • Management, monitoring, maintenance of slaves, assign work to them, keep track of work • Master daemon -runs on master node 12
  • 13. Slave Nodes • Worker nodes • Does the work assigned by master • Slave daemon-runs on all the slave nodes 13
  • 15. • User develop the work/application • Submit work on the master • Master will divide the work • And submit it to all the nodes on the cluster • All the slaves are doing sub-works – In this manner Spark enjoys Distributed Computing , parallel processing 15
  • 16. Resilient Distributed Dataset • Basic core abstraction in spark – Resilient – if data is lost it will be recreated automatically(fault tolerant ) – Distributed – data is distributedly stored/processed – Dataset – data can come from different data-stores 16
  • 17. • RDD is a simple and immutable collection of objects • RDD can contain any type of (Scala, Java, Python and R)objects • Each RDD is split-up into different partitions , which may be computed on different nodes of clusters 17
  • 18. What is RDD? • RDDs are the fundamental unit of data in Spark • Core spark abstraction • Enable parallel processing on dataset • Immutable, recomputable, fault tolerant • During spark programming we perform operations on RDDs • Transformations and actions are used to process RDDs 18
  • 19. RDD operations • Two types of operations ▪ Transformation - Create a new RDD from the existing one - Eg : map, filterMap, join ..etc ▪ Action - Return a result or write it to storage - Eg: count, collect, save..etc 19
  • 20. • Lazy evaluation – the execution will not start until an action is triggered 20
  • 21. Spark context • Spark context is an object • Every spark application requires a spark context • Main entry point for spark application • Interact with cluster manager • Specify spark how to access the cluster • RDDs are created using spark context 21
  • 23. • Developer develops the application/program • Needs the spark context object, the main entry point of spark application, which can interact with cluster manager • Data nodes, slaves of HDFS • Worker nodes, slaves of Spark • Cluster manager will interact with the worker node and get the resources • Executer is the distributed agent responsible for the execution of tasks 23
  • 24. The driver program • The driver program runs the main () function of the application and is the place where the Spark Context is created • The driver program that runs on the master node of the spark cluster schedules the job execution and negotiates with the cluster manager 24
  • 25. Executor • Executor is a distributed agent responsible for the execution of tasks • Every spark applications has its own executor process • Executor performs all the data processing. • Reads from and Writes data to external sources. • Executor stores the computation results data in-memory, cache or on hard disk drives. • Interacts with the storage systems. 25
  • 26. Cluster manager • An external service responsible for acquiring resources on the spark cluster and allocating them to a spark job 26
  • 28. Spark core • Main spark engine • Kernel of spark • it is in charge of essential I/O functionalities 28
  • 29. Spark SQL • Enables users to run sql queries • Can handle structured or semi-structured data • One of the most popular sql engine in big data 29
  • 30. Spark streaming • Can handle live streams without any latency • A powerful interactive and analytical application • Can process near real-time data from multiple sources • Internally convert the streams into micro batches, process the in cluster, pushes to data-stores 30
  • 31. MLlib • Machine Learning Library, scalable • Used for advanced analytics 31
  • 32. GraphX • Enable users to handles the graph data processing • We can represent our data in terms of graph • Eg: – in LinkedIn degree of connections, 1st degree, 2nd degree connections – In Facebook, friends of friends Such type of requirements can be handle efficiently by the Graph engine 32
  • 33. Storage system • Spark is dependent on third party storage system, like: – HDFS – HBASE – CASSANDRA – AMAZON S3 and so on 33
  • 36. Disadvantages • No File Management System • Expensive • Near Real-time Processing 36
  • 37. 37