Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It extends the MapReduce model of Hadoop to efficiently use it for more types of computations, which includes interactive queries and stream processing. This slide shares some basic knowledge about Apache Spark.
Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Scala, Java, Python and R for building parallel applications. Spark features include in-memory computing, lazy evaluation, and support for streaming, SQL, machine learning and graph processing. The core of Spark is the Resilient Distributed Dataset (RDD) which allows data to be partitioned across nodes in a cluster and supports parallel operations.
This document is a presentation on Apache Spark that compares its performance to MapReduce. It discusses how Spark is faster than MapReduce, provides code examples of performing word counts in both Spark and MapReduce, and explains features that make Spark suitable for big data analytics such as simplifying data analysis, providing built-in machine learning and graph libraries, and speaking multiple languages. It also lists many large companies that use Spark for applications like recommendations, business intelligence, and fraud detection.
The document discusses Apache Spark, an open source cluster computing framework for real-time data processing. It notes that Spark is up to 100 times faster than Hadoop for in-memory processing and 10 times faster on disk. The main feature of Spark is its in-memory cluster computing capability, which increases processing speeds. Spark runs on a driver-executor model and uses resilient distributed datasets and directed acyclic graphs to process data in parallel across a cluster.
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
This presentation is an analysis of the observed trends in the transition from the Hadoop ecosystem to the Spark ecosystem. The related talk took place at the Chicago Hadoop User Group (CHUG) meetup held on February 12, 2015.
This document provides an overview of Hadoop and its uses. It defines Hadoop as a distributed processing framework for large datasets across clusters of commodity hardware. It describes HDFS for distributed storage and MapReduce as a programming model for distributed computations. Several examples of Hadoop applications are given like log analysis, web indexing, and machine learning. In summary, Hadoop is a scalable platform for distributed storage and processing of big data across clusters of servers.
Using pySpark with Google Colab & Spark 3.0 previewMario Cartia
Spark is an open source engine for large-scale data processing. It provides APIs in multiple languages including Python. Spark includes libraries for SQL, streaming, machine learning and runs on single machines or clusters. The document discusses Spark's architecture, RDD API, transformations and actions, shuffle operations, persistence, and structured APIs like DataFrames and SQL.
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
This document discusses Apache Spark-on-YARN, which allows Spark applications to leverage existing Hadoop clusters. Spark improves efficiency over Hadoop via in-memory computing and supports rich APIs. Spark-on-YARN provides access to HDFS data and resources on Hadoop clusters without extra deployment costs. It supports running Spark jobs in YARN cluster and client modes. The document describes Yahoo's use of Spark-on-YARN for machine learning applications on large datasets.
Apache Spark is a fast distributed data processing engine that runs in memory. It can be used with Java, Scala, Python and R. Spark uses resilient distributed datasets (RDDs) as its main data structure. RDDs are immutable and partitioned collections of elements that allow transformations like map and filter. Spark is 10-100x faster than Hadoop for iterative algorithms and can be used for tasks like ETL, machine learning, and streaming.
Apache spark architecture (Big Data and Analytics)Jyotasana Bharti
A presentation/slides on Apache Spark Architecture with its features, architecture, working, etc.
Introduction
Features
Understanding Apache Spark Architecture
Working of Apache Spark Architecture
Applications
Conclusion
References
In this era of ever growing data, the need for analyzing it for meaningful business insights becomes more and more significant. There are different Big Data processing alternatives like Hadoop, Spark, Storm etc. Spark, however is unique in providing batch as well as streaming capabilities, thus making it a preferred choice for lightening fast Big Data Analysis platforms.
This presentation on Spark Architecture will give an idea of what is Apache Spark, the essential features in Spark, the different Spark components. Here, you will learn about Spark Core, Spark SQL, Spark Streaming, Spark MLlib, and Graphx. You will understand how Spark processes an application and runs it on a cluster with the help of its architecture. Finally, you will perform a demo on Apache Spark. So, let's get started with Apache Spark Architecture.
YouTube Video: https://www.youtube.com/watch?v=CF5Ewk0GxiQ
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
This document provides an overview of Apache Spark, including:
- Spark is an open source cluster computing framework built for speed and active use. It can access data from HDFS and other sources.
- Key features include simplicity, speed (both in memory and disk-based), streaming, machine learning, and support for multiple languages.
- Spark's architecture includes its core engine and additional modules for SQL, streaming, machine learning, graphs, and R integration. It can run on standalone, YARN, or Mesos clusters.
- Example uses of Spark include ETL, online data enrichment, fraud detection, and recommender systems using streaming, and customer segmentation using machine learning.
Workshop - How to Build Recommendation Engine using Spark 1.6 and HDP
Hands-on - Build a Data analytics application using SPARK, Hortonworks, and Zeppelin. The session explains RDD concepts, DataFrames, sqlContext, use SparkSQL for working with DataFrames and explore graphical abilities of Zeppelin.
b) Follow along - Build a Recommendation Engine - This will show how to build a predictive analytics (MLlib) recommendation engine with scoring This will give a better understanding of architecture and coding in Spark for ML.
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
This Edureka "What is Spark" tutorial will introduce you to big data analytics framework - Apache Spark. This tutorial is ideal for both beginners as well as professionals who want to learn or brush up their Apache Spark concepts. Below are the topics covered in this tutorial:
1) Big Data Analytics
2) What is Apache Spark?
3) Why Apache Spark?
4) Using Spark with Hadoop
5) Apache Spark Features
6) Apache Spark Architecture
7) Apache Spark Ecosystem - Spark Core, Spark Streaming, Spark MLlib, Spark SQL, GraphX
8) Demo: Analyze Flight Data Using Apache Spark
This document discusses Apache Spark, an open-source cluster computing framework for big data processing. It provides an overview of Spark, how it fits into the Hadoop ecosystem, why it is useful for big data analytics, and hands-on analysis of data using Spark. Key features that make Spark suitable for big data analytics include simplifying data analysis, built-in machine learning and graph processing libraries, support for multiple programming languages, and faster performance than Hadoop MapReduce.
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
This document provides an overview of Apache Spark, including its history, features, architecture and use cases. Spark started in 2009 at UC Berkeley and was later adopted by the Apache Foundation. It provides faster processing than Hadoop by keeping data in memory. Spark supports batch, streaming and interactive processing on large datasets using its core abstraction called resilient distributed datasets (RDDs).
Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.
Spark is an open-source cluster computing framework that uses in-memory processing to allow data sharing across jobs for faster iterative queries and interactive analytics, it uses Resilient Distributed Datasets (RDDs) that can survive failures through lineage tracking and supports programming in Scala, Java, and Python for batch, streaming, and machine learning workloads.
Spark is an open-source cluster computing framework that provides high performance for both batch and streaming data processing. It addresses limitations of other distributed processing systems like MapReduce by providing in-memory computing capabilities and supporting a more general programming model. Spark core provides basic functionalities and serves as the foundation for higher-level modules like Spark SQL, MLlib, GraphX, and Spark Streaming. RDDs are Spark's basic abstraction for distributed datasets, allowing immutable distributed collections to be operated on in parallel. Key benefits of Spark include speed through in-memory computing, ease of use through its APIs, and a unified engine supporting multiple workloads.
This document provides an overview of Spark driven big data analytics. It begins by defining big data and its characteristics. It then discusses the challenges of traditional analytics on big data and how Apache Spark addresses these challenges. Spark improves on MapReduce by allowing distributed datasets to be kept in memory across clusters. This enables faster iterative and interactive processing. The document outlines Spark's architecture including its core components like RDDs, transformations, actions and DAG execution model. It provides examples of writing Spark applications in Java and Java 8 to perform common analytics tasks like word count.
This document provides an overview of the Apache Spark framework. It covers Spark fundamentals including the Spark execution model using Resilient Distributed Datasets (RDDs), basic Spark programming, and common Spark libraries and use cases. Key topics include how Spark improves on MapReduce by operating in-memory and supporting general graphs through its directed acyclic graph execution model. The document also reviews Spark installation and provides examples of basic Spark programs in Scala.
Spark is an open source cluster computing framework that allows processing of large datasets across clusters of computers using a simple programming model. It provides high-level APIs in Java, Scala, Python and R.
Typical machine learning workflows in Spark involve loading data, preprocessing, feature engineering, training models, evaluating performance, and tuning hyperparameters. Spark MLlib provides algorithms for common tasks like classification, regression, clustering and collaborative filtering.
The document provides an example of building a spam filtering application in Spark. It involves reading email data, extracting features using tokenization and hashing, training a logistic regression model, evaluating performance on test data, and tuning hyperparameters via cross validation.
Spark is a framework for large-scale data processing that improves on MapReduce. It handles batch, iterative, and streaming workloads using a directed acyclic graph (DAG) model. Spark aims for generality, low latency, fault tolerance, and simplicity. It uses an in-memory computing model with Resilient Distributed Datasets (RDDs) and a driver-executor architecture. Common Spark performance issues relate to partitioning, shuffling data between stages, task placement, and load balancing. Evaluation tools include the Spark UI, Sar, iostat, and benchmarks like SparkBench and GroupBy tests.
This document provides an overview and introduction to Spark, including:
- Spark is a general purpose computational framework that provides more flexibility than MapReduce while retaining properties like scalability and fault tolerance.
- Spark concepts include resilient distributed datasets (RDDs), transformations that create new RDDs lazily, and actions that run computations and return values to materialize RDDs.
- Spark can run on standalone clusters or as part of Cloudera's Enterprise Data Hub, and examples of its use include machine learning, streaming, and SQL queries.
An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
This document provides an overview of Apache Spark, including:
- Spark allows for fast iterative processing by keeping data in memory across parallel jobs for faster sharing than MapReduce.
- The core of Spark is the resilient distributed dataset (RDD) which allows parallel operations on distributed data.
- Spark comes with libraries for SQL queries, streaming, machine learning, and graph processing.
- Apache Spark is an open-source cluster computing framework that is faster than Hadoop for batch processing and also supports real-time stream processing.
- Spark was created to be faster than Hadoop for interactive queries and iterative algorithms by keeping data in-memory when possible.
- Spark consists of Spark Core for the basic RDD API and also includes modules for SQL, streaming, machine learning, and graph processing. It can run on several cluster managers including YARN and Mesos.
Spark is a fast and general cluster computing system that improves on MapReduce by keeping data in-memory between jobs. It was developed in 2009 at UC Berkeley and open sourced in 2010. Spark core provides in-memory computing capabilities and a programming model that allows users to write programs as transformations on distributed datasets.
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
Deep Learning on Apache Spark has the potential for huge impact in research and industry. This talk will describe best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this talk will focus on issues that are common to many deep learning frameworks when running on a Spark cluster: optimizing cluster setup and data ingest, tuning the cluster, and monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library.
More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput. Interactive monitoring facilitates both the work of configuration and checking the stability of deep learning jobs.
Speaker: Tim Hunter
This talk was originally presented at Spark Summit East 2017.
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
Metail uses Apache Spark and a functional programming approach to process and analyze data from its fashion recommendation application. It collects data through various pipelines to understand user journeys and optimize business processes like photography. Metail's data pipeline is influenced by functional paradigms like immutability and uses Spark on AWS to operate on datasets in a distributed, scalable manner. The presentation demonstrated Metail's use of Clojure, Spark, and AWS services to build a functional data pipeline for analytics purposes.
Productionizing Spark and the Spark Job ServerEvan Chan
You won't find this in many places - an overview of deploying, configuring, and running Apache Spark, including Mesos vs YARN vs Standalone clustering modes, useful config tuning parameters, and other tips from years of using Spark in production. Also, learn about the Spark Job Server and how it can help your organization deploy Spark as a RESTful service, track Spark jobs, and enable fast queries (including SQL!) of cached RDDs.
Spark is a fast, general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R for distributed tasks including SQL, streaming, and machine learning. Spark improves on MapReduce by keeping data in-memory, allowing iterative algorithms to run faster than disk-based approaches. Resilient Distributed Datasets (RDDs) are Spark's fundamental data structure, acting as a fault-tolerant collection of elements that can be operated on in parallel.
Productionizing Spark and the REST Job Server- Evan ChanSpark Summit
The document discusses productionizing Apache Spark and using the Spark REST Job Server. It provides an overview of Spark deployment options like YARN, Mesos, and Spark Standalone mode. It also covers Spark configuration topics like jars management, classpath configuration, and tuning garbage collection. The document then discusses running Spark applications in a cluster using tools like spark-submit and the Spark Job Server. It highlights features of the Spark Job Server like enabling low-latency Spark queries and sharing cached RDDs across jobs. Finally, it provides examples of using the Spark Job Server in production environments.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Life of Ah Gong and Ah Kim ~ A Story with Life Lessons (Hokkien, English & Ch...OH TEIK BIN
A PowerPoint Presentation of a fictitious story that imparts Life Lessons on loving-kindness, virtue, compassion and wisdom.
The texts are in Romanized Hokkien, English and Chinese.
For the Video Presentation with audio narration in Hokkien, please check out the Link:
https://vimeo.com/manage/videos/987932748
Description:
Welcome to the comprehensive guide on Relational Database Management System (RDBMS) concepts, tailored for final year B.Sc. Computer Science students affiliated with Alagappa University. This document covers fundamental principles and advanced topics in RDBMS, offering a structured approach to understanding databases in the context of modern computing. PDF content is prepared from the text book Learn Oracle 8I by JOSE A RAMALHO.
Key Topics Covered:
Main Topic : PL/SQL
Sub-Topic :
Structure of PL/SQL Block, Declaration Section, Variable, Constant, Execution Section, Exception, How PL/SQL works, Control Structures, If then Command,
Loop Command, Loop with IF, Loop with When, For Loop Command, While Command, Integrating SQL in PL/SQL program.
Target Audience:
Final year B.Sc. Computer Science students at Alagappa University seeking a solid foundation in RDBMS principles for academic and practical applications.
URL for previous slides
Unit V
Chapter 15
Unit IV
Chapter 14 Synonym : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter14_synonyms-pdf/270327685
Chapter 13 Users, Privileges : https://www.slideshare.net/slideshow/lecture-notes-unit4-chapter13-users-roles-and-privileges/270304806
Chapter 12 View : https://www.slideshare.net/slideshow/rdbms-lecture-notes-unit4-chapter12-view/270199683
Chapter 11 Sequence: https://www.slideshare.net/slideshow/sequnces-lecture_notes_unit4_chapter11_sequence/270134792
chapter 8,9 and 10 : https://www.slideshare.net/slideshow/lecture_notes_unit4_chapter_8_9_10_rdbms-for-the-students-affiliated-by-alagappa-university/270123800
About the Author:
Dr. S. Murugan is Associate Professor at Alagappa Government Arts College, Karaikudi. With 23 years of teaching experience in the field of Computer Science, Dr. S. Murugan has a passion for simplifying complex concepts in database management.
Disclaimer:
This document is intended for educational purposes only. The content presented here reflects the author’s understanding in the field of RDBMS as of 2024.
Topics to be Covered
Beginning of Pedagogy
What is Pedagogy?
Definition of Pedagogy
Features of Pedagogy
What Is Pedagogy In Teaching?
What Is Teacher Pedagogy?
What Is The Pedagogy Approach?
What are Pedagogy Approaches?
Teaching and Learning Pedagogical approaches?
Importance of Pedagogy in Teaching & Learning
Role of Pedagogy in Effective Learning
Pedagogy Impact on Learner
Pedagogical Skills
10 Innovative Learning Strategies For Modern Pedagogy
Types of Pedagogy
PRESS RELEASE - UNIVERSITY OF GHANA, JULY 16, 2024.pdfnservice241
The University of Ghana has launched a new vision and strategic plan, which will focus on transforming lives and societies through unparalleled scholarship, innovation, and result-oriented discoveries.
How to Load Custom Field to POS in Odoo 17 - Odoo 17 SlidesCeline George
This slide explains how to load custom fields you've created into the Odoo 17 Point-of-Sale (POS) interface. This approach involves extending the functionalities of existing POS models (e.g., product.product) to include your custom field.
How to Fix Field Does Not Exist Error in Odoo 17Celine George
This slide will represent how to fix the error field does not exist in a model in Odoo 17. So if you got an error field does not exist it typically means that you are trying to refer a field that doesn’t exist in the model or view.
Plato and Aristotle's Views on Poetry by V.Jesinthal Maryjessintv
PPT on Plato and Aristotle's Views on Poetry prepared by Mrs.V.Jesinthal Mary, Dept of English and Foreign Languages(EFL),SRMIST Science and Humanities ,Ramapuram,Chennai-600089
How to Configure Field Cleaning Rules in Odoo 17Celine George
In this slide let’s discuss how to configure field cleaning rules in odoo 17. Field Cleaning is used to format the data that we use inside Odoo. Odoo 17's Data Cleaning module offers Field Cleaning Rules to improve data consistency and quality within specific fields of our Odoo records. By using the field cleaning, we can correct the typos, correct the spaces between them and also formats can be corrected.
How to Create an XLS Report in Odoo 17 - Odoo 17 SlidesCeline George
XLSX reports are essential for structured data analysis, customizable presentation, and compatibility across platforms, facilitating efficient decision-making and communication within organizations.
2. ✓ Need for spark
✓ Introducton to Apache Spark
✓ Spark features
✓ Spark architecture
✓ What is RDDs
✓ Transformations & Actions
✓ Spark execution model
✓ Spark ecosystem
2
3. Why spark?
Need for general purpose cluster computing system
as:
➢MapReduce limited to batch processing
➢Storm limited to real time stream processing
➢Impala/Tez limited to interactive processing
➢Neo4J/Giraph limited to graph processing
3
4. Need for Spark
• Need for a powerful engine that can process
the data in real time(streaming) as well as in
batch mode
• Need for a powerful engine that can respond in
sub-seconds and perform in-memory analytics
• Apache Spark is a powerful open source engine
that provides real-time(stream), interactive,
graph, in-memory as well as batch processing
with speed, ease of use & sophisticated
analytics.
4
5. What is Apache Spark
Lightning fast and general purpose cluster
computing system
5
6. Introduction to Apache Spark
➢Apache Spark is lightning-fast cluster computing
tool
➢General purpose distributed system
➢Up to 100 times faster than MapReduce
➢Written in Scala
➢Provides APIs in Scala, Java and python
➢Integrate with Hadoop and can process existing
data
6
7. History
• Introduced by UC Berkeley’s in 2009
• Open sourced in 2010
• Donated to the Apache in 2013,beacme top-level
project in 2014
• Became most active project at Apache in 2015
7
9. Apache Spark features
• Speed
• Ease of use
• Low latency
• Integration with Hadoop
• Rich set of operators
• Fault tolerant
• Generalized execution model
9
12. Master node
• Manager node
• Assign the work to slave nodes
• Management, monitoring, maintenance of
slaves, assign work to them, keep track of
work
• Master daemon -runs on master node
12
13. Slave Nodes
• Worker nodes
• Does the work assigned by master
• Slave daemon-runs on all the slave nodes
13
15. • User develop the work/application
• Submit work on the master
• Master will divide the work
• And submit it to all the nodes on the cluster
• All the slaves are doing sub-works
– In this manner Spark enjoys Distributed
Computing , parallel processing
15
16. Resilient Distributed Dataset
• Basic core abstraction in spark
– Resilient – if data is lost it will be recreated
automatically(fault tolerant )
– Distributed – data is distributedly stored/processed
– Dataset – data can come from different data-stores
16
17. • RDD is a simple and immutable collection of
objects
• RDD can contain any type of (Scala, Java,
Python and R)objects
• Each RDD is split-up into different partitions ,
which may be computed on different nodes of
clusters
17
18. What is RDD?
• RDDs are the fundamental unit of data in Spark
• Core spark abstraction
• Enable parallel processing on dataset
• Immutable, recomputable, fault tolerant
• During spark programming we perform
operations on RDDs
• Transformations and actions are used to process
RDDs
18
19. RDD operations
• Two types of operations
▪ Transformation
- Create a new RDD from the existing one
- Eg : map, filterMap, join ..etc
▪ Action
- Return a result or write it to storage
- Eg: count, collect, save..etc
19
20. • Lazy evaluation
– the execution will not start until an action is
triggered
20
21. Spark context
• Spark context is an object
• Every spark application requires a spark context
• Main entry point for spark application
• Interact with cluster manager
• Specify spark how to access the cluster
• RDDs are created using spark context
21
23. • Developer develops the application/program
• Needs the spark context object, the main
entry point of spark application, which can
interact with cluster manager
• Data nodes, slaves of HDFS
• Worker nodes, slaves of Spark
• Cluster manager will interact with the worker
node and get the resources
• Executer is the distributed agent responsible
for the execution of tasks
23
24. The driver program
• The driver program runs the main () function
of the application and is the place where the
Spark Context is created
• The driver program that runs on the master
node of the spark cluster schedules the job
execution and negotiates with the cluster
manager
24
25. Executor
• Executor is a distributed agent responsible for
the execution of tasks
• Every spark applications has its own executor
process
• Executor performs all the data processing.
• Reads from and Writes data to external
sources.
• Executor stores the computation results data
in-memory, cache or on hard disk drives.
• Interacts with the storage systems.
25
26. Cluster manager
• An external service responsible for acquiring
resources on the spark cluster and allocating
them to a spark job
26
28. Spark core
• Main spark engine
• Kernel of spark
• it is in charge of essential I/O functionalities
28
29. Spark SQL
• Enables users to run sql queries
• Can handle structured or semi-structured data
• One of the most popular sql engine in big data
29
30. Spark streaming
• Can handle live streams without any latency
• A powerful interactive and analytical
application
• Can process near real-time data from multiple
sources
• Internally convert the streams into micro
batches, process the in cluster, pushes to
data-stores
30
32. GraphX
• Enable users to handles the graph data processing
• We can represent our data in terms of graph
• Eg:
– in LinkedIn degree of connections, 1st degree, 2nd
degree connections
– In Facebook, friends of friends
Such type of requirements can be handle efficiently by the
Graph engine
32
33. Storage system
• Spark is dependent on third party storage
system, like:
– HDFS
– HBASE
– CASSANDRA
– AMAZON S3 and so on
33