SlideShare a Scribd company logo
Big Data Training -
Amazon EMR
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Amazon EMR
Agenda
• EMR Overview
• EMR Fundamental blocks
• Launch types of EMR
• EMR Storage
• EMR Managed Scaling
• EMR Security
• EMR Pricing
• Hands-on
4
What is EMR?
5
Elastic MapReduce
Managed Hadoop framework on EC2 instances.
Includes Spark, HBase, Presto, Hive & more
Several integration points with AWS.
Basic blocks of
EMR
• Master node:
The master node manages the cluster
and typically runs master components
of distributed applications.
All the major services like spark-
history server, resource manager, and
node manager runs on the master
node.
6
Basic blocks of
EMR
• Core node:
A node with software components
that run tasks and store data in the
Hadoop Distributed File System (HDFS)
on your cluster.
Multi-node clusters have at least one
core node.
7
Basic blocks of
EMR
• Task node:
A node with software components
that only runs tasks, and you can use
task nodes to add power to perform
parallel computation tasks on data,
such as Hadoop MapReduce tasks and
Spark executors.
Task nodes don’t run the Data Node
daemon nor store data in HDFS.
8
Launch types of
EMR
• EMR on EKS cluster.
• EMR serverless (November 2021.)
• EMR on EC2 instances.
• Instance Group
• Instance Fleets
9
EMR Storage
HDFS
• Hadoop Distributed File System
• Multiple copies stored across cluster instances
for redundancy
• Files stored as blocks (128MB default size)
• Ephemeral – HDFS data is lost when cluster is
terminated!
• But, useful for caching intermediate results or
workloads with significant random I/O
• Hadoop tries to process data where it is stored
on HDFS
Local file system:
• Suitable only for temporary data (buffers,
caches, etc)
10
EMRFS:
• Access S3 as if it were HDFS
• Allows persistent storage after cluster
termination
• EMRFS Consistent View – Optional for S3
consistency
• Uses DynamoDB to track consistency
• May need to tinker with read/write
capacity on DynamoDB
• New in 2021: S3 is Now Strongly
Consistent!
EMR Scaling
EMR Automatic Scaling :
• The old way of doing it
• Custom scaling rules based on CloudWatch
metrics
• Supports instance groups only.
EMR Managed Scaling:
• Support instance groups and instance fleets
• Scales spot, on-demand, and instances in a
Savings Plan within the same cluster
• Available for Spark, Hive, and YARN workloads
11
Scale-up Strategy
• First, add core nodes, then task nodes,
up to max units specified
Scale-down Strategy
• First removes task nodes, then core
nodes, no further than minimum
constraints
Spot nodes always removed before on-demand
instances
EMR
Security
• EMRFS
• S3 encryption (SSE or CSE) at rest
• TLS in transit between EMR nodes and S3
• S3
• SSE-S3, SSE-KMS
• Local disk encryption
• Spark communication between drivers &
executors is encrypted
• Hive communication between Glue Meta store
and EMR uses TLS
• Force HTTPS (TLS) on S3 policies with aws:
Secure Transport.
• IAM roles and policy.
12
EMR Pricing
• Amazon EMR on Amazon EC2:
• The Amazon EMR price is added to the Amazon EC2 price (the
price for the underlying servers) and Amazon Elastic Block
Store (Amazon EBS) price (if attaching Amazon EBS volumes).
These are also billed per second, with a one-minute minimum.
• Amazon EMR on Amazon EKS:
• The Amazon EMR price is added to the Amazon EKS pricing or
any other services used with EKS. You can run EKS on AWS
using either EC2 or AWS Fargate.
• Amazon EMR Serverless:
• With EMR Serverless, there are no upfront costs, and you pay
for only the resources you use. You pay for vCPU, memory, and
storage resources consumed by your applications.
13
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
14
Amazon EMR
Hands-on
EMR Cluster
Hands - on
• EMR portal overview
• EMR cluster creation overview
• SSH into the Cluster.
• Running application
• Spark shell
• Spark submit option
• EMR step
• EMR Notebook
• Logs overview
16
Spark Deployment
Modes
Client Mode
17
Spark Deployment
Modes
Cluster Mode
18
Spark Memory Allocation
19
Spark Memory Allocation
• Storage Memory:
• It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on.
• Execution Memory:
• It’s mainly used to store temporary data in the calculation process of Shuffle, Join,
Sort, Aggregation, etc.
• User Memory:
• It’s mainly used to store the data needed for RDD conversion operations, such as the
information for RDD dependency.
• Reserved Memory:
• The memory is reserved for the system and is used to store Spark’s internal object
20
EMR Bootstrap
• Use a bootstrap action to install additional
software or customize the configuration of
cluster instances
• Bootstrap actions are scripts that run on
the cluster after Amazon EMR launches
the instance using the Amazon Linux
Amazon Machine Image (AMI).
• Bootstrap actions run before Amazon EMR
installs the applications that you specify
when you create the cluster and before
cluster nodes begin processing data.
21
EMR Spark
Configuration
• spark.dynamicAllocation.enabled
• spark.executor.memory
• spark.driver.memory
• spark.driver.memoryOverhead
• spark.executor.memoryOverhead
• spark.driver.cores
• spark.executor.instances
• Spark arguments:
• --num-executors
• --executor-memory
• --executor-cores
• --py-files
• --packages
22
EMR Hands-
On
Write data to S3 using the EMR spark application.
23
EMR Hands-
On
Write data to RDS PostgreSQL using the EMR spark application.
24
EMR Hands-
On
Write data to S3 using the EMR spark kinesis streaming application.
25
EMR
Assignments
• Explore different file formats,
• CSV file format
• JSON file format
• Avro file format
• ORC file format
• Parquet file format.
•
Explore different compressions,
• ZIP
• GZIP
• BZIP
• Snappy
26
EMR
Assignments
• Create an S3 bucket and configure lambda as a trigger for every new object creation.
• Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments.
• EMR spark application should read the file from S3 and add some additional metadata columns such as load
datetime.
• After transformation, the output data frame should be stored under a target s3 bucket.
27
EMR
Assignments
• Create a spark streaming application
with kinesis as input.
• Perform a real-time insert, update, and
delete data on the RDS database.
28
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Feedback
29

More Related Content

Similar to EMR Training

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
mattlieber
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
Amazon Web Services
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
Amazon Web Services
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
Amazon Web Services
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
Arun Sirimalla
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
Amazon Web Services
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
Amazon Web Services
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
WeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
Amazon Web Services
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
Amazon Web Services
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
Amazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
Amazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
Vladimir Simek
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Amazon Web Services
 

Similar to EMR Training (20)

Hadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluationHadoop AWS infrastructure cost evaluation
Hadoop AWS infrastructure cost evaluation
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMR
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
AWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudDataAWS Well Architected-Info Session WeCloudData
AWS Well Architected-Info Session WeCloudData
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
STG330_Case Study How Experian Leverages Amazon EC2, EBS, and S3 with Clouder...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 

Recently uploaded

Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
brgylicumaormoccity
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
49AkshitYadav
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
Sathiakumar Chandr
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
Ladislau5
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
Steven McGee
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
JeevanKp7
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
SelcukTOPAL2
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Alireza Kamrani
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
AkhinaRomdoni
 

Recently uploaded (20)

Full Disclosure Board Policy.docx BRGY LICUMA
Full  Disclosure Board Policy.docx BRGY LICUMAFull  Disclosure Board Policy.docx BRGY LICUMA
Full Disclosure Board Policy.docx BRGY LICUMA
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Technology used in Ott data analysis project
Technology used in Ott data analysis  projectTechnology used in Ott data analysis  project
Technology used in Ott data analysis project
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Aws MLOps Interview Questions with answers
Aws MLOps Interview Questions  with answersAws MLOps Interview Questions  with answers
Aws MLOps Interview Questions with answers
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
393947940-The-Dell-EMC-PowerMax-Family-Overview.pdf
 
Annex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf documentAnnex K RBF's The World Game pdf document
Annex K RBF's The World Game pdf document
 
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
DESIGN AND DEVELOPMENT OF AUTO OXYGEN CONCENTRATOR WITH SOS ALERT FOR HIKING ...
 
Selcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdfSelcuk Topal Arbitrum Scientific Report.pdf
Selcuk Topal Arbitrum Scientific Report.pdf
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
Dataguard Switchover Best Practices using DGMGRL (Dataguard Broker Command Line)
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
Systane Global education training centre
Systane Global education training centreSystane Global education training centre
Systane Global education training centre
 

EMR Training

  • 1. Big Data Training - Amazon EMR
  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 4. Agenda • EMR Overview • EMR Fundamental blocks • Launch types of EMR • EMR Storage • EMR Managed Scaling • EMR Security • EMR Pricing • Hands-on 4
  • 5. What is EMR? 5 Elastic MapReduce Managed Hadoop framework on EC2 instances. Includes Spark, HBase, Presto, Hive & more Several integration points with AWS.
  • 6. Basic blocks of EMR • Master node: The master node manages the cluster and typically runs master components of distributed applications. All the major services like spark- history server, resource manager, and node manager runs on the master node. 6
  • 7. Basic blocks of EMR • Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node. 7
  • 8. Basic blocks of EMR • Task node: A node with software components that only runs tasks, and you can use task nodes to add power to perform parallel computation tasks on data, such as Hadoop MapReduce tasks and Spark executors. Task nodes don’t run the Data Node daemon nor store data in HDFS. 8
  • 9. Launch types of EMR • EMR on EKS cluster. • EMR serverless (November 2021.) • EMR on EC2 instances. • Instance Group • Instance Fleets 9
  • 10. EMR Storage HDFS • Hadoop Distributed File System • Multiple copies stored across cluster instances for redundancy • Files stored as blocks (128MB default size) • Ephemeral – HDFS data is lost when cluster is terminated! • But, useful for caching intermediate results or workloads with significant random I/O • Hadoop tries to process data where it is stored on HDFS Local file system: • Suitable only for temporary data (buffers, caches, etc) 10 EMRFS: • Access S3 as if it were HDFS • Allows persistent storage after cluster termination • EMRFS Consistent View – Optional for S3 consistency • Uses DynamoDB to track consistency • May need to tinker with read/write capacity on DynamoDB • New in 2021: S3 is Now Strongly Consistent!
  • 11. EMR Scaling EMR Automatic Scaling : • The old way of doing it • Custom scaling rules based on CloudWatch metrics • Supports instance groups only. EMR Managed Scaling: • Support instance groups and instance fleets • Scales spot, on-demand, and instances in a Savings Plan within the same cluster • Available for Spark, Hive, and YARN workloads 11 Scale-up Strategy • First, add core nodes, then task nodes, up to max units specified Scale-down Strategy • First removes task nodes, then core nodes, no further than minimum constraints Spot nodes always removed before on-demand instances
  • 12. EMR Security • EMRFS • S3 encryption (SSE or CSE) at rest • TLS in transit between EMR nodes and S3 • S3 • SSE-S3, SSE-KMS • Local disk encryption • Spark communication between drivers & executors is encrypted • Hive communication between Glue Meta store and EMR uses TLS • Force HTTPS (TLS) on S3 policies with aws: Secure Transport. • IAM roles and policy. 12
  • 13. EMR Pricing • Amazon EMR on Amazon EC2: • The Amazon EMR price is added to the Amazon EC2 price (the price for the underlying servers) and Amazon Elastic Block Store (Amazon EBS) price (if attaching Amazon EBS volumes). These are also billed per second, with a one-minute minimum. • Amazon EMR on Amazon EKS: • The Amazon EMR price is added to the Amazon EKS pricing or any other services used with EKS. You can run EKS on AWS using either EC2 or AWS Fargate. • Amazon EMR Serverless: • With EMR Serverless, there are no upfront costs, and you pay for only the resources you use. You pay for vCPU, memory, and storage resources consumed by your applications. 13
  • 14. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 14
  • 16. EMR Cluster Hands - on • EMR portal overview • EMR cluster creation overview • SSH into the Cluster. • Running application • Spark shell • Spark submit option • EMR step • EMR Notebook • Logs overview 16
  • 20. Spark Memory Allocation • Storage Memory: • It’s mainly used to store Spark cache data, such as RDD cache, Unroll data, and so on. • Execution Memory: • It’s mainly used to store temporary data in the calculation process of Shuffle, Join, Sort, Aggregation, etc. • User Memory: • It’s mainly used to store the data needed for RDD conversion operations, such as the information for RDD dependency. • Reserved Memory: • The memory is reserved for the system and is used to store Spark’s internal object 20
  • 21. EMR Bootstrap • Use a bootstrap action to install additional software or customize the configuration of cluster instances • Bootstrap actions are scripts that run on the cluster after Amazon EMR launches the instance using the Amazon Linux Amazon Machine Image (AMI). • Bootstrap actions run before Amazon EMR installs the applications that you specify when you create the cluster and before cluster nodes begin processing data. 21
  • 22. EMR Spark Configuration • spark.dynamicAllocation.enabled • spark.executor.memory • spark.driver.memory • spark.driver.memoryOverhead • spark.executor.memoryOverhead • spark.driver.cores • spark.executor.instances • Spark arguments: • --num-executors • --executor-memory • --executor-cores • --py-files • --packages 22
  • 23. EMR Hands- On Write data to S3 using the EMR spark application. 23
  • 24. EMR Hands- On Write data to RDS PostgreSQL using the EMR spark application. 24
  • 25. EMR Hands- On Write data to S3 using the EMR spark kinesis streaming application. 25
  • 26. EMR Assignments • Explore different file formats, • CSV file format • JSON file format • Avro file format • ORC file format • Parquet file format. • Explore different compressions, • ZIP • GZIP • BZIP • Snappy 26
  • 27. EMR Assignments • Create an S3 bucket and configure lambda as a trigger for every new object creation. • Lambda should receive an event from S3 and submit a step on the EMR cluster with the required arguments. • EMR spark application should read the file from S3 and add some additional metadata columns such as load datetime. • After transformation, the output data frame should be stored under a target s3 bucket. 27
  • 28. EMR Assignments • Create a spark streaming application with kinesis as input. • Perform a real-time insert, update, and delete data on the RDS database. 28
  • 29. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Feedback 29