SlideShare a Scribd company logo
Big Data Training
About me
• I’m Vishal Periyasamy Rajendran
• Senior Data Engineer
• Focused on architecting and developing big
data solutions at AWS cloud.
• 8x AWS certifications + other certifications on
Azure, Snowflake etc.
• You can find me on
• LinkedIn:
https://www.linkedin.com/in/vishal-p-
2703a9131/
• Medium:
https://medium.com/@vishalrv1904
2
Agenda
• Big data Overview
• Dimensions of Big data
• Traditional approach and limitations
• Hadoop Overview
• Spark Overview
• Hive Overview
• Other Big data frameworks
3
Big Data Overview

Recommended for you

CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx

Azure Synapse Analytics is a limitless analytics service that brings together data integration, enterprise data warehousing, and big data analytics. It provides the freedom to query data at scale using either serverless or dedicated options. Azure HDInsight allows the use of open source frameworks like Hadoop, Spark, Hive, and Kafka for processing large volumes of data. Azure Databricks offers environments for SQL, data science/engineering, and machine learning. The Azure IoT Hub enables scalable IoT solutions by allowing bidirectional communication between IoT applications and connected devices.

Apache Spark
Apache SparkApache Spark
Apache Spark

Apache Spark is a fast, general-purpose cluster computing system that allows processing of large datasets in parallel across clusters. It can be used for batch processing, streaming, and interactive queries. Spark improves on Hadoop MapReduce by using an in-memory computing model that is faster than disk-based approaches. It includes APIs for Java, Scala, Python and supports machine learning algorithms, SQL queries, streaming, and graph processing.

big dataapache sparkreal-time processing
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL

This document discusses MySQL and Hadoop integration. It covers structured versus unstructured data and the capabilities and limitations of relational databases, NoSQL, and Hadoop. It also describes several tools for integrating MySQL and Hadoop, including Sqoop for data transfers, MySQL Applier for streaming changes to Hadoop, and MySQL NoSQL interfaces. The document outlines the typical life cycle of big data with MySQL playing a role in data acquisition, organization, analysis, and decisions.

What is Big
data?
• Each user with a smartphone generates
approximately 40 Exabytes of data every month.
• According to Forbes, 2.5 Quintillion bytes of data are
created every day.
5
What is Big
data?
• Collection of data that is so huge & complex like none of
the traditional data management tool can store or process
it.
6
Dimensions of Big Data
6v’s Of Big data
• Volume
• The scale of data.
• Velocity
• Speed of data.
• Variety
• Diversity of data.
• Veracity
• Accuracy of data.
• Value
• Insights gained from data.
• Variability
• How often data can change.
8

Recommended for you

Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3

The document outlines Oracle's Big Data Appliance product. It discusses how businesses can use big data to gain insights and make better decisions. It then provides an overview of big data technologies like Hadoop and NoSQL databases. The rest of the document details the hardware, software, and applications that come pre-installed on Oracle's Big Data Appliance - including Hadoop, Oracle NoSQL Database, Oracle Data Integrator, and tools for loading and analyzing data. The summary states that the Big Data Appliance provides a complete, optimized solution for storing and analyzing less structured data, and integrates with Oracle Exadata for combined analysis of all data sources.

Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin

This presentation gives an overview of Apache Spark and explains the features of Apache Zeppelin(incubator). Zeppelin is the open source tool for data discovery, exploration and visualization. It supports REPLs for shell, SparkSQL, Spark(scala), python and angular. This presentation was made on the Big Data Day, at the Great Indian Developer Summit, Bangalore, April 2015

apache sparkapache zeppelindata visualization
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019

Slides from the STL Big Data IDEA meeting from January 2019. The presenters discussed technologies to continue using, stop using, and start using in 2019.

stlbigdataideastlhugbig data
Big Data Phases
Big Data Phases
• Data collection
• Data Cleansing / Validation
• Data Transformation
• Data Storage
• Data Visualization
Different Pipelines:
• ETL (Extract, Transform, Load)
• ELT (Extract, Load, Transform)
10
Traditional Approach
Traditional
Approach
• An enterprise will have a
computer to store and process
big data.
• Limitations:
• Processor that is
processing the data.
• Dealing with huge amounts
amounts of scalable data
12

Recommended for you

Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop

This document summarizes a meetup about Big Data and SQL on Hadoop. The meetup included discussions on what Hadoop is, why SQL on Hadoop is useful, what Hive is, and introduced IBM's BigInsights software for running SQL on Hadoop with improved performance over other solutions. Key topics included HDFS file storage, MapReduce processing, Hive tables and metadata storage, and how BigInsights provides a massively parallel SQL engine instead of relying on MapReduce.

Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization

This document summarizes a presentation given by Nicholas Berg of Seagate and Adriana Zubiri of IBM on delivering analytics across organizations using Hadoop and SQL. Some key points discussed include Seagate's plans to use Hadoop to enable deeper analysis of factory and field data, the evolving Hadoop landscape and rise of SQL, and a performance comparison showing IBM's Big SQL outperforming Spark SQL, especially at scale. The document provides an overview of Seagate and IBM's strategies and experiences with Hadoop.

seagate ibm big data hadoop big sql biginsights
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform

This document discusses building a data analytics platform and summarizes various technologies that can be used. It begins by outlining reasons for analyzing data like reporting, monitoring, and exploratory analysis. It then discusses using relational databases, parallel databases, Hadoop, and columnar storage to store and process large volumes of data. Streaming technologies like Storm, Kafka, and services like Redshift, BigQuery, and Treasure Data are also summarized as options for a complete analytics platform.

yapc
Traditional
Approach
• Google’s Solution:
• Solved the processor
problem using an
algorithm called
MapReduce.
• Divides the task into small
parts and assigns them to
many computers.
13
Hadoop Overview
Hadoop Overview
• Using the solution provided by
Google, Doug Cutting and his team
developed an Open-Source Project
called HADOOP.
15
Hadoop Overview
• Framework for distributed data processing Maps
data to key/value pairs
Reduces intermediate results to final output Largely
supplanted by Spark these days
• Yet Another Resource Negotiator
Manages cluster resources for multiple data
processing frameworks
• Hadoop Distributed File System
Distributes data blocks across clusters in a redundant
manner
16

Recommended for you

Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training

An Engine to process big data in faster(than MR), easy and extremely scalable way. An Open Source, parallel, in-memory processing, cluster computing framework. Solution for loading, processing and end to end analyzing large scale data. Iterative and Interactive : Scala, Java, Python, R and with Command line interface.

apache sparksynergetics-indiahdinsight
Big data applications
Big data applicationsBig data applications
Big data applications

A summarized version of a presentation regarding Big Data architecture, covering from Big Data concept to Hadoop and tools like Hive, Pig and Cassandra

big datahadooplambda architecture
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala

SQL Engines for Hadoop - The case for Impala presentation by Mark Grover at Budapest Data Forum on June 4th, 2015

introductionsqlimpala
Spark
Overview
Spark Overview
• Hadoop MapReduce must persist data back to the
disk after every Map or Reduce action.
• This brings processing slowness.
• Spark - Distributed processing framework for big
data.
• Apache Spark is very much popular for its speed.
It runs 100 times faster in memory and ten times
faster on disk than Hadoop MapReduce since it
processes data in memory (RAM).
• Supports Java, Scala, Python, and R.
18
Spark Components
19
How Spark
Works
• Spark apps are run as
independent processes on a
cluster.
• Executors run computations
and store data.
• Spark context sends
application code and tasks to
executors
• Cluster manager – Yarn
20

Recommended for you

5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics

In this session, we spend time comparing Microsoft Big Data Technologies for Analytics. he technology includes HDInsight, Apache Spark and Power BI.

apache sparkhdinsightpower bi
Apache drill
Apache drillApache drill
Apache drill

A talk given by Ted Dunning on February 2013 on Apache Drill, an open-source community-driven project to provide easy, dependable, fast and flexible ad hoc query capabilities.

apache drillinteractive querieshadoop
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python

In this webinar, we'll see how to use Spark to process data from various sources in R and Python and how new tools like Spark SQL and data frames make it easy to perform structured data processing.

rsparkapache spark
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 1.x three entry points were introduced,
•
Spark Context:
• The entry point of all spark application
• Spark Context is the first step to use RDD and connect to Spark
Cluster
• SQL Context:
• Used for the spark SQL executions & Structured data processing.
•
Hive Context:
• Used for the application to communicate with the hive.
21
Spark Context vs SQL Context vs
Hive Context vs Spark Session
• Spark 2.x introduced spark session,
• Spark Session:
• Combination of spark context, SQL context and
hive context.
22
Resilient Distributed
Dataset (RDD) & Dataframe
• RDD (Resilient Distributed Dataset) is a fundamental data
structure of Spark.
• The data frame is organized into named columns.
• Data frame supports APIs such as select, agg, sum, avg
etc.
• Support Spark SQL
• Catalyst Optimizer is available.
• Both are fault-tolerant, immutable distributed collections of
objects, which means you cannot change once you create.
23
Different types of Evaluation
• Eager Evaluation:
• Is the evaluation strategy you’ll most probably be familiar with and is used in most
programming languages
• Lazy Evaluation:
• Is an evaluation strategy that delays the evaluation of an expression until its value is
needed.
• Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want,
but Spark will not start the execution of the process until an ACTION is called.
24

Recommended for you

Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark

This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.

hadoopapache hadoopspark
Spark SQL
Spark SQLSpark SQL
Spark SQL

The document provides an agenda and overview for a Big Data Warehousing meetup hosted by Caserta Concepts. The meetup agenda includes an introduction to SparkSQL with a deep dive on SparkSQL and a demo. Elliott Cordo from Caserta Concepts will provide an introduction and overview of Spark as well as a demo of SparkSQL. The meetup aims to share stories in the rapidly changing big data landscape and provide networking opportunities for data professionals.

sqlbig data analyticscas
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls

( Call  ) Girls Nehru Place 9711199012 Beautiful Girls

Transformation & Actions
• Transformations are the instructions you use to modify the Data Frame in the way you want and
are lazily executed.
• Narrow transformations:
• Select
• Filter
• with column
• Wide transformations:
• Group by
• Repartition
• Actions are statements that will ask for a value to be computed immediately and are eager statements.
• Show, collect, save, count.
25
Spark’s Catalyst
Optimizer
• When performing different transformations,
Spark will store them in a Directed Acyclic
Graph (or DAG).
• Once the DAG is constructed, Spark’s catalyst
optimizer will perform a set of rule-based
and cost-based optimizations to determine
a logical and then physical plan of execution.
• Spark’s Catalyst optimizer will group
operations together, reducing the number of
passes on data and improving performance.
26
© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Spark Hands-on
27
Spark Assignment
• Input:
• Covid data CSV file
• Expected outputs:
• Convert all state names to lowercase.
• The day had a greater number of covid cases.
• The state has the second-largest number of covid cases.
• Which Union Territory has the least number of death.
• The state has the lowest Death to Total Confirmed cases
ratio.
• Find which month the more Newer recovered cases.
• If the month is 02 it should display as February.
28

Recommended for you

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...

❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RESULT KALYAN MATKA TIPS SATTA MATKA MATKA COM MATKA PANA JODI TODAY

#sattamatka #matka #dpboss#kalyanmatka #matka ##kalyanmatka
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata AvailableKolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available

Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available For Ad Post Contact : adityaroy0215@gmail.com

Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...

Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai Available

Hive Overview
Apache Hive
• Uses familiar SQL syntax (HiveQL)
• Scalable – works with “big data” on a
cluster
• Really most appropriate for data
warehouse applications
• Easy OLAP queries – WAY easier than
writing MapReduce in Java.
• Interactive & Highly optimized.
30
Other Big Data
Frameworks
Other Big
Data
Frameworks
32
• Pig introduces Pig Latin, a scripting language that lets you
use SQL-like syntax to define your map and reduce steps.
Apache Pig:
• Non-relational, petabyte-scale database.
• In-memory, Based on Google’s Bigtable, on top of HDFS
Apache HBase:
• It can connect to many different “big data” databases and
data stores at once, and query across them.
• Interactive queries at the petabyte scale.
Presto:
• Interactively run scripts/code against your data.
Apache Zeppelin:

Recommended for you

Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript

原版制作【微信:A575476】【(NC毕业证)尼亚加拉学院毕业证成绩单offer】【微信:A575476】��留信学历认证永久存档查询)采用学校原版纸张(包括:隐形水印,阴影底纹,钢印LOGO烫金烫银,LOGO烫金烫银复合重叠,文字图案浮雕,激光镭射,紫外荧光,温感,复印防伪)行业标杆!精益求精,诚心合作,真诚制作!多年品质 ,按需精细制作,24小时接单,全套进口原装设备,十五年致力于帮助留学生解决难题,业务范围有加拿大、英国、澳洲、韩国、美国、新加坡,新西兰等学历材料,包您满意。 【业务选择办理准则】 一、工作未确定,回国需先给父母、亲戚朋友看下文凭的情况,办理一份就读学校的毕业证【微信:A575476】文凭即可 二、回国进私企、外企、自己做生意的情况,这些单位是不查询毕业证真伪的,而且国内没有渠道去查询国外文凭的真假,也不需要提供真实教育部认证。鉴于此,办理一份毕业证【微信:A575476】即可 三、进国企,银行,事业单位,考公务员等等,这些单位是必需要提供真实教育部认证的,办理教育部认证所需资料众多且烦琐,所有材料您都必须提供原件,我们凭借丰富的经验,快捷的绿色通道帮您快速整合材料,让您少走弯路。 留信网认证的作用: 1:该专业认证可证明留学生真实身份【微信:A575476】 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内,将在公安局网内查询个人身份证信息后,同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料,供国家高端企业选择人才 → 【关于价格问题(保证一手价格) 我们所定的价格是非常合理的,而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子 我给客户的都是第一手的代理价格,因为我想坦诚对待大家 不想跟大家在价格方面浪费时间 对于老客户或者被老客户介绍过来的朋友,我们都会适当给一些优惠。 选择实体注册公司办理,更放心,更安全!我们的承诺:可来公司面谈,可签订合同,会陪同客户一起到教育部认证窗口递交认证材料,客户在教育部官方认证查询网站查询到认证通过结果后付款,不成功不收费! 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】外观非常精致,由特殊纸质材料制成,上面印有校徽、校名、毕业生姓名、专业等信息。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】格式相对统一,各专业都有相应的模板。通常包括以下部分: 校徽:象征着学校的荣誉和传承。 校名:学校英文全称 授予学位:本部分将注明获得的具体学位名称。 毕业生姓名:这是最重要的信息之一,标志着该证书是由特定人员获得的。 颁发日期:这是毕业正式生效的时间,也代表着毕业生学业的结束。 其他信息:根据不同的专业和学位,可能会有一些特定的信息或章节。 办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476】价值很高,需要妥善保管。一般来说,应放置在安全、干燥、防潮的地方,避免长时间暴露在阳光下。如需使用,最好使用复印件而不是原件,以免丢失。 综上所述,办理(NC毕业证)尼亚加拉学院毕业证【微信:A575476 】是证明身份和学历的高价值文件。外观简单庄重,格式统一,包括重要的个人信息和发布日期。对持有人来说,妥善保管是非常重要的。

旧金山艺术大学毕业证金门大学毕业证圣地亚哥州立大学毕业证
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...

*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderabad Available

( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...

( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Dolle come here

© Presidio, Inc. All rights reserved. Proprietary and Confidential.
Questions
33

More Related Content

Similar to Big Data training

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
Thomas W. Dinsmore
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
ITLAb21
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
Revathiparamanathan
 
Apache Spark
Apache SparkApache Spark
Apache Spark
masifqadri
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
David Smelker
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
xKinAnx
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
Adam Doyle
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
N Masahiro
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
Jen Stirrup
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
Maloy Manna, PMP®
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
Mostafa
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Caserta
 

Similar to Big Data training (20)

Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)Advanced Analytics and Big Data (August 2014)
Advanced Analytics and Big Data (August 2014)
 
Spark_Talha.pptx
Spark_Talha.pptxSpark_Talha.pptx
Spark_Talha.pptx
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
CC -Unit4.pptx
CC -Unit4.pptxCC -Unit4.pptx
CC -Unit4.pptx
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Presentation big dataappliance-overview_oow_v3
Presentation   big dataappliance-overview_oow_v3Presentation   big dataappliance-overview_oow_v3
Presentation big dataappliance-overview_oow_v3
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019Big Data Retrospective - STL Big Data IDEA Jan 2019
Big Data Retrospective - STL Big Data IDEA Jan 2019
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Apache drill
Apache drillApache drill
Apache drill
 
Data processing with spark in r & python
Data processing with spark in r & pythonData processing with spark in r & python
Data processing with spark in r & python
 
Programming in Spark using PySpark
Programming in Spark using PySpark      Programming in Spark using PySpark
Programming in Spark using PySpark
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 

Recently uploaded

( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
Nikita Singh$A17
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata AvailableKolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
roshansa9823
 
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
1258strict
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
taqyea
 
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
roobykhan02154
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
seenu pandey
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
67n7f53
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
sanjay singh
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
javier ramirez
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
Amazon Web Services Korea
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
bookmybebe1
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
RajdeepPaul47
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
Amazon Web Services Korea
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
Miguel Ángel Rodríguez Anticona
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
shruti singh$A17
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
#kalyanmatkaresult #dpboss #kalyanmatka #satta #matka #sattamatka
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
Donghwan Lee
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
Amazon Web Services Korea
 

Recently uploaded (20)

( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
( Call  ) Girls Nehru Place 9711199012 Beautiful Girls
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata AvailableKolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
Kolkata @Call @Girls Service 0000000000 Rani Best High Class Kolkata Available
 
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
Mumbai Central @Call @Girls 🛴 9930687706 🛴 Aaradhaya Best High Class Mumbai A...
 
Niagara College degree offer diploma Transcript
Niagara College  degree offer diploma TranscriptNiagara College  degree offer diploma Transcript
Niagara College degree offer diploma Transcript
 
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
*Call *Girls in Hyderabad 🤣 8826483818 🤣 Pooja Sharma Best High Class Hyderab...
 
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
( Call ) Girls South Mumbai phone 9930687706 You Are Serach A Beautyfull Doll...
 
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
一比一原版(usyd毕业证书)悉尼大学毕业证如何办理
 
Streamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through ModernizationStreamlining Legacy Complexity Through Modernization
Streamlining Legacy Complexity Through Modernization
 
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
Cómo hemos implementado semántica de "Exactly Once" en nuestra base de datos ...
 
[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers[D3T1S03] Amazon DynamoDB design puzzlers
[D3T1S03] Amazon DynamoDB design puzzlers
 
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model SafeKarol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
Karol Bagh @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Jya Khan Top Model Safe
 
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptxBIGPPTTTTTTTTtttttttttttttttttttttt.pptx
BIGPPTTTTTTTTtttttttttttttttttttttt.pptx
 
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
[D3T1S04] Aurora PostgreSQL performance monitoring and troubleshooting by use...
 
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdfAWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
AWS Cloud Technology and Services by Miguel Ángel Rodríguez Anticona.pdf
 
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model SafeSaket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
Saket @ℂall @Girls ꧁❤ 9873777170 ❤꧂VIP Neha Singla Top Model Safe
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
❻❸❼⓿❽❻❷⓿⓿❼ SATTA MATKA DPBOSS KALYAN FAST RESULTS CHART KALYAN MATKA MATKA RE...
 
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
[D2T2S04] SageMaker를 활용한 Generative AI Foundation Model Training and Tuning
 
[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction[D3T1S02] Aurora Limitless Database Introduction
[D3T1S02] Aurora Limitless Database Introduction
 

Big Data training

  • 2. About me • I’m Vishal Periyasamy Rajendran • Senior Data Engineer • Focused on architecting and developing big data solutions at AWS cloud. • 8x AWS certifications + other certifications on Azure, Snowflake etc. • You can find me on • LinkedIn: https://www.linkedin.com/in/vishal-p- 2703a9131/ • Medium: https://medium.com/@vishalrv1904 2
  • 3. Agenda • Big data Overview • Dimensions of Big data • Traditional approach and limitations • Hadoop Overview • Spark Overview • Hive Overview • Other Big data frameworks 3
  • 5. What is Big data? • Each user with a smartphone generates approximately 40 Exabytes of data every month. • According to Forbes, 2.5 Quintillion bytes of data are created every day. 5
  • 6. What is Big data? • Collection of data that is so huge & complex like none of the traditional data management tool can store or process it. 6
  • 8. 6v’s Of Big data • Volume • The scale of data. • Velocity • Speed of data. • Variety • Diversity of data. • Veracity • Accuracy of data. • Value • Insights gained from data. • Variability • How often data can change. 8
  • 10. Big Data Phases • Data collection • Data Cleansing / Validation • Data Transformation • Data Storage • Data Visualization Different Pipelines: • ETL (Extract, Transform, Load) • ELT (Extract, Load, Transform) 10
  • 12. Traditional Approach • An enterprise will have a computer to store and process big data. • Limitations: • Processor that is processing the data. • Dealing with huge amounts amounts of scalable data 12
  • 13. Traditional Approach • Google’s Solution: • Solved the processor problem using an algorithm called MapReduce. • Divides the task into small parts and assigns them to many computers. 13
  • 15. Hadoop Overview • Using the solution provided by Google, Doug Cutting and his team developed an Open-Source Project called HADOOP. 15
  • 16. Hadoop Overview • Framework for distributed data processing Maps data to key/value pairs Reduces intermediate results to final output Largely supplanted by Spark these days • Yet Another Resource Negotiator Manages cluster resources for multiple data processing frameworks • Hadoop Distributed File System Distributes data blocks across clusters in a redundant manner 16
  • 18. Spark Overview • Hadoop MapReduce must persist data back to the disk after every Map or Reduce action. • This brings processing slowness. • Spark - Distributed processing framework for big data. • Apache Spark is very much popular for its speed. It runs 100 times faster in memory and ten times faster on disk than Hadoop MapReduce since it processes data in memory (RAM). • Supports Java, Scala, Python, and R. 18
  • 20. How Spark Works • Spark apps are run as independent processes on a cluster. • Executors run computations and store data. • Spark context sends application code and tasks to executors • Cluster manager – Yarn 20
  • 21. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 1.x three entry points were introduced, • Spark Context: • The entry point of all spark application • Spark Context is the first step to use RDD and connect to Spark Cluster • SQL Context: • Used for the spark SQL executions & Structured data processing. • Hive Context: • Used for the application to communicate with the hive. 21
  • 22. Spark Context vs SQL Context vs Hive Context vs Spark Session • Spark 2.x introduced spark session, • Spark Session: • Combination of spark context, SQL context and hive context. 22
  • 23. Resilient Distributed Dataset (RDD) & Dataframe • RDD (Resilient Distributed Dataset) is a fundamental data structure of Spark. • The data frame is organized into named columns. • Data frame supports APIs such as select, agg, sum, avg etc. • Support Spark SQL • Catalyst Optimizer is available. • Both are fault-tolerant, immutable distributed collections of objects, which means you cannot change once you create. 23
  • 24. Different types of Evaluation • Eager Evaluation: • Is the evaluation strategy you’ll most probably be familiar with and is used in most programming languages • Lazy Evaluation: • Is an evaluation strategy that delays the evaluation of an expression until its value is needed. • Lazy Evaluation means that you can apply as many TRANSFORMATIONs as you want, but Spark will not start the execution of the process until an ACTION is called. 24
  • 25. Transformation & Actions • Transformations are the instructions you use to modify the Data Frame in the way you want and are lazily executed. • Narrow transformations: • Select • Filter • with column • Wide transformations: • Group by • Repartition • Actions are statements that will ask for a value to be computed immediately and are eager statements. • Show, collect, save, count. 25
  • 26. Spark’s Catalyst Optimizer • When performing different transformations, Spark will store them in a Directed Acyclic Graph (or DAG). • Once the DAG is constructed, Spark’s catalyst optimizer will perform a set of rule-based and cost-based optimizations to determine a logical and then physical plan of execution. • Spark’s Catalyst optimizer will group operations together, reducing the number of passes on data and improving performance. 26
  • 27. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Spark Hands-on 27
  • 28. Spark Assignment • Input: • Covid data CSV file • Expected outputs: • Convert all state names to lowercase. • The day had a greater number of covid cases. • The state has the second-largest number of covid cases. • Which Union Territory has the least number of death. • The state has the lowest Death to Total Confirmed cases ratio. • Find which month the more Newer recovered cases. • If the month is 02 it should display as February. 28
  • 30. Apache Hive • Uses familiar SQL syntax (HiveQL) • Scalable – works with “big data” on a cluster • Really most appropriate for data warehouse applications • Easy OLAP queries – WAY easier than writing MapReduce in Java. • Interactive & Highly optimized. 30
  • 32. Other Big Data Frameworks 32 • Pig introduces Pig Latin, a scripting language that lets you use SQL-like syntax to define your map and reduce steps. Apache Pig: • Non-relational, petabyte-scale database. • In-memory, Based on Google’s Bigtable, on top of HDFS Apache HBase: • It can connect to many different “big data” databases and data stores at once, and query across them. • Interactive queries at the petabyte scale. Presto: • Interactively run scripts/code against your data. Apache Zeppelin:
  • 33. © Presidio, Inc. All rights reserved. Proprietary and Confidential. Questions 33