SlideShare a Scribd company logo
Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer Science
University of Pittsburgh
ztliu@cs.pitt.edu
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Hadoop
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at Apache
Hadoop first.
Apache Hadoop is an open-source software framework written in Java for
distributed storage and distributed processing of very large data sets
on computer clusters built from commodity hardware.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the following
modules:
Hadoop Common
Hadoop Distributed File System ( ) - storage
Hadoop YARN
Hadoop MapReduce ( ) - processing
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, and
portable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system that
stores data on commodity machines, providing very high aggregate
bandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)
across multiple machines.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
Hadoop MapReduce
MapReduce is a programming model and an associated implementation for
processing and generating large data sets with a parallel, distributed
algorithm on a cluster.
A MapReduce program is
composed of
Map procedure
Reduce procedure
Figure 1: Image from
http://tessera.io/docs-datadr/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.
Document Data Model, such as MongoDB.
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
MapReduce V.S. Spark
A quick history:
Figure 2: Image from
http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complex
batch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
difficult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
Apache Spark
Spark fits into the Hadoop open-source community, building on top of the
Hadoop Distributed File System (HDFS). It is a framework for writing
fast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce for
certain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
Advantages of Spark
Low-latency computations by caching the working dataset in memory
and then performing computations at memory speeds.
Efficient iterative algorithm by having subsequent iterations share
data through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
Apache Spark
Spark has the upper hand as long as were talking about iterative
computations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, data
transformation or data integration, then MapReduce is the deal - this is
what it was designed for1.
1
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amount
of data you need to process, because the data has to fit into the memory
for optimal performance. So, if you need to process really Big Data,
Hadoop will definitely be the cheaper option since hard disk space comes
at a much lower rate than memory space2.
2
https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
Thank you
Thank You
Q & A
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17

More Related Content

What's hot

Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Evert Lammerts
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
Edureka!
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
Edureka!
 
Hadoop Case Studies in the Real World
Hadoop Case Studies in the Real WorldHadoop Case Studies in the Real World
Hadoop Case Studies in the Real World
Mobin Ranjbar
 
CSB_community
CSB_communityCSB_community
Python for data science
Python for data sciencePython for data science
Python for data science
Tanzeel Ahmad Mujahid
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
Jie-Han Chen
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
Edureka!
 
Big data computing
Big data computingBig data computing
Big data computing
TasneemKhan47
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
Mohanasundaram Ponnusamy
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
Prasad Prabhu (PP)
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
Edureka!
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
Urvashi Kataria
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
Sarah Guido
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
Edureka!
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
amrutupre
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
Edureka!
 

What's hot (20)

Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Hadoop Case Studies in the Real World
Hadoop Case Studies in the Real WorldHadoop Case Studies in the Real World
Hadoop Case Studies in the Real World
 
CSB_community
CSB_communityCSB_community
CSB_community
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Data science-toolchain
Data science-toolchainData science-toolchain
Data science-toolchain
 
Hadoop for Java Professionals
Hadoop for Java ProfessionalsHadoop for Java Professionals
Hadoop for Java Professionals
 
Big data computing
Big data computingBig data computing
Big data computing
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Big Data - HDInsight and Power BI
Big Data - HDInsight and Power BIBig Data - HDInsight and Power BI
Big Data - HDInsight and Power BI
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?Is It A Right Time For Me To Learn Hadoop. Find out ?
Is It A Right Time For Me To Learn Hadoop. Find out ?
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Analyzing Data With Python
Analyzing Data With PythonAnalyzing Data With Python
Analyzing Data With Python
 
Webinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use HadoopWebinar: Big Data & Hadoop - When not to use Hadoop
Webinar: Big Data & Hadoop - When not to use Hadoop
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionalsHadoop for Data Warehousing professionals
Hadoop for Data Warehousing professionals
 

Viewers also liked

Hadoop
HadoopHadoop
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Jane Man
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
npinto
 
Non technical presentation
Non technical presentationNon technical presentation
Non technical presentation
connorhowe131
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
Jose Luis Lopez Pino
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
huguk
 
Soft Skills Presentation
Soft Skills PresentationSoft Skills Presentation
Soft Skills Presentation
Stephanie Rule
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
shilpi nagpal
 

Viewers also liked (9)

Hadoop
HadoopHadoop
Hadoop
 
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
Non technical presentation
Non technical presentationNon technical presentation
Non technical presentation
 
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
 
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
 
Soft Skills Presentation
Soft Skills PresentationSoft Skills Presentation
Soft Skills Presentation
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 

Similar to Hadoop/Spark Non-Technical Basics

Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
Ahmed Elsayed
 
Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...
kiwenlau
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Slim Baltagi
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Turkish Testing Board
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
Nirmal Fernando
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Slim Baltagi
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Jason Dai
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
elephantscale
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Edureka!
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
Evert Lammerts
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
Gregg Barrett
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
Slim Baltagi
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
Gezim Sejdiu
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
Makoto Yui
 

Similar to Hadoop/Spark Non-Technical Basics (20)

Analyzing Big data in R and Scala using Apache Spark 17-7-19
Analyzing Big data in R and Scala using Apache Spark  17-7-19Analyzing Big data in R and Scala using Apache Spark  17-7-19
Analyzing Big data in R and Scala using Apache Spark 17-7-19
 
Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...Store app a shared storage appliance for efficient and scalable virtualized h...
Store app a shared storage appliance for efficient and scalable virtualized h...
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Machine learning for java developers
Machine learning for java developersMachine learning for java developers
Machine learning for java developers
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics ZooAutomated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
Automated Time Series Analysis using Deep Learning, Ray and Analytics Zoo
 
Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...Webinar:  Mastering Python - An Excellent tool for Web Scraping and Data Anal...
Webinar: Mastering Python - An Excellent tool for Web Scraping and Data Anal...
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Building a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystemBuilding a Big Data platform with the Hadoop ecosystem
Building a Big Data platform with the Hadoop ecosystem
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Db tech show - hivemall
Db tech show - hivemallDb tech show - hivemall
Db tech show - hivemall
 

Recently uploaded

Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
femim26318
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
Milind Agarwal
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
Aadee4
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Deepikakumari457585
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
amazenolmedojeruel
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
District 11 Solutions
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Samuel Jackson
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
SantuJana12
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
HeidiLivengood
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
PromptCloud
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
RejoJohn2
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
DALubis
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
lenjisoHussein
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
SomalyEng
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
Priyanka Jadhav
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
aznidajailani
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
Becky Burwell
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
talha2khan2k
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
weiwchu
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
RuchiRathor2
 

Recently uploaded (20)

Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
Cal Girls Mansarovar Jaipur | 08445551418 | Rajni High Profile Girls Call in ...
 
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
From Signals to Solutions: Effective Strategies for CDR Analysis in Fraud Det...
 
future-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-managementfuture-of-asset-management-future-of-asset-management
future-of-asset-management-future-of-asset-management
 
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop ServiceCal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
Cal Girls Hotel Safari Jaipur | | Girls Call Free Drop Service
 
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptxPRODUCT | RESEARCH-PRESENTATION-1.1.pptx
PRODUCT | RESEARCH-PRESENTATION-1.1.pptx
 
Data Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 SolutionsData Analytics for Decision Making By District 11 Solutions
Data Analytics for Decision Making By District 11 Solutions
 
Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...Combined supervised and unsupervised neural networks for pulse shape discrimi...
Combined supervised and unsupervised neural networks for pulse shape discrimi...
 
Vrinda store data analysis project using Excel
Vrinda store data analysis project using ExcelVrinda store data analysis project using Excel
Vrinda store data analysis project using Excel
 
Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635Data Storytelling Final Project for MBA 635
Data Storytelling Final Project for MBA 635
 
How AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdfHow AI is Revolutionizing Data Collection.pdf
How AI is Revolutionizing Data Collection.pdf
 
CT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptxCT AnGIOGRAPHY of pulmonary embolism.pptx
CT AnGIOGRAPHY of pulmonary embolism.pptx
 
Accounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-RegulationsAccounting and Auditing Laws-Rules-and-Regulations
Accounting and Auditing Laws-Rules-and-Regulations
 
Training on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptxTraining on CSPro and step by steps.pptx
Training on CSPro and step by steps.pptx
 
Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?Where to order Frederick Community College diploma?
Where to order Frederick Community College diploma?
 
Unit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptxUnit 1 Introduction to DATA SCIENCE .pptx
Unit 1 Introduction to DATA SCIENCE .pptx
 
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di MalaysiaBimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
Bimbingan kaunseling untuk pelajar IPTA/IPTS di Malaysia
 
SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024SFBA Splunk Usergroup meeting July 17, 2024
SFBA Splunk Usergroup meeting July 17, 2024
 
Acid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjkAcid Base Practice Test 4- KEY.pdfkkjkjk
Acid Base Practice Test 4- KEY.pdfkkjkjk
 
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
Harnessing Wild and Untamed (Publicly Available) Data for the Cost efficient ...
 
Big Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of PaymentsBig Data and Analytics Shaping the future of Payments
Big Data and Analytics Shaping the future of Payments
 

Hadoop/Spark Non-Technical Basics

  • 1. Hadoop/Spark Non-Technical Basics Zitao Liu Department of Computer Science University of Pittsburgh ztliu@cs.pitt.edu September 24, 2015 Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
  • 2. Big Data Analytics Big Data Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
  • 3. Big Data Analytics Big Data Analytics always require two components: A filesystem to store big data. A computation framework to analysis big data. Hadoop Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
  • 4. Apache Hadoop Too many meanings associated with “Hadoop”. Let’s look at Apache Hadoop first. Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
  • 5. Apache Hadoop The base Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System Hadoop YARN Hadoop MapReduce Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
  • 6. Apache Hadoop The base Apache Hadoop framework is composed of the following modules: Hadoop Common Hadoop Distributed File System ( ) - storage Hadoop YARN Hadoop MapReduce ( ) - processing Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
  • 7. Hadoop Distributed File System (HDFS) The Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file-system written in Java for the Hadoop framework. Hadoop Distributed File System (HDFS) a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster. HDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
  • 8. Hadoop MapReduce MapReduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of Map procedure Reduce procedure Figure 1: Image from http://tessera.io/docs-datadr/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
  • 9. Hadoop Ecosystem Hadoop Ecosystem includes: Distributed Filesystem, such as HDFS. Distributed Programming, such as MapReduce, Pig, Spark. SQL-On-Hadoop, such as Hive, Drill, Presto. NoSQL Databases. Column Data Model, such as HBase, Cassandra. Document Data Model, such as MongoDB. · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
  • 10. MapReduce V.S. Spark A quick history: Figure 2: Image from http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
  • 11. Advantages of MapReduce MapReduce has proven to be an ideal platform to implement complex batch applications as diverse as sifting through analyzing system logs running ETL computing web indexes powering personal recommendation systems · · · Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
  • 12. Limitations of MapReduce Some limitations of MapReduce: Batch mode processing (one-pass computation model) difficult to program directly in MapReduce performance bottlenecks In short, MR doesn’t compose well for a large number of applications. Therefore, people built specialized systems as workarounds, such as Spark. Details can be found in http: //stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf. Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
  • 13. Apache Spark Spark fits into the Hadoop open-source community, building on top of the Hadoop Distributed File System (HDFS). It is a framework for writing fast, distributed programs. Faster (a in-memory approach) 10 times faster than MapReduce for certain applications. Better for iterative algorithms in ML. Clean, concise APIs in Scala, Java and Python. Interactive query analysis (from the Scala and Python shells). Real-time analysis (Spark Streaming). Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
  • 14. Advantages of Spark Low-latency computations by caching the working dataset in memory and then performing computations at memory speeds. Efficient iterative algorithm by having subsequent iterations share data through memory, or repeatedly accessing the same dataset. Figure 3: Image from http://blog.cloudera.com/blog/2013/11/ putting-spark-to-use-fast-in-memory-computing-for-your-big-data-app Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
  • 15. Apache Spark Spark has the upper hand as long as were talking about iterative computations that need to pass over the same data many times. But when it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal - this is what it was designed for1. 1 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
  • 16. Apache Spark Cost The memory in the Spark cluster should be at least as large as the amount of data you need to process, because the data has to fit into the memory for optimal performance. So, if you need to process really Big Data, Hadoop will definitely be the cheaper option since hard disk space comes at a much lower rate than memory space2. 2 https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/ Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
  • 17. Thank you Thank You Q & A Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17