SlideShare a Scribd company logo

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing
by Alexey Grigorev

Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions

• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity

MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
• apply to each element of the list
• rdc = fl = acmlt
• aggregate a list and produce one value of output
• No side effects

MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )

(a + (it123)
mp 1 ls




(eue+0(a + (it123)
mp 1 ls





MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:

• (it1234

( i t 1 2 and ( i t 3 4

• Apply map to each chuck separately, and then combine ( r d c them
e u e)

MapReduce: Origins
• Mapping separately:

(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls


(eue+rs (a + (it34)
e1 mp 1 ls

• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
mp 1 ls
• Note that for r d c the function must be additive

• A m p function
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
• for each group in the intermediate results
• aggregates and produces the final output

MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
• group together the intermediate results by key
• reduce stage: apply r d c to each group

MapReduce Stages











(nky i_a)i_e, nvl >
[otky otvl]
(u_e, u_a)

(u_e,[u_a] otky otvl) >

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.

MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)

frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )

0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)

n e


frec vi otu_as
o ah
n uptvl:


rs+ v
e =


Ei rs
m t( e )

MapReduce Example


)1 ,w(

• reduce stage: for each

pairs into

)]1 , . . . ,1 ,1[ ,w(

• group a list of


• map stage: output 1 for each word

calculate how many ones there are

MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...



... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.

• Open Source implementation of MapReduce
• "Hadoop":
• Hadoop MapReduce
• HBase
• Hive
• ... many others

Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase





local storage



Redu ce


local storage





• No execution plan


• Node done


• Node failed

Task reassigned
Another task assigned

• No communication costs

• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware

• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young


[Abouzeid, Azza et al 2009]

Hadoop as a Data Warehouse
• Cheetah
• Hive

• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views


• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary

• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying

0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)

0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'

0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col

0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
ATTO d=20-32'
0. SLC sb1gne,cut1
EET uq.edr on()
0. GOPB sb1gne
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col

0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.n
0. UIG'o1.y A (col mm,ct
SN tp0p' S sho, ee n)
0 .F O (
3 RM
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col on() s n
(A bsho,asau
MP .col .tts
SN mm_xrco.y
A (col mm)
S sho, ee
FO sau_paeaJI poie b
RM ttsudt
ON rfls
O (.srd=buei) sb1
N auei
.srd) uq
GOPB sb1sho,sb1mm
RU Y uq.col
DSRBR B sho,mm
ITIUE Y col ee
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2

Hadoop + Data Warehouse

Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data

• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW

ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing

Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset

Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW


Analytical Sandbox

Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there

• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox

1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]
Thank you
Prepared with Shower

More Related Content

What's hot

Exploiting Memory Overflows
Exploiting Memory OverflowsExploiting Memory Overflows
Exploiting Memory Overflows
Ankur Tyagi
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment
Nagi Teramo
The Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at WorkThe Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at Work
Jorge Vásquez
Hive at
Hive at Last.fmHive at
Hive at
Skills Matter
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
Java 8 monads
Java 8   monadsJava 8   monads
Java 8 monads
Asela Illayapparachchi
Apache Spark: Moving on from Hadoop
Apache Spark: Moving on from HadoopApache Spark: Moving on from Hadoop
Apache Spark: Moving on from Hadoop
Victor Sanchez Anguix
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Data Con LA

What's hot (9)

Exploiting Memory Overflows
Exploiting Memory OverflowsExploiting Memory Overflows
Exploiting Memory Overflows
Trading volume mapping R in recent environment
Trading volume mapping R in recent environment Trading volume mapping R in recent environment
Trading volume mapping R in recent environment
The Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at WorkThe Terror-Free Guide to Introducing Functional Scala at Work
The Terror-Free Guide to Introducing Functional Scala at Work
Hive at
Hive at Last.fmHive at
Hive at
Migrating from matlab to python
Migrating from matlab to pythonMigrating from matlab to python
Migrating from matlab to python
Java 8 monads
Java 8   monadsJava 8   monads
Java 8 monads
Apache Spark: Moving on from Hadoop
Apache Spark: Moving on from HadoopApache Spark: Moving on from Hadoop
Apache Spark: Moving on from Hadoop
HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?HyperLogLog in Hive - How to count sheep efficiently?
HyperLogLog in Hive - How to count sheep efficiently?
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...

Viewers also liked

Pig Experience
Pig ExperiencePig Experience
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Jane Man
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
Zitao Liu
Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technology
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
Nicolas Morales
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
Jose Luis Lopez Pino
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
Pranya Prabhakar
Big data
Big dataBig data
Big data
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
Chicago Hadoop Users Group
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
Inderaj (Raj) Bains
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
DataWorks Summit/Hadoop Summit
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
Kalyan Hadoop
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?
Julian Cole

Viewers also liked (20)

Pig Experience
Pig ExperiencePig Experience
Pig Experience
Build a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OSBuild a Big Data solution using DB2 for z/OS
Build a Big Data solution using DB2 for z/OS
Hadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical BasicsHadoop/Spark Non-Technical Basics
Hadoop/Spark Non-Technical Basics
Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)Virtualization and Open Virtualization Format (OVF)
Virtualization and Open Virtualization Format (OVF)
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
Scientific writing in Engineering and Technology
Scientific writing in Engineering and TechnologyScientific writing in Engineering and Technology
Scientific writing in Engineering and Technology
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
Cloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBaseCloud Deployments with Apache Hadoop and Apache HBase
Cloud Deployments with Apache Hadoop and Apache HBase
Social Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data TechnologiesSocial Data Analytics using IBM Big Data Technologies
Social Data Analytics using IBM Big Data Technologies
High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)High-level languages for Big Data Analytics (Presentation)
High-level languages for Big Data Analytics (Presentation)
mapReduce for machine learning
mapReduce for machine learning mapReduce for machine learning
mapReduce for machine learning
Big data
Big dataBig data
Big data
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
Process Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at ExpediaProcess Scheduling on Hadoop at Expedia
Process Scheduling on Hadoop at Expedia
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
Orienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshotsOrienit hadoop practical cluster setup screenshots
Orienit hadoop practical cluster setup screenshots
What is Comms Planning?
What is Comms Planning?What is Comms Planning?
What is Comms Planning?

Similar to Hadoop in Data Warehousing

R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
Gwen (Chen) Shapira
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Ivan Morozov
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
Adi Handarbeni
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
CS, NcState
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Sina Ebrahimi
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
Ajay Ohri
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
Sandeep Deshmukh
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
Manohar Amrutkar
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
Bharat Kalia
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Siddharth Mathur
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Paco Nathan
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
Moulay abdelaziz EL Amrani
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola

Similar to Hadoop in Data Warehousing (20)

R for hadoopers
R for hadoopersR for hadoopers
R for hadoopers
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
Scala Meetup Hamburg - Spark
Scala Meetup Hamburg - SparkScala Meetup Hamburg - Spark
Scala Meetup Hamburg - Spark
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
GALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software EngineeringGALE: Geometric active learning for Search-Based Software Engineering
GALE: Geometric active learning for Search-Based Software Engineering
An Introduction to MapReduce
An Introduction to MapReduce An Introduction to MapReduce
An Introduction to MapReduce
Training in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media AnalyticsTraining in Analytics, R and Social Media Analytics
Training in Analytics, R and Social Media Analytics
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016Tulsa techfest Spark Core Aug 5th 2016
Tulsa techfest Spark Core Aug 5th 2016
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Streaming API, Spark and Ruby
Streaming API, Spark and RubyStreaming API, Spark and Ruby
Streaming API, Spark and Ruby
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia OLAP Basics and Fundamentals by Bharat Kalia
OLAP Basics and Fundamentals by Bharat Kalia
Apache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathurApache pig presentation_siddharth_mathur
Apache pig presentation_siddharth_mathur
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Brief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICMEBrief Intro to Apache Spark @ Stanford ICME
Brief Intro to Apache Spark @ Stanford ICME
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Surpac geological modelling 3
Surpac geological modelling 3Surpac geological modelling 3
Surpac geological modelling 3
Scala and spark
Scala and sparkScala and spark
Scala and spark

More from Alexey Grigorev

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Alexey Grigorev
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Alexey Grigorev
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Alexey Grigorev
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
Alexey Grigorev
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
Alexey Grigorev
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
Alexey Grigorev
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Alexey Grigorev
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
Alexey Grigorev
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Alexey Grigorev
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
Alexey Grigorev
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
Alexey Grigorev
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
Alexey Grigorev
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Alexey Grigorev
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
Alexey Grigorev
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
Alexey Grigorev
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
Alexey Grigorev
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
Alexey Grigorev
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
Alexey Grigorev
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
Alexey Grigorev
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project
Alexey Grigorev

More from Alexey Grigorev (20)

MLOps week 1 intro
MLOps week 1 introMLOps week 1 intro
MLOps week 1 intro
Codementor - Data Science at OLX
Codementor - Data Science at OLX Codementor - Data Science at OLX
Codementor - Data Science at OLX
Data Monitoring with whylogs
Data Monitoring with whylogsData Monitoring with whylogs
Data Monitoring with whylogs
Data engineering zoomcamp introduction
Data engineering zoomcamp  introductionData engineering zoomcamp  introduction
Data engineering zoomcamp introduction
AI in Fashion - Size & Fit - Nour Karessli
 AI in Fashion - Size & Fit - Nour Karessli AI in Fashion - Size & Fit - Nour Karessli
AI in Fashion - Size & Fit - Nour Karessli
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia PavlovaAI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
AI-Powered Computer Vision Applications in Media Industry - Yulia Pavlova
ML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - KubernetesML Zoomcamp 10 - Kubernetes
ML Zoomcamp 10 - Kubernetes
Paradoxes in Data Science
Paradoxes in Data ScienceParadoxes in Data Science
Paradoxes in Data Science
ML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learningML Zoomcamp 8 - Neural networks and deep learning
ML Zoomcamp 8 - Neural networks and deep learning
Algorithmic fairness
Algorithmic fairnessAlgorithmic fairness
Algorithmic fairness
MLOps at OLX
MLOps at OLXMLOps at OLX
MLOps at OLX
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble LearningML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 6 - Decision Trees and Ensemble Learning
ML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deploymentML Zoomcamp 5 - Model deployment
ML Zoomcamp 5 - Model deployment
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for ClassificationML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 4 - Evaluation Metrics for Classification
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for ClassificationML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp 3 - Machine Learning for Classification
ML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office HoursML Zoomcamp Week #2 Office Hours
ML Zoomcamp Week #2 Office Hours
AMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplacesAMLD2021 - ML in online marketplaces
AMLD2021 - ML in online marketplaces
ML Zoomcamp 2 - Slides
ML Zoomcamp 2 - SlidesML Zoomcamp 2 - Slides
ML Zoomcamp 2 - Slides
ML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction ProjectML Zoomcamp 2.1 - Car Price Prediction Project
ML Zoomcamp 2.1 - Car Price Prediction Project

Recently uploaded

Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
Peter Caitens
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Alliance
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Alliance
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Alliance
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
Michael Price
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
Stephanie Beckett
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Nohoax Kanont
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
webbyacad software
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
Priyanka Aash
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Alliance
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Alliance
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+

Recently uploaded (20)

Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024Increase Quality with User Access Policies - July 2024
Increase Quality with User Access Policies - July 2024
FIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptxFIDO Munich Seminar: Securing Smart Car.pptx
FIDO Munich Seminar: Securing Smart Car.pptx
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Mastering Board Best Practices: Essential Skills for Effective Non-profit Lea...
Retrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with RagasRetrieval Augmented Generation Evaluation with Ragas
Retrieval Augmented Generation Evaluation with Ragas
History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )History and Introduction for Generative AI ( GenAI )
History and Introduction for Generative AI ( GenAI )
Zaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdfZaitechno Handheld Raman Spectrometer.pdf
Zaitechno Handheld Raman Spectrometer.pdf
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptxFIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar In-Vehicle Payment Trends.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptxFIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
FIDO Munich Seminar: Biometrics and Passkeys for In-Vehicle Apps.pptx
Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024Generative AI Reasoning Tech Talk - July 2024
Generative AI Reasoning Tech Talk - July 2024
The History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal EmbeddingsThe History of Embeddings & Multimodal Embeddings
The History of Embeddings & Multimodal Embeddings
Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024Perth MuleSoft Meetup July 2024
Perth MuleSoft Meetup July 2024
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan..."Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
"Building Future-Ready Apps with .NET 8 and Azure Serverless Ecosystem", Stan...
What's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptxWhat's New in Copilot for Microsoft 365 June 2024.pptx
What's New in Copilot for Microsoft 365 June 2024.pptx
UiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, ConnectUiPath Community Day Amsterdam: Code, Collaborate, Connect
UiPath Community Day Amsterdam: Code, Collaborate, Connect
Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...Generative AI technology is a fascinating field that focuses on creating comp...
Generative AI technology is a fascinating field that focuses on creating comp...
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and ConsiderationsChoosing the Best Outlook OST to PST Converter: Key Features and Considerations
Choosing the Best Outlook OST to PST Converter: Key Features and Considerations
Keynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive SecurityKeynote : AI & Future Of Offensive Security
Keynote : AI & Future Of Offensive Security
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptxFIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: FIDO Tech Principles.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptxFIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
FIDO Munich Seminar: Strong Workforce Authn Push & Pull Factors.pptx
Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+Scaling Vector Search: How Milvus Handles Billions+
Scaling Vector Search: How Milvus Handles Billions+

Hadoop in Data Warehousing

  • 1. 1 INFO-H-419: Data Warehouses project Hadoop in Data Warehousing by Alexey Grigorev
  • 2. 2 Hadoop: In this Presentation 1. Introduction 2. Origins 3. MapReduce 4. Hadoop as MapReduce Implementation 5. Data Warehouse on Hadoop 6. Hadoop and Data Warehousing 7. Conclusions
  • 3. 3 Why? • Lot of Data • How to deal with it? • Hadoop to rescue! • When to use? • When not to use? • Curiosity
  • 4. 4 MapReduce: Origins • Functional Programming • High order functions to operate on lists • mp a • apply to each element of the list • rdc = fl = acmlt eue od cuuae • aggregate a list and produce one value of output • No side effects
  • 5. 5 MapReduce: Origins • (eie(1e)( e 1) dfn + l + l ) • (a + (it123) mp 1 ls ) • (eue+0(it234) rdc ls ) • (eue+0(a + (it123) rdc mp 1 ls )) (it234 ls ) 9 9 ⇒ ⇒ ⇒
  • 6. 6 MapReduce: Origins • These function do not have side effects • And can be parallelized easily • Can split the input data into chunks: ⇒ • (it1234 ls ) ( i t 1 2 and ( i t 3 4 ls ) ls ) • Apply map to each chuck separately, and then combine ( r d c them e u e) together
  • 7. 7 MapReduce: Origins • Mapping separately: • (eiers (eue+0(a + (it12) dfn e1 rdc mp 1 ls )) • (eue+rs (a + (it34) rdc e1 mp 1 ls )) • This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 ) rdc mp 1 ls )) • Note that for r d c the function must be additive eue
  • 8. 8 MapReduce • A m p function a • takes a key-value pair ( n k y i _ a ) i_e, nvl • produces zero or more key-value pairs: intermediate results • intermediate results are grouped by key • A r d c function eue • for each group in the intermediate results • aggregates and produces the final output
  • 9. 9 MapReduce Stages each MapReduce Job is executed in 3 stages • map stage: apply m p to each key-value pair a • group together the intermediate results by key • reduce stage: apply r d c to each group eue
  • 11. 11 Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus. Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros. Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis fringilla dolor ornare mi dictum ornare.
  • 12. 12 MapReduce Example 0 .d f m p S r n i p t k y S r n d c : 1 e a(tig nu_e, tig o) 0. 2 0. 3 frec wr wi dc o ah od n o: EiItreit w 1 m t n e m d a e( , ) 0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s : 4 e eueSrn uptky trtr uptvl) 0. 5 itrs=0 n e 0. 6 frec vi otu_as o ah n uptvl: 0. 7 rs+ v e = 0. 8 Ei rs m t( e )
  • 13. 13 MapReduce Example w )1 ,w( • reduce stage: for each pairs into )]1 , . . . ,1 ,1[ ,w( • group a list of w • map stage: output 1 for each word calculate how many ones there are
  • 14. 14 MapReduce Example: Result • amet: 2 • ante: 2 • aptent: 1 • consectetur: 1 • dictum: 3 • dolor: 2 • elit: 3 • ...
  • 16. 16 “ Hadoop ... is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
  • 17. 17 Hadoop • Open Source implementation of MapReduce • "Hadoop": • HDFS • Hadoop MapReduce • HBase • Hive • ... many others
  • 18. 18 Hadoop Cluster: Terminology • Name Node: orchestrates the process • Workers: nodes that do the computation • Mappers do the map phase • Reducers do the reduce phase
  • 21. 21
  • 22. 22
  • 23. 23
  • 24. 24
  • 25. 25
  • 26. 26
  • 27. 27 ≈ Fault-Tolerance Load-Balancing • No execution plan ⇒ • Node done ⇒ • Node failed Task reassigned Another task assigned • No communication costs
  • 28. 28 Advantages • Simple, especially for programmers who know FP • Fault tolerant • No schema, can process any data • Flexible • Cheap and runs on commodity hardware
  • 29. 29 Disadvantages • No declarative high-level language like SQL • Performance issues: • Map and Reduce are blocking • Name Node: single point of failure • It's young
  • 31. 31 Hadoop as a Data Warehouse • Cheetah • Hive
  • 32. 32 Cheetah • Typical DW relation-like schemas • ... But not exactly • They call it virtual views
  • 34. 34 Cheetah • Virtual views consist of columns that can be queried • Everything inside is entirely denormalized • Append-only design and slowly changing dimensions • Proprietary
  • 35. 35 Hive • A data warehousing solution built by Facebook • For Big data analysis: • in 2010 (4 years ago!), 30+ PB • Has its own data model • HiveQL: a declarative SQL-like language for ad-hoc querying
  • 36. 36 HiveQL Tables 0 .S A U U D T ( s r i i t s a u s r n , d s r n ) 1 TTS PAEue d n, tts tig s tig 0 .P O I E ( s r d i t s h o s r n , g n e i t 2 RFLSuei n, col tig edr n) 0 .L A D T L C L I P T ' o s s a u _ p a e ' 1 OD AA OA NAH lg/ttsudts 0 .I T T B E s a u _ p a e 2 NO AL ttsudts 0 .P R I I N ( s ' 0 9 0 - 0 ) 3 ATTO d=20-32'
  • 37. 37 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0 .I S R O E W I E T B E g n e _ u m r 5 NET VRRT AL edrsmay 0 .P R I I N ( s ' 0 9 0 - 0 ) 6 ATTO d=20-32' 0 .S L C s b 1 g n e , c u t 1 7 EET uq.edr on() 0 .G O P B s b 1 g n e 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 38. 38 HiveQL 0 .F O 1 RM 0 .( E E T a s a u , b s h o , g g n e 2 SLC .tts .col .edr 0. FO sau_pae aJI poie b 3 RM ttsudts ON rfls 0. O (.srd=buei adad ='090-0)sb1 4 N auei .srd n .s 20-32' uq 0. ISR OEWIETBEgne_umr 5 NET VRRT AL edrsmay 0. PRIIN(s'090-0) 6 ATTO d=20-32' 0. SLC sb1gne,cut1 7 EET uq.edr on() 0. GOPB sb1gne 8 RU Y uq.edr 0 .I S R O E W I E T B E s h o _ u m r 9 NET VRRT AL colsmay 1 .P R I I N ( s ' 0 9 0 - 0 ) 0 ATTO d=20-32' 1 .S L C s b . c o l c u t 1 1 EET uqsho, on() 1 .G O P B s b 1 s h o 2 RU Y uq.col
  • 39. 39 HiveQL 0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t 1 EUE uq.col uq.n 0. UIG'o1.y A (col mm,ct 2 SN tp0p' S sho, ee n) 0 .F O ( 3 RM 0. 4 SLC sb1sho,sb1mm,cut1 a ct EET uq.col on() s n 0. 5 FO RM 0. 6 (A bsho,asau MP .col .tts 0. 7 UIG'eeetatrp' SN mm_xrco.y 0. 8 A (col mm) S sho, ee 0. 9 FO sau_paeaJI poie b RM ttsudt ON rfls 1. 0 O (.srd=buei) sb1 N auei .srd) uq 1. 1 GOPB sb1sho,sb1mm RU Y uq.col 1. 2 DSRBR B sho,mm ITIUE Y col ee 1. 3 SR B sho,mm,ctds) OT Y col ee n ec 1 .) s b 2 4 uq
  • 41. 41 Hadoop + Data Warehouse • Hadoop and Data Warehouses can co-exist • DW: OLAP, BI, transactional data • Hadoop: Raw, unstructured data
  • 42. 42 ETL • Extract: load to HDFS, parse, prepare • Run some analysis • Transform: clean data and transform to some structured format • with MapReduce • Load: extract from HDFS, load to DW
  • 43. 43 ETL: examples • Text processing • Call center records analysis • extract sentiment • link to profile • which customers are more important to keep? • Image processing
  • 44. 44 Active Storage • Don't delete the data after processing • Hadoop storage is cheap: it can store anything • Run more analysis when needed • Like: extract new keywords/features from the old dataset
  • 45. 45 Active Storage - 2 • Up to 80% of data is dormant (or cold) • Hadoop storage can be way cheaper than high-cost data management solutions • Move this data to Hadoop • When needed quickly analyze there or move back to DW
  • 49. 49 Analytical Sandbox • What are we looking in this data? • No structure - hard to know • Run ad-hoc Hive queries to see what's there
  • 50. 50 Conclusions • Hadoop is becoming more and more popular • Many companies plan to adopt • Best used with existent DW solutions • as an ETL • as Active Storage • as Analytical Sandbox
  • 51. 51 References 1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20. [pdf] 2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013. 3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010. [pdf] 4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and Teradata) 5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB Endowment 2.2 (2009): 1626-1629. [pdf] 6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]
  • 52. 52 References 7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013. 8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf] 9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf] 10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113. [pdf] 11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013. 12. Apache Hadoop project home page, url: [link]. 13. Apache HBase home page, [link]. 14. Apache Mahout home page, [link]. 15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014. 16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf] 17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]