Hadoop in Data Warehousing

1

INFO-H-419: Data Warehouses project

Hadoop in Data Warehousing
by Alexey Grigorev

2

Hadoop: In this Presentation
1. Introduction
2. Origins
3. MapReduce
4. Hadoop as MapReduce Implementation
5. Data Warehouse on Hadoop
6. Hadoop and Data Warehousing
7. Conclusions

3

Why?
• Lot of Data
• How to deal with it?
• Hadoop to rescue!
• When to use?
• When not to use?
• Curiosity

4

MapReduce: Origins
• Functional Programming
• High order functions to operate on lists
• mp
a
• apply to each element of the list
• rdc = fl = acmlt
eue
od
cuuae
• aggregate a list and produce one value of output
• No side effects

5

MapReduce: Origins
• (eie(1e)( e 1)
dfn + l + l )
•

(a + (it123)
mp 1 ls
)

•

(eue+0(it234)
rdc
ls
)

•

(eue+0(a + (it123)
rdc
mp 1 ls
))

(it234
ls
)
9
9

⇒

⇒

⇒

6

MapReduce: Origins
• These function do not have side effects
• And can be parallelized easily
• Can split the input data into chunks:
⇒

• (it1234
ls
)

( i t 1 2 and ( i t 3 4
ls
)
ls
)

• Apply map to each chuck separately, and then combine ( r d c them
e u e)
together

7

MapReduce: Origins
• Mapping separately:
•

(eiers (eue+0(a + (it12)
dfn e1 rdc
mp 1 ls
))

•

(eue+rs (a + (it34)
rdc
e1 mp 1 ls
))

• This is the same as ( e u e + 0 ( a + ( i t 1 2 3 4 )
rdc
mp 1 ls
))
• Note that for r d c the function must be additive
eue

8

MapReduce
• A m p function
a
• takes a key-value pair ( n k y i _ a )
i_e, nvl
• produces zero or more key-value pairs: intermediate results
• intermediate results are grouped by key
• A r d c function
eue
• for each group in the intermediate results
• aggregates and produces the final output

9

MapReduce Stages
each MapReduce Job is executed in 3 stages
• map stage: apply m p to each key-value pair
a
• group together the intermediate results by key
• reduce stage: apply r d c to each group
eue

10

MapReduce Stages
data
source

data
source

data
source

data
source

map

map

map

map

reduce

reduce

reduce

mp
a:
(nky i_a)i_e, nvl >
[otky otvl]
(u_e, u_a)

rdc:
eue
(u_e,[u_a] otky otvl) >
[e_a]
rsvl

11

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean dictum justo est, quis
sagittis leo tincidunt sit amet. Donec scelerisque rutrum quam non sagittis. Phasellus sem
nisi, cursus eu lacinia eu, tempor ac eros. Class aptent taciti sociosqu ad litora torquent per
conubia nostra, per inceptos himenaeos. In mollis elit quis orci congue, quis aliquet mauris
mollis. Interdum et malesuada fames ac ante ipsum primis in faucibus.

Proin euismod non quam vitae pretium. Quisque vel nisl et leo volutpat rhoncus quis ac eros.
Sed lacus tellus, aliquam non ullamcorper in, dictum at magna. Vestibulum consequat
egestas lacinia. Proin tempus rhoncus mi, et lacinia elit ornare auctor. Sed sagittis euismod
massa ut posuere. Interdum et malesuada fames ac ante ipsum primis in faucibus. Duis
fringilla dolor ornare mi dictum ornare.

12

MapReduce Example
0 .d f m p S r n i p t k y S r n d c :
1 e a(tig nu_e, tig o)
0.
2
0.
3

frec wr wi dc
o ah od
n o:
EiItreit w 1
m t n e m d a e( , )

0 .d f r d c ( t i g o t u _ e , I e a o o t u _ a s :
4 e eueSrn uptky trtr uptvl)
0.
5

itrs=0
n e

0.
6

frec vi otu_as
o ah
n uptvl:

0.
7

rs+ v
e =

0.
8

Ei rs
m t( e )

13

MapReduce Example

w

)1 ,w(

• reduce stage: for each

pairs into

)]1 , . . . ,1 ,1[ ,w(

• group a list of

w

• map stage: output 1 for each word

calculate how many ones there are

14

MapReduce Example: Result
• amet: 2
• ante: 2
• aptent: 1
• consectetur: 1
• dictum: 3
• dolor: 2
• elit: 3
• ...

http://flickr.com/photos/erikeldridge/3614786392/

Hadoop

16

“

Hadoop
... is a framework that allows for the distributed processing of large data
sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each
offering local computation and storage. Rather than rely on hardware to
deliver high-availability, the library itself is designed to detect and handle
failures at the application layer, so delivering a highly-available service on
top of a cluster of computers, each of which may be prone to failures.

17

Hadoop
• Open Source implementation of MapReduce
• "Hadoop":
• HDFS
• Hadoop MapReduce
• HBase
• Hive
• ... many others

18

Hadoop Cluster: Terminology
• Name Node: orchestrates the process
• Workers: nodes that do the computation
• Mappers do the map phase
• Reducers do the reduce phase

19

Hadoop
file

Read

Map

Combine

mapper
local storage

Pull
result

HDFS

Redu ce

Sort

reducer
local storage

Copy

20

http://escience.washington.edu/get-help-now/what-hadoop

27

≈

Fault-Tolerance

Load-Balancing

• No execution plan

⇒

• Node done

⇒

• Node failed

Task reassigned
Another task assigned

• No communication costs

28

Advantages
• Simple, especially for programmers who know FP
• Fault tolerant
• No schema, can process any data
• Flexible
• Cheap and runs on commodity hardware

29

Disadvantages
• No declarative high-level language like SQL
• Performance issues:
• Map and Reduce are blocking
• Name Node: single point of failure
• It's young

30

Disadvantages

[Abouzeid, Azza et al 2009]

31

Hadoop as a Data Warehouse
• Cheetah
• Hive

32

Cheetah
• Typical DW relation-like schemas
• ... But not exactly
• They call it virtual views

34

Cheetah
• Virtual views consist of columns that can be queried
• Everything inside is entirely denormalized
• Append-only design and slowly changing dimensions
• Proprietary

35

Hive
• A data warehousing solution built by Facebook
• For Big data analysis:
• in 2010 (4 years ago!), 30+ PB
• Has its own data model
• HiveQL: a declarative SQL-like language for ad-hoc querying

36

HiveQL
Tables
0 .S A U U D T ( s r i i t s a u s r n , d s r n )
1 TTS PAEue d n, tts tig s tig
0 .P O I E ( s r d i t s h o s r n , g n e i t
2 RFLSuei n, col tig edr n)

0 .L A D T L C L I P T ' o s s a u _ p a e '
1 OD AA OA NAH lg/ttsudts
0 .I T T B E s a u _ p a e
2 NO AL ttsudts
0 .P R I I N ( s ' 0 9 0 - 0 )
3 ATTO d=20-32'

37

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0 .I S R O E W I E T B E g n e _ u m r
5 NET VRRT AL edrsmay
0 .P R I I N ( s ' 0 9 0 - 0 )
6 ATTO d=20-32'
0 .S L C s b 1 g n e , c u t 1
7 EET uq.edr on()
0 .G O P B s b 1 g n e
8 RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col

38

HiveQL
0 .F O
1 RM
0 .( E E T a s a u , b s h o , g g n e
2 SLC .tts .col .edr
0. FO sau_pae aJI poie b
3
RM ttsudts
ON rfls
0. O (.srd=buei adad ='090-0)sb1
4
N auei
.srd n .s
20-32' uq
0. ISR OEWIETBEgne_umr
5
NET VRRT AL edrsmay
0. PRIIN(s'090-0)
6
ATTO d=20-32'
0. SLC sb1gne,cut1
7
EET uq.edr on()
0. GOPB sb1gne
8
RU Y uq.edr
0 .I S R O E W I E T B E s h o _ u m r
9 NET VRRT AL colsmay
1 .P R I I N ( s ' 0 9 0 - 0 )
0 ATTO d=20-32'
1 .S L C s b . c o l c u t 1
1 EET uqsho, on()
1 .G O P B s b 1 s h o
2 RU Y uq.col

39

HiveQL
0 .R D C s b 2 s h o , s b 2 m m , s b 2 c t
1 EUE uq.col uq.ee uq.n
0. UIG'o1.y A (col mm,ct
2
SN tp0p' S sho, ee n)
0 .F O (
3 RM
0.
4
SLC sb1sho,sb1mm,cut1 a ct
EET uq.col uq.ee on() s n
0.
5
FO
RM
0.
6
(A bsho,asau
MP .col .tts
0.
7
UIG'eeetatrp'
SN mm_xrco.y
0.
8
A (col mm)
S sho, ee
0.
9
FO sau_paeaJI poie b
RM ttsudt
ON rfls
1.
0
O (.srd=buei) sb1
N auei
.srd) uq
1.
1
GOPB sb1sho,sb1mm
RU Y uq.col uq.ee
1.
2
DSRBR B sho,mm
ITIUE Y col ee
1.
3
SR B sho,mm,ctds)
OT Y col ee n ec
1 .) s b 2
4
uq

http://www.flickr.com/photos/mrflip/5150336351/in/photos

Hadoop + Data Warehouse

41

Hadoop + Data Warehouse
• Hadoop and Data Warehouses can co-exist
• DW: OLAP, BI, transactional data
• Hadoop: Raw, unstructured data

42

ETL
• Extract: load to HDFS, parse, prepare
• Run some analysis
• Transform: clean data and transform to some structured format
• with MapReduce
• Load: extract from HDFS, load to DW

43

ETL: examples
• Text processing
• Call center records analysis
• extract sentiment
• link to profile
• which customers are more important to keep?
• Image processing

44

Active Storage
• Don't delete the data after processing
• Hadoop storage is cheap: it can store anything
• Run more analysis when needed
• Like: extract new keywords/features from the old dataset

45

Active Storage - 2
• Up to 80% of data is dormant (or cold)
• Hadoop storage can be way cheaper than high-cost data management
solutions
• Move this data to Hadoop
• When needed quickly analyze there or move back to DW

http://www.flickr.com/photos/pasukaru76/9824401426/

http://www.flickr.com/photos/pasukaru76/4977447932/

49

Analytical Sandbox
• What are we looking in this data?
• No structure - hard to know
• Run ad-hoc Hive queries to see what's there

50

Conclusions
• Hadoop is becoming more and more popular
• Many companies plan to adopt
• Best used with existent DW solutions
• as an ETL
• as Active Storage
• as Analytical Sandbox

51

References
1. Lee, Kyong-Ha, et al. "Parallel data processing with MapReduce: a survey." ACM SIGMOD Record 40.4 (2012): 11-20.
[pdf]
2. "MapReduce vs Data Warehouse". Webpage, [link]. Accessed 15/12/2013.
3. Ordonez, Carlos, Il-Yeol Song, and Carlos Garcia-Alvarado. "Relational versus non-relational database systems for
data warehousing." Proceedings of the ACM 13th international workshop on Data warehousing and OLAP. ACM, 2010.
[pdf]
4. A. Awadallah, D. Graham. "Hadoop and the Data Warehouse: When to Use Which." (2011). [pdf] (by Cloudera and
Teradata)
5. Thusoo, Ashish, et al. "Hive: a warehousing solution over a map-reduce framework." Proceedings of the VLDB
Endowment 2.2 (2009): 1626-1629. [pdf]
6. Chen, Songting. "Cheetah: a high performance, custom data warehouse on top of MapReduce." Proceedings of the
VLDB Endowment 3.1-2 (2010): 1459-1468. [pdf]

52

References
7. "How (and Why) Hadoop is Changing the Data Warehousing Paradigm." Webpage [link]. Accessed 15/12/2013.
8. P. Russom. "Integrating Hadoop into Business Intelligence and Data Warehousing." (2013). [pdf]
9. M. Ferguson. "Offloading and Accelerating Data Warehouse ETL Processing Using Hadoop." [pdf]
10. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of
the ACM 51.1 (2008): 107-113. [pdf]
11. "What is Hadoop?" Webpage [link]. Accessed 15/12/2013.
12. Apache Hadoop project home page, url: [link].
13. Apache HBase home page, [link].
14. Apache Mahout home page, [link].
15. "How Hadoop Cuts Big Data Costs" [link]. Accessed 05/01/2014.
16. "The Impact of Data Temperature on the Data Warehouse." whitepaper by Terradata (2012). [pdf]
17. Abouzeid, Azza, et al. "HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical
workloads." Proceedings of the VLDB Endowment 2.1 (2009): 922-933. [pdf]

Hadoop in Data Warehousing

Related slideshows

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop in Data Warehousing

Similar to Hadoop in Data Warehousing (20)

More from Alexey Grigorev

More from Alexey Grigorev (20)

Recently uploaded

Recently uploaded (20)

Hadoop in Data Warehousing