Training (Day –1) 
Four parameters: 
–Velocity: Streaming data and large volume data movement. 
–Volume: Scale from terabytes to zettabytes. 
–Variety: Manage the complexity of multiple relational and non-relational data types and schemas. 
–Voracity: Produced data has to be consumed fast before it becomes meaningless.
Not just internet companies 
Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
Data >> Information >> Business Value 
Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. 
Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. 
Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. 
Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
Single-core, single processor 
Single-core, multi-processor 
Single- core 
Multi-core, single processor 
Multi-core, multi-processor 
Cluster of processors (single or multi-core) with shared memory 
Cluster of processors with distributed memory 
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. 
Grid of clusters 
Embarrassingly parallel processing 
MapReduce, distributed file system 
Cloud computing 
Pipelined Instruction level 
Concurrent Thread level 
Service Object level 
Indexed File level 
Mega Block level 
Virtual System Level 
Data size: small 
Data size: large 
Reference: Bina Ramamurthy 2011 
Processing Granularity
How to Process BigData? 
Need to process large datasets (>100TB) 
–Just reading 100TB of data can be overwhelming 
–Takes ~11 days to read on a standard computer 
–Takes a day across a 10Gbit link (very high end storage solution) 
–On a single node (@50MB/s) –23days 
–On a 1000 node cluster –33min
•Web logs; 
•sensor networks; 
•social networks; 
•social data (due to thesocial data revolution), 
•Internet text and documents; 
•Internet search indexing; 
•call detail records; 
•atmospheric science, 
•biological, and 
•other complex and/or interdisciplinary scientific research; 
•military surveillance; 
•medical records; 
•photography archives; 
•video archives; and 
•large-scale e-commerce.
Not so easy… 
Moving data from storage cluster to computation cluster is not feasible 
In large clusters 
–Failure is expected, rather than exceptional. 
–In large clusters, computers fail every day 
–Data is corrupted or lost 
–Computations are disrupted 
–The number of nodes in a cluster may not be constant. 
–Nodes can be heterogeneous. 
Very expensive to build reliability into each application 
–A programmer worries about errors, data motion, communication… 
–Traditional debugging and performance tools don’t apply 
Need a common infrastructure and standard set of tools to handle this complexity 
–Efficient, scalable, fault-tolerant and easy to use
Why is Hadoop and MapReduceneeded? 
The answer to this questions comes from another trend in disk drives: 
–seek time is improving more slowly than transfer rate. 
Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. 
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. 
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
Why is Hadoop and MapReduceneeded? 
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. 
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. 
MapReducecan be seen as a complement to an RDBMS. 
MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
Why is Hadoop and MapReduceneeded?
Hadoop distributions 
Apache™ Hadoop™ 
Apache Hadoop-based Services for Windows Azure 
Cloudera’sDistribution Including Apache Hadoop (CDH) 
HortonworksData Platform 
IBM InfoSphereBigInsights 
Platform Symphony MapReduce 
MapRHadoop Distribution 
EMC GreenplumMR (using MapR’sM5 Distribution) 
ZettasetData Platform 
SGI Hadoop Clusters (uses Clouderadistribution) 
Grand Logic JobServer 
OceanSyncHadoop Management Software 
Oracle Big Data Appliance (uses Clouderadistribution)
What’s up with the names? 
When naming software projects, Doug Cutting seems to have been inspired by his family. 
Luceneis his wife’s middle name, and her maternal grandmother’s first name. 
His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. 
Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
Hadoop features 
Distributed Framework for processing and storing data generally on commodity hardware. 
Completely Open Source. 
Written in Java 
–Runs on Linux, Mac OS/X, Windows, and Solaris. 
–Client apps can be written in various languages. 
•Scalable: store and process petabytes, scale by adding Hardware 
•Economical: 1000’s of commodity machines 
•Efficient: run tasks where data is located 
•Reliable: data is replicated, failed tasks are rerun 
•Primarily used for batch data processing, not real-time / user facing applications
Components of Hadoop 
•HDFS(Hadoop Distributed File System) 
store TB' and PB's data. 
process the data stored onto HDFS in key-value . 
Processing Framework 
Client 1 
Client 2 
Shuffle & Sort 
•Very Large Distributed File System 
–10K nodes, 100 million files, 10 PB 
–Linearly scalable 
–Supports Large files (in GBs or TBs) 
–Uses Commodity Hardware 
–Nodes fail every day. Failure is expected, rather than exceptional. 
–The number of nodes in a cluster is not constant. 
•Optimized for Batch Processing 
HDFS Goals 
•Highly fault-tolerant 
–runs on commodity HW, which can fail frequently 
•High throughput of data access 
–Streaming access to data 
•Large files 
–Typical file is gigabytes to terabytes in size 
–Support for tens of millions of files 
•Simple coherency 
–Write-once-read-many access model
HDFS: Files and Blocks 
•Data Organization 
–Data is organized into files and directories 
–Files are divided into uniform sized large blocks 
–Typically 128MB 
–Blocks are distributed across cluster nodes 
•Fault Tolerance 
–Blocks are replicated (default 3) to handle hardware failure 
–Replication based on Rack-Awareness for performance and fault tolerance 
–Keeps checksums of data for corruption detection and recovery 
–Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
HDFS: Files and Blocks 
•High Throughput: 
–Client talks to both NameNodeand DataNodes 
–Data is not sent through the NameNode. 
–Throughput of file system scales nearly linearly with the number of nodes. 
•HDFS exposes block placement so that computation can be migrated to data
HDFS Components 
–Manages the file namespace operation like opening, creating, renaming etc. 
–File name to list blocks + location mapping 
–File metadata 
–Authorization and authentication 
–Collect block reports from DataNodeson block locations 
–Replicate missing blocks 
–Keeps ALL namespace in memory plus checkpoints & journal 
–Handles block storage on multiple volumes and data integrity. 
–Clients access the blocks directly from data nodes for read and write 
–Data nodes periodically send block reports to NameNode 
–Block creation, deletion and replication upon instruction from the NameNode.
name:/users/joeYahoo/myFile -blocks:{1,3} 
name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} 
Datanodes (the slaves) 
Namenode (the master) 
HDFS Architecture
Simple commands 
hdfsdfs-ls, -du, -rm, -rmr 
Uploading files 
hdfsdfs–copyFromLocalfoo mydata/foo 
Downloading files 
hdfsdfs-moveToLocalmydata/foo foo 
hdfsdfs-cat mydata/foo 
Hadoop DFS Interface
Map Reduce -Introduction 
•Parallel Job processing framework 
•Written in java 
•Close integration with HDFS 
•Provides : 
–Auto partitioning of job into sub tasks 
–Auto retry on failures 
–Linear Scalability 
–Locality of task execution 
–Plugin based framework for extensibility
•MapReduceprograms are executed in two main phases, called 
–mapping and 
•In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. 
•In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. 
•The mapper is meant to filter and transform the input into something 
•That the reducer can aggregate over. 
•MapReduceuses lists and (key/value) pairs as its main data primitives.
Map-Reduce Program 
–Based on two functions: Map and Reduce 
–Every Map/Reduce program must specify a Mapper and optionally a Reducer 
–Operate on key and value pairs 
Map-Reduce works like a Unix pipeline: 
cat input | grep| sort | uniq-c | cat > output 
Input| Map| Shuffle & Sort | Reduce| Output 
cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist 
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) 
Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
Map-Reduce on Hadoop
Hadoop and its elements 
File 1 
File 2 
File 3 
File N-2 
File N-1 
File N 
Machine -1 
Machine -2 
Machine -M 
Split 1 
Split 2 
Split 3 
Split M-2 
Split M-1 
Split M 
Map 1 
Map 2 
Map 3 
Map M-2 
Map M-1 
Map M 
Combiner 1 
Combiner C 
(Kay, Value) 
Record Reader 
Partition 1 
Partition 2 
Partition P-1 
Partition P 
File 1 
File 2 
File 3 
File O-2 
File O-1 
File O 
Reducer 1 
Reducer 2 
Reducer R-1 
Reducer R 
Machine -x
Hadoop Eco-system 
•Hadoop Common: The common utilities that support the other Hadoop subprojects. 
•Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. 
•Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. 
•Other Hadoop-related projects at Apache include: 
–Avro™: A data serialization system. 
–Cassandra™: A scalable multi-master database with no single points of failure. 
–Chukwa™: A data collection system for managing large distributed systems. 
–HBase™: A scalable, distributed database that supports structured data storage for large tables. 
–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. 
–Mahout™: A Scalable machine learning and data mining library. 
–Pig™: A high-level data-flow language and execution framework for parallel computation. 
–ZooKeeper™: A high-performance coordination service for distributed applications.
Exercise –task 
You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. 
–Provide an architecture of such system to meet following goals 
–Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. 
•Group / individual presentation
End of session 
Day –1: Introduction

Hadoop introduction

  • 1. Training (Day –1) Introduction
  • 2. Big-data Four parameters: –Velocity: Streaming data and large volume data movement. –Volume: Scale from terabytes to zettabytes. –Variety: Manage the complexity of multiple relational and non-relational data types and schemas. –Voracity: Produced data has to be consumed fast before it becomes meaningless.
  • 3. Not just internet companies Big Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
  • 4. Data >> Information >> Business Value Retail–By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues. Financial Services–By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets. Government–By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies. Healthcare–Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
  • 5. Single-core, single processor Single-core, multi-processor Single- core Multi-core, single processor Multi-core, multi-processor Multi-core Cluster of processors (single or multi-core) with shared memory Cluster of processors with distributed memory Cluster Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN. Grid of clusters Embarrassingly parallel processing MapReduce, distributed file system Cloud computing Pipelined Instruction level Concurrent Thread level Service Object level Indexed File level Mega Block level Virtual System Level Data size: small Data size: large Reference: Bina Ramamurthy 2011 Processing Granularity
  • 6. How to Process BigData? Need to process large datasets (>100TB) –Just reading 100TB of data can be overwhelming –Takes ~11 days to read on a standard computer –Takes a day across a 10Gbit link (very high end storage solution) –On a single node (@50MB/s) –23days –On a 1000 node cluster ��33min
  • 7. Examples •Web logs; •RFID; •sensor networks; •social networks; •social data (due to thesocial data revolution), •Internet text and documents; •Internet search indexing; •call detail records; •astronomy, •atmospheric science, •genomics, •biogeochemical, •biological, and •other complex and/or interdisciplinary scientific research; •military surveillance; •medical records; •photography archives; •video archives; and •large-scale e-commerce.
  • 8. Not so easy… Moving data from storage cluster to computation cluster is not feasible In large clusters –Failure is expected, rather than exceptional. –In large clusters, computers fail every day –Data is corrupted or lost –Computations are disrupted –The number of nodes in a cluster may not be constant. –Nodes can be heterogeneous. Very expensive to build reliability into each application –A programmer worries about errors, data motion, communication… –Traditional debugging and performance tools don’t apply Need a common infrastructure and standard set of tools to handle this complexity –Efficient, scalable, fault-tolerant and easy to use
  • 9. Why is Hadoop and MapReduceneeded? The answer to this questions comes from another trend in disk drives: –seek time is improving more slowly than transfer rate. Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data. It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth. If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
  • 10. Why is Hadoop and MapReduceneeded? On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well. For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database. MapReducecan be seen as a complement to an RDBMS. MapReduceis a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
  • 11. Why is Hadoop and MapReduceneeded?
  • 12. Hadoop distributions Apache™ Hadoop™ Apache Hadoop-based Services for Windows Azure Cloudera’sDistribution Including Apache Hadoop (CDH) HortonworksData Platform IBM InfoSphereBigInsights Platform Symphony MapReduce MapRHadoop Distribution EMC GreenplumMR (using MapR’sM5 Distribution) ZettasetData Platform SGI Hadoop Clusters (uses Clouderadistribution) Grand Logic JobServer OceanSyncHadoop Management Software Oracle Big Data Appliance (uses Clouderadistribution)
  • 13. What’s up with the names? When naming software projects, Doug Cutting seems to have been inspired by his family. Luceneis his wife’s middle name, and her maternal grandmother’s first name. His son, as a toddler, used Nutchas the all- purpose word for meal and later named a yellow stuffed elephant Hadoop. Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
  • 14. Hadoop features Distributed Framework for processing and storing data generally on commodity hardware. Completely Open Source. Written in Java –Runs on Linux, Mac OS/X, Windows, and Solaris. –Client apps can be written in various languages. •Scalable: store and process petabytes, scale by adding Hardware •Economical: 1000’s of commodity machines •Efficient: run tasks where data is located •Reliable: data is replicated, failed tasks are rerun •Primarily used for batch data processing, not real-time / user facing applications
  • 15. Components of Hadoop •HDFS(Hadoop Distributed File System) –ModeledonGFS –Reliable,HighBandwidthfilesystemthatcan store TB' and PB's data. •Map-Reduce –UsingMap/ReducemetaphorfromLisplanguage –Adistributedprocessingframeworkparadigmthat process the data stored onto HDFS in key-value . DFS Processing Framework Client 1 Client 2 Input data Output data Map Map Map Reduce Reduce Input Map Shuffle & Sort Reduce Output
  • 16. •Very Large Distributed File System –10K nodes, 100 million files, 10 PB –Linearly scalable –Supports Large files (in GBs or TBs) •Economical –Uses Commodity Hardware –Nodes fail every day. Failure is expected, rather than exceptional. –The number of nodes in a cluster is not constant. •Optimized for Batch Processing HDFS
  • 17. HDFS Goals •Highly fault-tolerant –runs on commodity HW, which can fail frequently •High throughput of data access –Streaming access to data •Large files –Typical file is gigabytes to terabytes in size –Support for tens of millions of files •Simple coherency –Write-once-read-many access model
  • 18. HDFS: Files and Blocks •Data Organization –Data is organized into files and directories –Files are divided into uniform sized large blocks –Typically 128MB –Blocks are distributed across cluster nodes •Fault Tolerance –Blocks are replicated (default 3) to handle hardware failure –Replication based on Rack-Awareness for performance and fault tolerance –Keeps checksums of data for corruption detection and recovery –Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
  • 19. HDFS: Files and Blocks •High Throughput: –Client talks to both NameNodeand DataNodes –Data is not sent through the NameNode. –Throughput of file system scales nearly linearly with the number of nodes. •HDFS exposes block placement so that computation can be migrated to data
  • 20. HDFS Components •NameNode –Manages the file namespace operation like opening, creating, renaming etc. –File name to list blocks + location mapping –File metadata –Authorization and authentication –Collect block reports from DataNodeson block locations –Replicate missing blocks –Keeps ALL namespace in memory plus checkpoints & journal •DataNode –Handles block storage on multiple volumes and data integrity. –Clients access the blocks directly from data nodes for read and write –Data nodes periodically send block reports to NameNode –Block creation, deletion and replication upon instruction from the NameNode.
  • 21. name:/users/joeYahoo/myFile -blocks:{1,3} name:/users/bobYahoo/someData.gzip -blocks:{2,4,5} Datanodes (the slaves) Namenode (the master) 1 1 2 2 2 4 5 3 3 4 4 5 5 Client Metadata I/O 1 3 HDFS Architecture
  • 22. Simple commands hdfsdfs-ls, -du, -rm, -rmr Uploading files hdfsdfs–copyFromLocalfoo mydata/foo Downloading files hdfsdfs-moveToLocalmydata/foo foo hdfsdfs-cat mydata/foo Admin hdfsdfsadmin–report Hadoop DFS Interface
  • 23. Map Reduce -Introduction •Parallel Job processing framework •Written in java •Close integration with HDFS •Provides : –Auto partitioning of job into sub tasks –Auto retry on failures –Linear Scalability –Locality of task execution –Plugin based framework for extensibility
  • 24. Map-Reduce •MapReduceprograms are executed in two main phases, called –mapping and –reducing. •In the mapping phase, MapReducetakes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result. •The mapper is meant to filter and transform the input into something •That the reducer can aggregate over. •MapReduceuses lists and (key/value) pairs as its main data primitives.
  • 25. Map-Reduce Map-Reduce Program –Based on two functions: Map and Reduce –Every Map/Reduce program must specify a Mapper and optionally a Reducer –Operate on key and value pairs Map-Reduce works like a Unix pipeline: cat input | grep| sort | uniq-c | cat > output Input| Map| Shuffle & Sort | Reduce| Output cat /var/log/auth.log* | grep“session opened” | cut -d’ ‘ -f10 | sort | uniq-c > ~/userlist Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2) Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
  • 27. Hadoop and its elements HDFS . . . File 1 File 2 File 3 File N-2 File N-1 File N Input files Splits Mapper Machine -1 Machine -2 Machine -M Split 1 Split 2 Split 3 Split M-2 Split M-1 Split M Map 1 Map 2 Map 3 Map M-2 Map M-1 Map M Combiner 1 Combiner C (Kay, Value) pairs Record Reader combiner . . . Partition 1 Partition 2 Partition P-1 Partition P Partitionar Reducer HDFS . . . File 1 File 2 File 3 File O-2 File O-1 File O Reducer 1 Reducer 2 Reducer R-1 Reducer R Input Output Machine -x
  • 28. Hadoop Eco-system •Hadoop Common: The common utilities that support the other Hadoop subprojects. •Hadoop Distributed File System (HDFS™): A distributed file system that provides high- throughput access to application data. •Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters. •Other Hadoop-related projects at Apache include: –Avro™: A data serialization system. –Cassandra™: A scalable multi-master database with no single points of failure. –Chukwa™: A data collection system for managing large distributed systems. –HBase™: A scalable, distributed database that supports structured data storage for large tables. –Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. –Mahout™: A Scalable machine learning and data mining library. –Pig™: A high-level data-flow language and execution framework for parallel computation. –ZooKeeper™: A high-performance coordination service for distributed applications.
  • 29. Exercise –task You have timeseriesdata (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well. •Task: –Provide an architecture of such system to meet following goals –Fast –Available –Fair –Or, provide analytics algorithm and data-structure design considerations (e.g. k- means clustering, or regression) on this data set of worth 3 months. •Group / individual presentation
  • 30. End of session Day –1: Introduction