2013: year of real-time access to Big Data?

              Geoffrey Hendrey

•   Hadoop MapReduce basics
•   Hadoop stack & data formats
•   File access times and mechanics
•   Key-based indexing systems (HBase)
•   MapReduce, Hive/Pig
•   MPP approaches & alternatives
A very bad* diagram

*this diagram makes it appear that data flows through the master node.
A better picture
Job Configuration
Map and Reduce Java Code
Reducer Group Iterators
• Reducer groups values together by key
• Your code will iterate over the values, emit reduced

              Bear:[1,1]            Bear:2

• Hadoop reducer value iterators return THE SAME
  OBJECT each next(). Object is “reused” to reduce
  garbage collection load
• Beware of “reused” objects (this is a VERY common
  cause of long and confusing debugs)
• Cause for concern: you are emitting an object with
  non-primitive values. STALE “reused object” state from
  previous value.
Hadoop Writables
• Values in Hadoop are transmitted (shuffled, emitted) in a binary
• Hadoop includes primitive types: IntWritable, Text, LongWritable,
• You must implement Writable interface for custom objects
   public void write(DataOutput d) throws IOException {
    public void readFields(DataInput di) throws IOException {
         this.string = di.readUTF();
         this.column = di.readByte();
Hadoop Keys (WritableComparable)
• Be very careful to implement equals and hashcode
  consistently with compareTo()
• compareTo() will control the sort order of keys
  arriving in reducer
• Hadoop includes ability to write custom partitioner
 public int getPartition(Document doc,
                        Text v, int numReducers) {
     return doc.getDocId()%numReducers;
Typical Hadoop File Formats
Hadoop Stack Review
Distributed File System
HDFS performance characteristics
• HDFS was designed for high throughput, not low
  seek latency
• best-case configurations have shown HDFS to
  perform 92K/s random reads

• Personal experience: HDFS very robust. Fault
  tolerance is “real”. I’ve unplugged machines
  and never lost data.
Motivation for Real-time Hadoop
• Big Data is more opaque than small data
  – Spreadsheets choke
  – BI tools can’t scale
  – Small samples often fail to replicate issues
• Engineers, data scientists, analysts need:
  – Faster “time to answer” on Big Data
  – Rapid “find, quantify, extract”
• Solve “I don’t know what I don’t know”
• MapReduce jobs are hard to debug
Survey or real-time capabilities

• Real-time, in-situ, self-service is the
  “Holy Grail” for the business analyst
• spectrum of real-time capabilities exists
  on Hadoop

 Available in Hadoop                   Proprietary

           HDFS        HBase   Drill
 Easy                                        Hard
Real-time spectrum on Hadoop
Use Case                                      Support       Real-time

Seek to a particular byte in a distributed    HDFS          YES

Seek to a particular value in a distributed   HBase         YES
file, by key (1-dimensional indexing)

Answer complex questions expressible in       MapReduce     NO
code (e.g. matching users to music            (Hive, Pig)
albums). Data science.

Ad-hoc query for scattered records given MPP                YES
simple constraints (“field*4+==“music” && Architectures
Hadoop Underpinned By HDFS
•   Hadoop Distributed File System (HDFS)
•   inspired by Google FileSystem (GFS)
•   underpins every piece of data in “Hadoop”
•   Hadoop FileSystem API is pluggable
•   HDFS can be replaced with other suitable
    distributed filesystem
    – S3
    – kosmos
    – etc
Amazon S3
MapFile for real-time access?
  – Index file must be loaded by client (slow)
  – Index file must fit in RAM of client by default
  – scan an average of 50% of the sampling
  – Large records make scanning intolerable
  – not a viable “real world” solution for random
Apache HBase

• Clone of Google’s Big Table.
• Key-based access mechanism
• Designed to hold billions of rows
• “Tables” stored in HDFS
• Supports MapReduce over tables, into
• Requires you to think hard, and commit
  to a key design.
HBase Architecture
HBase random read performance
• 7 servers, each with
   • 8 cores
   • 32GB DDR3 and
   • 24 x 146GB SAS 2.0 10K RPM disks.
• Hbase table
   • 3 billion records,
   • 6600 regions.
   • data size is between 128-256 bytes per row,
     spread in 1 to 5 columns.
Zoomed-in “Get” time histogram

• “MapReduce is a framework for processing
  parallelizable problems across huge datasets
  using a large number of computers”-wikipedia
• MapReduce is strongly tied to HDFS in Hadoop.
• Systems built on HDFS (i.e. HBase) leverage this
  common foundation for integration with the MR
MapReduce and Data Science
• Many complex algorithms can be expressed in
  the MapReduce paradigm
  – NLP
  – Graph processing
  – Image codecs
• The more complex the algorithm, the more Map
  and Reduce processes become complex
  programs in their own right.
• Often cascade multiple MR jobs in succession
Is MapReduce real-time?
• MapReduce on Hadoop has certain latencies
  that are hard to improve
  – Copy
  – Shuffle, sort
  – Iterate
• time-dependent on the both the size of the
  input data and the number of processors
• In a nutshell, it’s a “batch process” and isn’t
Hive and Pig
• Run on top of MapReduce
• Provide “Table” metaphor familiar to SQL users
• Provide SQL-like (or actually same) syntax
• Store a “schema” in a database, mapping tables
  to HDFS files
• Translate “queries” to MapReduce jobs
• No more real-time than MapReduce
MPP Architectures
• Massively Parallel Processing
• Lots of machines, so also lots of memory
• Spark – general purpose data science framework
  sort of like real-time MapReduce for data
• Dremel – columnar approach, geared toward
  answering SQL-like aggregations and BI-style

• Originally designed for iterative machine
  learning problems at Berkeley
• MapReduce does not do a great job on iterative
• Spark makes more explicit use of memory
  caches than Hadoop
• Spark can load data from any Hadoop input
Effect of Memory Caching in Spark
Is Spark Real-time?
• If data fits in memory, execution time for most
  algorithms still depends on
  – amount of data to be processed
  – number of processors
• So, it still “depends”
• …but definitely more focused on fast time-to-
• Interactive scala and java shells
Dremel MPP architecture
• MPP architecture for ad-hoc query on nested
• Apache Drill is an OS clone of Dremel
• Dremel originally developed at Google
• Features “in situ” data analysis
• “Dremel is not intended as a replacement for
  MR and is often used in conjunction with it to
  analyze outputs of MR pipelines or rapidly
  prototype larger computations.” -Dremel:
  Interactive Analysis of WebScaleDatasets
In Situ Analysis

• Moving Big Data is a nightmare
• In situ: ability to access data in
  – In HDFS
  – In Big Table
Uses For Dremel At Google
•    Analysis of crawled web documents.
•    Tracking install data for applications on Android
•    Crash reporting for Google products.
•    OCR results from Google Books.
•    Spam analysis.
•    Debugging of map tiles on Google Maps.
•    Tablet migrations in managed Bigtable instances.
•    Results of tests run on Google’s distributed build
•   Etc, etc.
Why so many uses for Dremel?
• On any Big Data problem or application, dev
  team faces these problems:
  – “I don’t know what I don’t know” about data
  – Debugging often requires finding and correlating
    specific needles in the haystack
  – Support and marketing often require segmentation
    analysis (identify and characterize wide swaths of
• Every developer/analyst wants
  – Faster time to answer
  – Fewer trips around the mulberry bush
Column Oriented Approach
Dremel MPP query execution tree
Is Dremel real-time?
Alternative approaches?
• Both MapReduce and MPP query architectures
  take “throw hardware at the problem”
• Alternatives?
  – Use MapReduce to build distributed indexes on data
  – Combine columnar storage and inverted indexes to
    create columnar inverted indexes
  – Aim for the sweet spot for data scientist and
    engineer: Ad-hoc queries with results returned in
    seconds on a single processing node.
Contact Info


• Dremel: Interactive Analysis of WebScale Datasets

Similar to Real time hadoop + mapreduce intro

Real time hadoop + mapreduce intro

  • 1. 2013: year of real-time access to Big Data? Geoffrey Hendrey @geoffhendrey @vertascale
  • 2. Agenda • Hadoop MapReduce basics • Hadoop stack & data formats • File access times and mechanics • Key-based indexing systems (HBase) • MapReduce, Hive/Pig • MPP approaches & alternatives
  • 3. A very bad* diagram *this diagram makes it appear that data flows through the master node.
  • 6. Map and Reduce Java Code
  • 8. Reducer Group Iterators • Reducer groups values together by key • Your code will iterate over the values, emit reduced result Bear:[1,1] Bear:2 • Hadoop reducer value iterators return THE SAME OBJECT each next(). Object is “reused” to reduce garbage collection load • Beware of “reused” objects (this is a VERY common cause of long and confusing debugs) • Cause for concern: you are emitting an object with non-primitive values. STALE “reused object” state from previous value.
  • 9. Hadoop Writables • Values in Hadoop are transmitted (shuffled, emitted) in a binary format • Hadoop includes primitive types: IntWritable, Text, LongWritable, etc • You must implement Writable interface for custom objects public void write(DataOutput d) throws IOException { d.writeUTF(this.string); d.writeByte(this.column); } public void readFields(DataInput di) throws IOException { this.string = di.readUTF(); this.column = di.readByte(); }
  • 10. Hadoop Keys (WritableComparable) • Be very careful to implement equals and hashcode consistently with compareTo() • compareTo() will control the sort order of keys arriving in reducer • Hadoop includes ability to write custom partitioner public int getPartition(Document doc, Text v, int numReducers) { return doc.getDocId()%numReducers; }
  • 14. HDFS performance characteristics • HDFS was designed for high throughput, not low seek latency • best-case configurations have shown HDFS to perform 92K/s random reads [] • Personal experience: HDFS very robust. Fault tolerance is “real”. I’ve unplugged machines and never lost data.
  • 15. Motivation for Real-time Hadoop • Big Data is more opaque than small data – Spreadsheets choke – BI tools can’t scale – Small samples often fail to replicate issues • Engineers, data scientists, analysts need: – Faster “time to answer” on Big Data – Rapid “find, quantify, extract” • Solve “I don’t know what I don’t know” • MapReduce jobs are hard to debug
  • 16. Survey or real-time capabilities • Real-time, in-situ, self-service is the “Holy Grail” for the business analyst • spectrum of real-time capabilities exists on Hadoop Available in Hadoop Proprietary HDFS HBase Drill Easy Hard
  • 17. Real-time spectrum on Hadoop Use Case Support Real-time Seek to a particular byte in a distributed HDFS YES file Seek to a particular value in a distributed HBase YES file, by key (1-dimensional indexing) Answer complex questions expressible in MapReduce NO code (e.g. matching users to music (Hive, Pig) albums). Data science. Ad-hoc query for scattered records given MPP YES simple constraints (“field*4+==“music” && Architectures field*9+==“dvd”)
  • 18. Hadoop Underpinned By HDFS • Hadoop Distributed File System (HDFS) • inspired by Google FileSystem (GFS) • underpins every piece of data in “Hadoop” • Hadoop FileSystem API is pluggable • HDFS can be replaced with other suitable distributed filesystem – S3 – kosmos – etc
  • 20. MapFile for real-time access? – Index file must be loaded by client (slow) – Index file must fit in RAM of client by default – scan an average of 50% of the sampling interval – Large records make scanning intolerable – not a viable “real world” solution for random access
  • 21. Apache HBase • Clone of Google’s Big Table. • Key-based access mechanism • Designed to hold billions of rows • “Tables” stored in HDFS • Supports MapReduce over tables, into tables • Requires you to think hard, and commit to a key design.
  • 23. HBase random read performance • 7 servers, each with • 8 cores • 32GB DDR3 and • 24 x 146GB SAS 2.0 10K RPM disks. • Hbase table • 3 billion records, • 6600 regions. • data size is between 128-256 bytes per row, spread in 1 to 5 columns.
  • 24. Zoomed-in “Get” time histogram
  • 25. MapReduce • “MapReduce is a framework for processing parallelizable problems across huge datasets using a large number of computers”-wikipedia • MapReduce is strongly tied to HDFS in Hadoop. • Systems built on HDFS (i.e. HBase) leverage this common foundation for integration with the MR paradigm
  • 26. MapReduce and Data Science • Many complex algorithms can be expressed in the MapReduce paradigm – NLP – Graph processing – Image codecs • The more complex the algorithm, the more Map and Reduce processes become complex programs in their own right. • Often cascade multiple MR jobs in succession
  • 27. Is MapReduce real-time? • MapReduce on Hadoop has certain latencies that are hard to improve – Copy – Shuffle, sort – Iterate • time-dependent on the both the size of the input data and the number of processors available • In a nutshell, it’s a “batch process” and isn’t “real-time”
  • 28. Hive and Pig • Run on top of MapReduce • Provide “Table” metaphor familiar to SQL users • Provide SQL-like (or actually same) syntax • Store a “schema” in a database, mapping tables to HDFS files • Translate “queries” to MapReduce jobs • No more real-time than MapReduce
  • 29. MPP Architectures • Massively Parallel Processing • Lots of machines, so also lots of memory Examples: • Spark – general purpose data science framework sort of like real-time MapReduce for data science • Dremel – columnar approach, geared toward answering SQL-like aggregations and BI-style questions
  • 30. Spark • Originally designed for iterative machine learning problems at Berkeley • MapReduce does not do a great job on iterative workloads • Spark makes more explicit use of memory caches than Hadoop • Spark can load data from any Hadoop input source
  • 31. Effect of Memory Caching in Spark
  • 32. Is Spark Real-time? • If data fits in memory, execution time for most algorithms still depends on – amount of data to be processed – number of processors • So, it still “depends” • …but definitely more focused on fast time-to- answer • Interactive scala and java shells
  • 33. Dremel MPP architecture • MPP architecture for ad-hoc query on nested data • Apache Drill is an OS clone of Dremel • Dremel originally developed at Google • Features “in situ” data analysis • “Dremel is not intended as a replacement for MR and is often used in conjunction with it to analyze outputs of MR pipelines or rapidly prototype larger computations.” -Dremel: Interactive Analysis of WebScaleDatasets
  • 34. In Situ Analysis • Moving Big Data is a nightmare • In situ: ability to access data in place – In HDFS – In Big Table
  • 35. Uses For Dremel At Google • Analysis of crawled web documents. • Tracking install data for applications on Android Market. • Crash reporting for Google products. • OCR results from Google Books. • Spam analysis. • Debugging of map tiles on Google Maps. • Tablet migrations in managed Bigtable instances. • Results of tests run on Google’s distributed build system. • Etc, etc.
  • 36. Why so many uses for Dremel? • On any Big Data problem or application, dev team faces these problems: – “I don’t know what I don’t know” about data – Debugging often requires finding and correlating specific needles in the haystack – Support and marketing often require segmentation analysis (identify and characterize wide swaths of data) • Every developer/analyst wants – Faster time to answer – Fewer trips around the mulberry bush
  • 38. Dremel MPP query execution tree
  • 40. Alternative approaches? • Both MapReduce and MPP query architectures take “throw hardware at the problem” approach. • Alternatives? – Use MapReduce to build distributed indexes on data – Combine columnar storage and inverted indexes to create columnar inverted indexes – Aim for the sweet spot for data scientist and engineer: Ad-hoc queries with results returned in seconds on a single processing node.
  • 41. Contact Info Email: Twitter: @geoffhendrey @vertascale www:
  • 42. references • the-elephant/ • clusters-and-the-network/ • • • • • s3_growth_2012_q1_1.png • • • view1.png • Dremel: Interactive Analysis of WebScale Datasets