Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

1© Cloudera, Inc. All rights reserved.
13 April 2016
Ted Malaska| Principle Solutions Architect @ Cloudera,
Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC
Apache HBase + Spark:
Leveraging your Non-Relational
Datastore in Batch and
Streaming applications

About Ted and Jon
Ted Malaska
• Principal Solutions Architect
@ Cloudera
• Apache HBase SparkOnHBase
Contributor
• Contact
• ted.malaska@cloudera.com
Jon Hsieh
• Tech Lead/Eng Manager
HBase Team @ Cloudera
• Apache HBase PMC
• Apache Flume founder
• Contact
• jon@cloudera.com
• @jmhsieh
Hsieh and Malaska, Hadoop Summit EU Dublin 2016

Outline
• Introduction
• Architecture and integration patterns
• Typing and API usage examples
• Future work and Conclusion

• Apache HBase is a distributed non-
relational datastore that specializes in
strongly consistent, low-latency,
random access reads, writes, and
short scans. As a storage system, it is
an obvious source for reading RDDs
and a destination for writing RDDs
• Apache Spark is a distributed in-
memory processing system that can
be used for batch and continuous,
near-real time streaming
jobs. Spark’s programming model is
built upon the RDD (resilient
distributed dataset) abstraction
Apache HBase + Apache Spark

Example Use cases
• Streaming Analytics into HBase to replace Lambda Architectures (with
Kafka)
• Weblogs
• ETL in Spark to bulkload into HBase
• 25-50B records per weekly batch
• Using SQL for extraction layer to query HBase entity-centric timeseries data

Architecture and Integration
Patterns

How does data get in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
Gets
Short scan
Full Scan, Snapshot,
MapReduce
HBase Scanner

HBase + MapReduce: Batch processing patterns
• Read dataset from HBase Table
• Use HBase’s MR InputFormats
• TableInputFormat
• MultiTableInputFormat
• TableSnapshotInputFormat
• Write dataset to HBase Table
• Use HBase’s MR OutputFormat
• TableOutputFormat
• MultiTableOutputFormat
• HFileOutputFormat
Read from HBase Table
Write to HBase Table

HBase + Spark: Batch processing patterns
• Read dataset(RDD) from HBase Table
• Use HBase’s MR InputFormats
• TableInputFormat
• MultiTableInputFormat
• TableSnapshotInputFormat
• Write dataset(RDD) to HBase Table
• Use HBase’s MR OutputFormat
• TableOutputFormat
• MultiTableOutputFormat
• HFileOutputFormat
Read HBase Table as RDD
Write RDD as HBase Table

Spark Streaming
• Take an Data source
• Partition in to mini batches RDDs
• Compute using Spark engine
• Output mini batch RDDs
Mini batch input RDD
Data source
Mini batch output RDD

HBase + Spark Streaming – Enriching With HBase Data
• “Join” a dataset with HBase data
• Enrich Streaming data source with
HBase data
• Extract information from minibatch
• Read/write/update HBase data in
processing
• Output HBase-data enriched stream
of output RDDs
Data source
HBase-enriched mini batch output RDD

How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
low latency
high throughput
Gets
Short scan
MapReduce
HBase Scanner

HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
low latency
high throughput
Gets
Short scan
MapReduce
HBase Scanner
Batch RDD via HBase’s MR
Input/ Output Formats
Streaming using Hbase to
Enrich stream data
Streaming using HBase to
Enrich stream data

Typing and API Usage

Under the covers
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks

Key Addition: HBaseContext
• Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
// A sample RDD
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1")), (Bytes.toBytes("2")),
(Bytes.toBytes("7"))))

• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
Operations on the HBaseContext

Foreach
• Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(
TableName.valueOf("t1"))
it.foreach(r => {
... // HBase API put/incr/append/cas calls
}
bufferedMutator.flush()
bufferedMutator.close()
})

Map
• Take an HBase dataset and map it in parallel for each partition to produce a new
RDD
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]()
it.map( r => {
... // HBase API Scan Results
}
})

BulkLoad
• Bulk load a data set into Hbase (for all cases, generally wide tables)
rdd.hbaseBulkLoad (tableName, t => {
Seq((new KeyFamilyQualifier(t.rowKey, t.family,
t.qualifier), t.value)).iterator
},
stagingFolder)

BulkLoadThinRows
• Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
})
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)

Scan vs Bulk Get (Parallel HBase Multigets)
Scan HBase Table Bulk Get HBase Table

BulkPut
• Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
put
}
}

BulkDelete
• Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size

SparkSQL
• Using SparkSQL to query HBase Data
// Setup Schema Mapping
val dataframe = sqlContext.load("org.apache.hadoop.hbase.spark",
Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a,
B_FIELD STRING c:b,", "hbase.table" -> "t1"))
dataframe.registerTempTable("hbaseTmp")
// Query
sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " +
"WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " +
"(KEY_FIELD <= 'get3' and B_FIELD = '8')")
.foreach(r => println(" - "+r))

SparkSQL + MLLib
• Process data extracted from SparkSQL
val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played
FROM gamer")
// Parse data to apply typing information
val parsedData = resultDf.map(r => {
val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble,
r.getInt(3).toDouble)
Vectors.dense(array) })
val dataCount = parsedData.count()
if (dataCount > 0) {
val clusters = KMeans.train(parsedData, 3, 5)
clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))
}

Future work and Conclusion

Development and Distribution Status
• Today
• Batch Analysis patterns with existing MR Input/Output Formats
• Streaming Analysis Patterns
• Committed to HBase trunk branch (2.0) as part of HBase project
• Available in CDH5.7.0 with commercial support
• Used in production and pre-production today at ~10 Cloudera customers
• Recent Additions
• Kerberos and Secure HBase access
• To come: Kerberos ticket renewals for Spark Streaming
• New JSON based HBase table schema specification

HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
low latency
high throughput
Gets
Short scan
Full Scan,
MapReduce
HBase Scanner
Batch RDD via HBase’s MR
Input/ Output Formats
Enrich stream data
Enrich stream data
HBase Data as Spark
Streaming data source

Future: HBase Data as a Source
• HBase edits as a Spark streaming data
source (with Kafka?)
• Gather other data
• Do some computation
• Write the data out
HBase
Replication
Data source

Thank you!

Use Case – Streaming Counting
Hsieh and Malaska, Hadoop Summit EU
• Puts vs Increments
• Bulk Puts/Gets is good
• You can get perfect counting
4/13/2016

DStream
DStream
DStream
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count HBase Increments
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count HBase Increments
First
Batch
Second
Batch

DStream
DStream
DStream
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
HBase Puts
Source Receiver
RDD
partitions
RDD
Parition
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
HBase Puts
Stateful RDD 2
Stateful RDD 1
Spark Streaming

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Related slideshows

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Similar to Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications

Editor's Notes