13 April 2016
Ted Malaska| Principle Solutions Architect @ Cloudera,
Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC
Apache HBase + Spark:
Leveraging your Non-Relational
Datastore in Batch and
Streaming applications
About Ted and Jon
Ted Malaska
• Principal Solutions Architect
@ Cloudera
• Apache HBase SparkOnHBase
• Contact
Jon Hsieh
• Tech Lead/Eng Manager
HBase Team @ Cloudera
• Apache HBase PMC
• Apache Flume founder
• Contact
• @jmhsieh
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Introduction
• Architecture and integration patterns
• Typing and API usage examples
• Future work and Conclusion
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Apache HBase is a distributed non-
relational datastore that specializes in
strongly consistent, low-latency,
random access reads, writes, and
short scans. As a storage system, it is
an obvious source for reading RDDs
and a destination for writing RDDs
• Apache Spark is a distributed in-
memory processing system that can
be used for batch and continuous,
near-real time streaming
jobs. Spark’s programming model is
built upon the RDD (resilient
distributed dataset) abstraction
Apache HBase + Apache Spark
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Example Use cases
• Streaming Analytics into HBase to replace Lambda Architectures (with
• Weblogs
• ETL in Spark to bulkload into HBase
• 25-50B records per weekly batch
• Using SQL for extraction layer to query HBase entity-centric timeseries data
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Architecture and Integration
How does data get in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
Short scan
Full Scan, Snapshot,
HBase Scanner
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
HBase + MapReduce: Batch processing patterns
• Read dataset from HBase Table
• Use HBase’s MR InputFormats
• TableInputFormat
• MultiTableInputFormat
• TableSnapshotInputFormat
• Write dataset to HBase Table
• Use HBase’s MR OutputFormat
• TableOutputFormat
• MultiTableOutputFormat
• HFileOutputFormat
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Read from HBase Table
Write to HBase Table
HBase + Spark: Batch processing patterns
• Read dataset(RDD) from HBase Table
• Use HBase’s MR InputFormats
• TableInputFormat
• MultiTableInputFormat
• TableSnapshotInputFormat
• Write dataset(RDD) to HBase Table
• Use HBase’s MR OutputFormat
• TableOutputFormat
• MultiTableOutputFormat
• HFileOutputFormat
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Read HBase Table as RDD
Write RDD as HBase Table
Spark Streaming
• Take an Data source
• Partition in to mini batches RDDs
• Compute using Spark engine
• Output mini batch RDDs
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Mini batch input RDD
Data source
Mini batch output RDD
HBase + Spark Streaming – Enriching With HBase Data
• “Join” a dataset with HBase data
• Enrich Streaming data source with
HBase data
• Extract information from minibatch
• Read/write/update HBase data in
• Output HBase-data enriched stream
of output RDDs
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Mini batch input RDD
Data source
HBase-enriched mini batch output RDD
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
Short scan
Full Scan, Snapshot,
HBase Scanner
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
Short scan
Full Scan, Snapshot,
HBase Scanner
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Batch RDD via HBase’s MR
Input/ Output Formats
Streaming using Hbase to
Enrich stream data
Streaming using HBase to
Enrich stream data
Typing and API Usage
15© Cloudera, Inc. All rights reserved.
Under the covers
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Walker Node
Static Space
Tasks Tasks
Walker Node
Static Space
Tasks Tasks
Key Addition: HBaseContext
• Create an HBaseContext
// an Hadoop/HBase Configuration object
val conf = HBaseConfiguration.create()
conf.addResource(new Path("/etc/hbase/conf/core-site.xml"))
conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml"))
// sc is the Spark Context; hbase context corresponds to an HBase Connection
val hbaseContext = new HBaseContext(sc, conf)
// A sample RDD
val rdd = sc.parallelize(Array(
(Bytes.toBytes("1")), (Bytes.toBytes("2")),
(Bytes.toBytes("3")), (Bytes.toBytes("4")),
(Bytes.toBytes("5")), (Bytes.toBytes("6")),
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Foreach
• Map
• BulkLoad
• BulkLoadThinRows
• BulkGet (aka Multiget)
• BulkDelete
Operations on the HBaseContext
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Read HBase data in parallel for each partition and compute
rdd.hbaseForeachPartition(hbaseContext, (it, conn) => {
// do something
val bufferedMutator = conn.getBufferedMutator(
it.foreach(r => {
... // HBase API put/incr/append/cas calls
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Take an HBase dataset and map it in parallel for each partition to produce a new
val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => {
val table = conn.getTable(TableName.valueOf("t1"))
var res = mutable.MutableList[String]() r => {
... // HBase API Scan Results
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Bulk load a data set into Hbase (for all cases, generally wide tables)
rdd.hbaseBulkLoad (tableName, t => {
Seq((new KeyFamilyQualifier(t.rowKey,,
t.qualifier), t.value)).iterator
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Bulk load a data set into HBase (for skinny tables, <10k cols)
hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte],
Array[Byte])])] (rdd, TableName.valueOf(tableName), t => {
val rowKey = Bytes.toBytes(t._1)
val familyQualifiersValues = new FamiliesQualifiersValues
t._2.foreach(f => {
val family:Array[Byte] = f._1
val qualifier = f._2
val value:Array[Byte] = f._3
familyQualifiersValues +=(family, qualifier, value)
(new ByteArrayWrapper(rowKey), familyQualifiersValues)
}, stagingFolder.getPath)
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Scan vs Bulk Get (Parallel HBase Multigets)
Scan HBase Table Bulk Get HBase Table
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
23© Cloudera, Inc. All rights reserved.
• Parallelized HBase Multiput
hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte],
Array[Byte])])](rdd, tableName, (putRecord) => {
val put = new Put(putRecord._1)
putRecord._2.foreach((putValue) =>
put.add(putValue._1, putValue._2, putValue._3))
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Parallelized HBase Multi-deletes
hbaseContext.bulkDelete[Array[Byte]](rdd, tableName,
putRecord => new Delete(putRecord),
4) // batch size
rdd.hbaseBulkDelete(hbaseContext, tableName,
putRecord => new Delete(putRecord),
4) // batch size
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
• Using SparkSQL to query HBase Data
// Setup Schema Mapping
val dataframe = sqlContext.load("org.apache.hadoop.hbase.spark",
Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a,
B_FIELD STRING c:b,", "hbase.table" -> "t1"))
// Query
sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " +
"WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " +
"(KEY_FIELD <= 'get3' and B_FIELD = '8')")
.foreach(r => println(" - "+r))
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
SparkSQL + MLLib
• Process data extracted from SparkSQL
val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played
FROM gamer")
// Parse data to apply typing information
val parsedData = => {
val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble,
Vectors.dense(array) })
val dataCount = parsedData.count()
if (dataCount > 0) {
val clusters = KMeans.train(parsedData, 3, 5)
clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Future work and Conclusion
Development and Distribution Status
• Today
• Batch Analysis patterns with existing MR Input/Output Formats
• Streaming Analysis Patterns
• Committed to HBase trunk branch (2.0) as part of HBase project
• Available in CDH5.7.0 with commercial support
• Used in production and pre-production today at ~10 Cloudera customers
• Recent Additions
• Kerberos and Secure HBase access
• To come: Kerberos ticket renewals for Spark Streaming
• New JSON based HBase table schema specification
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
How does Spark get data in and out of HBase?
HBase Client
Put, Incr, Append
HBase Client
Get, Scan
Bulk Import
HBase Client
HBase ReplicationHBase Replication
low latency
high throughput
Short scan
Full Scan,
HBase Scanner
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Batch RDD via HBase’s MR
Input/ Output Formats
Streaming using Hbase to
Enrich stream data
Streaming using Hbase to
Enrich stream data
HBase Data as Spark
Streaming data source
Future: HBase Data as a Source
• HBase edits as a Spark streaming data
source (with Kafka?)
• Gather other data
• Do some computation
• Write the data out
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Mini batch input RDD
Data source
Thank you!
Use Case – Streaming Counting
Hsieh and Malaska, Hadoop Summit EU
• Puts vs Increments
• Bulk Puts/Gets is good
• You can get perfect counting
Spark Streaming
Single Pass
Source Receiver RDD
Source Receiver RDD
Filter Count HBase Increments
Source Receiver RDD
Single Pass
Filter Count HBase Increments
Hsieh and Malaska, Hadoop Summit EU Dublin 2016
Single Pass
Source Receiver RDD
Source Receiver RDD
Filter Count
HBase Puts
Source Receiver
Single Pass
Filter Count
Stateful RDD 1
HBase Puts
Stateful RDD 2
Stateful RDD 1
Spark Streaming
Hsieh and Malaska, Hadoop Summit EU Dublin 2016

  • 1. 1© Cloudera, Inc. All rights reserved. 13 April 2016 Ted Malaska| Principle Solutions Architect @ Cloudera, Jonathan Hsieh| HBase Tech Lead @ Cloudera, Apache HBase PMC Apache HBase + Spark: Leveraging your Non-Relational Datastore in Batch and Streaming applications
  • 2. 2© Cloudera, Inc. All rights reserved. About Ted and Jon Ted Malaska • Principal Solutions Architect @ Cloudera • Apache HBase SparkOnHBase Contributor • Contact • Jon Hsieh • Tech Lead/Eng Manager HBase Team @ Cloudera • Apache HBase PMC • Apache Flume founder • Contact • • @jmhsieh Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 3. 3© Cloudera, Inc. All rights reserved. Outline • Introduction • Architecture and integration patterns • Typing and API usage examples • Future work and Conclusion Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 4. 4© Cloudera, Inc. All rights reserved. • Apache HBase is a distributed non- relational datastore that specializes in strongly consistent, low-latency, random access reads, writes, and short scans. As a storage system, it is an obvious source for reading RDDs and a destination for writing RDDs • Apache Spark is a distributed in- memory processing system that can be used for batch and continuous, near-real time streaming jobs. Spark’s programming model is built upon the RDD (resilient distributed dataset) abstraction Apache HBase + Apache Spark Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 5. 5© Cloudera, Inc. All rights reserved. Example Use cases • Streaming Analytics into HBase to replace Lambda Architectures (with Kafka) • Weblogs • ETL in Spark to bulkload into HBase • 25-50B records per weekly batch • Using SQL for extraction layer to query HBase entity-centric timeseries data Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 6. 6© Cloudera, Inc. All rights reserved. Architecture and Integration Patterns
  • 7. 7© Cloudera, Inc. All rights reserved. How does data get in and out of HBase? HBase Client Put, Incr, Append HBase Client Get, Scan Bulk Import HBase Client HBase ReplicationHBase Replication low latency high throughput Gets Short scan Full Scan, Snapshot, MapReduce HBase Scanner Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 8. 8© Cloudera, Inc. All rights reserved. HBase + MapReduce: Batch processing patterns • Read dataset from HBase Table • Use HBase’s MR InputFormats • TableInputFormat • MultiTableInputFormat • TableSnapshotInputFormat • Write dataset to HBase Table • Use HBase’s MR OutputFormat • TableOutputFormat • MultiTableOutputFormat • HFileOutputFormat Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Read from HBase Table Write to HBase Table
  • 9. 9© Cloudera, Inc. All rights reserved. HBase + Spark: Batch processing patterns • Read dataset(RDD) from HBase Table • Use HBase’s MR InputFormats • TableInputFormat • MultiTableInputFormat • TableSnapshotInputFormat • Write dataset(RDD) to HBase Table • Use HBase’s MR OutputFormat • TableOutputFormat • MultiTableOutputFormat • HFileOutputFormat Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Read HBase Table as RDD Write RDD as HBase Table
  • 10. 10© Cloudera, Inc. All rights reserved. Spark Streaming • Take an Data source • Partition in to mini batches RDDs • Compute using Spark engine • Output mini batch RDDs Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Mini batch input RDD Data source Mini batch output RDD
  • 11. 11© Cloudera, Inc. All rights reserved. HBase + Spark Streaming – Enriching With HBase Data • “Join” a dataset with HBase data • Enrich Streaming data source with HBase data • Extract information from minibatch • Read/write/update HBase data in processing • Output HBase-data enriched stream of output RDDs Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Mini batch input RDD Data source HBase-enriched mini batch output RDD
  • 12. 12© Cloudera, Inc. All rights reserved. How does Spark get data in and out of HBase? HBase Client Put, Incr, Append HBase Client Get, Scan Bulk Import HBase Client HBase ReplicationHBase Replication low latency high throughput Gets Short scan Full Scan, Snapshot, MapReduce HBase Scanner Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 13. 13© Cloudera, Inc. All rights reserved. How does Spark get data in and out of HBase? HBase Client Put, Incr, Append HBase Client Get, Scan Bulk Import HBase Client HBase ReplicationHBase Replication low latency high throughput Gets Short scan Full Scan, Snapshot, MapReduce HBase Scanner Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Batch RDD via HBase’s MR Input/ Output Formats Streaming using Hbase to Enrich stream data Streaming using HBase to Enrich stream data
  • 14. 14© Cloudera, Inc. All rights reserved. Typing and API Usage
  • 15. 15© Cloudera, Inc. All rights reserved. Under the covers Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 16. 16© Cloudera, Inc. All rights reserved. Key Addition: HBaseContext • Create an HBaseContext // an Hadoop/HBase Configuration object val conf = HBaseConfiguration.create() conf.addResource(new Path("/etc/hbase/conf/core-site.xml")) conf.addResource(new Path("/etc/hbase/conf/hbase-site.xml")) // sc is the Spark Context; hbase context corresponds to an HBase Connection val hbaseContext = new HBaseContext(sc, conf) // A sample RDD val rdd = sc.parallelize(Array( (Bytes.toBytes("1")), (Bytes.toBytes("2")), (Bytes.toBytes("3")), (Bytes.toBytes("4")), (Bytes.toBytes("5")), (Bytes.toBytes("6")), (Bytes.toBytes("7")))) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 17. 17© Cloudera, Inc. All rights reserved. • Foreach • Map • BulkLoad • BulkLoadThinRows • BulkGet (aka Multiget) • BulkDelete Operations on the HBaseContext Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 18. 18© Cloudera, Inc. All rights reserved. Foreach • Read HBase data in parallel for each partition and compute rdd.hbaseForeachPartition(hbaseContext, (it, conn) => { // do something val bufferedMutator = conn.getBufferedMutator( TableName.valueOf("t1")) it.foreach(r => { ... // HBase API put/incr/append/cas calls } bufferedMutator.flush() bufferedMutator.close() }) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 19. 19© Cloudera, Inc. All rights reserved. Map • Take an HBase dataset and map it in parallel for each partition to produce a new RDD val getRdd = rdd.hbaseMapPartitions(hbaseContext, (it, conn) => { val table = conn.getTable(TableName.valueOf("t1")) var res = mutable.MutableList[String]() r => { ... // HBase API Scan Results } }) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 20. 20© Cloudera, Inc. All rights reserved. BulkLoad • Bulk load a data set into Hbase (for all cases, generally wide tables) rdd.hbaseBulkLoad (tableName, t => { Seq((new KeyFamilyQualifier(t.rowKey,, t.qualifier), t.value)).iterator }, stagingFolder) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 21. 21© Cloudera, Inc. All rights reserved. BulkLoadThinRows • Bulk load a data set into HBase (for skinny tables, <10k cols) hbaseContext.bulkLoadThinRows[(String, Iterable[(Array[Byte], Array[Byte], Array[Byte])])] (rdd, TableName.valueOf(tableName), t => { val rowKey = Bytes.toBytes(t._1) val familyQualifiersValues = new FamiliesQualifiersValues t._2.foreach(f => { val family:Array[Byte] = f._1 val qualifier = f._2 val value:Array[Byte] = f._3 familyQualifiersValues +=(family, qualifier, value) }) (new ByteArrayWrapper(rowKey), familyQualifiersValues) }, stagingFolder.getPath) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 22. 22© Cloudera, Inc. All rights reserved. Scan vs Bulk Get (Parallel HBase Multigets) Scan HBase Table Bulk Get HBase Table Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 23. 23© Cloudera, Inc. All rights reserved. BulkPut • Parallelized HBase Multiput hbaseContext.bulkPut[(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte])])](rdd, tableName, (putRecord) => { val put = new Put(putRecord._1) putRecord._2.foreach((putValue) => put.add(putValue._1, putValue._2, putValue._3)) put } } Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 24. 24© Cloudera, Inc. All rights reserved. BulkDelete • Parallelized HBase Multi-deletes hbaseContext.bulkDelete[Array[Byte]](rdd, tableName, putRecord => new Delete(putRecord), 4) // batch size rdd.hbaseBulkDelete(hbaseContext, tableName, putRecord => new Delete(putRecord), 4) // batch size Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 25. 25© Cloudera, Inc. All rights reserved. SparkSQL • Using SparkSQL to query HBase Data // Setup Schema Mapping val dataframe = sqlContext.load("org.apache.hadoop.hbase.spark", Map("hbase.columns.mapping" -> "KEY_FIELD STRING :key, A_FIELD STRING c:a, B_FIELD STRING c:b,", "hbase.table" -> "t1")) dataframe.registerTempTable("hbaseTmp") // Query sqlContext.sql("SELECT KEY_FIELD FROM hbaseTmp " + "WHERE " + "(KEY_FIELD = 'get1' and B_FIELD < '3') or " + "(KEY_FIELD <= 'get3' and B_FIELD = '8')") .foreach(r => println(" - "+r)) Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 26. 26© Cloudera, Inc. All rights reserved. SparkSQL + MLLib • Process data extracted from SparkSQL val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer") // Parse data to apply typing information val parsedData = => { val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble) Vectors.dense(array) }) val dataCount = parsedData.count() if (dataCount > 0) { val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v)) } Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 27. 27© Cloudera, Inc. All rights reserved. Future work and Conclusion
  • 28. 28© Cloudera, Inc. All rights reserved. Development and Distribution Status • Today • Batch Analysis patterns with existing MR Input/Output Formats • Streaming Analysis Patterns • Committed to HBase trunk branch (2.0) as part of HBase project • Available in CDH5.7.0 with commercial support • Used in production and pre-production today at ~10 Cloudera customers • Recent Additions • Kerberos and Secure HBase access • To come: Kerberos ticket renewals for Spark Streaming • New JSON based HBase table schema specification Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 29. 29© Cloudera, Inc. All rights reserved. How does Spark get data in and out of HBase? HBase Client Put, Incr, Append HBase Client Get, Scan Bulk Import HBase Client HBase ReplicationHBase Replication low latency high throughput Gets Short scan Full Scan, MapReduce HBase Scanner Hsieh and Malaska, Hadoop Summit EU Dublin 2016 Batch RDD via HBase’s MR Input/ Output Formats Streaming using Hbase to Enrich stream data Streaming using Hbase to Enrich stream data HBase Data as Spark Streaming data source
  • 30. 30© Cloudera, Inc. All rights reserved. Future: HBase Data as a Source • HBase edits as a Spark streaming data source (with Kafka?) • Gather other data • Do some computation • Write the data out Hsieh and Malaska, Hadoop Summit EU Dublin 2016 HBase Replication Mini batch input RDD Data source
  • 31. 31© Cloudera, Inc. All rights reserved. Thank you!
  • 32. 32© Cloudera, Inc. All rights reserved. Use Case – Streaming Counting Hsieh and Malaska, Hadoop Summit EU • Puts vs Increments • Bulk Puts/Gets is good • You can get perfect counting 4/13/2016
  • 33. 33© Cloudera, Inc. All rights reserved. DStream DStream DStream Spark Streaming Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count HBase Increments Source Receiver RDD RDD RDD Single Pass Filter Count HBase Increments First Batch Second Batch Hsieh and Malaska, Hadoop Summit EU Dublin 2016
  • 34. 34© Cloudera, Inc. All rights reserved. DStream DStream DStream Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count HBase Puts Source Receiver RDD partitions RDD Parition RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 HBase Puts Stateful RDD 2 Stateful RDD 1 Spark Streaming Hsieh and Malaska, Hadoop Summit EU Dublin 2016

