SlideShare a Scribd company logo
HBase Tuning
Performance and Correctness
Lars Hofhansl
Principal Architect, Salesforce (10 years!)
HBase, Phoenix Committer, PMC
Apache Incubator PMC
Apache Foundation Member
http://hadoop-hbase.blogspot.com/
Apache HBase Performance Tuning
Boring Topic
Experiment with Colorful Slides
Agenda
• HDFS
• HBase – Server
• HBase – Client
• Correctness
• Performance

Recommended for you

Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data

The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.

big datasalesforceengineering
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase

The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.

hbase
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase

This document discusses file system usage in HBase. It provides an overview of the three main file types in HBase: write-ahead logs (WALs), data files, and reference files. It describes durability semantics, IO fencing techniques for region server recovery, and how HBase leverages data locality through short circuit reads, checksums, and block placement hints. The document is intended help understand HBase's interactions with HDFS for tuning IO performance.

filesystemhbasehadoop
HDFS
hdfs-site.xml
HDFS - Background
• Stores HBase WAL and HFiles
• No sync-to-disk by default
• Datanode writes tmp file, moves it into place
• Old data lost on power outage
HDFS Correctness Settings
• dfs.datanode.synconclose = true
(since Hadoop 1.1)
• mount ext4 with dirsync! Or use XFS
• You must do this!
HDFS Performance Settings
1. Sync behind writes
2. Stale Datanode Detection
3. Short Circuit Reads
4. Miscellaneous Settings

Recommended for you

HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals

Apache HBase is the Hadoop opensource, distributed, versioned storage manager well suited for random, realtime read/write access. This talk will give an overview on how HBase achieve random I/O, focusing on the storage layer internals. Starting from how the client interact with Region Servers and Master to go into WAL, MemStore, Compactions and on-disk format details. Looking at how the storage is used by features like snapshots, and how it can be improved to gain flexibility, performance and space efficiency.

clouderahbasehadoop summit
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path

Anoop Sam John and Ramkrishna Vasudevan (Intel) HBase provides an LRU based on heap cache but its size (and so the total data size that can be cached) is limited by Java’s max heap space. This talk highlights our work under HBASE-11425 to allow the HBase read path to work directly from the off-heap area.

intelhbaseconhbase
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices

This document provides best practices for YARN administrators and application developers. For administrators, it discusses YARN configuration, enabling ResourceManager high availability, configuring schedulers like Capacity Scheduler and Fair Scheduler, sizing containers, configuring NodeManagers, log aggregation, and metrics. For application developers, it discusses whether to use an existing framework or develop a native application, understanding YARN components, writing the client, and writing the ApplicationMaster.

HDFS Sync Behind Writes
• Syncs partial blocks to disk – best effort
(OK, since blocks are immutable)
• Necessary with sync-on-close for performance
• Always enable this
• dfs.datanode.sync.behind.writes = true
(Since Hadoop 1.1)
Stale Datanodes - Background
• Datanodes (DNs) send block reports to the
Namenode (NN)
• After 10min(!) w/o a report, DN is declared dead
• NN will still direct reads and writes to those DNs
• Bad for recovery. Down by 1 DN by definition.
(every 3rd read/write goes to a bad DN)
Stale Datanodes - Detection
Don’t use a DN for read or write when it looks like it is
stale (default off)
• dfs.namenode.avoid.read.stale.datanode = true
• dfs.namenode.avoid.write.stale.datanode = true
• dfs.namenode.stale.datanode.interval = 30000
(default)
HDFS short circuit reads
Read local blocks directly without DN, when
RegionServers and DNs are co-located.
• dfs.client.read.shortcircuit = true
• dfs.client.read.shortcircuit.buffer.size = 131072
(important, OOM on direct buffers, default on 0.98+)
• hbase.regionserver.checksum.verify = true
(default on 0.98+)
• dfs.domain.socket.path
(local Unix domain socket, not group or world readable)

Recommended for you

HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers

Todd Lipcon presents a solution to avoid full garbage collections (GCs) in HBase by using MemStore-Local Allocation Buffers (MSLABs). The document outlines that write operations in HBase can cause fragmentation in the old generation heap, leading to long GC pauses. MSLABs address this by allocating each MemStore's data into contiguous 2MB chunks, eliminating fragmentation. When MemStores flush, the freed chunks are large and contiguous. With MSLABs enabled, the author saw basically zero full GCs during load testing. MSLABs improve performance and stability by preventing GC pauses caused by fragmentation.

hbase hadoop clouderahbase hug presentationmemstore-local allocation buffers
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries

Apache Hive is a data warehousing system for large volumes of data stored in Hadoop. However, the data is useless unless you can use it to add value to your company. Hive provides a SQL-based query language that dramatically simplifies the process of querying your large data sets. That is especially important while your data scientists are developing and refining their queries to improve their understanding of the data. In many companies, such as Facebook, Hive accounts for a large percentage of the total MapReduce queries that are run on the system. Although Hive makes writing large data queries easier for the user, there are many performance traps for the unwary. Many of them are artifacts of the way Hive has evolved over the years and the requirement that the default behavior must be safe for all users. This talk will present examples of how Hive users have made mistakes that made their queries run much much longer than necessary. It will also present guidelines for how to get better performance for your queries and how to look at the query plan to understand what Hive is doing.

owen o'malleyhadoophive
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Flink Forward San Francisco 2022. With a real-time processing engine like Flink and a transactional storage layer like Hudi, it has never been easier to build end-to-end low-latency data platforms connecting sources like Kafka to data lake storage. Come learn how to blend Lakehouse architectural patterns with real-time processing pipelines with Flink and Hudi. We will dive deep on how Flink can leverage the newest features of Hudi like multi-modal indexing that dramatically improves query and write performance, data skipping that reduces the query latency by 10x for large datasets, and many more innovations unique to Flink and Hudi. by Ethan Guo & Kyle Weller

stream processingbig dataapache flink
Misc HDFS tips
Keep DN running with some failed disks
• dfs.datanode.failed.volumes.tolerated = <N>
(tolerate losing this many disks)
Distribute data across disks at a DN
• dfs.datanode.fsdataset.volume.choosing.policy =
AvailableSpaceVolumeChoosingPolicy
(HDFS-1804 hit drives with more space with higher probability for writes when free space
differs by more than 10GB by default)
Misc HDFS settings
(just trust me on these)
• dfs.block.size = 268435456
(note that WAL is rolled at 95% of this)
• ipc.server.tcpnodelay = true
• ipc.client.tcpnodelay = true
Misc HDFS settings
(just trust me on these, really)
• dfs.datanode.max.xcievers = 8192
• dfs.namenode.handler.count = 64
• dfs.datanode.handler.count = 8
(match number of spindles)
Apache HBase Performance Tuning

Recommended for you

Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase

This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.

hbase introduction
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...

Building highly efficient data lakes using Apache Hudi (Incubating) Even with the exponential growth in data volumes, ingesting/storing/managing big data remains unstandardized & in-efficient. Data lakes are a common architectural pattern to organize big data and democratize access to the organization. In this talk, we will discuss different aspects of building honest data lake architectures, pin pointing technical challenges and areas of inefficiency. We will then re-architect the data lake using Apache Hudi (Incubating), which provides streaming primitives right on top of big data. We will show how upserts & incremental change streams provided by Hudi help optimize data ingestion and ETL processing. Further, Apache Hudi manages growth, sizes files of the resulting data lake using purely open-source file formats, also providing for optimized query performance & file system listing. We will also provide hands-on tools and guides for trying this out on your own data lake. Speaker: Vinoth Chandar (Uber) Vinoth is Technical Lead at Uber Data Infrastructure Team

sf big analyticsubergopro
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George

This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.

apache hadoopnosqlhbase
HBase
RegionServer Settings
hbase-site.xml
Compactions
Compactions - Background
• Writes are buffered in the memstore
• Memstore contents flushed to disk as HFiles
• Need to limit # HFiles by rewriting small HFiles
into fewer larger ones
• Remove deleted and expired Cells
• Same data written multiple times => Write
Amplification!
Read vs. Write
• Read requires merging HFiles => fewer is
better
• Write throughput better with fewer
compactions => leads to more files
• Optimize for Read or Write, not both

Recommended for you

Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing

Apache Tez is a framework for accelerating Hadoop query processing. It is based on expressing a computation as a dataflow graph and executing it in a highly customizable way. Tez is built on top of YARN and provides benefits like better performance, predictability, and utilization of cluster resources compared to traditional MapReduce. It allows applications to focus on business logic rather than Hadoop internals.

Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities

Learn about Hudi's architecture, concurrency control mechanisms, table services and tools. By : Abhishek Modi, Balajee Nagasubramaniam, Prashant Wason, Satish Kotha, Nishith Agarwal

big datadistributed databasesanalytics
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase

This document provides an overview of HBase and why NoSQL databases like HBase were developed. It discusses how relational databases do not scale horizontally well with large amounts of data. HBase was created by Google to address these scaling issues and was inspired by their BigTable database. The document explains the HBase data model with rows, columns, and versions. It describes how data is stored physically in HFiles and served from memory and disks. Basic operations like put, get, and scan are also covered.

hbase nosql
Write Amplification
Vs.
Read Performance
Control the number of HFiles
• hbase.hstore.blockingStoreFiles = 10
(do not allow more flushes when there more than <N> files)
small for read, large for write, will stop flushes and writes
• hbase.hstore.compactionThreshold = 3
(number of files that starts a compaction)
small for read, large for write
• hbase.hregion.memstore.flush.size = 128
(max memstore size, default is good)
larger good for fewer compaction (watch Region Server heap)
Time Based Compactions
• HBase does time based major compactions
• expensive, always at wrong time
• hbase.hregion.majorcompaction = 604800000
(week, default)
• hbase.hregion.majorcompaction.jitter = 0.5 (½
week, default)
Memstore/Cache Sizing
• hbase.hregion.memstore.flush.size = 128
• hbase.hregion.memstore.block.multiplier
(allow single memstore to grow by this multiplier, good for heavy, bursty
writes)
• hbase.regionserver.global.memstore.upperLimit (0.98)
hbase.regionserver.global.memstore.size (1.0+)
(percent of heap, default 0.4, decrease for read heavy load)
• hfile.block.cache.size
(percent heap used for the block cache, default 0.4)

Recommended for you

Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013

Apache Kafka is a distributed publish-subscribe messaging system that allows both publishing and subscribing to streams of records. It uses a distributed commit log that provides low latency and high throughput for handling real-time data feeds. Key features include persistence, replication, partitioning, and clustering.

zookeepertrihugkafka
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements

This document discusses supporting Apache HBase and improving troubleshooting and supportability. It introduces two Cloudera employees who work on HBase support and provides an overview of typical troubleshooting scenarios for HBase like performance degradation, process crashes, and inconsistencies. The agenda covers using existing tools like logs and metrics to troubleshoot HBase performance issues with a general approach, and introduces htop as a real-time monitoring tool for HBase.

dataworks summit 2019dws19dataworks summit washington dc
HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi

Speakers: Liang Xie and Honghua Feng (Xiamoi) This talk covers the HBase environment at Xiaomi, including thoughts and practices around latency, hardware/OS/VM configuration, GC tuning, the use of a new write thread model and reverse scan, and block index optimization. It will also include some discussion of planned JIRAs based on these approaches.

hbasehbasecon
Autotune BlockCache vs. Memstores (1.0+)
HBASE-5349, not well tested, Must Experiment
• hbase.regionserver.global.memstore.size.{max|min}.range
• hfile.block.cache.size.{max|min}.range
• hbase.regionserver.heapmemory.tuner.class
• hbase.regionserver.heapmemory.tuner.period
Data Locality
• Essential for Short Circuit Reads
• hbase.hstore.min.locality.to.skip.major.compact
(compact even when unnecessary to restore locality)
• hbase.master.wait.on.regionservers.timeout
(allow master to wait a bit upon restart, so not all region go to the first servers
who sign in 30-90s is good. Default it 4.5s)
• Don’t use the HDFS balancer!
HBase
Column Family
Settings
Block Encoding
• NONE, FAST_DIFF, PREFIX, etc
• alter 'test', { NAME => 'cf',
DATA_BLOCK_ENCODING => 'FAST_DIFF' }
• Scan friendly, decodes as you scan
• Not so Get friendly (might need to decode many
previous Cells)
• Currently produces a lot of extra garbage
• Safe to enable, always

Recommended for you

Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management

Netezza provides workload management options to efficiently service user queries. It allows restricting the maximum concurrent jobs, creating resource sharing groups to control resource allocation disproportionately, and uses multiple schedulers like gatekeeper and GRA. Gatekeeper queues jobs and schedules based on priority and resource availability. GRA allocates resources to jobs based on user's resource group. Short queries can be prioritized using short query bias which reserves system resources for such queries.

netezza
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace

The document discusses using Netezza query plans to improve performance. It explains that Netezza generates a query plan detailing the optimal execution path determined by the query optimizer. The query plan shows snippets that read/join data and where they execute. Key points in plans like estimated rows, costs and distributions are examined to identify issues. Common performance problems and steps for analysis are outlined, like generating statistics, changing distributions and rewriting queries.

netezzaperformance tuning
Concurrency
ConcurrencyConcurrency
Concurrency

Parallel processing involves executing multiple tasks simultaneously using multiple cores or processors. It can provide performance benefits over serial processing by reducing execution time. When developing parallel applications, developers must identify independent tasks that can be executed concurrently and avoid issues like race conditions and deadlocks. Effective parallelization requires analyzing serial code to find optimization opportunities, designing and implementing concurrent tasks, and testing and tuning to maximize performance gains.

computer programmingparallel computingconcurrent processing
Compression
• NONE, GZIP, SNAPPY, etc
• create ’test', {NAME => ’cf', COMPRESSION => 'SNAPPY’}}
• Compresses entire blocks, not Scan or Get friendly
• Typically does not achieve much over block encoding
• Blocks cached decompressed, unless
hbase.block.data.cachecompressed = true
(more cache capacity, but every access needs decompressions)
• Need to test with your data
HFile Block Size
• Don’t confuse with HDFS block size!
• create ‘test′,{NAME => ‘cf′, BLOCKSIZE => ’4096'}
• Default 64k good compromise between Scans
and point Gets
• Increase for large Scans
• Decrease for many point gets
• Rarely want to change this, likely never > 1mb
RegionServer - Garbage Collection
(source: http://www.everystockphoto.com)
Weak Generational Hypothesis
Most Allocated Objects Die Young

Recommended for you

Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database

There are two main types of relational database management systems (RDBMS): row-based and columnar. Row-based systems store all of a row's data contiguously on disk, while columnar systems store each column's data together across all rows. Columnar databases are generally better for read-heavy workloads like data warehousing that involve aggregating or retrieving subsets of columns, whereas row-based databases are better for transactional systems that require updating or retrieving full rows frequently. The optimal choice depends on the specific access patterns and usage of the data.

data warehousingdatabasedatabase management system
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference

The document discusses HDFS architecture and components. It describes how HDFS uses NameNodes and DataNodes to store and retrieve file data in a distributed manner across clusters. The NameNode manages the file system namespace and regulates access to files by clients. DataNodes store file data in blocks and replicate them for fault tolerance. The document outlines the write and read workflows in HDFS and how NameNodes and DataNodes work together to manage data storage and access.

hadoophdfs
Websphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentals

The document provides an overview of the fundamentals of Websphere MQ including: - The key MQ objects like messages, queues, channels and how they work - Basic MQ administration tasks like defining, displaying, altering and deleting MQ objects using MQSC commands - Hands-on exercises are included to demonstrate programming with MQ and administering MQ objects

messagingmqqueuing
Garbage Collection - Background
HotSpot manages four generations (CMS collector):
• Eden for all new objects
• Survivor I and II where surviving objects are promoted when
eden is collected
• Tenured space. Objects surviving a few rounds (16 by default)
of eden/survivor collection are promoted into the tenured
space
• Perm gen for classes, interned strings, and other more or less
permanent objects. (gone, finally, in JDK8)
Garbage Collection - HBase
• Garbage from operations is shortlived (single
RPC)
• Memstore is relatively long-lived
(allocated in 2mb chunks)
• Blockcache is long-lived
(allocation in 64k blocks)
• Deal with the “operational” garbage efficiently
Garbage Collection (CMS)
-Xmn512m
very small eden space
-XX:+UseParNewGC
collect eden in parallel
-XX:+UseConcMarkSweepGC
use the non-moving CMS collector
-XX:CMSInitiatingOccupancyFraction=70
start collecting when 70% of tenured gen is full, avoid collection under pressure
-XX:+UseCMSInitiatingOccupancyOnly
do not try to adjust CMS setting
RegionServer Machine Sizing

Recommended for you

Project Risk Management
Project Risk ManagementProject Risk Management
Project Risk Management

The document discusses project risk management. It defines project risk as the loss multiplied by the likelihood. Successful project leaders plan thoroughly to understand challenges, anticipate problems, and minimize variation. Project failures can occur if objectives are impossible, deliverables are possible but other objectives are unrealistic, or feasible deliverables and objectives but insufficient planning. Risk management includes qualitative and quantitative risk assessment to understand probability and impact of risks. It is important to document risks, have risk management plans, and regularly review assumptions and risks.

risk managementproject managementrisk
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers

This document provides an introduction to the Netezza database appliance, including its architecture and key components. The Netezza uses an Asymmetric Massively Parallel Processing (AMPP) architecture with an array of servers (S-Blades) connected to disks. Each S-Blade contains a Database Accelerator card that offloads processing from the CPU. The document outlines the various hardware components and how they work together to process queries in parallel. It also defines common Netezza objects like users, groups, tables and databases that can be created and managed.

netezzapure systems
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza

This document discusses considerations for data modeling on Netezza appliances to optimize performance. It recommends distributing data uniformly across snippet processors to maximize parallel processing. When joining tables, the distribution key should match join columns to keep processors independent. Zone maps and clustered tables can reduce data reads from disk. Materialized views on frequently accessed columns further improve performance for single table and join queries.

data modelingnetezzaperformance
RegionServer Machine Sizing
• How much RAM/Heap?
• How many disks?
• What size of disk?
• Network?
• Number of cores?
RegionServer Disk/Java Heap ratio
• Disk/Heap ratio:
RegionSize / MemstoreSize *
ReplicationFactor *
HeapFractionForMemstores * 2
(assuming memstores on average ½ filled)
• 10gb/128mb * 3 * 0.4 * 2 = 192, with default
settings
RegionServer Disk/Java Heap ratio
• Each 192 bytes on disk need 1 byte of Heap
• With 32gb of heap, can barely fill 6T
disk/machine
(32gb * 192 = 6tb)
192?!
W.T.F.
How about 1gb regions?
1gb/128mb * 3 * 0.4 * 2 = 19

Recommended for you

HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014

We start by looking at distributed database features that impact latency. Then we take a deeper look at the HBase read and write paths with a focus on request latency. We examine the sources of latency and how to minimize them.

hbaselatencyperformance
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction

Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.

hadoophbasebigdata
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003

This document provides an overview of HBase, including: - HBase is a distributed, scalable, big data store modeled after Google's BigTable. It provides a fault-tolerant way to store large amounts of sparse data. - HBase is used by large companies to handle scaling and sparse data better than relational databases. It features automatic partitioning, linear scalability, commodity hardware, and fault tolerance. - The document discusses HBase operations, schema design best practices, hardware recommendations, alerting, backups and more. It provides guidance on designing keys, column families and cluster configuration to optimize performance for read and write workloads.

hbasehadoopbig data
(source: http://www.everystockphoto.com)
RegionServer sizing configs
• hbase.hregion.max.filesize (default 10g is good)
• hbase.hregion.memstore.flush.size (default 128mb)
(decrease for read heavy loads)
• hbase.regionserver.maxlogs
(HDFS blocksize * 0.95 * <this> should larger than
0.4*JavaHeap)
RegionServer Hardware
• <= 6T disk space per machine
• Enough heap (~diskspace/200)
• Many cores are good. HBase is CPU intensive.
• Match network and disk throughput
(1ge and 24 disks is not good 125mb/s vs 2.4gb/s)
(10ge and 24 disks is OK, 1ge and 4 or 6 disks is OK)
• But… For reads with filters more disks are still better.
HBase Client Settings

Recommended for you

Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015

Elastic HBase on Mesos aims to improve resource utilization of HBase clusters by running HBase in Docker containers managed by Mesos and Marathon. This allows HBase clusters to dynamically scale based on varying workload demands, increases utilization by running mixed workloads on shared resources, and simplifies operations through standard containerization. Key benefits include easier management, higher efficiency through elastic scaling and resource sharing, and improved cluster tunability.

bigdatamesosdocker
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency

This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.

hbaseconhbase
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed

This document discusses tuning Linux and PostgreSQL for performance. It recommends: - Tuning Linux kernel parameters like huge pages, swappiness, and overcommit memory. Huge pages can improve TLB performance. - Tuning PostgreSQL parameters like shared_buffers, work_mem, and checkpoint_timeout. Shared_buffers stores the most frequently accessed data. - Other tips include choosing proper hardware, OS, and database based on workload. Tuning queries and applications can also boost performance.

postgresqlconferenceasia
Client/Server RPC chunk size
• No streaming RPC in HBase
• Can only asymptotically approach the
full network bandwidth
• Typical intra datacenter latency: 0.1ms-1ms
• Transmitting 2mb over 1ge: 150ms
• Transmitting 2mb over 10ge: 15ms
2mb chunks between Client and Server are good
But, how Should I do that?
Client Chunk Size Settings
Write:
• hbase.client.write.buffer = 2mb (default write buffer, good)
Read
• Scan.setCaching(<n>) (default 100 rows)
(but… how large are the rows? Must guess!)
• hbase.client.scanner.max.result.size = 2mb (default scan
buffer, 0.98.12+ only)
Client
Consider RPC size * hbase.regionserver.handler.count for
server GC
Need to be able to ride over splits and region moves:
hbase.client.pause = 100
hbase.client.retries.number = 35
hbase.ipc.client.tcpnodelay = true

Recommended for you

Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan

Hadoop is an open-source framework that processes large datasets in a distributed manner across commodity hardware. It uses a distributed file system (HDFS) and MapReduce programming model to store and process data. Hadoop is highly scalable, fault-tolerant, and reliable. It can handle data volumes and variety including structured, semi-structured and unstructured data.

hadoopinstall
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos

Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.

clouderabigdatahbase
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs

The document discusses how HDFS architecture has evolved to meet new requirements for higher scalability, availability, and improved random read performance. It summarizes the key aspects of HDFS architecture in 2010, including limitations, and improvements made since then, such as read pipeline optimizations, federated namespaces, and high availability name nodes. It also outlines future directions for HDFS architecture.

Replication (trust me)
• hbase.zookeeper.useMulti = true (needs ZK 3.4)
this one is important for correctness
Other defaults are good:
• replication.sleep.before.failover = 30000
• replication.source.maxretriesmultiplier = 300
• replication.source.ratio = 0.10
Linux
• Turn THP (Transparent Huge Pages) OFF
• Set Swappiness to 0
• Set vm.min_free_kbytes to AT LEAST 1GB (8GB on
larger systems, server allocation immediately)
• Set zone_reclaim_mode to 0
(one cache on NUMA)
• dirsync mount option for EXT4, or use XFS
Not Covered
• Security/Kerberos
• HA NameNode/QJM
• ZK/Disk Layout
• Obscure Configs
• Offheap Caching, G1 GC
(source: http://www.morguefile.com)

Recommended for you

HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices

This document provides an overview and best practices for operating HBase clusters. It discusses HBase and Hadoop architecture, how to set up an HBase cluster including Zookeeper and region servers, high availability considerations, scaling the cluster, backup and restore processes, and operational best practices around hardware, disks, OS, automation, load balancing, upgrades, monitoring and alerting. It also includes a case study of a 110 node HBase cluster.

hbase performance tuninghow to setup hbasehbase
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency

A deeper look at the HBase read and write paths with a focus on request latency. We look at sources of latency and how to minimize them.

latencyapache hbaseperformance
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse

This document provides an overview of HBase architecture and advanced usage topics. It discusses course credit requirements, HBase architecture components like storage, write path, read path, files, region splits and more. It also covers advanced topics like secondary indexes, search integration, transactions and bloom filters. The document emphasizes that HBase uses log-structured merge trees for efficient data handling and operates at the disk transfer level rather than disk seek level for performance. It also provides details on various classes involved in write-ahead logging.

hbase
TL;DR:
• Enable HDFS Sync on close, Sync behind writes
• Mount EXT4 with dirsync
• Enabled Stale Datanode detection
• Tune HBase read vs. write load
• Set HFile block size for your load
• Get RPC Client/Server chunk size right
Thank You!
http://hadoop-hbase.blogspot.com/

More Related Content

What's hot

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
Salesforce Engineering
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
DataWorks Summit
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
HBaseCon
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
DataWorks Summit
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
Cloudera, Inc.
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
DataWorks Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Flink Forward
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
alexbaranau
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
JAX London
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
Carol McDonald
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
mumrah
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 

What's hot (20)

Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Scaling HBase for Big Data
Scaling HBase for Big DataScaling HBase for Big Data
Scaling HBase for Big Data
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBase Storage Internals
HBase Storage InternalsHBase Storage Internals
HBase Storage Internals
 
Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path Off-heaping the Apache HBase Read Path
Off-heaping the Apache HBase Read Path
 
Apache Hadoop YARN: best practices
Apache Hadoop YARN: best practicesApache Hadoop YARN: best practices
Apache Hadoop YARN: best practices
 
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation BuffersHBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
HBase HUG Presentation: Avoiding Full GCs with MemStore-Local Allocation Buffers
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
HBase Advanced - Lars George
HBase Advanced - Lars GeorgeHBase Advanced - Lars George
HBase Advanced - Lars George
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 

Viewers also liked

HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi
HBaseCon
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
Biju Nair
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
Biju Nair
 
Concurrency
ConcurrencyConcurrency
Concurrency
Biju Nair
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
Biju Nair
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
Biju Nair
 
Websphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentals
Biju Nair
 
Project Risk Management
Project Risk ManagementProject Risk Management
Project Risk Management
Biju Nair
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
Biju Nair
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
Biju Nair
 

Viewers also liked (10)

HBase at Xiaomi
HBase at XiaomiHBase at Xiaomi
HBase at Xiaomi
 
Netezza workload management
Netezza workload managementNetezza workload management
Netezza workload management
 
Using Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve PerformaceUsing Netezza Query Plan to Improve Performace
Using Netezza Query Plan to Improve Performace
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Row or Columnar Database
Row or Columnar DatabaseRow or Columnar Database
Row or Columnar Database
 
HDFS User Reference
HDFS User ReferenceHDFS User Reference
HDFS User Reference
 
Websphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentalsWebsphere MQ (MQSeries) fundamentals
Websphere MQ (MQSeries) fundamentals
 
Project Risk Management
Project Risk ManagementProject Risk Management
Project Risk Management
 
Netezza fundamentals for developers
Netezza fundamentals for developersNetezza fundamentals for developers
Netezza fundamentals for developers
 
NENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezzaNENUG Apr14 Talk - data modeling for netezza
NENUG Apr14 Talk - data modeling for netezza
 

Similar to Apache HBase Performance Tuning

HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
Jean-Baptiste Poullet
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
Jean-Baptiste Poullet
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
Cosmin Lehene
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
HBaseCon
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
Equnix Business Solutions
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
Narayana B
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
NAVER D2
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
Venu Anuganti
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
Nick Dimiduk
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
Scott Miao
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
bigbase
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
Yiwei Ma
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
larsgeorge
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
HBaseCon
 

Similar to Apache HBase Performance Tuning (20)

HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Hbase: an introduction
Hbase: an introductionHbase: an introduction
Hbase: an introduction
 
Hbase 20141003
Hbase 20141003Hbase 20141003
Hbase 20141003
 
Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015Elastic HBase on Mesos - HBaseCon 2015
Elastic HBase on Mesos - HBaseCon 2015
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar AhmedPGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
PGConf.ASIA 2019 Bali - Tune Your LInux Box, Not Just PostgreSQL - Ibrar Ahmed
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
HBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on MesosHBaseCon 2015: Elastic HBase on Mesos
HBaseCon 2015: Elastic HBase on Mesos
 
[B4]deview 2012-hdfs
[B4]deview 2012-hdfs[B4]deview 2012-hdfs
[B4]deview 2012-hdfs
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Apache HBase Low Latency
Apache HBase Low LatencyApache HBase Low Latency
Apache HBase Low Latency
 
004 architecture andadvanceduse
004 architecture andadvanceduse004 architecture andadvanceduse
004 architecture andadvanceduse
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
HBase: Extreme makeover
HBase: Extreme makeoverHBase: Extreme makeover
HBase: Extreme makeover
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
HBase Sizing Guide
HBase Sizing GuideHBase Sizing Guide
HBase Sizing Guide
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsBig Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
 
Large-scale Web Apps @ Pinterest
Large-scale Web Apps @ PinterestLarge-scale Web Apps @ Pinterest
Large-scale Web Apps @ Pinterest
 

Recently uploaded

Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
Vijayananda Mohire
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
The Digital Insurer
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
ScyllaDB
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
James Anderson
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
Liveplex
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
Matthew Sinclair
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Erasmo Purificato
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
UiPathCommunity
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
huseindihon
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
The Digital Insurer
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
KAMAL CHOUDHARY
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
ScyllaDB
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
SynapseIndia
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
Kief Morris
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
Emerging Tech
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
ArgaBisma
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
ishalveerrandhawa1
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
amitchopra0215
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Earley Information Science
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
Safe Software
 

Recently uploaded (20)

Quantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLMQuantum Communications Q&A with Gemini LLM
Quantum Communications Q&A with Gemini LLM
 
K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024K2G - Insurtech Innovation EMEA Award 2024
K2G - Insurtech Innovation EMEA Award 2024
 
Running a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU ImpactsRunning a Go App in Kubernetes: CPU Impacts
Running a Go App in Kubernetes: CPU Impacts
 
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating AppsecGDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
GDG Cloud Southlake #34: Neatsun Ziv: Automating Appsec
 
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALLBLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
BLOCKCHAIN FOR DUMMIES: GUIDEBOOK FOR ALL
 
20240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 202420240702 QFM021 Machine Intelligence Reading List June 2024
20240702 QFM021 Machine Intelligence Reading List June 2024
 
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
Paradigm Shifts in User Modeling: A Journey from Historical Foundations to Em...
 
UiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs ConferenceUiPath Community Day Kraków: Devs4Devs Conference
UiPath Community Day Kraków: Devs4Devs Conference
 
find out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challengesfind out more about the role of autonomous vehicles in facing global challenges
find out more about the role of autonomous vehicles in facing global challenges
 
Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024Verti - EMEA Insurer Innovation Award 2024
Verti - EMEA Insurer Innovation Award 2024
 
Recent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS InfrastructureRecent Advancements in the NIST-JARVIS Infrastructure
Recent Advancements in the NIST-JARVIS Infrastructure
 
How to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory ModelHow to Avoid Learning the Linux-Kernel Memory Model
How to Avoid Learning the Linux-Kernel Memory Model
 
How RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptxHow RPA Help in the Transportation and Logistics Industry.pptx
How RPA Help in the Transportation and Logistics Industry.pptx
 
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
[Talk] Moving Beyond Spaghetti Infrastructure [AOTB] 2024-07-04.pdf
 
Implementations of Fused Deposition Modeling in real world
Implementations of Fused Deposition Modeling  in real worldImplementations of Fused Deposition Modeling  in real world
Implementations of Fused Deposition Modeling in real world
 
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdfWhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
WhatsApp Image 2024-03-27 at 08.19.52_bfd93109.pdf
 
Calgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptxCalgary MuleSoft Meetup APM and IDP .pptx
Calgary MuleSoft Meetup APM and IDP .pptx
 
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
@Call @Girls Pune 0000000000 Riya Khan Beautiful Girl any Time
 
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design ApproachesKnowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
Knowledge and Prompt Engineering Part 2 Focus on Prompt Design Approaches
 
Coordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar SlidesCoordinate Systems in FME 101 - Webinar Slides
Coordinate Systems in FME 101 - Webinar Slides
 

Apache HBase Performance Tuning

  • 1. HBase Tuning Performance and Correctness Lars Hofhansl Principal Architect, Salesforce (10 years!) HBase, Phoenix Committer, PMC Apache Incubator PMC Apache Foundation Member http://hadoop-hbase.blogspot.com/
  • 4. Agenda • HDFS • HBase – Server • HBase – Client • Correctness • Performance
  • 6. HDFS - Background • Stores HBase WAL and HFiles • No sync-to-disk by default • Datanode writes tmp file, moves it into place • Old data lost on power outage
  • 7. HDFS Correctness Settings • dfs.datanode.synconclose = true (since Hadoop 1.1) • mount ext4 with dirsync! Or use XFS • You must do this!
  • 8. HDFS Performance Settings 1. Sync behind writes 2. Stale Datanode Detection 3. Short Circuit Reads 4. Miscellaneous Settings
  • 9. HDFS Sync Behind Writes • Syncs partial blocks to disk – best effort (OK, since blocks are immutable) • Necessary with sync-on-close for performance • Always enable this • dfs.datanode.sync.behind.writes = true (Since Hadoop 1.1)
  • 10. Stale Datanodes - Background • Datanodes (DNs) send block reports to the Namenode (NN) • After 10min(!) w/o a report, DN is declared dead • NN will still direct reads and writes to those DNs • Bad for recovery. Down by 1 DN by definition. (every 3rd read/write goes to a bad DN)
  • 11. Stale Datanodes - Detection Don’t use a DN for read or write when it looks like it is stale (default off) • dfs.namenode.avoid.read.stale.datanode = true • dfs.namenode.avoid.write.stale.datanode = true • dfs.namenode.stale.datanode.interval = 30000 (default)
  • 12. HDFS short circuit reads Read local blocks directly without DN, when RegionServers and DNs are co-located. • dfs.client.read.shortcircuit = true • dfs.client.read.shortcircuit.buffer.size = 131072 (important, OOM on direct buffers, default on 0.98+) • hbase.regionserver.checksum.verify = true (default on 0.98+) • dfs.domain.socket.path (local Unix domain socket, not group or world readable)
  • 13. Misc HDFS tips Keep DN running with some failed disks • dfs.datanode.failed.volumes.tolerated = <N> (tolerate losing this many disks) Distribute data across disks at a DN • dfs.datanode.fsdataset.volume.choosing.policy = AvailableSpaceVolumeChoosingPolicy (HDFS-1804 hit drives with more space with higher probability for writes when free space differs by more than 10GB by default)
  • 14. Misc HDFS settings (just trust me on these) • dfs.block.size = 268435456 (note that WAL is rolled at 95% of this) • ipc.server.tcpnodelay = true • ipc.client.tcpnodelay = true
  • 15. Misc HDFS settings (just trust me on these, really) • dfs.datanode.max.xcievers = 8192 • dfs.namenode.handler.count = 64 • dfs.datanode.handler.count = 8 (match number of spindles)
  • 19. Compactions - Background • Writes are buffered in the memstore • Memstore contents flushed to disk as HFiles • Need to limit # HFiles by rewriting small HFiles into fewer larger ones • Remove deleted and expired Cells • Same data written multiple times => Write Amplification!
  • 20. Read vs. Write • Read requires merging HFiles => fewer is better • Write throughput better with fewer compactions => leads to more files • Optimize for Read or Write, not both
  • 22. Control the number of HFiles • hbase.hstore.blockingStoreFiles = 10 (do not allow more flushes when there more than <N> files) small for read, large for write, will stop flushes and writes • hbase.hstore.compactionThreshold = 3 (number of files that starts a compaction) small for read, large for write • hbase.hregion.memstore.flush.size = 128 (max memstore size, default is good) larger good for fewer compaction (watch Region Server heap)
  • 23. Time Based Compactions • HBase does time based major compactions • expensive, always at wrong time • hbase.hregion.majorcompaction = 604800000 (week, default) • hbase.hregion.majorcompaction.jitter = 0.5 (½ week, default)
  • 24. Memstore/Cache Sizing • hbase.hregion.memstore.flush.size = 128 • hbase.hregion.memstore.block.multiplier (allow single memstore to grow by this multiplier, good for heavy, bursty writes) • hbase.regionserver.global.memstore.upperLimit (0.98) hbase.regionserver.global.memstore.size (1.0+) (percent of heap, default 0.4, decrease for read heavy load) • hfile.block.cache.size (percent heap used for the block cache, default 0.4)
  • 25. Autotune BlockCache vs. Memstores (1.0+) HBASE-5349, not well tested, Must Experiment • hbase.regionserver.global.memstore.size.{max|min}.range • hfile.block.cache.size.{max|min}.range • hbase.regionserver.heapmemory.tuner.class • hbase.regionserver.heapmemory.tuner.period
  • 26. Data Locality • Essential for Short Circuit Reads • hbase.hstore.min.locality.to.skip.major.compact (compact even when unnecessary to restore locality) • hbase.master.wait.on.regionservers.timeout (allow master to wait a bit upon restart, so not all region go to the first servers who sign in 30-90s is good. Default it 4.5s) • Don’t use the HDFS balancer!
  • 28. Block Encoding • NONE, FAST_DIFF, PREFIX, etc • alter 'test', { NAME => 'cf', DATA_BLOCK_ENCODING => 'FAST_DIFF' } • Scan friendly, decodes as you scan • Not so Get friendly (might need to decode many previous Cells) • Currently produces a lot of extra garbage • Safe to enable, always
  • 29. Compression • NONE, GZIP, SNAPPY, etc • create ’test', {NAME => ’cf', COMPRESSION => 'SNAPPY’}} • Compresses entire blocks, not Scan or Get friendly • Typically does not achieve much over block encoding • Blocks cached decompressed, unless hbase.block.data.cachecompressed = true (more cache capacity, but every access needs decompressions) • Need to test with your data
  • 30. HFile Block Size • Don’t confuse with HDFS block size! • create ‘test′,{NAME => ‘cf′, BLOCKSIZE => ’4096'} • Default 64k good compromise between Scans and point Gets • Increase for large Scans • Decrease for many point gets • Rarely want to change this, likely never > 1mb
  • 31. RegionServer - Garbage Collection (source: http://www.everystockphoto.com)
  • 32. Weak Generational Hypothesis Most Allocated Objects Die Young
  • 33. Garbage Collection - Background HotSpot manages four generations (CMS collector): • Eden for all new objects • Survivor I and II where surviving objects are promoted when eden is collected • Tenured space. Objects surviving a few rounds (16 by default) of eden/survivor collection are promoted into the tenured space • Perm gen for classes, interned strings, and other more or less permanent objects. (gone, finally, in JDK8)
  • 34. Garbage Collection - HBase • Garbage from operations is shortlived (single RPC) • Memstore is relatively long-lived (allocated in 2mb chunks) • Blockcache is long-lived (allocation in 64k blocks) • Deal with the “operational” garbage efficiently
  • 35. Garbage Collection (CMS) -Xmn512m very small eden space -XX:+UseParNewGC collect eden in parallel -XX:+UseConcMarkSweepGC use the non-moving CMS collector -XX:CMSInitiatingOccupancyFraction=70 start collecting when 70% of tenured gen is full, avoid collection under pressure -XX:+UseCMSInitiatingOccupancyOnly do not try to adjust CMS setting
  • 37. RegionServer Machine Sizing • How much RAM/Heap? • How many disks? • What size of disk? • Network? • Number of cores?
  • 38. RegionServer Disk/Java Heap ratio • Disk/Heap ratio: RegionSize / MemstoreSize * ReplicationFactor * HeapFractionForMemstores * 2 (assuming memstores on average ½ filled) • 10gb/128mb * 3 * 0.4 * 2 = 192, with default settings
  • 39. RegionServer Disk/Java Heap ratio • Each 192 bytes on disk need 1 byte of Heap • With 32gb of heap, can barely fill 6T disk/machine (32gb * 192 = 6tb) 192?! W.T.F.
  • 40. How about 1gb regions? 1gb/128mb * 3 * 0.4 * 2 = 19
  • 42. RegionServer sizing configs • hbase.hregion.max.filesize (default 10g is good) • hbase.hregion.memstore.flush.size (default 128mb) (decrease for read heavy loads) • hbase.regionserver.maxlogs (HDFS blocksize * 0.95 * <this> should larger than 0.4*JavaHeap)
  • 43. RegionServer Hardware • <= 6T disk space per machine • Enough heap (~diskspace/200) • Many cores are good. HBase is CPU intensive. • Match network and disk throughput (1ge and 24 disks is not good 125mb/s vs 2.4gb/s) (10ge and 24 disks is OK, 1ge and 4 or 6 disks is OK) • But… For reads with filters more disks are still better.
  • 45. Client/Server RPC chunk size • No streaming RPC in HBase • Can only asymptotically approach the full network bandwidth • Typical intra datacenter latency: 0.1ms-1ms • Transmitting 2mb over 1ge: 150ms • Transmitting 2mb over 10ge: 15ms
  • 46. 2mb chunks between Client and Server are good But, how Should I do that?
  • 47. Client Chunk Size Settings Write: • hbase.client.write.buffer = 2mb (default write buffer, good) Read • Scan.setCaching(<n>) (default 100 rows) (but… how large are the rows? Must guess!) • hbase.client.scanner.max.result.size = 2mb (default scan buffer, 0.98.12+ only)
  • 48. Client Consider RPC size * hbase.regionserver.handler.count for server GC Need to be able to ride over splits and region moves: hbase.client.pause = 100 hbase.client.retries.number = 35 hbase.ipc.client.tcpnodelay = true
  • 49. Replication (trust me) • hbase.zookeeper.useMulti = true (needs ZK 3.4) this one is important for correctness Other defaults are good: • replication.sleep.before.failover = 30000 • replication.source.maxretriesmultiplier = 300 • replication.source.ratio = 0.10
  • 50. Linux • Turn THP (Transparent Huge Pages) OFF • Set Swappiness to 0 • Set vm.min_free_kbytes to AT LEAST 1GB (8GB on larger systems, server allocation immediately) • Set zone_reclaim_mode to 0 (one cache on NUMA) • dirsync mount option for EXT4, or use XFS
  • 51. Not Covered • Security/Kerberos • HA NameNode/QJM • ZK/Disk Layout • Obscure Configs • Offheap Caching, G1 GC
  • 53. TL;DR: • Enable HDFS Sync on close, Sync behind writes • Mount EXT4 with dirsync • Enabled Stale Datanode detection • Tune HBase read vs. write load • Set HFile block size for your load • Get RPC Client/Server chunk size right