HBase Performance Tuning

Wu Jiaojiao & Li Yuansan
3 min readMar 3, 2021

HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable.

It is used whenever there is a need to write heavy applications and whenever fast random access to available data is needed. Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase internally.

So if you noticed your HBase are not performing as expected, here are the some points you can check:

Poor schema design

Each table has just one index — the row key. The key is used to locate where your data is stored, so please don’t use sequential index which prevent HBase from evenly distributed into multiple servers, hence lower performance. The row keys should be less than 4k per key.

Good way to define the row keys:

  • Reverse domain names
  • String identifiers
  • Timestamps as suffix in key

Bad way to define row keys:

  • Domain names
  • Sequential numeric values
  • Timestamps alone
  • Timestamps as prefix of row key
  • Mutable or repeatedly updated values

Related entities should be kept in adjacent rows if possible. This would reduce the network IO for navigating multiple servers.

Check the following sizing limit whether it has exceeded:

Inappropriate workload

Did you choose the right database? if the data load is too small like less than 300G, the benefit of HBase is not fully shown out. Or is the load a one time short burst?

Tuning the Cluster

  • Increase the request handler thread count
    If you need to handle high volume of request patterns, increase the number of listeners generated by the RegionServers.
  • Configure the size and number of WAL files
    HBase uses the Write Ahead Log, or WAL, to recover MemStore data not yet flushed to disk if a RegionServer crashes. Administrators should configure these WAL files to be slightly smaller than the HDFS block size.
  • Configure compactions
    If you are an administrator and expect the HBase clusters to host large amounts of data, consider the effect that compactions have on write throughput. For write-intensive data request patterns, you should consider less frequent compactions and more StoreFiles per region.
  • Considerations for splitting tables
    You can split tables during table creation based on the target number of regions per RegionServer to avoid costly dynamic splitting as the table starts to fill.
  • Tune JVM garbage collection in RegionServers
    You can tune garbage collection in HBase RegionServers for stability, because a RegionServer cannot utilize a very large heap due to the cost of garbage collection. Administrators should specify no more than 24 GB for one RegionServer.

Cluster just started

If All above doesn’t work, you might want to check whether it is due to the fact that the cluster just started, it will take a little while to bootstraping everything, in this case, you can wait for longer running time and check the performance again.

HDD used instead of SSD

Last but not least, double check your storage option whether you have wrongly choosed HDD instead of SSD.

Thanks for reading, I hope this give you idea on how to performance tuning your HBase. Welcome feedback and suggestions.

Reference:

https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.0/hbase-data-access/content/method_for_configuring_cluster_for_the_first_time.html

--

--