Apache HBase is a column family-oriented distributed NoSQL database that is built on Hadoop Distributed File System (HDFS). It offers real-time read/write access to data and features high availability, high performance, and scalability. As a subproject of Apache Hadoop, HBase is compatible with the Hadoop ecosystem. HBase runs on top of Hadoop and offers Bigtable-like capabilities. You can store data in HDFS by using HBase. HBase is suitable for storing semi-structured or unstructured sparse data. It uses HDFS as its file storage system, MapReduce for processing large amounts of data, and ZooKeeper for distributed coordination.
HBase adopts the ZooKeeper framework. ZooKeeper maintains the relationship between the master and each region server as well as the status of the master and region servers. To access an HBase cluster, a client first communicates with ZooKeeper to obtain the addresses of the master and region servers. Then, the client sends requests to the master to create, delete, or modify tables and to the corresponding region servers to read and write data. ZooKeeper ensures that one and only one master is available at any time in an HBase cluster. In addition, ZooKeeper stores relationships between regions and region servers to provide the addressing service. ZooKeeper also stores the cluster metadata, including schema and table definitions, monitors the status of region servers, and notifies the master in real time when region servers go online or offline. The master manages region servers. For example, it monitors the status of each region server, assigns regions to region servers, and balances loads among region servers. Region servers provide read/write access to data for clients. A region server encompasses multiple regions. Regions are the basic unit of data storage. HBase splits a large table into multiple regions and store them in HDFS.
HBase is weakly typed and stores all keys and values as byte arrays. You must convert the byte arrays into required types such as strings, numbers, or complex objects.
Data Storage Model
HBase uses column families to store data. A column family is a collection of ordered columns. All columns in a column family share the same prefix. When you create a table, you do not need to specify column names, but must specify column family names.
Column families help improve the data processing and query efficiency by grouping related columns together, and allow you to implement access control by column family.
However, column families have disadvantages. When a client reads data from HBase, the client reads all data in the column family to which the specified row key belongs. If a table contains only a few column families, many network and I/O resources are consumed each time when a client reads data from the table. If a table contains many column families, the network and I/O resources consumed per data cell decrease, but more memory resources are consumed. This is because each column family is allocated a MemStore. In addition, many column families can cause data in MemStores to be flushed to many small files that are difficult to merge.
When a client reads data from HBase, the client reads all data in the column family to which the specified row key belongs. If the column families contain a large amount of data, the read operation may trigger Full GC, causing a timeout failure. In addition, all the handlers of the region server may be occupied, which deteriorates the performance of the region server. In extreme cases, the region server may become unavailable.
HBase can be deployed in a distributed cluster. HBase splits a table into one or more regions and assigns the regions to region servers for management. A region is a group of continuous rows in the table. Each row is uniquely identified by a row key. A region can thus be understood as a collection of continuous row keys. HBase supports two split policies: pre-splitting and automatic splitting. Each of the two split policies has their own splitting algorithms and methods.
When you create a table, HBase uses an appropriate algorithm to pre-split the table into multiple regions based on the estimated distribution of row keys.
- HexSpringSplit: suitable when row keys use hexadecimal strings as prefixes.
- DecimalStringSplit: suitable when row keys use decimal numbers as prefixes.
- UniformSplit algorithm: suitable when row keys use random prefixes.
Auto split policies include:
- ConstantSizeRegionSplit: In versions earlier than HBase 0.94.0, HBase uses this split policy as the default split policy. If the size of any store in a region reaches the value of base.hregion.max.filesizeHBase, HBase splits the region.
- IncreasingToUpperBoundRegionSplit: In HBase 0.94.0 to 2.0.0, HBase uses this split policy as the default split policy. HBase determines whether to split a region based on the size of stores in the region and the total number of regions in the table. HBase splits a region when the size of a store in the region reaches the value of sizeToCheck.
- SteppingSplit: By default, a table with less than 256 MB data contains only one region. If the size of the table reaches 256 MB, HBase splits the table into two regions. When the data size of a region reaches the specified value, for example, 10 GB, HBase further splits the region.
HBase supports primary and secondary indexes. It uses row keys as primary indexes and supports secondary indexing schemes such as Apache Phoenix, Lily HBase Indexer, and CDH Search. Apache Phoenix supports covering indexes, global indexes, and local indexes, whereas CDH Search supports batch indexes and real-time indexes.
Different compression algorithms result in different compression ratios. The compression ratios of Gzip, Lempel-Ziv-Oberhumer (LZO), and Snappy are 7.4, 4.88, and 4.5 respectively. Gzip yields the highest compression ratio, but it is relatively slow. LZO and Snappy have lower compression ratios but are fast.
High Reliability and Availability
HBase uses hlogs to ensure the high reliability of data writes. Each region server has hlogs that track data changes. For each write operation, HBase first records the operation in an hlog and then writes data to an MemStore and an HFile in sequence. When a failure occurs, you can use hlogs to restore data. In addition, HBase stores three replicas of each HFile. This ensures high availability of data.
In consideration of the value of data and maintenance cost, HBase can automatically manage the time to live (TTL) of data based on your configurations. You can specify the TTL in seconds for column families. HBase automatically deletes the rows whose storage period reaches the specified threshold.
Parallel Computing on a Single node
HBase achieves parallel computing by splitting a table into multiple regions and using different CPU cores to compute data in the regions at the same time.
HBase supports linearly scalable distributed computing.
Multi-table Association Analysis
HBase does not support join operations. However, HBase supports large tables and can achieve join-like effects.
HBase uses the AccessController coprocessor to control users’ read, write, execute, create, and administrator permissions on data.
The preceding analysis of Apache HBase on storage and computing demonstrates that Apache HBase is competitor to Alibaba Cloud Hologres. The following table provides the competitive matrix between Apache HBase and Alibaba Cloud Hologres.
|Strategic positioning||Product positioning||An open-source, column family-oriented distributed database.||A data warehouse that can process petabytes of data in real time.|
|Features||System architecture||HBase adopts an architecture in which storage and computing are coupled. It relies on underlying HDFS for data storage. HDFS clusters can only be scaled out manually.||Hologres adopts an architecture in which storage and computing are separated. Hologres adopts the massively parallel processing (MPP) architecture. It combines the traditional shared storage architecture for database storage with the shared-nothing architecture for database computing. In the MPP architecture, storage and computing are separated, which improves the parallel computing capability.||Hologres
A new trend in big data scenarios is to decouple storage from computing. Hologres allows you to scale out computing and storage resources separately. You can customize computing resources and storage space based on your business needs. Therefore, Hologres is more flexible and cost-effective.
HBase splits tables into regions and further splits regions based on the region size. It stores the regions of a table on different nodes of a cluster. When you add new nodes to the cluster, HBase adjusts the load of each existing node and distributes loads to the new nodes. In this way, cluster resources are scaled out in a dynamic way without interrupting services.
|Storage modes||HBase only supports row store. It stores data as quadruples of row key, column, value, and timestamp.||Hologres supports both row store and column store.||Hologres
Hologres supports the row store mode of traditional databases, and the column store mode of big data databases. The Hologres team is working on support for the external storage feature. This feature allows you to store data in HDFS. Hologres supports multiple storage modes. You can select a storage mode based on the data access frequency and access methods.
|Table schema||HBase uses weak schemas. It supports complex data structures in tables.||Hologres uses strong schemas and supports a variety of data types.||Hologres
Strong schemas make development more efficient. If data quality is less desirable and data interfaces are not defined, it is easier to diagnose issues in development based on schemas.
|Compression||The compression ratio is low and much redundant information exists.||The compression ratio is high, particularly in the case of column store.|
|Global sorting||HBase supports global sorting.||Hologres supports partial sorting.||Hologres supports configurable clustered indexes.|
|Sharding||HBase supports pre-splitting and auto-splitting.||When Hologres is deployed in a cluster, it supports two sharding methods: hash-based sharding and random sharding.|
|Batch import||HBase supports batch import with BulkLoads.||Hologres supports batch import with BulkLoads.|
|Real-time data writes||Supported. You can query data immediately after the data is written. The compaction performance limits the write transactions per second (TPS).||Supported. You can query data immediately after the data is written. Hologres supports high write TPS.||Hologres
Hologres features higher queries per second (QPS) and TPS. HBase has the hotspotting issue on writes, which makes region servers unstable or even stop responding.
|SQL support||HBase uses Phoenix to provide limited support for SQL operations, and does not support joins.
The key-value storage model results in poor SQL performance.
|Hologres is highly compatible with the protocols, syntax, and ecosystem of PostgreSQL, and provides powerful functionality and excellent performance.||Hologres
Hologres supports standard SQL syntax as well as PostgreSQL 11 syntax and functions. You can choose HBase or Hologres based on the scenario and required ecosystem.
|Storage capability||HBase stores data in HDFS. You must manage the HBase cluster by yourself. The cluster automatically creates and keeps multiple replicas of data. The storage capacity depends on the cluster size, which can be linearly expanded. HBase supports log-structured merge-trees (LSM trees) and multiple compression algorithms.||Hologres stores data in Apsara Distributed File System or HDFS. The storage capacity depends on the Hologres cluster size, which can be linearly expanded. A single table can store more than 3 PB data. Hologres supports diverse storage modes and compression algorithms, thereby improving storage capabilities.|
|Advanced features||Query and analysis capabilities||HBase only supports the Get and Scan operations by default. It supports high QPS for Get operations but provides poor Scan performance.
HBase uses a coprocessor to support Phoenix SQL. The performance of SQL operations is poor. In addition, HBase does not support complex computing with SQL statements.
|Hologres allows you to query tens of billions of data records in real time and receive analysis results in sub-seconds. It allows you to query hundreds or even thousands of billions of data records in real time and receive analysis results in seconds. Hologres provides powerful capabilities of joining tables. Hologres supports high QPS for Get operations.||Hologres
Hologres allows you to query and analyze large amounts of data in real time, and provides powerful capabilities of joining tables.
|QPS of complex online analytical processing (OLAP)||HBase does not support OLAP.||Hologres supports complex OLAP at a high QPS.||Hologres|
|QPS of online data services||Get operations: 50,000+ QPS||Get operations: 50,000+ QPS||Hologres uses an asynchronous architecture and supports high QPS and TPS.|
|Federated computing||HBase does not support federated computing.||Hologres supports federated computing of real-time and batch data.||Hologres
Hologres allows you to create foreign tables for data in heterogeneous data sources without affecting data privacy and security. This allows you to implement federated queries and computing.
|Reliability||Atomicity, consistency, isolation, and durability (ACID) of transactions||HBase partially ensures ACID of transactions at the row level.||Hologres partially ensures ACID of transactions at the row level.|
|Transaction isolation||HBase supports concurrent writes as well as concurrent reads and writes.||Hologres supports concurrent writes as well as concurrent reads and writes based on snapshots.|
|Backup and disaster recovery||Supported. HBase provides three standard replicas for disaster recovery in big data scenarios.||Supported. Hologres provides three standard replicas for disaster recovery in big data scenarios.|
|Ease of use||Query||Query language||HBase supports the Java API for queries and requires other frameworks such as Apache Phoenix to support SQL queries.||Hologres is compatible with PostgreSQL.||Hologres
Hologres supports SQL syntax without the need of any other components or frameworks. It allows you to perform DDL operations on more types of objects. It supports full joins.
|DDL||HBase allows you to perform the following DDL operations on namespaces, tables, and column families: create, alter, drop, describe, and list.||Hologres allows you to perform the following DDL operations on databases, tables, views, schemas, cast functions, extensions, roles, users, user mappings, and groups: create, alter, and drop.|
|DML||HBase allows you to perform the following DML operations: put, get, scan, delete, and truncate.||Hologres allows you to perform the following DML operations: select, insert, update, and delete.|
|DCL||HBase allows you to perform the following DCL operations: grant, revoke, and rollback.||Hologres allows you to perform the following DCL operations: grant, revoke, and rollback.|
|Development tool||HBase supports open-source tools.||Hologres supports terminal tools, open-source tools, and Alibaba Cloud tools.|
|Scalability||HBase adopts an architecture in which storage and computing are coupled. Storage and computing resources must be scaled out at the same time.||Hologres allows you to scale out storage and computing resources separately. This helps improve parallel processing capabilities.||Hologres
Hologres adopts an architecture in which storage and computing are separated. This type of architecture is suitable for real-time big data processing for the following reasons:
1. To remove storage or computing bottlenecks, you can scale out storage or computing resources separately without the need to scale out both types of resources. This prevents the waste of resources.
2. You do not need to change the allocation plans of storage or computing resources frequently.
3. You do not need to migrate large amounts of data during a resource scale-out.
|Operations and maintenance (O&M)||If you choose HBase, you must perform O&M operations on HBase by yourself.||Hologres provides fully managed services.
The system can automatically detect the topology changes of each cluster. You do not need to concern yourself with the topology changes.
Hologres provides fully managed services. This frees you from O&M.
|Ecology||HBase is compatible with the Hadoop ecosystem.||Hologres is highly compatible with the PostgreSQL ecosystem.|
|Scenario||HBase is a write intensive database that can be used to store large amounts of unstructured data. It provides excellent performance for point queries.||Hologres is a real-time data warehouse that can replace HBase. It eliminates data silos and allows you to query and analyze large amounts of data in real time. It supports elastic scale-out of cluster resources and provides full SQL support.|
|Development||Application development is complex. You must design business metrics, dimensions, tables, and aggregations as key-value pairs, and implement key processing operations to allow users to query and filter data at the application layer. The system efficiency depends heavily on how well keys are designed. In various complex scenarios such as data writes, data analysis, and data queries, the application layer depends on the basic interfaces for processing key-value pairs.||Hologres allows you to perform table-oriented application development, which is simple. The applications allow users to use standard SQL statements for complex multi-dimensional analysis, nested queries, and associated queries. Hologres provides Java Database Connectivity (JDBC) and Open Database Connectivity (ODBC) interfaces for you to perform theme-oriented modeling and development.||Hologres
Compared with the metric-oriented and wide table-oriented development supported by HBase, the theme-oriented modeling supported by Hologres has many benefits. It reduces loss of information among heterogeneous data collection, processing, and analysis systems. It requires fewer turns in data processing and makes data usage more flexible.
Hologres provides the following advantages over HBase:
1. Hologres adopts an architecture in which storage and computing are separated. Hologres allows you to scale out storage and computing resources separately, and configure storage and computing resources based on your business needs. This helps you reduce the cost of resources.
2. Hologres supports multiple storage modes. This helps you save storage space and improve storage usage.
3. Hologres is highly compatible with PostgreSQL. Hologres supports the standard PostgreSQL syntax and provides an advanced community ecosystem.
4. Hologres supports federated queries of data in heterogeneous data sources.
5. Hologres allows you to query and analyze large amounts of data in real time. It provides excellent performance for complex queries and point queries with high QPS and TPS.
6. Hologres supports association analysis and is suitable for OLAP analysis.
7. Hologres is a real-time cloud-based data warehouse. It provides fully managed services and frees you from O&M.
8. Hologres uses strong schemas and supports table-oriented development. It requires fewer turns in data processing and improves the efficiency of data warehouse development.