According to the 2019 Gartner report, Greenplum is the only open source service that ranked top 10 in typical data analytics and real-time analytics fields. Greenplum provides a powerful big data engine that supports features such as real-time processing, elastic scale-out, hybrid loads, cloud-native capabilities, and integrated data analysis. Based on the massively parallel processing (MPP) architecture and the online analytical processing (OLAP) system, Greenplum supports elastic and linear scale-out. Greenplum also adopts parallel storage, parallel communication, parallel computing, and optimization technologies. Greenplum is compatible with the PostgreSQL ecosystem, and provides powerful, efficient, and secure data storage, processing, and real-time analytics capabilities. Greenplum can store, process, and analyze petabytes of structured, semi-structured, and unstructured data. Greenplum has no limits on hardware environments and platforms, which offers flexibility in terms of deployment environments. Greenplum can be deployed on enterprise hosts, containers, private clouds, and public clouds.
Greenplum consists of three parts: master, segment, and interconnect. The master is the entry to the Greenplum Database system. The master accepts the requests of client connections and SQL queries, and distributes workloads to segment instances. After the segment instances process the workloads, the master coordinates the results returned by each segment instance. The master does not store data, but functions as a client access entry and stores metadata and the data dictionary that contains the logic of distributing tables on segment instances. Segment instances are independent PostgreSQL databases that store and process data. The interconnect serves as an Ethernet switch and handles the communications among segment instances.
Greenplum supports basic data types and a wide array of complex data types, including semi-structured data types. The examples of complex data types include array, JSON, XML, and hstore.
Data storage models
Greenplum supports polymorphic data storage. Three storage models are available: row store, column store, and external storage. Row store is suitable for online transactional processing (OLTP) databases, and column store is suitable for OLAP databases. In terms of external storage, a Hadoop Distributed File System (HDFS) serves as an external data storage system and stores data, and Greenplum stores only metadata.
Data distribution methods (sharding)
Greenplum can distribute data to a single node or across the nodes of a cluster. To distribute data across the nodes of a cluster, Greenplum divides the data into shards and distributes the shards across the nodes. Greenplum supports three distribution methods: hash-based sharding, random sharding, and replicated tables.
Hash-based sharding: distributes data across nodes and assigns each row to a specific node based on the hash values of distribution columns.
Random sharding: randomly and evenly distributes data across all nodes in a round-robin manner. Data rows that have the same values may be distributed to different nodes.
Replicated table: distributes data by replicating all the rows in a replicated table to all nodes. Replicated table data is evenly distributed because every node has the same rows.
Replicated tables have two significant benefits
(1) The application of replicated tables helps you to avoid distributed queries. If all nodes have the replicas of data in a table, local connections can be generated and you can query all data in a single node. This prevents data migration among different nodes in the cluster. If you use a replicated table to store small tables, for example, a table that contains thousands of rows, the query performance improves in a significant way. The replicated table distribution method is not suitable if your tables contain large amounts of data.
(2) Replicated tables allow you to run user-defined functions (UDFs) on nodes to access tables. In the MPP architecture, data is divided into shards and the shards are distributed across different nodes. Each node contains only partial data. In this case, you cannot access tables by running UDFs on nodes. If you run UDFs on nodes to access tables, data computing errors occur.
Data distribution methods (partitioning)
Greenplum supports multi-level partitioned tables and two partitioning types: range partitioning and partitioning based on each value of the partition keys.
Greenplum supports multiple index types, such as B-tree, bitmap, GIN, and GiST indexes. Greenplum does not support hash indexes.
Compression ratios and results vary based on compression algorithms. In Greenplum, row store tables use the common compression algorithms ZLIB and QUICKLZ. Column store tables use the RLE_TYPE, ZLIB, and QUICKLZ compression algorithms. In most cases, QUICKLZ compression uses less CPU capacity and compresses data faster at a lower compression ratio than ZLIB compression. ZLIB compression provides higher compression ratios at a lower speed than QUICKLZ compression. Column store tables can provide a high compression ratio that ranges from 3:1 to 5:1.
High availability of data
Greenplum provides high availability by replicating data from a primary segment to the corresponding mirror segment. Greenplum provides one standard replica for disaster recovery in big data scenarios. Greenplum supports group mirroring and spread mirroring policies.
Currently, Greenplum does not support automatic time to live (TTL) management. You must manually delete and update data. Greenplum provides commands for you to reclaim storage space. This prevents hidden data at the physical storage layer from reducing the scan rates. You can run the VACUUM command to rearrange the data in a table and reclaim storage space. This helps you improve data read performance.
Parallel computing capabilities of a single node
The minimum unit for parallel processing in Greenplum is an instance instead of a node. Each instance has its own PostgreSQL directory structure and daemon process. Therefore, Greenplum Database running on a single node can be considered as a system that uses a parallel computing architecture. In most cases, six to eight instances are configured for a single node. This indicates that six to eight PostgreSQL databases run in parallel on the single node. This allows you to take advantage of all the CPU and I/O resources of each node.
Distributed computing capabilities
Greenplum uses PostgreSQL databases as sub PostgreSQL database instances. Based on the coordination of the Greenplum interconnect, dozens or even thousands of sub PostgreSQL database instances perform parallel computing. Sub PostgreSQL database instances use a shared nothing architecture, which maximizes the parallel computing capabilities of Greenplum.
Multi-table association analysis
Greenplum supports correlated subqueries and the queries that are based on array and aggregate functions.
Comparison between Hologres and Greenplum
Greenplum and Alibaba Cloud Hologres are in a competitive relationship based on the preceding analysis of Greenplum storage and computing capabilities.
|Category||Feature||Greenplum||Alibaba Cloud Hologres||More competitive|
|Strategic positioning||Product positioning||A hybrid transaction/analytical processing (HTAP) data warehouse.||A hybrid serving/analytical processing (HSAP) data warehouse that can process petabytes of data in real time.|
|Basic features||System architecture||Features the coupling of storage and computing and uses the MPP architecture that is also known as a shared nothing architecture.||Decouples storage from computing, and provides a quick method for you to scale out computing and storage nodes in a separate way.||
A new trend in big data scenarios is to decouple storage from computing. Hologres allows you to scale out computing and storage nodes in a separate way. You can configure computing resources and storage space based on your business needs. This enables Hologres to offer flexible and cost-effective solutions.
|Polymorphic storage||Supports row store, column store, and external storage. By default, data is stored in on-premises disks.||Supports row store and column store. By default, data is stored in a distributed file system, such as Apsara Distributed File System or the HDFS.||
Supports row store and column store.
|Multiple sharding methods||Supports three sharding methods if data is distributed across the nodes of a cluster: hash-based sharding, random sharding, and replicated tables.||Supports two sharding methods if data is distributed across the nodes of a cluster: hash-based sharding and random sharding.||
Greenplum supports more sharding methods than Hologres. You can use these methods to distribute data in a fine-grained manner based on your business needs. This allows you to distribute data and schedule cluster resources in an appropriate way. Replicated tables allow you to run UDFs on nodes to access tables. This improves the performance of small table queries.
|Multi-level partitioning||Supported. Greenplum supports two partitioning types: range partitioning and partitioning based on each value of the partition keys.||Partially supported. Currently, Hologres supports only partitioning based on each value of the partition keys, and does not support subpartitions.||
Greenplum provides finer-grained partitioning policies and more partitioning types than Hologres. This helps you manage and maintain TTL.
Currently, the Hologres team is working on support for multi-level partitioning and dynamic partitioning. The Hologres team is also working to provide and optimize the features that are supported by Greenplum at the earliest opportunity.
|Real-time writes||Limited support. Greenplum allows you to execute INSERT SQL statements in micro-batches to insert data. This compromises the write performance.||Supported. Hologres delivers high performance, and allows you to query data immediately after the data is written.||Hologres|
|Real-time updates||Supported.||Supported.||Hologres supports real-time data writes and updates, and allows you to query data immediately after the data is written.|
|SQL support||Highly compatible with the protocols, the syntax, and the ecosystem of PostgreSQL 9.4.||Highly compatible with the protocols, the syntax, and the ecosystem of PostgresSQL 11.2.||
Supports later versions of PostgreSQL than Greenplum.
|Advanced features||Storage capability||Stores data in Serial Advanced Technology Attachment (SATA) on-premises disks. The storage capacity depends on the cluster size. Greenplum supports linear scale-out to improve storage capacity. A single table can store a maximum of 32 TB data. Greenplum provides diversified storage models and compression algorithms to improve storage capability.||Stores data in Apsara Distributed File System or the HDFS. The storage capacity depends on the cluster size. Hologres supports linear scale-out to improve storage capacity. A single table can store a maximum of 3 PB data. Hologres supports diverse storage models and compression algorithms to improve storage capabilities.||
Hologres outperforms Greenplum in terms of the storage capacity of a single table. The upper limit for the storage capacity of Hologres is higher than that of Greenplum in big data scenarios.
|Query and analysis capabilities||Lacks of tens of billions or even hundreds of billions of data records for performance testing. Single-table queries are slow and real-time response cannot be ensured.||Allows you to query tens of billions of data records in real time and receive analysis results in several sub-seconds. Hologres allows you to query hundreds or even thousands of billions of data records in real time, and receive analysis results in several seconds. Hologres provides powerful capabilities of joining tables.||
Allows you to query and analyze large amounts of data in real time.
|Concurrent queries per second (QPS) supported in complex OLAP scenarios||Supports low concurrent QPS.||Supports high concurrent QPS based on an asynchronous architecture.||Concurrent QPS supported by Hologres is twice as many as that supported by Greenplum.|
|Concurrent QPS support by online data services||Supports low concurrent QPS.||Supports high concurrent QPS.||Hologres incorporates the optimizations that are specific to simple queries, and supports hundreds of QPS. Hologres supports millions of QPS for point queries.|
|Federated computing||Supports real-time offline federated computing.||Supports real-time offline federated computing.|
|Reliability||Atomicity, consistency, isolation, and durability (ACID) of transactions||Full support.||Limited support.||Greenplum|
|Transaction isolation||Supports multiversion concurrency control (MVCC) to isolate transactions. This ensures that data is consistent among transactions and is not affected by changes in other concurrent transactions.||Supports MVCC, but delivers low performance in concurrency control.||Supports full transaction capabilities.|
|Backup and disaster recovery||
(1) Provides one standard replica for disaster recovery in big data scenarios.
(2) Supports the mirroring mechanism for nodes.
|Supported. Provides three standard replicas for disaster recovery in big data scenarios.||
Allows you to replicate a raw table to create multiple replicas. This ensures high availability and data security.
Supports the mirroring mechanism for nodes. If a master or segment instance fails, Greenplum starts a mirror instance so that the primary instance can replicate data to the mirror instance.
|Ease of use||SQL syntax||DDL||Supported. The keywords are create, alter, and drop. The objects are database, table, view, schema, cast, sequences, role, user, user mapping, and group.||Supported. The keywords are create, alter, and drop. The objects are database, table, view, schema, cast, extension, role, user, user mapping, and group.|
|DML||Supported. The keywords are select, insert, update, and delete.||Supported. The keywords are select, insert, update, and delete.|
|DCL||Supported. The keywords are grant, revoke, and rollback.||Supported. The keywords are grant, revoke, and rollback.|
|Development tools||Uses terminal tools and open source tools. Terminal tools are the major tools that are used.||Uses terminal tools, open source tools, and Alibaba Cloud tools.|
|Scalability||Uses a shared nothing architecture where only a few node interactions are involved. Greenplum supports linear scale-out, which indicates that you can add nodes to improve parallel processing capabilities.||Hologres allows you to add storage and computing nodes in a separate way for linear scale-out. This helps you improve parallel processing capabilities.||
Hologres uses an architecture where storage is decoupled from computing. The architecture of this type is suitable for real-time big data processing and computing. The reasons are described as follows: (1) If scale-out is required to remove storage or computing bottlenecks, Hologres users can improve storage or computing capabilities based on business needs. This helps Hologres users prevent the waste of storage or computing resources. (2) Hologres users do not need to change the allocation plans of storage or computing resources at a high frequency. (3) The architecture helps Hologres users handle data migration challenges during scale-out. During the scale-out, large amounts of data may need to be migrated.
|Operations and maintenance (O&M)||Relies on manual system O&M and performance tuning. The hardware O&M costs are high, and the performance tuning procedure is complex.||Allows the system to automatically detect the topology changes of a cluster. You do not need to concern yourself with the topology changes.||
Provides fully managed services. This indicates that no maintenance efforts are required.
|Scenarios||Stores large amounts of data, allows you to query large amounts of data offline, and supports the elastic scale-out of clusters.||Connects data silos, stores large amounts of data, allows you to query and analyze large amounts of data in real time, and supports the elastic scale-out of clusters.|
Hologres provides the following advantages over Greenplum:
(1) Hologres uses an architecture where storage is decoupled from computing. Hologres allows you to scale out storage and computing resources in a separate way, and configure storage and computing resources based on your business needs. This helps you save costs.
(2) Hologres is highly compatible with PostgreSQL 11.4. Hologres supports the standard PostgreSQL syntax and provides a community-based ecosystem. You can access and use the Hologres service at low costs.
(3) Hologres supports federated queries among disparate data sources.
(4) Hologres allows you to store large amounts of data in a single table. Hologres also allows you to query and analyze large amounts of data in real time.
(5) Hologres delivers high performance for concurrent tasks and online services. Hologres incorporates the optimizations that are specific to simple queries, and supports hundreds of QPS. Hologres supports millions of QPS for point queries.
(6) Hologres allows you to replicate a raw table to create multiple replicas. This ensures high availability and data security.
(7) Hologres provides a real-time cloud data warehouse service. Hologres provides fully managed services, which indicates that no maintenance efforts are required.