This presentation covers three topics:
1. Mainstream Real-Time Data Warehouse Architecture: Lambda
2. Alibaba’s Practice with Lambda
3. Hologres, a Cloud-Native HSAP System
1. Mainstream Real-Time Data Warehouse Architecture: Lambda
1.1 Timeliness, a Booster of the Value of Data
Enterprises are embracing digital transformation. As we all know, the value of data decreases rapidly over time. Therefore, timeliness boosts the value of data.
Timeliness here is a broad concept. First, it involves end-to-end data and the collection, processing, and analytics of real-time data. Second, timeliness concerns the transformation of real-time analytics into real-time data services for online production systems. Third, timeliness requires enterprises to transform data into quick self-services with their existing architecture, to quickly respond to business changes. Enterprises can bring out the full value of data only when succeeding in all these three aspects.
1.2 Mainstream Real-Time Data Warehouse Architecture: Lambda
Many enterprises advance in their digital transformation through trials and errors. They constantly upgrade data architectures to solve business problems. Currently, Lambda is the most popular real-time data warehouse architecture.
Meituan, Zhihu, Cainiao, and other enterprises have succeeded in practice with the Lambda architecture. As shown in the following figure, a real-time data warehouse built on the Lambda architecture distributes the collected data to a real-time layer and an offline layer according to business needs. The real-time layer processes real-time data, and the offline layer processes offline data. The processing results are integrated before the data service layer connects to data applications. In this way, real-time data and offline data can jointly serve online data applications and data products.
- Processing at the offline layer: The offline layer stores the collected data in an offline data warehouse and then unifies the data to the operation data store (ODS) layer. The data at the ODS layer is cleaned to construct a data warehouse detail (DWD) layer. During data modeling, enterprises process the data at the DWD layer in some sort of pre-calculation to improve the efficiency of data analytics and stratify the data for different business needs. The pre-calculation results are stored in offline storage systems before they can serve data services.
- Processing at the real-time layer: The real-time layer processes data according to a logic similar to that of the offline layer but requires more timeliness. Similar to the offline layer, the real-time layer subscribes to and consumes data from upstream databases or systems. After consuming data, the real-time layer writes the data into its real-time DWD or DWS layer. Real-time computing engines offer computing power. The results of their calculations are stored in real-time storage systems, such as a KV store.
In consideration of costs and development efficiency, the real-time and offline layers are usually not completely synchronous. The real-time layer generally stores data for two to three days, or seven days in special cases. Data that needs to be retained for months or years is stored in offline warehouses.
The real-time data warehouse is split into the real-time layer and offline layer. However, business usually cares more about the result of data analytics than how data is processed. Therefore, full real-time and offline data needs to be processed. The data processed by the two layers is stored in different storage systems. Real-time data that is stream-processed and offline data that is batch-processed need to be integrated before they can serve data services.
It seems that this architecture solves many business problems related to offline data warehousing, data analytics, and data dashboards. However, this architecture is not perfect. Some pain points still exist.
1.3 Pain Points in the Lambda Architecture
(1) Inconsistency: Two semantics, two logics, and two data sets are generated.
The Lambda architecture processes data at two layers: the offline layer for batch processing and the real-time layer for stream processing. The two layers have different computing engines and storage engines, resulting in different semantics. As a result, data is processed according to different logics and into different results.
Because the offline layer and the real-time layer write their results in different storage systems, the same data is stored in at least two storage systems. One copy is from batch processing, and one copy is from stream processing. These copies need to be integrated before they can serve data services. This process requires the data structure to be redefined because storage, transformation, and integration can cause inconsistency.
It is impossible to fix such data inconsistency caused by architectural complexity. The industry practice is to tackle the problem at the business level. For example, this architecture is applicable if the business accepts a difference of less than 3% between real-time and offline data.
(2) Complex architecture built on interconnected systems, with high O&M costs: Batch processing often uses offline computing engines such as MaxCompute or self-built Hadoop engines. Stream processing also introduces many new products, such as Flink and Spark, into the system. The processed data needs to be written into storage systems. Therefore, more products are introduced into the data service layer, making the architecture even more complex. For example, this layer may include HBase for efficient point query. It could also include Presto or Impala for interactive data analytics in the offline data warehouse. It might also include MySQL for data export, and open-source systems such as Druid and ClickHouse for end-to-end real-time performance in the real-time data warehouse. These systems make the architecture complex and increase the O&M costs. Data developers need to familiarize themselves with many systems, which increases the costs of learning. In addition, multi-layer data processing and cleaning cause serious redundancy in the process. The real-time layer and the offline layer each retains a data set. Another data set is generated when they are integrated. This expansion of data consumes vast storage resources.
(3) Long development cycle and less agile business: Each data set or business solution requires data checking and data verification before being launched. Any problem that arises during data checking is difficult to locate and diagnose. This is because problems can occur at any stage. Some problems are not found until the process proceeds to the data application layer. When finding a problem, you need to go over all the procedures of data integration, real-time computing, offline processing, and data collection. The process is complex, and it takes a long time to revise and supplement missing data. In addition, a data set is processed separately by the real-time and offline layers. The processing pipeline is long. If a new field needs to be inserted in the process, especially somewhere upstream, the process needs to rerun, and missing historical data needs to be supplemented. This consumes time and resources.
(4) After data development is completed, the functionalities are acknowledged by businesses, and the system starts to deliver excellent results. Then, businesses appreciate the value of data architecture more and raise more demands. Product operation teams or the management may request new data analytics reports for a data set that they consider valuable. They may also ask for self-service real-time analytics. However, in the Lambda architecture, all computing and analytics are done at the computing layer. For example, to incorporate new business data in the offline layer, you need to create a new job at the DWS layer, write the new data to the DWS-layer storage, synchronize the data to the data service system, and then provide online reporting services. In this process, data developers need to review and assess the data requirements. This needs an offline processing time of T+1 during the development cycle. In urgent situations that cannot wait T+1, you can lose business opportunities. The same applies to the development of new real-time pipelines for new business requirements. Development, testing, and launching all extend the development cycle. As a result, the Lambda architecture does not meet the flexibility requirements of online businesses.
2. Alibaba’s Practice with Lambda
2.1 Old Architecture for Refined Operation of Search Recommendations
Architectures in practice are more complex than the ideal Lambda architecture. The most complex data architectures of Alibaba are for search recommendations. The following figure shows how Alibaba runs its old architecture for refined operation of search recommendations on the Lambda architecture. This architecture is similar to the Lambda architecture.
On the left are the data sources of Alibaba, which include transaction, user behavior, product attribute, and search recommendation data. The data is either batch-imported into MaxCompute through data integration, or collected through the real-time message queuing of Alibaba DataHub and then cleaned by Flink.
Architectural evolution: The architecture evolves as Flink develops rapidly. One typical business of Alibaba Cloud in this area is the dashboard of the Double 11 Shopping Festival, where data is cleaned by Flink and then written into HBase. HBase is connected to the real-time dashboard to provide efficient point query services.
As Flink enhances its computing power and extends more real-time data warehousing and reporting applications, product operation teams and the management recognize the value of real-time services to business. After discovering the offline analytics capabilities of MaxCompute and the powerful real-time capabilities of Flink, the business asks whether the same data can be analyzed in real time. Therefore, open-source products, such as Druid, are introduced. Flink writes its online logs into Druid in real time. Druid analyzes the data in real time and sends the results to real-time reporting, decision-making, and real-time inventory systems.
Two pipelines are thus formed. Druid analyzes real-time data. MaxCompute analyzes offline data. However, due to the performance bottlenecks of Druid, such as its storage capacity, Druid stores data only for two to three days, or seven days in special cases. In marketing events such as a big promotion, new data usually needs to be compared with historical data. For example, the data of a seasonal promotion is compared with that of the last year or the year before that. The data of a promotion period is also compared with that of the warm-up period. The current data is compared with that of the last week or the week before that for marketing strategy analytics. This integrates offline and real-time data by introducing more products into the system. The offline data in MaxCompute and the real-time data in Druid are merged by using MySQL before the data is used to provide online services.
2.2 Integrated Systems, Scenarios, and Analytics Services
As shown in the figure, the data sources and data cleaning remain the same as that in the ideal Lambda architecture. However, the business application and business analytics layers change according to increased needs. More products are introduced to support these needs. However, this is still a typical Lambda architecture. With the rapid business expansion of Alibaba, it becomes increasingly unsatisfactory in terms of consistency, O&M simplicity, costs, and agility.
What capabilities do these new systems offer and why are they introduced into the architecture as business becomes more complex?
KV store: Redis, MySQL, HBase, or Cassandra. A KV store improves point query efficiency in scenarios that require a high query per second (QPS).
Interactive computing: Presto or Drill.
Real-time data warehouse: ClickHouse or Druid. A real-time warehouse offers real-time storage and online computing.
2.3 New Architecture for Refined Operation of Search Recommendations
Various big data products are introduced to support these three types of capabilities. Is it possible to integrate these capabilities into one engine that can handle the same tasks in multiple business scenarios? To store data and provide services to applications in a unified way, Alibaba conceives the following architecture.
The upstream data processing and cleaning remain the same, but more capabilities are enabled when the system provides services for business applications. For example, the architecture provides the capabilities of point query, result caching, offline acceleration, federated analytics, and interactive analytics. This architecture is defined as a hybrid serving analytical processing (HSAP) system, which integrates analytics and services. It can analyze data and provide services from the same data at the same time.
Hologres is a cloud-native HSAP system launched by Alibaba in this context.
First, Hologres stores real-time and offline data in a unified way. It supports real-time writes for live data and batch import of offline data.
Second, data services are designed around real-time analytics that can satisfy the needs of both real-time business analytics and online services.
Third, the computing power of Hologres is accelerated by MaxCompute and serves online services directly. The original real-time data warehouse architecture remains unchanged.
3. Hologres, a Cloud-Native HSAP System
3.1 Core Benefits of Hologres
In this cloud-native HSAP system, the same data can be used in real-time analytics and online services.
Quick response: The millisecond-level response of Hologres can easily satisfy your needs for complex and multi-dimensional analytics of massive data. Hologres can process tens of millions of point queries and thousands of simple real-time queries per second.
Real-time storage: Hologres can write hundreds of millions of transactions per second (TPS), satisfying your needs for timeliness. You can query data immediately after it is written.
Acceleration with MaxCompute: MaxCompute can directly analyze the data without data migration or redundant storage.
PostgreSQL ecosystem: Hologres is compatible with the PostgreSQL ecosystem. Hologres is developer-friendly and compatible with PostgreSQL tools such as psql, Navicat, and DataGrip. It also interacts seamlessly with business intelligence (BI) tools.
Online service data of Hologres during the Double 11 Shopping Festival of 2019: Hologres enabled real-time writes of 130 million TPS and 145 million QPS for highly concurrent online queries. Data could be queried immediately after it was written.
3.2 Typical Scenarios of Hologres
Acceleration of offline data queries: MaxCompute can respond to offline interactive queries within seconds. MaxCompute transforms cold data into analytics results that are easy to understand without additional extract, transform, and load (ETL). This helps enterprises make decisions more efficiently and reduces time costs.
Real-time data warehouse: The real-time data warehouse consists of Flink and Hologres. It aims to provide user diagnostics in real time from different perspectives through observing and monitoring. This helps apply targeted operation policies and achieves refined real-time user operation.
Federated computing for real-time and offline data: The computing combines MaxCompute and Hologres. Based on business logic, it allows you to analyze offline data in real time and perform federated queries of both real-time and offline data, achieving end-to-end refined real-time operation.
The following describes in detail the architecture of the preceding three scenarios and their application.
3.3 Case Study – Internet Media Content
The following figure demonstrates a typical case in the Internet industry. VivaVideo is a short video app popular in Southeast Asia. In addition to real-time dashboards and reporting, the Internet industry relies on user analytics, user profiling, user tagging, and real-time video recommendations. This scenario is similar to the refined search recommendation of Alibaba.
3.4 Data Chain Streamlining
In Hologres, the end-to-end data ecosystem is built on data and integrates the big data ecosystem, PostgreSQL ecosystem, and Alibaba Cloud ecosystem. The end-to-end chain connects data sources, data synchronization, data processing, data O&M, and data analytics and application.