Revolutionizing Data Management with HTAP Databases
Organizations handle a vast amount of data daily, necessitating efficient systems like Hybrid Transactional Analytical Processing (HTAP). This advanced system streamlines online transaction processing (OLTP) and analytical processing (OLAP), enabling real-time insights and prompt actions. HTAP databases, such as Primary Row Store and In-Memory Column Store, play a crucial role in modern data-intensive applications. Distributed architectures, like Distributed Row Store and Column Store Replica, enhance scalability and performance for businesses. Stay ahead with HTAP systems!
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
HTAP DATABASES: WHAT IS NEW AND WHAT IS NEXT
INTRODUCTION Every organization is processing more data than ever, and more data is coming in a high volume, high velocity, and variety. A single HTAP system that can effectively manage both online transactional processing (OLTP) and online analytical processing (OLAP) for quick decision-making is advantageous for enterprises with data-intensive applications. With an HTAP system, for example, business owners in the retail industry can analyze the most recent transaction data in real time, pinpoint the sales trend, and then take prompt action, such as launching advertising campaigns for products that show promise.
HTAP DEFINITION A Gartner research from 2014 proposed the hybrid transactional/analytical processing (HTAP) application architecture, which makes use of in-memory computing technologies to enable concurrent analytical and transaction processing on the same in-memory data store. An architecture like this should make the extract-transform-load (ETL) processes unnecessary, expediting data analytics and bringing about significant business innovation.
HTAP DATABASES Primary Row Store + In-Memory Column Store Distributed Row Store + Column Store Replica Disk Row Store + Distributed Column Store Primary Column Store + Delta Row Store
PRIMARY ROW STORE+IN-MEMORY COLUMN STORE These HTAP databases use an in-memory column store to process OLAP workloads and a primary row store as the foundation for OLTP operations. The main row store has all of the data persisted in it. The row storage is also memory-optimized to effectively handle data updates. The delta store, which will be integrated with the column store, is also updated.
DISTRIBUTED ROW STORE+COLUMN STORE REPLICA To enable HTAP, this category depends on distributed architecture. When responding to transaction requests, the master node asynchronously copies the logs to the slave nodes. Row store is the primary type of storage, although some slave nodes will be designated as column store servers to speed up queries. Transactions are handled in a distributed way for high scalability; complex queries are performed in the server nodes with a column store. TiDB, a Raft-based distributed HTAP database, serves as an example. TiDB asynchronously replicates Raft logs from the leader node to follower nodes that store the data in rowbased replicas.
DISK ROW STORE+DISTRIBUTED COLUMN STORE These databases enable HTAP by using a distributed in-memory column store (IMCS) and a disk-based RDBMS. For OLTP workloads, the RDBMS maintains its full capacity, while an IMCS cluster is tightly connected to speed up query processing. The columnar data is taken out of the row store, while the hot data is stored in IMCS and the cold data is moved to disk. For instance, MySQL Heatwave enables real- time analytics by fusing a MySQL database with a distributed IMCS cluster known as Heatwave. The MySQL database fully completes transactions. The Heatwave will be populated with frequently used columns. A complex query can be pushed down to the IMCS engine for query acceleration when it is received.
PRIMARY COLUMN STORE+DELTA ROW STORE This group of databases uses a delta row store for OLTP and a primary column store as the foundation for OLAP. The entire set of data is kept in the main column store of in-memory deltamain HTAP databases. The row-based delta store is updated with new data.The highly read-optimized column store contributes to the outstanding OLAP performance.Nevertheless, the OLTP scalability is poor because there is just a delta row storage for OLTP workloads. One example is SAP HANA.
HTAP TECHNIQUES Transaction Processing (TP) Techniques Analytical Processing (AP) Techniques Data Synchronization (DS) Techniques Query Optimization Techniques Resource Scheduling Techniques
TRANSACTION PROCESSING (TP) TECHNIQUES 1. The first type is MVCC + logging, which processes transactions using MVCC protocols as well as logging mechanisms. Each insert is added to the in-memory delta store after first being written to the log and the row store. The earlier version of the row is designated as a delete row in a delete bitmap after an update creates a new version with a new lifespan of a begin timestamp and an end timestamp. As a result of the DML activities being carried out in memory, transaction processing is effective. 2. The second type uses distributed architecture and is known as 2PC+Raft+logging. It offers distributed transaction processing and excellent scalability. The write-ahead log (WAL) technique, the Raft-based consensus algorithm, and the 2-phase commit (2PC) protocol are used to process the ACID transactions on the distributed nodes.
ANALYTICAL PROCESSING (AP) TECHNIQUES 1. In-memory delta and column scan is the first kind. This line of work simultaneously scans the columnar and in-memory delta data since the delta store can contain updated entries that haven't been merged with the column store. The data freshness is high for OLAP since it has scanned the recently visible delta tuples in memory. 2. The second form, called log-based delta and column scan, searches both columnar and log-based delta data simultaneously for incoming queries. Similar to the first type, this method is more expensive because it must read potential delta files that were not merged. As a result of the lengthy delays involved in delivering and merging the delta files, the data's freshness is low. 3. The third option is column scan, which is extremely efficient because it simply scans columnar data and avoids the expense of reading any delta data.Since the data is often updated in the row storage, this strategy, however, results in low freshness.
DATA SYNCHRONIZATION (DS) TECHNIQUES Three different DS approaches are available for different HTAP databases. Particularly, (i) in-memory delta merge, (ii) disk-based delta merge and (iii) rebuild from primary row store. 1. The primary column storage is frequently updated with the most recent in-memory delta data by the first category. The two-phase transaction-based data migration is one of the methods that are presented to optimize the merge process. 2. In the second category the main column store and the disk-based delta files are merged. The delta items can be quickly retrieved using key lookups if the delta data is indexed by a B+-tree, which can speed up the merging process. 3. The third category reconstructs the primary row store from the in-memory column store. This is typical when the number of delta changes exceeds a specific threshold; in this situation, rebuilding the column store is more effective than merging the updates, which would take up a lot of memory.
QUERY OPTIMIZATION TECHNIQUES We provide three query optimization techniques: (i) HTAP column selection (ii) hybrid row/column scan (iii), and (iv) HTAP CPU/GPU acceleration. 1. The first type chooses columns from the primary store that are often requested and loads them into memory using statistics and history workload. In order to speed up a query, it might be pushed down to the in- memory column storage. The drawback is that row-based query processing may result if the columns for a new query have not been chosen. 2. The second kind speeds up a query by using a hybrid row/column scan.These methods allow for the decomposition of complex queries to run either over the row store or over the column store before combining the results. This is common for an SPJ query that may be processed using both a full column-based scan and a row-based index scan. 3. To speed up HTAP workloads, the third category of solutions makes use of heterogeneous CPU/GPU architecture. These methods handle OLTP and OLAP using, respectively, the task-parallel nature of CPUs and the data-parallel nature of GPUs. Yet, these methods have low OLTP throughput while favoring high OLAP throughput.
RESOURCE SCHEDULING TECHNIQUES The workload-driven approaches and the freshness-driven approaches are two different categories of scheduling techniques. OLTP and OLAP parallelism threads are adjusted by the first one dependent on how well performed workloads perform.For instance, the task scheduler may reduce OLAP parallelism while increasing OLTP parallelism when OLAP threads are using up all available CPU resources. The later one changes the resource allocation and data interchange execution mechanisms for OLTP and OLAP. For instance, in order to maximize throughput, the scheduler manages the execution of OLTP and OLAP in isolation before regularly synchronizing the data. It transitions to an execution mode with shared CPU, memory, and data as the freshness of the data drops down.
HTAP BENCHMARKS A widely-used end-to-end HTAP benchmark is CH-benchmark that combines two TPC benchmarks, the TPC-C for transactional workloads and the TPC-H for analytical workloads. TPC-C and TPC-H are combined in HTAPBench, another end-to-end benchmark, however it suggests a different metric. The data generation, execution rules, and performance metrics between HTAPBench and CH-benchmark are being compared and investigated on how they grow the original data generator for data production. The execution rule and how benchmark parameters are used to manage the concurrent execution of OLTP and OLAP workloads are being described as well as how performance metrics combine measures for transactions per minute (tpmC) and finished queries per hour (QphH).
HTAP DATABASE EVALUATION A summary of the knowledge gained by using HTAP databases for existing evaluation techniques is provided. The trade-offs that HTAP systems had to make in order to manage OLTP and OLAP workloads is being focused. To provide some insight into how various HTAP databases perform in various circumstances, quantitative statistics are being given. For instance, the percentage of performance degradation the systems should bear in order to keep the data fresh in order to strike a balance between workload isolation and data freshness is assessed.
CHALLENGES AND OPEN PROBLEMS Automatic Column Selection for HTAP Workload Learned HTAP Query Optimizer Adaptive HTAP Resource Scheduling HTAP Benchmark Suite
AUTOMATIC COLUMN SELECTION FOR HTAP WORKLOAD Choosing which columns to import from the row store into the in-memory column store is crucial when dealing with an HTAP workload. However, current techniques, like Oracle 21c's Heatmap, mainly rely on previous statistics to choose the columns into memory. Such techniques are costly and rigid since they generate recommendations by executing all the queries. Lately, learning-based techniques including view selection, join ordering, and knob tuning have become widely used in the database industry. In order to choose the columns for HTAP workloads efficiently and effectively, new automatic approaches are therefore required. Designing a quick learning approach that can record workload access patterns without running the complete workload via the database is the key problem.Moreover, it is difficult to consider the data encoding together, which results in a bigger search space.
LEARNED HTAP QUERY OPTIMIZER Current techniques choose the row store and column store access path in an HTAP database while optimizing the query by utilizing cost functions. However, they use those estimates to calculate the row/column sizes and then use those calculations to calculate the scan costs for row store and column store. Due to erroneous cost estimations, such strategies are troublesome for correlated and skewed data. Lately, learned query optimizers have demonstrated useful benefits by figuring out how to translate an incoming query to the execution plan used by the current optimizer. Consequently, creating a trained query optimizer for HTAP databases is equally promising. Due to the size of the learning space, the key problem is to take into account both row-based and column-based operators during the learning phase.
ADAPTIVE HTAP RESOURCE SCHEDULING Databases can better balance the trade-off between workload isolation and data freshness with the aid of HTAP resource scheduling. By changing the OLAP and OLTP execution modes, this is accomplished. OLAP and OLTP tasks that are executed separately benefit from high throughput but have poor data freshness. High data freshness is favoured by shared execution of mixed workloads, however there is significant workload interference. Exiting freshness-driven scheduling ignores the workload pattern in favor of a rule-based approach to controlling the execution mode. The OLTP and OLAP threads are adjusted according to workload, however freshness is not taken into account. Consequently, it is vital to account both workload and freshness while scheduling the resources. In order to achieve this, it is preferred to create a lightweight adaptive scheduling approach that not only captures the workload pattern for improved performance but also meets the need for data freshness.
HTAP BENCHMARK SUITE First, it's been noted that TPC-H has a uniform distribution and little correlation between columns, which presents a minor issue when testing OLAP. The join-crossing correlation with skew should be included in HTAP benchmarks with TPC-H as a result. Second, according to Gartner's definition of the HTAP transaction [34, 35], it may include analytical processes. However as of now, no HTAP benchmark has included this capability. This necessitates the creation of a new HTAP benchmark with analytical activities, such as the addition of TPC-C to include analytical processes. Finally, there aren't many specialized micro-benchmarks available for HTAP operations like resource allocation, query optimization, and data synchronization. Overall, it advocates for a new testbed that can expand different elements of current benchmarks for a comprehensive assessment of HTAP databases.