HTAP Databases and the processes that keep them going - Part 1 of 4
Introduction
This is the first in a 4-part series on HTAP (Hybrid Transaction and Analytical Processing) databases. By now you know I am nuts about concurrency and its use in OSs, DBs, calculations and everything in between. I will cover a few weaknesses and strengths in these DB systems, some notable implemementations in production systems and then write some fake version of Pluscal to describe what I think it should look like in an abstract code implementation. Note I use math as its the most powerful language I know and pluscal as its the best tool for describing these types of systems where concurrency is everywhere. So lets begin with a drawing of the Primary-row store with an in-memory column store.
Please note this diagram is adapted from the paper "HTAP databases: A Survey", by Zhang et al. 2024. The contents of this blog are inspired by insights from this brilliant paper as well. However I will not go into the depth that the paper provides. The reader is encouraged to read the paper.
Problem
The best way start off the series is to know why there are more than one architecture type used in these systems. The issue is that data freshness and isolation performance for these systems is a trade-off. One could have a system that writes constantly from the row store to the column store via the delta data store, and this will give a high data freshness for near real-time analytics. But then isolation performance will be low. The converse is also true. Another consideration is distribution of the row or column store will allow great scalability, but then low data freshness, since it takes time to get the data from one node to another. The converse is again true, where having no distribution of the data is a scalability concern, that gives high freshness of data.
How does "Primary-row store with in-memory columns approach the problem" ? Strengths and Weaknesses
The row store keeps all OLTP data from transactions and then via the Delta data, the Column store periodically will be updated with data for analytics. Clients are then able to query the system for transactions or analytics on the same data. The usual data processing applies to the OLTP system where transactions come in, they are put into LSM (log structured merge) tables, then upon being filled in volatile memory they are written to the disk in a merge operation. The writes happen with appending so its very fast, but queries require searching the LSM data structures and there is a time penalty for this. The Log is just the WAL (write-ahead log) and is typical for systems that use MVCC (multi-version concurrency control), which is rather widespread in adoption today.
Now for free-hand fake pluscal
This will not be pretty but you will get the gist and note this is free-hand without an IDE or git so i dont have a code snippet to share but I encourage you to learn Pluscal as its the only way I know how to express this level of concurrency. Also this code will not be translatable to TLA+ as its fake and purposed with giving you a very high level view of the system. You can take this and do an implementation in whatever language you like, a benefit of language agnostic pseudocode.
//global data
channel c1 == FIFO queue of txns
channel c2 == FIFO queue of txns
process OLTP client {
local var trxn t1 send to row store over channel c1
(tcp perhaps) to write/read
}
process OLAP client {
local var trxn t2 send to column store over channel c2
(tcp perhaps) to read
}
process row store {
while(true){
if(channel c1 has a txn t1){
write txn into the WAL with Log process
if(txn is a write){
t1 is parsed through Query engine and written
to the root node of some B+tree or node in SkipList
}
if(txn is a read){
t1 parsed through Query planner/optimizer
then run on tables to search for row data
return row data
}
}
}
}
process column store {
while(true){
if(channel c2 has a txn t2){
t2 parsed through Query planner then run
on tables to search for column data
if(column data not in data structures){
pull in data from Delta
}
}
}
}
process delta data store {
while(true){
if(event triggered){
pull in data from the row store into the
delta data store
}
}
}
process log (WAL) {
while(true){
if(trxn event occures){
write the trxn to the log for failure and recovery
}
}
}
process persistant storage {
while(true){
write the data based on scheduled job to the disk with
a file write IO operation
}
}
Analysis and Final thoughts
Notice how many processes we have here: 7. Thats a lot of threading!! Also notice the global variables which are channels, and how they are shared between the processes, luckily these are FIFO queues. Then notice how the local trxns move from the clients through the system. We need to take care not to block the processes since we use if-logic and thus we can get into deadlocks or livelocks depending on how the processes share data and how the data is changed. There is a lot of concurrency here and its hard to keep this in your head: go head and try it. I didnt mention any locking here in explicit code but you can assume any shared resource or data must be locked or use a compare and swap operation. I didnt speak about the MVCC and its locking or 2PL (2 phase locking) and how the latches (a lock on a node of data) is locked on the B-tree or the Skiplist in concurrent execution. These just add complexity. In summary I want to conclude that this sytem is tricky to get right! But how beautiful is the concurrency though!!?? Really nice to work with right?! Anyway thanks for reading this work and stay tuned for the next system.
So where can you use such a system? We know this system has high freshness but poor isolation performance, so it will work only in areas this is acceptable. One could be banking as mentioned in the paper listed above. Others could be real-time analytic systems that do not require high performance computation. The key point i will come back during this series is that there is no single database to rule them all.
In closing, I am deliberately being abstract here and creating what I think should be done in this algorithm. This does not mean its correct! If you are interested in looking into actual systems that use this architecture, please see Oracle's in-memory dual format database and the SQL-Server implementation that uses Hekaton (with a row-based engine). Have a great week and come by later to see the rest.
Comments
Post a Comment