Posts

HTAP databases are cool but lets go further - part 4 of 4

Image
Introduction So far we have discussed the HTAP database category at length and I gave a bunch of pseudocode for making it accessible to build your own version. But I want to go further and do some things that are weird and perhaps we can find something interesting. So what about incorporating vector DBs into our designs and swapping out the Row or Column in one of these designs and then replacing it with the Vector type? I think experimentation is the root of invention so I hope you will join in on the fun. But first lets look start with one more HTAP databaes. The above is a Primary Column store with Delta Row store where each component does as the name suggests and stores data in the appropriate format for analytics and transactions, respectively. Very nice. This is a typical architecture used for the famous Hyper database and SAP HANA. These are also HTAP type DBs and find application in fraud detection due to providing high freshness of data, but note the sca...

HTAP database - Distributed column storage - Part 3 of 4

Image
Introduction This is the 3rd in a 4-part series on HTAP (Hybrid Transaction and Analytical Processing) databases. In the previous blog post ( link opens in separate window ) in this series I described the architecture known as "Distributed row store with a replicated column store" and discussed a simple design of such a system and the concurrency that would be required when designing it. If you like distributed systems go check that one out as well. I will now describe a separate system architecture for HTAP known as "Distributed column storage with Row store on a disk" , see the image below for the architecture. This blog will cover weaknesses and strengths in this architecture, some notable implementations in production systems and some pseudocode model in fake Pluscal to describe what I think it should look like. Again this is my mental model, you are encouraged to make your own and mine is purposely high level. As an aside, Pluscal is a math based language f...

HTAP databases - lets get distributed - Part 2 of 4

Image
Introduction This is the second in a 4-part series on HTAP (Hybrid Transaction and Analytical Processing) databases. In the previous blog post ( link opens in separate window ) in this series I described the architecture known as "Primary-row store with an in-memory column store" and discussed a simple design of such a system and the concurrency that would be required when designing it. I will now describe a separate system architecture for HTAP known as "Distributed row store with a replicated column store" , see the image below for the architecture. This blog will cover weaknesses and strengths in this architecture, some notable implemementations in production systems and some pseudocode model in fake Pluscal to describe what I think it should look like. Again this is my mental model, you are encouraged to make your own and mine is purposely high level. As an aside, Pluscal is a math based language for describing concurrent systems. I use it as its more genera...

HTAP Databases and the processes that keep them going - Part 1 of 4

Image
Introduction This is the first in a 4-part series on HTAP (Hybrid Transaction and Analytical Processing) databases. By now you know I am nuts about concurrency and its use in OSs, DBs, calculations and everything in between. I will cover a few weaknesses and strengths in these DB systems, some notable implemementations in production systems and then write some fake version of Pluscal to describe what I think it should look like in an abstract code implementation. Note I use math as its the most powerful language I know and pluscal as its the best tool for describing these types of systems where concurrency is everywhere. So lets begin with a drawing of the Primary-row store with an in-memory column store . Please note this diagram is adapted from the paper "HTAP databases: A Survey", by Zhang et al. 2024. The contents of this blog are inspired by insights from this brilliant paper as well. However I will not go into the depth that the paper provides. The reader ...

Rate Limiters are everywhere: Distributed and Local

Image
What is a rate limiter The Rate Limiter (RL) is a middleware (think Redis) or some software strategy to limit access of clients to resources that are limited or for which we want to ensure that fair usage and access is encouraged. In the context of websites the approach also provides security as it limits access to adversarial clients who wish to abuse the system. This approach also finds use in the context of databases - which I will give details on in this post. Note: adversaries are a superset of the user set (they can do what users do plus other things we dont normally want to allow) who behave badly, and could be external actors to the system or registered members. Rate Limiters in Distributed Systems The poster child for rate limiting - in my opinion - is Redis as it is the first thing that comes to mind from my background and experience. The reason is that they have made the use of rate limiting simple via their api. You should check out Redis if you have not alre...

Cache and Buffer Pool Manager

Image
The present post is about Caches and the Buffer Pool Manager for a database named SparrowDB that I am building. The programming language is Rust. The DB is a SQL relational database I wish to take from scratch to distributed. If you wish to come along for the ride, I will post weekly on my thoughts, struggles and completed work. Disclaimer: A lot of what I know and some parts of this blog comes from lessons by Andy Pavlo from CMU who made a youtube course, and I highly recommend you learn from him if you desire to work with database internals. Cache Policies : There are several types of caches we could consider eg. course-grained and fine-grained, but we will only consider the former category. Furthermore we will only use LRU-K in this project, but I wish to give an honourable mention to frequency based cache policies such as Q2 and LFU. The LRU-K is an improvement on the LRU policy as it avoids thrashing in the database cache where we evict a page only to read it again in t...

Workers - what can they do and what do they look like? Lets make a model in Go.

Image
Today I want to talk about work in software. So naturally/unnaturally I gravitate towards ... orchestrators, for reasons that will become clear shortly. Borg and K8's come to mind. What do these have in common and what do they contain that makes them indispensable to modern computing. I think its the ability to make distributed systems so easy to deploy (K8s). Perhaps also that it hides much of the complexity of failures, and writing software to handle these cases. In my opinion what I love about them is the concept of work. It's the same reason I like the MapReduce system, and the concept of doing work as a whole resonates with me. No this isn't an AI assistant writing for me, I use the word resonate because I like that word. So in this post I will model a Worker, and what does that actually mean for the system, ie. What do Workers do?  Manager (M) , Workers (W) and KVstores (KV) are the system we will think about today. Lets start with the following:  Worker comes to work...