20241201 ◎

did the laundry
finished reading (skimming) the chapter 3 of “Desining Data-Intensive Applications”
troubleshooted a problem that my posts in my Mastodon server are not published to some of other servers
- No clue

MEMO: Designing Data-Intensive Applications Chapter 3

This chapter covers the implementation of databases’ read-and-write operations. Although the content is quite interesting, I don’t find much content relatable to my everyday work as an application engineer. Therefore, I will write down some clluttered memos for myself.

The use of the word “log”

The word log is often used to refer to application logs, where an application outputs text that describes what’s happening. In this book, log is used in the more general sense: an append-only sequence of records. It doesn’t have to be human-readable; it might be binary and intended only for other programs to read.

OLTP and OLAP

OLTP: online transaction processing

OLAP: online analytic processing

It is a common practice among big companies to have a read-only and separated copy of application data (from the main OLTP systems) for data analysis. The separated database is called data warehouse.

This way, we can ensure that the queries for data analysis run by non-engineers do not affect the central OLTP database, and by separating databases for different uses, we can optimize a data warehouse, in which we run OLAP-type (analytic-type) queries.

The process to populate a data warehouse with application data is called Extract-Transform-Load (ETL).

A data warehouse is commonly designed as column-oriented storage.

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.

A normal relational database is row-oriented, in a way that the entire row will be fetched from the database, and its columsn are tightly coupled (cannot extract some of them effectively).

I will skip the explanation of the column-oriented or materialized views (which cache the results of queries that are frequently used) for now.

Indexes

This chapter also covers a wide range of the implementations of indexes. Starting from hash indexes, they show extensive forms or improvements of it.

Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index slows down writes.

Around two years ago, my colleague tried to add a bunch of indexes that seemed to be rarely used, and I left a comment on his pull request opposing the addition, but most of the collaborators did not think it was problematic, and the pull request was merged.

I don’t blame them at all as we did our best, but I wish I had read this book at that time so that I could have been more helpful to the team.

Hash Indexes

Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file (the location at which the value can be found).

This approach is adopted by Bitcask.

Although this approach works efficiently both for reads and writes, but there are some limitations.

The entire hash table must fit in memory because a key-value data structure is used.
Range queries are not efficient because a simple key-value pair only has access to a single record, and we need to iterate through each key to perform range queries.

Sorted String Table (SSTable)

On top of the hash indexes approach, keys are sorted. This constraint does not directly address the problems of the hash indexes, but it provides several improvements.

Indexes are implemented to break data down into segments, and SStable is more efficient in terms of segment mergers because we can do something like mergesort to merge sorted segments.
We can reduce the number of keys in memory compared to the hash indexes approach by leveraging the fact that we can instantly pick a segment by specifying a range.
- I know I should provide more explanations, but let me skip that for today.
Compression is also efficient, grouping records in a range into a block.

Log-Structured Merge-Tree (LSM-Tree)

I repeated reading the section, but I could not get the definition of LSM-Tree. I will research this later (maybe).

LSM-Tree seems to be a principle/strategy of data management, using SSTable as building blocks, but I’m not confident.

LSM-Tree seems not to be a tree structure in the conventional sense such as Binary Tree.

Bloom Filter

A Bloom filter is a memory-efficient data structure for approximating the contents of a set. It can tell you if a key does not appear in the database, and thus saves many unnecessary disk reads for nonexistent keys.

B-Trees

The most widely used indexing structure

B-trees break the database down into fixeed-size blocks or pages, traditionally 4KB in size (sometimes bigger), and read or write one page at a time (whereas the log-structured indexes break the database down into variable-sized segments). This design corresponds more closely to the underlying hardware, as disks are also arranged in fixed-size blocks.

The implementation of B-trees is quite impressive, but without illustrations, it’s very hard to explain. Ask me to buy this book for you.

This section ends with the comparision of B-Trees and LSM-Trees

Upgrading Amazon Lightsail (VPS) Plan

2 GB RAM, 2 vCPUs, 60 GB SSD -> 4 GB RAM, 2 vCPUs, 80 GB SSD

Stop the Docker containers in the old VPS instance
Stop the VPS
Create a snapshot
Select the snapshot -> Create an instance from a snapshot -> Select an upgraded plan
Delete the domain assignment in the old instance
Assign the domain to the new instance and add rules for HTTPS
Connect to the new instance
sudo certbot delete to delete the old certificate
sudo certbot certonly --standalone to get the new cerficate
docker compose up -d

Now, it costs $24/month to run the server. Yikes!

but it is a good experience running a server on my own.

162.6 lb

Protein shake 50g Tapioca 0g Tuna salad 60g Beef eye of round 70g

total protein -> 180 g (>= 140)

bench press, jogging (1 hour), lat pull down

MUST:

Consider the problem with embedding models
Ask the property manager about the rent payment
Get my hair cut

index 20241130 20241202