20250105

some research tasks

With the way I manage my daily reports, the number of HTML files grows linearly. I think this usage is uncommon, so maybe I should create a web application and migrate to it.

I tried to fix my circadian rhythm, but it did not go well…

Designing Data-Intensive Applications - Chapter 6 - Partitioning

Whereas replication is all about having multiple copies of the same data for better response time or availability, partitioning is a technique to split a massive amount of data into smaller groups for scalability.

Partitioning Key-Value Data

There are two main approaches to partitioning.

Partitioning by Key Range
- This approach partitions data with continuous key ranges. For example, data are partitioned into two parts; one is for keys between “a” to “m” and the other for “n” to “z” lexicographically.
- The advantage of this approach is that we can perform range queries efficiently.
- The disadvantage of this is that data may create hot spots. For example, when using timestamp as a key, all writes for a day will direct to the same partition.
Partitioning by Hash of Key
- This approach partitions data with the hash values of keys so that data are distributed evenly.
- Range queries cannot be performed.
- It’s unlikely to have hot spots since (key, data) are distributed fiarly among the partitions.

Secondary Indexes in Partitioning

It’s uncommon for users to search for specific data only with primary keys, so it’s natural to have secondary indexes. There are two approaches to this.

local index (document-partitioned)
- Each partition maintains secondary indexes of its data, i.e., a partition has no information about secondary indexes of other partitions.
- Clients need to query all partitions to retrieve all results if the results of secondary indexes are scattered across multiple partitions.
global index (term-partitioned)
- Each partition maintains some of secondary indexes for all data so that clients only need to query one partition to get data of an index.
- Writes need to propagate for all partitions so that their secondary indexes include the newly added records.

In general, document-partitioned is good at processing writes, whereas term-partitioned is good at reading.

Rebalancing

When adding a new node to have more data, the partitions must be rebalanced. Because we are dealing with a massive amount of data, we should minimize the moves of data in partitions; otherwise, we will experience an excessive amouint of loads on the system.

We cannot simply calculate mod N for partitioning because increasing the number of nodes moves a large amount of data in partitions. Let’s say we have a record with key = 10 and N = 5. At first, the record resides in partition 0 (10 mod 5), but after increasing the number of nodes, the record will be moved to partition 4 (10 mod 6). This type of move is inevitable and frequent with this approach.

Fixed number of partitions

There is a simple mitigation to reduce the volume of data moved: have many more partitions from the beginning than the actual number of nodes. With this approach, each node is assigned many partitions, such as 100 partitions. When a new node is added, each existing partition provides the new node with some of its partitions. Though, it’s difficult to choose the number of partitions.

Dynamic partitioning

When using key range partitioning, some flexibility about the number of partitions in a node is beneficial, avoiding hot spots. One approach is to configure a threshold for the number of records in a partition and divide/merge partitions to follow the configuration.

Manual Rebalancing vs Automatic Rebalancing

Fully automating rebalancing may lead to an undesired situation. For example, when the system experiences high loads, some of the nodes are unavailable, and this may trigger rebalancing, which puts more loads on the system. Therefore, it’s a rule of thumb to make some steps require manual reviews/operations.

Request Routing

There are several ways to let clients know which partition to make queries, and the most popular one is to have a separate coordination service.

Clients send requests to the service, and it replies with the knowledge of which partition clients should send requests next.

1 hour walk

TODO:

2044E, F
2048D
Smog test
- Ask my parents to transfer the ownership of my car
Naturalization Interview (20250207)

index 20250104 20250106