With the way I manage my daily reports, the number of HTML files grows linearly. I think this usage is uncommon, so maybe I should create a web application and migrate to it.
I tried to fix my circadian rhythm, but it did not go well…
Whereas replication is all about having multiple copies of the same data for better response time or availability, partitioning is a technique to split a massive amount of data into smaller groups for scalability.
There are two main approaches to partitioning.
It’s uncommon for users to search for specific data only with primary keys, so it’s natural to have secondary indexes. There are two approaches to this.
In general, document-partitioned is good at processing writes, whereas term-partitioned is good at reading.
When adding a new node to have more data, the partitions must be rebalanced. Because we are dealing with a massive amount of data, we should minimize the moves of data in partitions; otherwise, we will experience an excessive amouint of loads on the system.
We cannot simply calculate mod N for partitioning because increasing the number of nodes moves a large amount of data in partitions. Let’s say we have a record with key = 10 and N = 5. At first, the record resides in partition 0 (10 mod 5), but after increasing the number of nodes, the record will be moved to partition 4 (10 mod 6). This type of move is inevitable and frequent with this approach.
There is a simple mitigation to reduce the volume of data moved: have many more partitions from the beginning than the actual number of nodes. With this approach, each node is assigned many partitions, such as 100 partitions. When a new node is added, each existing partition provides the new node with some of its partitions. Though, it’s difficult to choose the number of partitions.
When using key range partitioning, some flexibility about the number of partitions in a node is beneficial, avoiding hot spots. One approach is to configure a threshold for the number of records in a partition and divide/merge partitions to follow the configuration.
Fully automating rebalancing may lead to an undesired situation. For example, when the system experiences high loads, some of the nodes are unavailable, and this may trigger rebalancing, which puts more loads on the system. Therefore, it’s a rule of thumb to make some steps require manual reviews/operations.
There are several ways to let clients know which partition to make queries, and the most popular one is to have a separate coordination service.
Clients send requests to the service, and it replies with the knowledge of which partition clients should send requests next.
1 hour walk
TODO: