20250212

I slept a lot, as I could not get enough sleep yesterday.

Designing Data-Intensive Applications - Chapter 8 - The Trouble with Distributed Systems

Problems in Distributed Systems

Unreliable Network

Unlike phone lines, network communication does not guarantee reliable delivery—packets can be lost in the middle or experience huge delay. When a response is not received, the requester cannot know whether the issue is due to a network failure or if the receiver is completely down.

To ensure proper resource management in distributed systems—such as promoting a new database leader—system failures must be detected. This is typically achieved using timeouts. However, because optimal timeout values vary depending on the infrastructure, they should be determined empirically, as setting it too low confuses temporary network faults as faults on the system side.

Unreliable Clocks

Computer clocks are inherently unreliable due to hardware limitations, and assuming clocks of distributed systems are in sync is incorrect. For example, using timestamps to order database operations might seem intuitive, but in practice, later operations can end up with earlier timestamps due to clock drift or synchronization issues. (Last Write Wins (LWW) is a common approach to resolve such conflicts, where the operation with the most recent timestamp takes precendence, though conflicts still can occur. When designing distributed systems, we should be aware of those limitations and make decisions upon that).

Process Pauses

Proceses can occasionally pause, often due to system-level operations like Garbage Collection (GC). The Java Virtual Machine (JVM) is particularly known for its GC-induced pauses, which can sometimes freeze all running threads for several minutes. However, this issue is not unique to Java—Python and many other languages also experience process pauses due to garbage collection and other system operations.

In distributed systems, such pauses can lead to critical issues. For example, in leader-based replication, a temporarily paused leader might be incorrectly considered dead, causing the system to elect a new leader. When the original leader resumes, they system may end up with multiple leaders, leading to conflicts and inconsistencies.

How to approach the problems

In distributed systems, things can go wrong partially. Many distributed systems approach this problem by relying on a quorum; when most (this number is configurable) nodes agree on something, it can be considered as true. In most cases, the quorum is more than a half of the nodes.


TODO:


index 20250211 20250213