20250311

Designing Data-Intensive Applications - Chapter 10 - Batch Processing

(I don’t have first-hand experience with batch processing or MapReduce, so I will just remember the word “MapReduce” for future reference and skip understanding its details.)

MapReduce is a common framework for batch processing in a distributed system that takes input data and generates output data. It consists of two components: the mapper, which applies a specific function to each record, and the reducer, which gathers and processes the mapped data.

MapReduce follows the same principle as UNIX pipes (while UNIX pipes work with streams of data) in that it is deterministic and allows chaining by directing the output of one MapReduce stage to the next.

UNIX commands and pipes are designed to run on a single machine, whereas MapReduce is built for distributed systems, splitting tasks across multiple mappers and leverage resources in multiple systems.

Apache Hadoop 3.4.1 — MapReduce Tutorial

I can see some examples on the documentation. It seems Java is the most common programming language that supports MapReduce.

PySpark Overview — PySpark 3.5.5 Documentation

BigData with PySpark: MapReduce Primer — nyu-cds.github.io

There is also a Python API to Apache Spark that provides API to leverage the MapReduce programming model.


TODO:


index 20250310 20250312