The Unix Philosophy and Map Reduce



What is Map Reduce?

Within a computer program, it is a design pattern for implementing functions that operate on sequences of elements, which uses the powerful idea of treating functions as first-class values that we can pass around and manipulate in our programs.

In a distributed system context, it is a programming model for executing batch jobs with parallel, distributed, algorithm on a cluster (e.g. processing and generating big data sets)

As an open source implementation (e.g. Hadoop), it is a system that orchestrates the processing of data by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play. Optimizing the communication cost is essential to a good MapReduce algorithm. The scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine.


File is the most powerful abstraction. What is a file? Anything you can read()or write() or open(). Thus, all these devices can be treated as a file: CDs, printers, monitors etc.

We can think of each process as a bundle of elements kept by the kernel to keep track of all these running tasks.

memory-mapped file I/O: Map a file on disk to memory. Instead of having to open the file and use commands such as read() and write(), the file looks as if it were any other type of RAM. It is the job of the operating system to maintain security and stability, so it needs to check if a process tries to write to a read only area and return an error.


How having a stack we are able to implement functions and recursion:

Each function has its own copy of its input arguments. This is because each function is allocated a new stack frame with its arguments in a fresh area of memory.

This is the reason why a variable defined inside a function can not be seen by other functions. Global variables (which can be seen by any function) are kept in a separate area of data memory.

This facilitates recursive calls. This means a function is free to call itself again, because a new stack frame will be created for all its local variables.

Each frame contains the address to return to. C only allows a single value to be returned from a function, so by convention this value is returned to the calling function in a specified register, rather than on the stack.

Because each frame has a reference to the one before it, a debugger can “walk” backwards, following the pointers up the stack. From this it can produce a stack trace which shows you all functions that were called leading into this function. This is extremely useful for debugging.

You can see how the way functions works fits exactly into the nature of a stack. Any function can call any other function, which then becomes the up most function (put on top of the stack). Eventually that function will return to the function that called it (takes itself off the stack).

Stacks do make calling functions slower, because values must be moved out of registers and into memory. Some architectures allow arguments to be passed in registers directly; however to keep the semantics that each function gets a unique copy of each argument the registers must rotate.

stack overflow is a common way of hacking a system by passing bogus values. If you as a programmer accept arbitrary input into a stack variable (say, reading from the keyboard or over the network) you need to explicitly say how big that data is going to be.


References:

11 March 2019