SCALABILITY GOAL, FAULT TOLERANCE, OPTIMIZATION AND DATA LOCALITY
The two biggest advantages of MapReduce are:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which helps us to process the data using different machines. As the data is processed by multiple machines instead of a single machine in parallel, the time taken to process the data gets reduced by a tremendous amount as shown in the figure below (2).
Instead of moving data to the processing unit, we are moving the processing unit to the data in the MapReduce Framework. In the traditional system, we used to bring data to the processing unit and process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing unit posed the following issues:
Now, MapReduce allows us to overcome the above issues by bringing the processing unit to the data. So, as you can see in the above image, the data is distributed among multiple nodes where each node processes the part of the data residing on it. This allows us to have the following advantages:
Fault Tolerance:
Fault tolerance is an important aspect of MapReduce as it ensures that the processing can continue even in the event of node failures.
MapReduce provides several mechanisms for fault tolerance, including:
By combining these mechanisms, MapReduce provides a high degree of fault tolerance, which helps to ensure that large-scale processing can be completed even in the presence of node failures.
References:
1. https://thirdeyedata.ai/hadoop-mapreduce/