PROCESS RESSILIENCE
Process resilience in distributed systems refers to the ability of the system to continue operating correctly despite failures in individual processes or components. This concept is vital for ensuring that distributed systems can handle faults gracefully and maintain service availability and consistency.
Processes can be made fault tolerant by arranging to have a group of processes.A message sent to the group is delivered to all of the “copies” of the process (the group members), and then only one of them performs the required service.If one of the processes fail, it is assumed that one of the others will still be able to function (and service any pending request or operation.
To tolerate a faulty process, organize several identical processes into a group which can be flat or hierarchial group.
a. flat groups: Flat group is good for fault tolerance as information exchange immediately occurs with all group members. All process with in the group have equal roles and control is completely distributed to all process.May impose more overhead as control is completely distributed (hard to implement).
b. hierarchical groups: In hierarchial groups all communication happens through a single coordinator . It is not really fault tolerant and scalable, but relatively easy to implement.
GROUP MEMBERSHIP
Centralized: have a group server to maintain a database for each group and get these requests.Efficient, easy to implement, but single point of failure
Distributed: to join a group, a new process can send a message to all group members that it wishes to join the group (Assume that reliable multicasting is available) .To leave, a process can ideally send a goodbye msg to all, but if it crashes (not just slow) then the others should discover that and remove it from the group!
Redundancy:
Replication:
Failover Mechanisms:
Load Balancing:
Checkpointing and Rollback: