DISTRIBUTED SYSTEM
CHAPTER 10 : CASE STUDY
LAB WORK SOLUTION- DISTRIBUTED SYSTEM
DISTRIBUTED SYSTEM -BCA -ALL SLIDES
MCQ- DISTRIBUTED SYSTEM

DISTRIBUTED SYSTEM NOTES, IOE, BCA,TU

FAULT, FAULT TOLERANCE AND FAULT MANAGEMENT

FAULT

Faults refer to any errors or failures that occur within the system, causing it to deviate from its intended behavior.

TYPES OF FAULT

  • Crash Faults:

    • Definition: Occur when a node or a component in the system unexpectedly stops functioning and halts all operations.
    • Example: A server suddenly loses power and goes offline.
  • Omission Faults:

    • Definition: Happen when a component fails to send or receive messages. This can be further divided into:
      • Send Omission: A node fails to send a message.
      • Receive Omission: A node fails to receive a message.
    • Example: Network congestion causes packets to be dropped, resulting in lost messages.
  • Timing Faults:

    • Definition: Occur when the system's timing constraints are violated, either by messages arriving too late or too early.
    • Example: A time-sensitive transaction takes longer than the acceptable limit, causing a timeout.
  • Byzantine Faults:

    • Definition: These are arbitrary faults where components behave erratically and inconsistently, possibly due to malicious attacks or severe bugs. They can produce incorrect or misleading results.
    • Example: A compromised node sends conflicting information to different parts of the system.
  • Network Faults:

    • Definition: Issues that arise within the communication network, such as packet loss, network partitioning, and high latency.
    • Example: A network partition isolates a subset of nodes from the rest of the system, causing communication failures.
  • Hardware Faults:

    • Definition: Failures related to physical components such as servers, storage devices, and network hardware.
    • Example: A hard disk crash leads to data loss.

FAULT TOLERANCE 

Fault tolerance in distributed systems is the capability of a system to continue functioning properly even when one or more of its components fail. This is a critical feature for distributed systems, as it ensures reliability, availability, and continuous service despite failures.

Diagram

Description automatically generated

TECHNIQUES FOR FAULT TOLERANCE 

  • Redundancy:

    • Data Redundancy: Storing copies of data in multiple locations to prevent data loss.
    • Component Redundancy: Having multiple instances of critical system components so that if one fails, another can take over.
  • Replication:

    • Stateful Replication: Replicating the state of a service or system component across multiple nodes.
    • Stateless Replication: Replicating requests and responses, useful for services that do not maintain state.
  • Failover:

    • Automatic switching to a standby system or component upon the failure of the primary system.
  • Consensus Protocols:

    • Protocols like Paxos, Raft, and Zab ensure that a group of nodes in a distributed system can agree on a single value or course of action even in the presence of failures.
  • Checkpointing and Rollback:

    • Saving the state of a system at intervals so that it can be restored to a known good state in case of a failure.
  • Load Balancing:

    • Distributing workloads across multiple nodes to ensure no single node becomes a point of failure.

EXAMPLE OF FAULT TOLERANCE SYSTEM 

  • Google File System (GFS):

    • Utilizes data replication and chunk servers to ensure high availability and fault tolerance.
  • Apache Hadoop:

    • Uses data replication across DataNodes to ensure fault tolerance in its distributed file system (HDFS).
  • Amazon Web Services (AWS):

    • Employs Availability Zones and regions to ensure services are fault-tolerant by isolating failures.
  • Apache Kafka:

    • Uses partitioning and replication of logs to ensure message durability and fault tolerance.

FAULT MANAGEMENT

  • Fault Detection:

    • Monitoring and Logging: Continuously monitoring system performance and logging activities to detect anomalies.
      • Example: Using tools like Prometheus for monitoring and Elasticsearch for logging.
  • Fault Diagnosis:

    • Identifying the underlying cause of a fault by analyzing logs and system states.
      • Example: Using tools like Splunk for log analysis.
  • Fault Recovery:

    • Automatically switching to a standby component when a primary component fails.
      • Example: High-availability clusters using technologies like Pacemaker and Corosync.
  • Fault Prevention:

    • Adding extra components that can take over in case of a failure.
      • Example: RAID configurations for disk redundancy.