DISTRIBUTED SYSTEM NOTES, IOE, BCA,TU
FAULT, FAULT TOLERANCE AND FAULT MANAGEMENT
FAULT
Faults refer to any errors or failures that occur within the system, causing it to deviate from its intended behavior.
TYPES OF FAULT
-
Crash Faults:
- Definition: Occur when a node or a component in the system unexpectedly stops functioning and halts all operations.
- Example: A server suddenly loses power and goes offline.
-
Omission Faults:
- Definition: Happen when a component fails to send or receive messages. This can be further divided into:
- Send Omission: A node fails to send a message.
- Receive Omission: A node fails to receive a message.
- Example: Network congestion causes packets to be dropped, resulting in lost messages.
-
Timing Faults:
- Definition: Occur when the system's timing constraints are violated, either by messages arriving too late or too early.
- Example: A time-sensitive transaction takes longer than the acceptable limit, causing a timeout.
-
Byzantine Faults:
- Definition: These are arbitrary faults where components behave erratically and inconsistently, possibly due to malicious attacks or severe bugs. They can produce incorrect or misleading results.
- Example: A compromised node sends conflicting information to different parts of the system.
-
Network Faults:
- Definition: Issues that arise within the communication network, such as packet loss, network partitioning, and high latency.
- Example: A network partition isolates a subset of nodes from the rest of the system, causing communication failures.
-
Hardware Faults:
- Definition: Failures related to physical components such as servers, storage devices, and network hardware.
- Example: A hard disk crash leads to data loss.
FAULT TOLERANCE
Fault tolerance in distributed systems is the capability of a system to continue functioning properly even when one or more of its components fail. This is a critical feature for distributed systems, as it ensures reliability, availability, and continuous service despite failures.
TECHNIQUES FOR FAULT TOLERANCE
EXAMPLE OF FAULT TOLERANCE SYSTEM
FAULT MANAGEMENT
-
Fault Detection:
- Monitoring and Logging: Continuously monitoring system performance and logging activities to detect anomalies.
- Example: Using tools like Prometheus for monitoring and Elasticsearch for logging.
-
Fault Diagnosis:
- Identifying the underlying cause of a fault by analyzing logs and system states.
- Example: Using tools like Splunk for log analysis.
-
Fault Recovery:
- Automatically switching to a standby component when a primary component fails.
- Example: High-availability clusters using technologies like Pacemaker and Corosync.
-
Fault Prevention:
- Adding extra components that can take over in case of a failure.
- Example: RAID configurations for disk redundancy.