Distributed Systems

A distributed system is a collection of independent computers that appear to the user as a single coherent system, this is in contrast to a networked system. In a networked system computers exchange information, however in a distributed system an application has many parts running on different computiers, and information is shared to accomplish a specific purpose.

Distributed Systems allow resources to be more easily shared, to be location independant, to distribute human resources, to increase performance, to increase modularity and scalability and to increase availability and reliability.

In distributed systems a middleware layer masks the underlying networks, hardware, operating systems and programming languages and presents the system as a single service.

Certain design features have to be considered when designing a distributed system

  • Variable network bandwidth needs
  • Possible network latency issues
  • No hardware support for synchronization
  • No global clock
  • Unpredictable component failure model
  • Security Issues

Remote Method Invocation (RMI) is Java’s implementation of object-to-object communication among Java objects to realise a distributed computing model. It allows objects to be distributed on various machines and invoke methods on objects located on remote sites.

Replication

Replication is the process of assigning mutliple copies of the same information in a system. It can enhance reliability and improve performance, but requires consistency. In order to maintain consistency it is important to check a replica is up to date, which could be achieved by use of a global clock, server-push, client-pull or regular updates.

  • In a strict consistency model we need an absolute global time, the concept of “most recent” must be unambiguous. This is akin to a uniprocessor system.
  • In a sequential consistency model the result of any execution is the same as if the operations by all processes on the data store were executed in some sequential order. So, processes can run concurrently and I/O operations may be interleaved but all processes are aware of this and the output will be the same regardless. The only different from strict consistency is that there is no reference to a global time.
  • In a causal consistency model writes that are potentially causally related must be seen by all processes in the same order. Concurrent writes may be seen in a different order on different machines. It is “weaker” than sequential consistency, and is based on the premise that if event B is caused or influenced by an earlier event A, then everyone must first see A and then B.

Consitency control can be enforced in Java with use of the “synchronized” keyword in a method declaration.

Synchronisation

Clock Synchronisation is a problem for distributed systems as the internal clocks of computers can differ, and even if set accurately can start to drift. The implication of this is the notion of physical time is problematic in networked computer systems, we are limited in our ability to timestamp events and nodes and thus accurately determine the order in which events occured.

Cristian’s algorithm suggests getting the current time from a time server, which does use a level of estimation but by repeated requests to estimate the round trip time this can be minimised. The use of a time server does however provide only a single point of failure, although again this can be resolved through grouping servers and using multicast requests.

The Network Time Protocol (NTP) is a scalable protocol for synchronising time, it allows a logical hierarchy of servers and supports authentication of trusted servers. NTP utilises UDP.

Fault Tolerance

\mbox{ Availability }=\dfrac{\mbox{ Reliability }}{\mbox{ Reliability } + \mbox{ Downtime }}

Where reliability is measured in mean-time-between-failures (MTBF), downtime is measured in mean-time-to-repair.

Fault tolerance is the ability of a system to continue operating properly in the event of a failure of some of its components, this is particularly important in high-availability or life-critical systems.

  • Hardware Faults are often physical faults and are typically unpredictable. Redundancy can resolve fault issues.
  • Software Faults usually arise from specification or design errors, or implementation mistakes.

Replication can give fault tolerance in different ways:

  1. Replication (multiple instances of the same system, used in parallel)
  2. Redundancy (multiple instances of the same system for fall-back)
  3. Diversity (multiple instances of a different implementation to cope with errors in specific implementations)

Recovery may be needed to restore a system in the event of a fault

  • Backward recovery allows the rollback of a system to a previously saved state (can be used regardless of damage to current state, can handle unpredictable errors, however requires significant resources and is a risk of the domino effect)
  • Forward recovery attempts to find a new state from which the system can continue operation (includes tools such as error compensation, fault masking and self-checking components however can only remove predictable errors ).

Leave a Comment