DEV Community

Sriram R
Sriram R

Posted on • Edited on

Introduction to Distributed Systems

What's a Distributed System?

A distributed system is a system whose components are on different computers connected by a network. These computers send messages to each other to talk to each other and coordinate their actions.

Distributed System

Components of a Distributed System

  1. Nodes - The various components that comprise a distributed system
  2. Network: The way that the nodes of Distributed Systems talk to each other

Why do we need Distributed Systems?

Performance

There are limits to what a single node can do. Each machine has hardware-based limits.
We can scale up the hardware of a machine by adding more RAM and CPU, but it gets very expensive after a certain point to improve the performance of a single computer.

Instead, we can get the same results with fewer, less expensive machines.

Scalability

Most computer systems deal with information. They are in charge of storing and processing data.
Since a single machine's performance can only be scaled to a certain point, we need more than one machine to handle the amount of data we get today.
One computer won't be able to handle all of your requests.

With multiple machines, we'll be able to store and process data more efficiently by splitting them up.

Availability

Most services need to be available 24 hours a day, 7 days a week, which is a big challenge. Any time, a single machine can break down.
If your service goes down, you'll lose money right away.
If you store all of your data on a single machine, and that machine crashes, you lose all of your data.

To be highly available, we need multiple machines so that if one fails, we can quickly switch to another.

Difficulties Designing Distributed Systems

Network Asynchrony

Communication networks have a property called asynchrony, which says that there is no way to know how long it will take to transfer an event from one machine to another. Sometimes things can occur out of order.
This makes it hard to build systems that are spread out.

To understand better, let's take an example.
Let's say a user disliked a post on a social media site, but then realised they meant to like it and changed their vote.
Since the network is asynchronous, it's possible that the like was received and processed first.
The real goal was for the post to be liked, but since the messages were sent out of order and the dislike was the second message sent, the system marked the post as disliked.

Partial Failures

Failure of some components of your system is called "partial failure." If the application doesn't take this into account, it could lead to bad results.

For example, let's say you have multiple machines where your users' data is spread out, and you lose connection with one of them. Users whose data was stored on that machine will have to wait for it to come back up.

It also makes things much more complicated when we need to do atomic transactions while some nodes are down.

Partial Failure

Concurrency

Concurrency is the ability to do more than one computation at a time, possibly on a single data set. This makes things more complicated because the two computations can mess with each other and cause bad results.

Measuring Correctness

How do we know that a system is correct or working as it should?
There are two main factors that determine whether a system is right or wrong:

Safety

A safety property says that something in the system must never happen.

If we think of a bicycle as a system, for example, the safety property says that the wheel must always be attached to the bike when it is running. If the wheel is taken off the cycle while it is running, bad things can happen.

Liveliness

A liveliness property defines something that must eventually occur in a system.

In the case of a bicycle system, the liveliness might mean that the bike should move when pedalled. The cycle should stop when breaks are applied.

Top comments (0)