☕ Tech Dive - Distributed Computing
A tech dive on distributed computing and the problems that arise from using a network of computers. We also talk about some of the distributed systems models used to design algorithms.
Hope you’re all having a great day!
Tech Dive - Distributed Computing
As your application grows you’ll eventually reach a point where you have to scale horizontally.
You’ll need to use multiple computers to run your application, where one computer may act as a database, another as a load balancer, and another handling authentication, etc.
One big issue with scaling horizontally is that working with distributed systems can be a big pain. It’s fundamentally different than working on a single computer and there are tons of new ways for things to go wrong!
When you’re working with a single computer, your program is typically deterministic. Either the program works, or it doesn’t work. Running the same operation in your program multiple times will produce the same result.
This is not the case for a distributed system. You can have partial failures with your distributed system and they can be “non-deterministic”. That means that you can run the same operation on your program multiple times and it might work sometimes and fail other times. Certain parts of your system may go down randomly causing different results. As you can imagine, this makes testing and debugging your application much more difficult.
Additionally, there’s the problem of how the bigger your system gets, the more likely it is that one of its components is broken. If you’re using a distributed system of thousands of individual nodes, you should assume that something is always broken.
When using a distributed system, you’re typically using a Shared Nothing Architecture.
This means that the nodes in your distributed system are sharing nothing. Each machine has its own memory, disk, and CPU and the only way one node can access another node’s memory, disk, or processing resources is through a network request.
The internal network inside of a data center is an Asynchronous Network, meaning that one node can send a packet to another node, but the network gives no guarantee as to when it will arrive, or whether it will arrive at all.
When you send a request, it’s possible for your request to get lost (someone unplugged a network cable for example), get stuck in a queue (receipt was overloaded), or for your receipt to be unavailable (recipient node was shut down).
The requestor has no way of knowing whether the packet was delivered. The only way of knowing is with a response, and that might also get lost or delayed.
The way distributed systems handle this unreliability is through a Timeout. The sender waits a certain amount of time for a response. Once the timeout threshold has passed (and no response is received), the sender assumes the request failed.
Timeouts may seem simple at the surface, but they’re actually quite complicated. One important question to consider is how long should your timeout be?
A long timeout means a long wait until a node is declared dead. This results in a bad UI experience as the user may have to wait or see an error message. Additionally, this may result in slowing down your entire system as nodes waste more time waiting on responses from dead machines.
A timeout that’s too short is also problematic, as you will be prematurely declaring a node to be dead. This means a higher risk of incorrectly declaring a node dead, when in fact the node only suffered a temporary slowdown (usually due to a load spike on the network).
When you declare a node to be dead, this results in its responsibilities being transferred to other nodes which means additional load being placed on the other nodes and network.
The length of your timeout duration is typically chosen is experimentally. Measure how long network round-trips take over an extended period and use that to determine how long you expect a delay to be. Use that (along with your application’s characteristics) to determine the duration of a timeout.
Some systems can continually measure response times and their variability and then automatically adjust timeouts according to the observed response time distribution. This is done in Akka and Cassandra for example.
The discussion around timeouts brings us to another big pain point of working with a Distributed System... Time.
Applications will depend on the time to answer various questions around things like cache-expiry, timestamps for events, alerts, etc. Typically, you can divide time into measuring durations (measured by Monotonic Clocks) and measuring points in time (measured by Time of Day clocks).
With your distributed system, measuring time becomes tricky because communication is not instantaneous. It takes time for messages to travel across your network.
Additionally, each node on your network will have its own internal clock. Their internal clocks are not perfectly accurate so one node’s clock may be slightly slower or faster than another node’s clock.
A possible solution to unsynced Time of Day clocks is NTP - Network Time Protocol. This allows the clocks on your network to be synced to within a few milliseconds of UTC. Even with NTP however, there are a host of errors that can pop up with network delays, incorrect configuration, leap seconds, etc.
Therefore, when working with distributed systems you should view your clock readings as being within a confidence interval. You might be 95% confident that the time is now between 1.2 and 1.4 seconds past 3:04 PM.
System Models for Distributed Computing Algorithms
When designing algorithms to run on a distributed system, it's important to have a base set of assumptions.
Software designers typically refer to these assumptions as a Computational Model. You might be familiar with the RAM Computational Model used for analyzing an algorithm's asymptotic complexity.
With distributed computing, we have System Models. This is an abstraction that describes what an algorithm can assume in terms of the hardware and software configuration.
The synchronous model assumes a bounded network delay and clock error. In other words, you have a fixed upper bound for the network delay and clock error.
The synchronous model isn't realistic for most practical systems, however (delays are usually unbounded) so it isn't used very much in practice.
Partially Synchronous Model
The partially synchronous model assumes that a system behaves like a synchronous system most of the time but will sometimes exceed the upper bound for network delays or clock drifts.
This is realistic for many systems. For many distributed computing systems, there is a fixed upper bound for the majority of requests but there are occasional time periods where things go wrong (maybe there's excessive load on the network or one of the nodes failed). In that scenario, there will be an unbounded delay.
With the asynchronous model, an algorithm is not allowed to make any assumptions around timing whatsoever. The algorithm assumes that the system does not even have a clock, so things like timeouts are not possible.
The asynchronous model is the most restrictive.