The Problem: Unreliable External Resources in Distributed Systems

Microservices Design Patterns

0% completed

In the dazzling world of distributed systems, one question that frequently pops up is, "What happens when things go wrong?" Well, that's a great question because, let's face it, in the real world, things do go wrong! More specifically, things often go wrong with external resources. But, what exactly are these external resources, and why are they often so problematic?

Let's imagine you're running a bustling online store. Your application interacts with various services like inventory databases, payment gateways, third-party delivery APIs, and more. All these services are external resources. They are the links in the chain that your application depends upon to function smoothly.

However, these resources, like everything else in the world, are not infallible. They can have temporary hiccups due to network glitches, load spikes, or even hardware failures. Think of it as a traffic jam on the way to your physical store - the destination is intact and the vehicle is working fine, but the path is temporarily blocked. When one of these 'traffic jams' happens in your application, operations that depend on these resources are bound to fail.

In a monolithic system, you might have a single database or a couple of internal services that, if they fail, will bring down the whole system. The probability of such a failure happening is relatively low, and when it does happen, there's nothing much left to do but restore the service as quickly as possible.

However, in a distributed system, things are different. Your application is now a collection of smaller services, each potentially interacting with multiple external resources. The chances of encountering a transient failure in one of these many interactions significantly increase. It's as if you own a chain of stores now, spread across different locations. If there's a traffic jam blocking the route to one of your stores, it doesn't mean all your other stores need to close as well.

This is precisely the kind of resilience that distributed systems aim to achieve. When an operation fails due to a transient error with an external resource, we don't want the entire system to collapse. Instead, we prefer a strategy that can tolerate these hiccups and continue serving the users.

In many cases, the transient errors resolve themselves after a short period. It's like waiting for the traffic jam to clear. So, one naive solution would be to retry the failed operation immediately. Sounds good, right? Well, not so fast! This approach can backfire quite spectacularly.

Why is that, you may wonder? Let's go back to the traffic jam analogy. What happens if all the blocked vehicles decide to move forward at the same time as soon as the path clears a bit? Chaos, right? The same thing can happen in your system. If all the failed operations are retried at once, it might lead to a sudden spike in load, causing more harm than good. This is known as the thundering herd problem, a situation we definitely want to avoid.

Moreover, repeatedly trying to interact with an unavailable resource can waste valuable processing power and network bandwidth. This is akin to repeatedly trying to open a locked door. It's not going to budge until someone unlocks it, so continuously pushing against it will only exhaust you.

Finally, not all errors are transient. Some failures are more permanent and will not resolve themselves over time. Retrying operations in such scenarios will just delay the inevitable, impacting your system's responsiveness and user experience.

So, how do we tackle these issues? We want to make our system resilient to transient errors, but we also need to avoid the pitfalls of mindless and aggressive retries. The answer lies in a thoughtful approach to retrying failed operations, one that can adapt based on the nature of the error and the response of the system - the Retry Pattern.

In the software world, you might have heard about "defensive programming". The Retry Pattern is a great example of this concept. It is about being ready for unexpected issues and having a plan to manage them gracefully. With the Retry Pattern, we can attempt to perform an operation that might fail, taking precautions to avoid the pitfalls of naive retries and enhancing the overall reliability of our system.

The Retry Pattern can be especially beneficial in microservices architecture where services often communicate over a network. Network communication is inherently unreliable - packets can get lost, latency can fluctuate, and servers can become temporarily unreachable. These are all transient errors that the Retry Pattern can handle effectively.

The application of the Retry Pattern isn't limited to network communication. It can be applied anywhere in your system where an operation has a reasonable chance of succeeding after a transient failure. Database operations, filesystem operations, inter-process communication - the Retry Pattern can improve reliability in all these scenarios.

The Retry Pattern not only helps us manage transient failures but also enhances the user experience. Instead of throwing an error at the user at the first sign of trouble, we can make a few more attempts to complete their request. The user might not even notice the hiccup.

As with any design pattern, the Retry Pattern is not a one-size-fits-all solution. It needs to be implemented thoughtfully, considering the nature of your application and the operations you're trying to protect. For example, retrying a failed operation immediately might make sense in a high-speed trading application where every millisecond counts. In contrast, a social media app might choose to wait a bit longer before retrying a failed operation to avoid overloading the servers.

Now, before we move ahead, let's address the elephant in the room. Isn't the Retry Pattern just a fancy name for a simple loop that tries an operation until it succeeds? Well, at a high level, it might seem that way. But there's much more to the Retry Pattern than just looping over a piece of code. To truly appreciate its intricacies and understand how to implement it effectively, we need to dive deeper into its architecture and inner workings.

How does the Retry Pattern decide when to retry an operation and when to give up? How does it avoid the thundering herd problem? What happens when the operation being retried has side effects? Let's explore these questions in the following sections. We will also discuss a real-world Java example to understand the Retry Pattern's practical implementation, discuss its performance implications, and look at some typical use cases.

By the end of this journey, you will have a thorough understanding of the Retry Pattern and how it can enhance the reliability and resilience of your distributed system. So, are you ready to dive deep into the Retry Pattern? Let's get started!

.....

Like the course? Get enrolled and start learning!