Many organizations are building or migrating towards a more distributed architecture. The advancements in cloud computing and container orchestration of the last few years have accelerated the development of microservices at scale. Such a microservices architecture brings many benefits for scalability and business agility. There is however, a price to pay, for using a microservices architecture. One of the biggest challenges to overcome is the extra complexity in both infrastructure and application logic.
Fallacies of distributed computing
Many years ago, when monoliths still roamed the earth, a few guys at Sun Microsystems came up with the Fallacies of distributed computing. These are eight, obviously flawed, assumptions about distributed systems that any software engineer or architect will have to deal with when building and designing a microservices architecture.
Most of these fallacies deal with networking, such as “the network is reliable” and “latency is zero”. Some of others deal with security or the operators of the eco-system.
Dealing with failures that come from using a microservices architecture is an interesting conundrum that has its effects on technical as well as functional aspects of the architecture.
What is failure?
Let’s have a look at what failure can look like. Imagine a customer of a video streaming service browsing through the catalog. The catalog shows many titles in different categories and also recommendations catered towards the customers previous selections. Now if we imagine that these recommendations are powered by a recommendation service with a special type of data store. Furthermore the data store is suffering from heavy load resulting in timeouts for the customers request.
Code for resilience
There are many ways to deal with failures on services or the network. One of them could be retrying to retrieve the data from the data store. This is a pattern that is useful on many occasions. However in this case this might make the problem worse: the data store is already overloaded and extra requests will not improve the situation.
A second pattern to use in such a case is the circuit breaker. This is exactly what it sounds like. After a number of failures the circuit breaker opens and no further requests will be made towards the data store. But what about the response to the customer? There are a number of things that can be done.
Design for failure
Coding for resilience is a technical approach and works well for intermittent failures. However for larger outages another approach is more feasible. In the previous section we spoke about the circuit breaker pattern. But what to do when the circuit breaker is in an open state? The first approach that could be taken is returning the last known good state. Recommendations do not change every few minutes so using a stale cached state will probably be fine. However this might not always be possible. The cache might not have been populated for this customer or the cache state was evicted due to time-to-live configurations.
In the last case it might be more feasible to not even show the personalised recommendations. Either a standard set or simply hiding the UI component are both good options.
Handling failure is all about user experience. In our example the personalised recommendations are not a core functionality and can be easily turned off when needed. The customer can still watch video streams and probably will not even notice anything different in his experience.
This makes dealing with failure a business decision. The business can best decide what features are absolutely critical for the user experience and what approach to take for core functionality and those that fit in a more nice-to-have category.
Failure is inevitable
Failure is a given and with more moving parts the chances of failure occurring are growing exponentially. Designing for failure and incorporating resilience patterns is a must to provide high levels of user experience.
About Mark van der Walle
Mark is an experienced software architect with more than 17 years professional experience in software development and operations. In his years of experience Mark has always had a drive of building and designing reliable and simple solutions to complex problems. To realize this Mark has a strong focus on quality backed by solid engineering, CI/CD pipelines, DevOps principles, craftmanship and Observability. At his customers Mark guides development teams and supports business stakeholders in building cloud native applications and going through cloud native transitions.
More on Mark van der Walle.