MICROSERVICES: CODE FOR RESILIENCE, DESIGN FOR FAILURE

April 20, 2021

Mark van der Walle

Many organizations are building or migrating towards a more distributed architecture. The advancements in cloud computing and container orchestration of the last few years have accelerated the development of microservices at scale. Such a microservices architecture brings many benefits for scalability and business agility. There is however, a price to pay, for using a microservices architecture. One of the biggest challenges to overcome is the extra complexity in both infrastructure and application logic.

Fallacies of distributed computing

Many years ago, when monoliths still roamed the earth, a few guys at Sun Microsystems came up with the Fallacies of distributed computing. These are eight, obviously flawed, assumptions about distributed systems that any software engineer or architect will have to deal with when building and designing a microservices architecture.

Most of these fallacies deal with networking, such as “the network is reliable” and “latency is zero”. Some of others deal with security or the operators of the eco-system.

Dealing with failures that come from using a microservices architecture is an interesting conundrum that has its effects on technical as well as functional aspects of the architecture.

What is failure?

Let’s have a look at what failure can look like. Imagine a customer of a video streaming service browsing through the catalog. The catalog shows many titles in different categories and also recommendations catered towards the customers previous selections. Now if we imagine that these recommendations are powered by a recommendation service with a special type of data store. Furthermore the data store is suffering from heavy load resulting in timeouts for the customers request.

Code for resilience

There are many ways to deal with failures on services or the network. One of them could be retrying to retrieve the data from the data store. This is a pattern that is useful on many occasions. However in this case this might make the problem worse: the data store is already overloaded and extra requests will not improve the situation.

A second pattern to use in such a case is the circuit breaker. This is exactly what it sounds like. After a number of failures the circuit breaker opens and no further requests will be made towards the data store. But what about the response to the customer? There are a number of things that can be done.

Design for failure

Coding for resilience is a technical approach and works well for intermittent failures. However for larger outages another approach is more feasible. In the previous section we spoke about the circuit breaker pattern. But what to do when the circuit breaker is in an open state? The first approach that could be taken is returning the last known good state. Recommendations do not change every few minutes so using a stale cached state will probably be fine. However this might not always be possible. The cache might not have been populated for this customer or the cache state was evicted due to time-to-live configurations.

In the last case it might be more feasible to not even show the personalised recommendations. Either a standard set or simply hiding the UI component are both good options.

User Experience

Handling failure is all about user experience. In our example the personalised recommendations are not a core functionality and can be easily turned off when needed. The customer can still watch video streams and probably will not even notice anything different in his experience.

This makes dealing with failure a business decision. The business can best decide what features are absolutely critical for the user experience and what approach to take for core functionality and those that fit in a more nice-to-have category.

Failure is inevitable

Failure is a given and with more moving parts the chances of failure occurring are growing exponentially. Designing for failure and incorporating resilience patterns is a must to provide high levels of user experience.

About the author

Mark is an experienced software architect with more than 14 years professional experience in software development and operations. In his years of experience Mark has always had a drive of building and designing reliable and simple solutions to complex problems.

Generative AI

Cloud

Testing

Artificial intelligence

Security

MICROSERVICES: CODE FOR RESILIENCE, DESIGN FOR FAILURE

April 20, 2021

About the author

Related posts

Hartman Value Profile for enterprise architects

The Uniform Information Management Framework

How to Organize Data Reporting and Establish Sources of Truth

Cluster Insight: A Weighted Clustering Tool for Large Textual Data Exploration

Crafting Compelling Data Personas: Examples and Application

The architecture growth model

From Fragmentation to Flow: Designing for Meaning Across the Organization

Crafting Compelling Data Personas: Prompts and Questions

Applying Semantic MediaWiki for a dynamic enterprise architecture repository

Why Meaning Gets Lost: The Disconnect Between Strategy and Operations

Comments

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

About the author

Mark van der Walle

Lead Software Architect | Netherlands

Related posts

Hartman Value Profile for enterprise architects

The Uniform Information Management Framework

How to Organize Data Reporting and Establish Sources of Truth

Cluster Insight: A Weighted Clustering Tool for Large Textual Data Exploration

Crafting Compelling Data Personas: Examples and Application

The architecture growth model

From Fragmentation to Flow: Designing for Meaning Across the Organization

Crafting Compelling Data Personas: Prompts and Questions

Applying Semantic MediaWiki for a dynamic enterprise architecture repository

Why Meaning Gets Lost: The Disconnect Between Strategy and Operations

Comments

Leave a Reply Cancel reply