Whenever you are operating an application in production things go wrong at some point in time. This is an unfortunate truth of designing, building and operating applications at any scale. If you run a larger landscape and separate this landscape into microservices the amount of moving parts and integrations grows exponentially. The chances of failure grow with the same if not greater amount.
Now, having learned the fact that things will go wrong, how do we find the cause? How do we understand what, where and why? Was it application X? Or did it happen on the call from application X towards application Y? Or was it something else entirely?
To figure this out we need the right information not just on one application, but on the entire application landscape and all their integration points. This is what observability is all about. By ensuring that each application provides the right information a DevOps team should be able to quickly identify where, what and when a problem occurred or started occurring.
So what is needed to make this kind of insight possible? This obviously depends on the landscape, but in general this is about a combination of the following:
● Application metrics
● Log aggregation
● Distributed tracing
● Audit logging
● Change and deployment logging
● Exception tracking
● Health check API
An application needs to provide these towards a central store. This can be done through either the push or the pull method. In a push situation the application pushes events, logs and metrics towards a central observability platform. In the pull situation the central platform connects to the application and retrieves the data for storage.
Now that we have all this data how do we get to the right information? For this we need to have correlation. The correlation of logs, deployments, changes to the platform and any of the above is key to finding the route cause.
Let us look at an example. Imagine a large e-commerce company where suddenly a portion of all checkouts fail. The teams maintaining this platform become aware of this problem because they have dashboards in place for the different components and processes. The metrics from the checkout service show that at a certain point in time all traffic towards the
instance in one of the three datacenters is not receiving any traffic.
A follow-up by looking at either the traces of the API gateway or log events shows that all calls towards this service get rejected. Then the teams look at the changes implemented at the time that the problem started occurring and notice that there was a network change right at that moment. This change can be either rolled back or an update can be done to fix the
Without proper metrics, logs and correlation with deployments and changes it is very hard to figure out what is going wrong and especially the cause of the problem. The example is very much simplified. In practice there are many more services, moving parts, ongoing changes,
hardware failures and anything else Murphy will throw at us.
There is also a danger in having metrics, logs and more on everything. It can lead quickly towards an information overload and alert fatigue. It is therefore important to have dashboard and alerts on those things that are most critical to the functioning of the application platform and business processes. The observability platform should provide a means to dive deeper into the data when needed, but also provide the overview to see
important issues without effort.
Last but not least, each team should decide on what data they want to have stored and what they think is important for the proper operation of their piece of the puzzle. For some teams this might be latency of database calls, for other teams the size and growth of the message
Getting started with observability is not an easy task as there are many decisions to be made. The technology is there to make all of this possible. It is even possible to apply AI and machine learning to predict failures and prevent them from even happening. But that is a story for another time.