It’s Wednesday, and the accounting team is closing this month’s sales and running end-of-month processing on a multi-cloud platform deployed four months ago. They run sales order entries on one cloud provider and the accounting application on another. Spanning both clouds is a common security system and API manager, among other services.
What took just a few hours last month to process from start to finish now takes almost a day. He gets an angry call from the CFO: “What the hell is going on?” Rather, what’s happening with your multicloud performance this month?
Multicloud deployments and cloud deployments in general behave differently at different levels of stress. There was little stress during processing last month; this month there is a medium level of stress that is causing a serious performance problem.
Those of you who diagnose and fix performance issues already understand this, but if you don’t, this is the best way to think about cloud performance: all interdependent components depend on all other components to work well. Problems arise when a component does not exert its weight in the “cloud performance supply chain”. The issue may be due to network or database latency, memory I/O latency, or storage performance. The result is the same: overall performance will suffer.
In our example, any malfunctioning component could have caused a cascading set of events that killed overall performance. In this case, end-of-month processing suffered, even though the load only increased from small to medium stress levels.
Of course, the slowest component sets your overall performance, which is no different in the cloud. This can cause issues such as network performance, slow databases, lack of required CPU resources, or poor performing applications. These are often called “cloud gremlins” that cloud architects and developers chase for days, sometimes months. In many cases, they are not easy to trace. So where do you look?
The best answer is to employ a good cloud operations and management tool, preferably one that can provide operational observability. Instead of wading through reams of detailed data (often called noise), the meaning of the data is obtained. A good tool will usually pinpoint where the performance issue exists and may even provide the root cause.
The network may have a latency problem, which is easy to diagnose. The tool could also trace the problem back to a poorly performing VPN sending and receiving data from one cloud provider to another. This is a frequent problem in multi-cloud deployments, considering that inter-cloud communications are trusted and therefore stressed, and inter-cloud connections need to be maintained more effectively. In fact, in the last few performance issues that I was asked to diagnose, the root cause was an inter-cloud communications network issue.
Other common issues with multi-cloud deployments include database performance issues at a single cloud provider that cause latency across multiple applications. Often, the applications themselves are blamed, and even code fixes are ordered. The database became the culprit when they determined that the code fixes were not working. The moral of that story is to diagnose first and correct second.
Of course, the list goes on. Multiclouds are complex, distributed platform implementations. Applications and data residing in multiple clouds can also be complex. Performance issues will appear frequently. My best advice is to invest in a good set of cross-cloud cloudops technologies that work across providers and can quickly diagnose the most common problems. Some even provide self-healing services to proactively fix problems. These tools pay for themselves with the first problem they solve.
Copyright © 2023 IDG Communications, Inc.
Be First to Comment