Business applications are complex beasts filled with interdependencies, and feedback loops, and performance bottlenecks that are not always fully understood. When failures happen, there is always a scramble to figure out what went wrong, how to fix it, and, critically, how to prevent it from happening again. To get their heads around these systems and how they fail, IT Teams tend to look at them through metaphors of failure. Today we’ll discuss some of these ideas and how they can help, or hurt, your ability to build more resilient systems.
The “Weakest Link” isn’t Root-Cause Analysis
The weakest link metaphor is obvious – a physical chain can’t be more reliable than any single link. And you’ll sometimes hear people talk about the weakest link in an always-on application or platform where your system can never be more reliable than the least reliable “link in the chain”. It’s an attractive model because it’s so straightforward. But it’s also wrong. Systems go down because of cascading failures. It’s almost never one thing alone. One small failure causes feedback to other systems that trigger other failures until eventually, the system fails. If you just look for the failure that happens immediately before the crash, you’re not finding the root cause of the failure, you’re just finding the component that failed last.
The “Last Straw” isn’t Fault Isolation
We’ve all heard the story about the straw that broke the camel’s back. As if the camel loading or IT systems had binary states of “not overloaded” and “overloaded”. This is another comfortable fiction that we need to stop using. This Last Straw idea leads to counterproductive efforts such as bandwidth throttling and connection limits. These efforts might prevent another full-on failure, but they’re hardly “always-on”, and you still haven’t diagnosed the true cause of your last failure. The system was probably suffering from multiple cascading failures and this “last straw” was just a red herring.
Your ITIL Process Doesn’t include the “Frayed Rope” Metaphor
The frayed rope metaphor looks at complex IT systems as if they were a rope being dragged over a sharp rock or a dockside cleat. Individual threads break one at a time. Each thread that breaks puts more stress on the remaining threads, which then fail in turn, eventually causing an accident. In an IT system, individual threads fail – like a drive or a network adapter, or a software microservice. This puts more stress (a feedback loop) on the remaining components (adding latency, increasing heat, shifting queries), causing more subsystems to fail – more drives, more network cards, and even server nodes. System capability degrades and eventually, you can no longer keep the service running.
So is the shutdown the fault of that last drive or server that failed? Of course not. No more than the rope breaking is the fault of that last remaining thread finally giving way. An experienced climber or sailor is always inspecting their ropes and would never rely on a frayed or damaged line. And an experienced IT operations team will be constantly on the lookout for minor issues within the system, addressing them before they risk a complete failure.
No matter what metaphor your team uses, IT resilience challenges are real
IT operations teams need the skills and tools to monitor and maintain every component in their complex systems and proactively address minor failures before they impact performance or availability. The rise of cloud computing and the various service offerings are proof that IT teams don’t always have the desire to manage systems and components at this level. Cloud providers, even though they do have the skills and tools, still rely on providers like StorPool to manage the Storage systems and components for them, freeing them to focus on higher-value projects.
StorPool’s storage architecture ensures no single point of failure and guarantees its continuous operation even if a component becomes unavailable. StorPool Analytics tracks detailed performance and availability metrics across the cluster, often predicting component failures before they happen. StorPool Alerts and Reporting ensures timely notification of any at-risk components so that they can be replaced before they fail. We can’t address every aspect of your IT systems, but with StorPool, you can be confident that your storage won’t be your final straw, or weak link, or last strand.
Learn more at https://storpool.com/get-started