You might think that your systems are reliable enough, but your customers won’t care if it means that single hour of downtime you have is when they need to use your services. Make sure that end user applications have the maximum amount of uptime by accurately assessing the reliability of your systems, and fully managing both planned and unplanned downtime.
How Reliable are Your Systems Really?
Assessing your system reliability gives you a picture of how many hours your systems might unexpectedly be offline annually. Hardware vendors do give companies one set of numbers to work from, but this is only a small part of the picture. Because your systems depend on a combination of hardware, software, and networking components, to get a real idea of your system reliability you must take all of these pieces into account.
Network-based systems rely on a minimum of 8 components:
- Power supplies
- Server disk drives
- Application software
- Network connections
- Operating systems (OS)
- Central processing units (CPUs)
- Database management systems
- Network switching and routing devices
To calculate your whole system reliability, you must take the system reliability of each item as a decimal and put it to the power of the amount of items with that reliability. If items have different reliabilities, then you take each reliability number raised to the power of the number of items with that reliability, then multiply that by the other reliabilities raised to their respective powers. So for instance, if you have 5 items with 97% reliability, and 3 items with 99% reliability, you would have (.97^5) * (.99^3) = .83, or 83% reliability.
Most components of a system are said to have 99% reliability. Unfortunately, this does not mean that the system also has 99% uptime; this is actually a large overestimation. If you have a system depending on 8 components, each with 99% reliability, then the reliability of the system will be .99^8 = .92, or 92%. If there are more components in a system, this number only gets lower (the same scenario for a 10 component system comes out to about 90%). If your system has 92% reliability that means you can expect to have this system unavailable 8% of the time. In an always-on, 24 x 7 x 365 business environment, that comes out to about 700 hours, or a little over 29 days of downtime per year.
Quantifying Planned Outages
While many people recognize unplanned downtime (usually natural disasters, failed hardware, hacked systems, etc.) to be more devastating, planned downtime is much more common, totaling about 90% of total system downtime. This is because processes like backups, updates, upgrades, and maintenance are routine and unavoidable. Since these processes typically follow predictable schedules at reliable frequencies, estimating the quantity (and by extension, impact) of planned downtime is simpler, and typically more accurate.
To estimate the amount of planned downtime that your company can expect, you must first complete a thorough audit of all normal maintenance activities. This includes processes like database backups, updates, reorganizations, hardware replacements, and so on. Have the historical average of downtime per episode calculated and accessible. Adjust each historical average for any growth trends, then multiply the average for each process by the number of times that process is performed annually. Although processes such as hardware and software updates aren’t always consistent, these averages give you a good guide to the frequency and duration of necessary downtime.
Once complete you will have a good idea of your system reliability, and how much planned downtime you can expect. Now you can calculate hourly costs and other damages from specific consequences of downtime on your systems.