It’s been a rough week for some IT guys. United Airlines, the New York Stock Exchange, and TD Ameritrade all had very public network failures. So far, no one has claimed that anything nefarious was going on. That is, these shortcomings were not the result of any cyber attacks or hacking attempts. Rather, they appear to have been the result of failure in standard operating procedures.
How can such a thing happen?
To some degree, such failures are not so much a question of if, but of when. No system is perfect and problems of one sort or another are inevitable. The question is what is an acceptable rate of failure. That is, if the NYSE cannot be guaranteed to be always working perfectly, what is an acceptable rate at which to have disruptions?
The Wall Street Journal had an article that somewhat relates to this point (What We Learned From the NYSE, United Airlines Tech Outages, Jul 9). The article notes that old-fashion land line telephones had an uptime of 99.999% — that is, Ma Bell would leave you without a working phone only about five minutes out of every year. Of course, old school phone systems were regulated and nudge to that level of reliability by their overseers. Unregulated, private networks aren’t held quite to the same standard. The article claims that firms are generally unwilling to invest to the level that would raise their reliability to the level of a land line. It also notes that there are other things going on.
Today’s problems with reliability are more fundamental, a reflection of the complexity of contemporary networks, the volume of data, the pace of change, insufficient organizational and cultural practices, and a legacy of arcane and poorly written business software that traditionally put little emphasis on usability or customer experience.
Outages persist because of the interdependency of computer systems, fueled by the rise of digital services across all industries, particularly those with customer-facing software such as mobile apps, according to former NYSE Euronext CIO Paul Cassell, now CIO of Pico Quantitative Trading LLC.
Complexity clearly matters. The more systems interact, the greater the impact of poor reliability. A component that achieves an Ivory soap level (i.e., 99.44%) of uptime of the time is generally available when needed. However, if a transaction requires that each of ten such components to be working. Things can get dicey. 0.994410 = 0.9454, so the chance of a transaction failing is much higher than the chance of any one component failing. To achieve “five nines” of reliability on at the transaction level, the reliability of each component must be much higher than 99.999%.
Complexity also complicates recovery. The more points at which failure can happen, the more difficult it can be to identify just what failed. More generally, even if the whole process doesn’t shut down, complexity can make it hard replicate particular failure modes. For example, short of having all its systems go kaput, United might have issues rebooking passengers traveling on a particular route who have bought a particular fare class through a particular travel agent. Knowing which variable drives the failure can be far from obvious.
Should we ever expect these private networks to match old-fashion phones? The article suggests not.
Underneath it all, the economics of falling prices carry a trade-off. Consumers get more for their money in the mobile, digital era, but that often leaves margin-stretched companies with fewer resources to invest in robustness and maintenance. Reliability is as much a function of business and risk management as it is about tech.
It is not clear that customers will pay for higher reliability — at least reliability that affects convenience as opposed to, say, safety. United’s computer systems may be balky but its plane don’t fall from the sky. A high failure rate on the latter would scare away customers. Uncertainty about system wide computer failures — even if they happen several times per year — don’t figure into the purchase decision in the same way. Unless customers are going to pay for admittedly small increases in reliability, firms are unlikely to break the bank to deliver those incremental gains.