Thursday, January 15, 2015

Some Very Expensive Software Failures

Why Concentrate on Failure?

"So long as a man attends to his business the public does not count his drinks. When he fails they notice if he takes even a glass of root beer." - Corra May Harris

Logically, direct measurement of value should be the first place an organization starts to look at itself, but that's not how it usually happens. Instead, the trigger for most organizations to embark on some self-examination is failure—either one whale of a failure or thousands of annoying failure mosquitoes.

This concentration on failure may seem illogical—and may be illogical in many circumstances—but it does fit with our understanding of quality as subjective value. Of all the troublesome aspects of using computers, failures are by far the most annoying to the most people. Without ever conducting a detailed impact case study, or even a greatest single benefit study, people know that they don't like it when their computer fails. Thus, customers heap abundant praise and appreciation on the software organization that doesn't fail them.

Of course, the definition of failure changes with time, as expectations change. Once customers become accustomed to a certain level of service, a lapse from that level becomes a failure. Some customers have come to expect a succession of "breakthroughs" in software, so that achieving only a modest gain is seen as a failure. Thus, the first step in managing failures is to manage customer expectations—but that's always the first step in managing quality.

What Do Failures Cost?

Some perfectionists in software engineering are overly preoccupied with failure, and most others don't rationally analyze the value they place on failure-free operation. Nonetheless, when we do measure the cost of failure carefully, we generally find that great value can be added by producing more reliable software. In this section, we'll take a look at a few examples that should convince you.

Case history 1: A national bank

The national bank of Country X issued loans to all the banks in the country. Each loan was confirmed by a telegram showing the amount of the loan, the repayment conditions, and the interest rate. The telegram became the legal loan document for the loan. The COBOL program that composed and sent these telegrams had been in operation for almost 15 years, and had worked flawlessly. Somebody noticed, however, that the serial number field would run out of digits and begin duplicating serial numbers within a few months. As each loan was legally identified by the serial number on the telegram, duplication could not be allowed.

Management directed that the serial number field be expanded. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Expand the serial number field by two digits." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.

Some time later, a financial analyst noticed a slight discrepancy between estimated loan receipts and actual loan receipts. After much searching, it was discovered that the serial number expansion had overlaid the low order digits of the interest rate field, causing the final two digits of every interest rate to be truncated to "00." Although the difference between 7.3845% and 7.3800% is quite small, when you are lending hundreds of billions of dollars, it quickly adds up to something significant. In this case, it added up to more than a billion dollars that the national bank could never recover.

Case history 2: A public utility

A utility company was changing its billing algorithm to accommodate rate changes (a utility company euphemism for "rate increases"). All this involved was updating a few numerical constants in the existing billing program.

Management directed that the constants be updated. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Replace these constants in the program." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.

Some time later, the Comptroller's office noticed a slight discrepancy between estimated receipts and actual receipts. After much searching, it was discovered that two low order digits in one of the constants had been entered with "75" transposed to "57", causing a number of the bills to be short by a small amount. Billing millions of customers, this small difference added up to X dollars that the utility could never recover.

The reason I say "X dollars" is that I've heard this story from four different clients, with different values of X. Estimated losses ranged from a low of $42 million to a high of $1.1 billion. Given that this happened four times to my clients, and given how few public utilities are clients of mine, I'm sure it's actually happened many more times.

Case history 3: A state lottery

I know of this one through the public press, so I can tell you that it's about the New York State Lottery:

A few years ago, the New York State legislature authorized a special lottery to raise extra money for some worthy purpose. As this special lottery was a variant of the regular lottery, the program to print the lottery tickets had to be modified. Fortunately, all this involved was changing one digit in the existing program.

Management directed that the change be made. The programming manager assigned the job to one of the team leaders, who gave it to a programmer, saying, "Change this digit to a five." The programmer made this trivial change, ran a few tests, and the system was put into operation the next day. Everything worked fine.


A few weeks later, when ticket sales were in full swing, one of the players bought two tickets and noticed that they had identical numbers. As there were supposed to be no duplicates in this lottery, he brought his tickets to the Daily News, which printed a photo of him and his two tickets on the front page. Public confidence in the lottery plunged, and the explanation that the error was "trivial" did not restore public confidence. In order to satisfy the public outcry, all lotteries were shut down pending the report of a blue ribbon investigating committee (this is government, after all). Altogether, it took 11 months for the matter to be resolved and the lotteries to be reestablished. At that time, the lotteries had been netting the state about $4 million to $5 million per month, so the total loss of revenue was estimated between $44 million and $55 million.

What's Next?

I have many more cases of failure, but to keep this blog short, I'll pause here. In my next blog essay, I'll give a few more cases, then describe the universal pattern of huge losses. After that, I'll provide some guides for preventing such failures.

Note

This essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events. 

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

No comments: