Thursday, January 22, 2015

The Universal Pattern of Huge Software Losses

Apology: This post was supposed to appear automatically as a follow-up to my three cases of large, costly software failures, but evidently I had a software failure of my own so the google scheduler didn't do what I thought I asked for. So, here is the first follow-up, a bit late. I hope it's worth waiting for.

To complete my data gathering, I'll present two more loss cases, then proceed to describing the pattern that governs all of these cases, followed by the number one rule for preventing such losses in the first place.

Case history 4: A broker's statement

I know this story from the outside, as a customer of a large brokerage firm:
One month, a spurious line of $100,000.00 was printed on the summary portion of 1,500,000 accounts, and nobody knew why it was there. Twenty percent of clients called about it, using perhaps 50,000 hours of account representative time, or $1,000,000 at least. An unknown amount of customer time was used, and the effect on customer confidence was unknown. The total cost of this failure was at least $2,000,000, and the failure resulted from one of the simplest known errors in COBOL coding: failing to clear a blank line in a printing area.

Case history 5: A buying club statement

I know this story, too, from the outside, as a customer of a mail-order company, and also from the inside, as their consultant:
One month, a new service phone number for customer inquiries was printed on each bill. Unfortunately, the phone number had one digit incorrect, producing the number of a local doctor instead of the mail-order company. The doctor's phone was continuously busy for a week until he could get it disconnected. Many patients suffered, though I don't know if anyone died as a result of not being able to reach the doctor. The total cost of this failure would have been hard to calculate except for the fact that the doctor sued the mail-order company and won a large settlement. One of the terms of the settlement was that the doctor not reveal the amount, but I presume it was big enough. The failure resulted from an even simpler error in COBOL coding: copying a constant wrong.

The Universal Pattern of Huge Losses

I'll stop here, because I suspect you are getting bored with reading all these cases. Let me assure you, however, that they were anything but boring to the top management of the organizations involved. Rather than give a number of similar cases I have in my files, let's consider each case as a data point and try to extract some generalized meaning.

Every such case that I have investigated follows a universal pattern:

1. There is an existing system in operation, and it is considered reliable and crucial to the operation.

2. A quick change to the system is desired, usually from very high in the organization.

3. The change is labeled "trivial."

4. Nobody notices that statement 3 is a statement about the difficulty of making the change, not the consequences of making it, or of making it wrong.

5. The change is made without any of the usual software engineering safeguards, however minimal, that the organization has in place.

6. The change is put directly into the normal operations.

7. The individual effect of the change is small, so that nobody notices immediately.

8. This small effect is multiplied by many uses, producing a large consequence.

The Universal Pattern of Management Coping With a Large Loss

Whenever I have been able to trace management action subsequent to the loss, I have found that the universal pattern continues. After the failure is spotted:

9. Management's first reaction is to minimize its magnitude, so the consequences are continued for somewhat longer than necessary.

10. When the magnitude of the loss becomes undeniable, the programmer who actually touched the code is fired—for having done exactly what the supervisor said.

11. The supervisor is demoted to programmer, perhaps because of a demonstrated understanding of the technical aspects of the job.

12. The manager who assigned the work to the supervisor is slipped sideways into a staff position, presumably to work on software engineering practices.

13. Higher managers are left untouched. After all, what could they have done?

The First Rule of Failure Prevention

Once you understand the Universal Pattern of Huge Losses, you know what to do whenever you hear someone say things like:

• "This is a trivial change."

• "What can possibly go wrong?"

• "This won't change anything."

When you hear someone express the idea that something is too small to be worth observing, always take a look. That's the First Rule of Failure Prevention.


Nothing is too small to not be worth observing.


What's Next?

Now that you're familiar with the pattern, we'll take a breather until the next post. There I'll provide other guides for preventing such failures.

Note

This essay is adapted from a portion of Chapter 2 from Responding to Significant Software Events.

This book, in turn, is part of the Quality Software Bundle, with is an economical way to obtain the entire nine volumes of the Quality Software Series (plus two more relevant volumes).

- See more at: http://secretsofconsulting.blogspot.com/#sthash.SRafTDef.dpuf

No comments: