The terms fault and failure are sometimes used loosely to mean the same thing but they are actually quite different. A fault is something inherent in the software – a failure is something that happens in the real world. Faults do not necessarily lead to failures and failures often occur in software that is not ‘faulty’.
The reason for this is that whether some behaviour is a failure or not, depends on the judgement of the observer and their expectations of the software. For example, I recently tried to buy 2 day passes on the Lisbon metro for myself and my wife. They use reusable cards so you buy 2 cards then credit them with the appropriate pass. The dialogue with the machine went as follows:
How many cards (0.5€ each): 2
How many passes (3.7€ each): 2
Total to pay: 15.8€
To put it mildly, I was surprised. I tried twice, the same thing happened. I then bought the passes one at a time and all was fine – I paid the correct fee of 8.4€.
From my perspective, this was a software failure. It meant that I had to spend longer than I should have buying these passes. On the train, I tried to think about what might have happened. What I guess is the situation is that it is possible to have buy more than 1 day pass at a time and have it credited to the card. So, the 2nd question should have been:
How many passes on each card?
From a testing perspective, the software was probably fine and free of defects and, if you understood the system, then you would have entered 1 pass per card.
So, failures are not some absolute thing that can be tested for. They will always happen because different people will have different expectations of systems. That’s the theme of my keynote talk at SEPGEurope 2010 conference in Porto. We need to design software to help people understand what its doing and help them recover from failures.
I’ve written elsewhere (http://se9book.wordpress.com/2010/04/30/designing-for-failure/) about my Mac mail problems and this got me thinking about how folks who aren’t computer experts cope. When things go wrong, it’s OK for people like me who are software professionals and understand words like ‘cache’ and ‘reboot’. But what about ordinary people who don’t know and don’t care about how software is organised and who are constantly baffled by the fact that their computers behave in idiosyncratic ways. How do they cope – expensive helplines or helpful friends and relatives I guess? Or do they just put up with things and think that this is how it has to be?
Really, as software engineers, we have to do better, It’s not just about making our software more dependable but also thinking about how we communicate with the users of our systems and help them when things go wrong.
Software is now so complex that, for sure, it’s sometimes going to go wrong. Most software systems don’t accept this and make it harder than it should be to detect and fix problems. Take my experience today with Mac Mail. This is an email client that picks up mail from one or more servers. Mine is configured to access 4 separate accounts. This morning, it connected to 3 of them with no problem but when trying to connect to the 4th (the main account) it simply hung and the connecting icon kept spinning.
This being an Exchange server I immediately dismissed it as a server problem and got on with something else. I tried again and was surprised it didn’t work because our sys admins are usually quite quick at rebooting the Exchange server (lots of practice!). So, I connected with another machine and that was absolutely fine. I asked for suggestions – none were forthcoming – so I tried periodically throughout the morning with no success. Then I noticed that my incremental backup was trying to back up 40GB – and I certainly hadn’t created anything like that since last night.
Then problem number 2. Time Machine on the Mac doesn’t tell you what it has backed up or is trying to back up. Into Google, found a utility that does this, installed it and discovered this was a file in the Mail library called Recovered Mail, which was huge. Google again and discovered my problem was known and the Recovered Mail had to be deleted along with the offline cache. Into Terminal, deleted these and all was well.
What I found annoying was that it would have been so easy to design the systems differently so that the problems could have been diagnosed and fixed in 2 minutes instead of several hours. If Mail had actually kept a log that could be examined, then I could have easily found out what t was trying to do. And, if it published the files it used and where they were installed, that would also have been helpful. Even better, if Mail had tried to create a very large file then asked me if I really wanted to do this, I would have picked up the problem immediately. If Time Machine had an elementary interface that showed what it was backing up, that would also have helped.
So, if you are designing software, think about what happens when it goes wrong. Don’t assume your users are stupid and provide ways to make the state of the system visible. When you create files – ask the user if these are exceptionally large and make sure that users can delete them. Don’t use ‘invisible’ files. And use timeouts – if something takes 50 times longer than normal, it really isn’t right.