I’ve written elsewhere (https://se9book.wordpress.com/2010/04/30/designing-for-failure/) about my Mac mail problems and this got me thinking about how folks who aren’t computer experts cope. When things go wrong, it’s OK for people like me who are software professionals and understand words like ‘cache’ and ‘reboot’. But what about ordinary people who don’t know and don’t care about how software is organised and who are constantly baffled by the fact that their computers behave in idiosyncratic ways. How do they cope – expensive helplines or helpful friends and relatives I guess? Or do they just put up with things and think that this is how it has to be?
Really, as software engineers, we have to do better, It’s not just about making our software more dependable but also thinking about how we communicate with the users of our systems and help them when things go wrong.
Software is now so complex that, for sure, it’s sometimes going to go wrong. Most software systems don’t accept this and make it harder than it should be to detect and fix problems. Take my experience today with Mac Mail. This is an email client that picks up mail from one or more servers. Mine is configured to access 4 separate accounts. This morning, it connected to 3 of them with no problem but when trying to connect to the 4th (the main account) it simply hung and the connecting icon kept spinning.
This being an Exchange server I immediately dismissed it as a server problem and got on with something else. I tried again and was surprised it didn’t work because our sys admins are usually quite quick at rebooting the Exchange server (lots of practice!). So, I connected with another machine and that was absolutely fine. I asked for suggestions – none were forthcoming – so I tried periodically throughout the morning with no success. Then I noticed that my incremental backup was trying to back up 40GB – and I certainly hadn’t created anything like that since last night.
Then problem number 2. Time Machine on the Mac doesn’t tell you what it has backed up or is trying to back up. Into Google, found a utility that does this, installed it and discovered this was a file in the Mail library called Recovered Mail, which was huge. Google again and discovered my problem was known and the Recovered Mail had to be deleted along with the offline cache. Into Terminal, deleted these and all was well.
What I found annoying was that it would have been so easy to design the systems differently so that the problems could have been diagnosed and fixed in 2 minutes instead of several hours. If Mail had actually kept a log that could be examined, then I could have easily found out what t was trying to do. And, if it published the files it used and where they were installed, that would also have been helpful. Even better, if Mail had tried to create a very large file then asked me if I really wanted to do this, I would have picked up the problem immediately. If Time Machine had an elementary interface that showed what it was backing up, that would also have helped.
So, if you are designing software, think about what happens when it goes wrong. Don’t assume your users are stupid and provide ways to make the state of the system visible. When you create files – ask the user if these are exceptionally large and make sure that users can delete them. Don’t use ‘invisible’ files. And use timeouts – if something takes 50 times longer than normal, it really isn’t right.
It’s fairly clear that the Toyota Prius braking system problems are software related and to do with the interaction between the ABS and the regenerative system that recovers energy from braking. Whether this was a software fault or a specification fault isn’t clear but it reveals that when we add complexity to a system, we are likely to run into problems. Toyota are being rightly criticised for this (and, for sure, they have handled the problem badly) but they are being innovative and an inevitable part of the price of innovation is that problems will occur.
We need to remember this if and when we are tempted to be judgemental about software faults.