Designing for failure

Software is now so complex that, for sure, it’s sometimes going to go wrong. Most software systems don’t accept this and make it harder than it should be to detect and fix problems. Take my experience today with Mac Mail. This is an email client that picks up mail from one or more servers. Mine is configured to access 4 separate accounts. This morning, it connected to 3 of them with no problem but when trying to connect to the 4th (the main account) it simply hung and the connecting icon kept spinning.

This being an Exchange server I immediately dismissed it as a server problem and got on with something else. I tried again and was surprised it didn’t work because our sys admins are usually quite quick at rebooting the Exchange server (lots of practice!). So, I connected with another machine and that was absolutely fine. I asked for suggestions – none were forthcoming – so I tried periodically throughout the morning with no success. Then I noticed that my incremental backup was trying to back up 40GB – and I certainly hadn’t created anything like that since last night.

Then problem number 2. Time Machine on the Mac doesn’t tell you what it has backed up or is trying to back up. Into Google, found a utility that does this, installed it and discovered this was a file in the Mail library called Recovered Mail, which was huge. Google again and discovered my problem was known and the Recovered Mail had to be deleted along with the offline cache.  Into Terminal, deleted these and all was well.

What I found annoying was that it would have been so easy to design the systems differently so that the problems could have been diagnosed and fixed in 2 minutes instead of several hours. If Mail had actually kept a log that could be examined, then I could have easily found out what t was trying to do. And, if it published the files it used and where they were installed, that would also have been helpful. Even better, if Mail had tried to create a very large file then asked me if I really wanted to do this, I would have picked up the problem immediately. If Time Machine had an elementary interface that showed what it was backing up, that would also have helped.

So, if you are designing software, think about what happens when it goes wrong. Don’t assume your users are stupid and provide ways to make the state of the system visible. When you create files – ask the user if these are exceptionally large and make sure that users can delete them. Don’t use ‘invisible’ files. And use timeouts – if something takes 50 times longer than normal, it really isn’t right.

Advertisements

1 Comment

Filed under LSCITS, Uncategorized

One response to “Designing for failure

  1. Pingback: How do normal people manage when their computers go wrong? « Thoughts on software and systems engineering

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s