Every now and then, I write a blog article that could probably get me sued, sacked or both; this started off as one of those and has been heavily edited as to avoid naming names…
Software Quality Sucks; the ‘Release Early, Release Often’ meme appears to have permeated into every level of the IT stack; from the buggy applications to the foundational infrastructure, it appears that it is acceptable to foist beta quality code on your customers as a stable release.
Having run a test team for the past few years has been eye-opening; by the time my team gets hands on your code…there should be no P1s and very few P2s but the amount of fundamentally broken code that has made it to us is scary.
And then also running an infrastructure team, this is beyond scary and heading into realms of terror and just to make things nice and frightening, every now and then, I ‘like’ to search vendor patch/bug databases for terms like ‘data corruption’, ‘data loss’ and other such cheery terms; don’t do this if you want to sleep well at night.
Recently I have come across such wonderful phenomena as a performance monitoring tool which slows your system down the longer it runs; clocks that drift for no explicable reason and can lock out authentication; reboots which can take hours; non-disruptive upgrades which are only non-disruptive if run at a quiet time; errors that you should ignore most of the time but sometimes they might be real; files that disappear on renaming; updates replacing a update which makes a severity 1 problem worse..even installing fixes seems to be fraught with risk.
Obviously no-one in their right minds ever takes a new vendor code release into production; certainly your sanity needs questioning if you put a new product which has less than two year’s GA into production. Yet often the demands are that we do so.
But it does lead me wondering, has software quality really got worse? It certainly feels that it has? So what are the possible reasons, especially in the realms of infrastructure?
Complexity? Yes, infrastructure devices are trying to do more; no-where is this more obvious than in the realms of storage where both capabilities and integration points have multiplied significantly. It is no longer enough to support the FC protocol; you must support SMB, NFS, iSCSI and integration points with VMware and Hyper-V. And with VMware on an 12 month refresh cycle pretty much, it is getting tougher for vendors and users to decide which version to settle on.
The Internet? How could this cause a reduction in software quality? Actually, the Internet as a distribution method has made it a lot easier and cheaper to release fixes; before if you had a serious bug, you would find yourself having to distribute physical media and often in the case of infrastructure, mobilising a force of Engineers to upgrade software. This cost money, took time and generally you did not want to do it; it was a big hassle. Now, send out an advisory notice with a link and let your customers get on with it.
End-users? We are a lot more accepting of poor quality code; we are used to patching everything from our PC to our Consoles to our Cameras to our TVs; especially, those of us who work in IT and find it relatively easy to do so.
Perhaps it is time to start a ‘Slow Software Movement’ which focuses on delivering things right first time?
I have been burned by the non-disruptive upgrade before with a HP SAN led to a vm snapshot corruption. Quite painful considering
“a ‘Slow Software Movement’ which focuses on delivering things right”
– it’s called Debian.
🙂
The problem with trying to get things right the first time is that to actually accomplish the goal you need to do formal correctness analysis which is very difficult, very time consuming, and hence very expensive. NASA had a cost per line in excess of $1,500 and yet they still had to occasionally patch the shuttles code. Since customers demand both a rich featureset and a reasonable cost you’ll never have formal correctness in commercial products so the best you can hope for is a fairly well debugged codebase where hopefully only the new features are likely to break.