Storagebod Rotating Header Image

Corporate IT

5 Minutes

One of the frustrations when dealing with vendors is actually getting real availability figures for their kit; you will get generalisation,s like it is designed to be 99.999% available or perhaps 99.9999% available. But what do those figures really mean to you and how significant are they?

Well, 99.999% available equates to a bit over 5 minutes of downtime and 99.9999% equates to a bit over 30 seconds downtime over a year. And in the scheme of things, that sounds pretty good.

However, these are design criteria and aims; what are the real world figures? Vendors, you will find are very coy about this; in fact, every presentation I have had with regards to availability are under very strict NDA and sometimes not even notes are allowed to be taken. Presentations are never allowed to be taken away.

Yet, there’s a funny thing….I’ve never known a presentation where the design criteria are not met or even significantly exceeded. So why are the vendors so coy about their figures? I have never been entirely sure; it may be that their ‘mid-range’ arrays display very similar real world availability figures to their more ‘Enterprise’ arrays…or it might be that once you have real world availability figures, you might start ask some harder questions.

Sample size; raw availability figures are not especially useful if you don’t know the sample size. Availability figures are almost always quoted as an average and unless you’ve got a real bad design; more arrays can skew figures.

Sample characteristics; I’ve known vendors when backed into a corner to provide figures do some really sneaky things; for example, they may provide figures for a specific model and software release. This is often done to hide a bad release for example. You should always try to ask for the figures for the entire life of a product; this will allow you to judge the quality of the code. If possible as for a breakdown on a month-by-month basis annotated with the code release schedule.

There are many tricks that vendors try to pull to hide causes of downtime and non-availability but instead of focusing on the availability figures; as a customer, it is sometimes better to ask different specific questions.

What is the longest outage that you have suffered on one of your arrays? What was the root cause? How much data loss was sustained? Did the customer have to invoke disaster recovery or any recovery procedures? What is the average length of outage on an array that has gone down?

Do not believe a vendor when they tell you that they don’t have these figures and information closely and easily to hand. They do and if they don’t; they are pretty negligent about their QC and analytics. Surely they don’t just use all their Big Data capability to crunch marketing stats? Scrub that, they probably do.

Another nasty thing that vendors are in the habit of doing is forcing customers to not disclose to other customers that they have had issues and what they were. And of course we all comply and never discuss such things.

So 5 minutes…it’s about long enough to ask some awkward questions.

The Complexity Legacy

I don’t blog about my day-job very often but I want to relate a conversation I had today; I was chatting to one of the storage administrators who works on our corporate IT systems, they’ve recently put in some XIV systems (some might be an understatement) and I asked how he was getting on with them. He’s been doing the storage administrator thing for a long time and cut his teeth on the Big Iron arrays and I thought he might be a bit resentful at how easy the XIV is to administer but no…he mentioned a case recently when they needed to allocate a large chunk of storage in a real hurry; took 30 minutes to do a job which he felt would take all day on a VMAX.

And I believe him but…

Here’s the thing; in theory using the latest GUI tools such as Unisphere for VMAX, surely this should be the case for VMAX? So what is going on? Quite simply the Big Iron arrays are hampered by a legacy of complexity; even experienced administrators and perhaps especially experienced administrators like to treat them as complex, cumbersome beasts. It is almost as if we’ve developed a fear of them and treat them with kid gloves.

And I don’t believe it is just VMAX that is suffering from this; all of the Big Iron arrays suffer from this perception of complexity. Perhaps because they are still expensive, perhaps because the vendors like to position them as Enterprise beasts and not as something which as easy as to configure as your home NAS and perhaps because the storage community are completely complicit in the secret occult world of Enterprise storage?

Teach the elephants to dance…they can and they might not crush your toes.

Just How Much Storage?

A good friend of mine recently got in contact to ask my professional opinion on something for a book he was writing; it always amazes me that anyone asks my professional opinion on anything…especially people who have known me for many years but as he’s a great friend, I thought I’d  try to help.

He asked me how much a petabyte of storage would cost today and when I thought it would affordable for an individual? Both parts of the question are interesting in their own way.

How would a petabyte of storage cost? Why, it very much depends; it’s not as much as it cost last year but not as a cheap as some people would think. Firstly, it depends on what you might want to do with it; capacity, throughput and I/O performance are just part of the equation.

Of course then you’ve got the cost of actually running it; 400-500 spindles of spinning stuff takes a reasonable amount of power, cooling and facilities. Even if you can pack it densely, it is still likely to fall through the average floor.

There are some very good deals to be had mind you but you are still looking at several hundred thousand pounds, especially if you look at a four year cost.

And when will the average individual be able to afford a petabyte of storage? Well without some significant changes in storage technology; we are some time away from this being feasible. Even with 10 Terabyte disks, we are talking over a hundred disks.

But will we ever need a petabyte of personal storage? That’s extremely hard to say; I wonder if we will we see the amount of personal storage peak in the next decade?

And as for on-premises personal storage?

That should start to go into decline, for me it is already beginning to do so; I carry less storage around than I used to…I’ve replaced my 120Gb iPod with a 32 Gb phone but if I’m out with my camera, I’ve probably got 32Gb+ of cards with me. Yet with connected cameras coming and 4G (once we get reasonable tariffs), this will probably start to fall off.

I also expect to see the use of spinning rust go into decline as PVRs are replaced with streaming devices; it seems madness to me that a decent proportion of the world’s storage is storing redundant copies of the same content. How many copies of EastEnders does the world need to be stored on a locally spinning drive?

So I am not sure that we will get to a petabyte of personal storage any time soon but we already have access to many petabytes of storage via the Interwebs.

Personally, I didn’t buy any spinning rust last year and although I expect to buy some this year; this will mostly be refreshing what I’ve got.

Professionally, looks like over a petabyte per month is going to be pretty much run-rate.

That is a trend I expect to see continue; the difference between commercial and personal consumption is going to grow. There will be scary amounts of data around about you and generated by you; you just won’t know it or access it.

Software Sucks!

Every now and then, I write a blog article that could probably get me sued, sacked or both; this started off as one of those and has been heavily edited as to avoid naming names…

Software Quality Sucks; the ‘Release Early, Release Often’ meme appears to have permeated into every level of the IT stack; from the buggy applications to the foundational infrastructure, it appears that it is acceptable to foist beta quality code on your customers as a stable release.

Having run a test team for the past few years has been eye-opening; by the time my team gets hands on your code…there should be no P1s and very few P2s but the amount of fundamentally broken code that has made it to us is scary.

And then also running an infrastructure team, this is beyond scary and heading into realms of terror and just to make things nice and frightening, every now and then, I ‘like’ to search vendor patch/bug databases for terms like ‘data corruption’, ‘data loss’ and other such cheery terms; don’t do this if you want to sleep well at night.

Recently I have come across such wonderful phenomena as a performance monitoring tool which slows your system down the longer it runs; clocks that drift for no explicable reason and can lock out authentication; reboots which can take hours; non-disruptive upgrades which are only non-disruptive if run at a quiet time; errors that you should ignore most of the time but sometimes they might be real; files that disappear on renaming; updates replacing a update which makes a severity 1 problem worse..even installing fixes seems to be fraught with risk.

Obviously no-one in their right minds ever takes a new vendor code release into production; certainly your sanity needs questioning if you put a new product which has less than two year’s GA into production. Yet often the demands are that we do so.

But it does lead me wondering, has software quality really got worse? It certainly feels that it has? So what are the possible reasons, especially in the realms of infrastructure?

Complexity? Yes, infrastructure devices are trying to do more; no-where is this more obvious than in the realms of storage where both capabilities and integration points have multiplied significantly. It is no longer enough to support the FC protocol; you must support SMB, NFS, iSCSI and integration points with VMware and Hyper-V. And with VMware on an 12 month refresh cycle pretty much, it is getting tougher for vendors and users to decide which version to settle on.

The Internet? How could this cause a reduction in software quality? Actually, the Internet as a distribution method has made it a lot easier and cheaper to release fixes; before if you had a serious bug, you would find yourself having to distribute physical media and often in the case of infrastructure, mobilising a force of Engineers to upgrade software. This cost money, took time and generally you did not want to do it; it was a big hassle. Now, send out an advisory notice with a link and  let your customers get on with it.

End-users? We are a lot more accepting of poor quality code; we are used to patching everything from our PC to our Consoles to our Cameras to our TVs; especially, those of us who work in IT and find it relatively easy to do so.

Perhaps it is time to start a ‘Slow Software Movement’ which focuses on delivering things right first time?

Good Enough Isn’t?

One of the impacts of the global slowdown has been that many companies have been focussing on services and infrastructure which is just good enough. For some time now, many of the mainstream arrays have been considered to be good enough. But the impact of SSD and Flash may change our thinking and in fact I hope it does.

So perhaps Good Enough Isn’t Really Good Enough? Good Enough is only really Good Enough if you are prepared to stagnate and not change; if we look at many enterprise infrastructures, they haven’t really changed that much in over the past 20 years and the thinking behind them has not changed dramatically. Even virtualisation has not really changed our thinking because still despite the many pundits and bloggers like me who witter on about service thinking and Business alignment; for many it is just hot-air.

There appears to be a lack of imagination that permeates our whole business; if a vendor turns up and says ‘I have a solution which can reduce your back-up windows by 50%’, the IT manager could think ‘Well, I don’t have a problem with my back-up windows; they all run perfectly well and everyone is happy…’. What they don’t tend to ask is ‘If my back-up windows are reduced by 50%, what can I do with the time that I have saved; what new service can be offered to the Business?’

Over the past few years, the focus has been on Good Enough; we need to get out of this rut and start to believe that we can do things better.

As storage people, we have been beaten up by everyone with regards to cost and yet I still hear it time and time again that storage is the bottle-neck in all infrastructures; time to provision, performance, and capacity; yet we are still happy to sit comfortably talking about ‘Good Enough Storage’.

Well, let me tell you that it isn’t ‘Good Enough’ and we need to be a lot more vocal in articulating why it isn’t and why doing things differently would be better; working a lot closer with our customers in explaining the impact of ‘Good Enough’ and letting them decide what is ‘Good Enough’.

Patience is a Virtue?

Or is patience just an acceptance of latency and friction? A criticism oft made of today’s generation is that they expect everything now and this is a bad thing but is it really?

If a bottle of fine wine could mature in an instant and be good as a ’61; would this be a bad thing? If you could produce a Michelin quality meal in a microwave, would it be a bad thing?

Yes, today we do have to accept that such things take time but is it really a virtue? Is there anything wrong with aspiring to do things quicker whilst maintaining quality?

We should not just accept that latency and friction in process is inevitable; we should work to try to remove them from the way that we work.

For example, change management is considered to be a necessary ITIL process but does it have to be the lengthy bureaucratic process that it is? If your infrastructure is dynamic, surely your change process should be dynamic too? If you are installing a new server, should you have to raise a change

1) to rack and stack
2) to configure the network
3) to install the operating system
4) to present the storage
5) to add the new server to the monitoring solution etc, etc

Each of these being an individual change being raised by separate teams. Or should you be able to do this all programmatically? Now obviously in a traditional data-centre, some of these require physical work but once the server has been physically commissioned, there is nothing there which should not be able to be done programmatically and pretty much automatically.

And so it goes for many of the traditional IT processes; they simply introduce friction and latency to reduce the risk of the IT department smacking into a wall. This is often deeply resented by the Business who simply want to get their services up and running, it is also resented by the people who are following the processes and then it is thrown away in an emergency (which happens more often than you would possibly expect 😉 ).

This is not a rant against ITIL, it was tool for a more sedate time but in a time when Patience is no longer really a virtue..do we need a better way. Or perhaps something like an IT Infrastructure API?

Don’t throw away the rule-book but replace it with something better.

p.s Patience was actually my grand-mother; she had her vices but we loved her very much.

Meltdown

The recent RBS systems meltdown and the rumoured reasons for it are a salutary reminder to all as to how much we are all reliant on the continued availability of core IT systems; these systems are pretty much essential to modern life. Yet arguably the corporations that run these systems have become incredibly cavalier and negligent about these systems; their maintenance and long-term sustainability even in supposedly heavily regulated sectors such as Banking is woeful.

There is a ‘It Aint Broke, So Don’t Fix It’ mentality that has led to systems that are unbelievably complex and tightly coupled; this is especially true of those early adopters of IT technologies such as the Banking sectors.

I spent my early IT years working for a retail bank in the UK and even twenty years ago, this mentality was prevalent and dangerous; code that no-one understood sat at the core of systems, wrappers written to try to hide the ancient code meant that you needed to be half-coder, half-historian to stand a chance of working out exactly what it did.

If we add another twenty years to this, twenty years of rapid change where we have seen the rise of the Internet, 24 hour access to information and services, mobile computing and a financial collapse; you have almost a perfect storm. Rapidly changing technology coupled with intense pressure on costs has led to under-investment on core infrastructure whilst Business chases the new. Experience has oft been replaced with expedience.

There is simply no easy Business Case that flies that justifies the re-writing and redevelopment of your core legacy applications, even if you still understand them; well, there wasn’t until last week. If you don’t do this and if you don’t start to understand your core infrastucture and applications; you might well find yourself in the same position that the guys in RBS have.

Systems that have become too complex and are hacked together to do things that they were never supposed to do; systems which if I’m being generous were developed in the 80s but more likely the 70s trying to cope with the demands of the 24 hour generation; systems which are carrying out more processing in realtime and yet are at their heart, batch systems.

If we continue with this route, there will be more failures and yet more questions to be answered. Dealing with legacy should no longer be ‘It Aint Broke, So Don’t Fix It’ but ‘It Probably Is Broke, You Don’t know It…yet!’ Look at your Business, if it has changed out of all recognition, if your processes and products no longer resemble those of twenty years ago, it is unlikely that IT systems designed twenty years are fit for purpose. And if you’ve stuck twenty years worth of sticking plaster on them to try and make them fit for purpose; it’s going to hurt when you try to remove the sticking plaster.

This is not a religious argument about Cloud, Distributed Systems, Mainframe but one about understanding the importance of IT to your Business and investing in it appropriately.

IT may not be your Business but IT makes your Business…you probably wouldn’t leave your offices to fall into disrepair, patching over the cracks until it falls down…don’t do the same your IT.

The Last of the Dinosaurs?

Myself and Chris ‘The Storage Architect’ Evans were having a twitter conversation during the EMC keynote where they announced the VMAX 40K; Chris was watching the live-stream and I was watching the Chelsea Flower Show, from Chris’ comments, I think that I got the better deal.

But we got to talking about the relevance of the VMAX and the whole bigger is better thing. Every refresh, the VMAX just gets bigger and bigger, more spindles and more capacity. Of course EMC are not the only company guilty of the bigger is better hubris.

VMAX and the like are the ‘Big Iron’ of the storage world; they are the choice of the lazy architect, the infrastructure patterns that they support are incredibly well understood and text-book but do they really support Cloud-like infrastructures going forward?

Now, there is no doubt in my mind that you could implement something which resembles a cloud or let’s say a virtual data-centre based on VMAX and it’s competitors. Certainly if you were a Service Provider which aspirations to move into the space; it’s an accelerated on-ramp to a new business model.

Yet just because you can, does that mean you should? EMC have done a huge amount of work to make it attractive, an API to enable to you to programmatically deploy and manage storage allows portals to be built to encourage self-service model. Perhaps you believe that this will allow light-touch administration and the end of the storage administrator.

And then myself and Chris started to talk about some of the realities; change control on a box of this size is going to be horrendous; in your own data-centre, co-ordination is going to be horrible but as a service provider? Well, that’s going to be some interesting terms and conditions.

Migration, in your own environment,  to migrate a petabyte array in a year means migrating 20 terabytes a week more or less. Now, depending on your workload, year-ends, quarter-ends and known peaks, your window for migrations could be quite small. And depending how you do it, it is not necessarily non-service impacting; mirroring at the host level means significantly increasing your host workload.

As a service provider; you have to know a lot about the workloads that you don’t really influence and don’t necessarily understand. As a service provider customer, you have to have a lot of faith in your service provider. When you are talking about massively-shared pieces of infrastructure, this becomes yet more problematic. You are going to have to reserve capacity and capability to support migration; if you find yourself overcommitting on performance i.e you make assumptions that peaks don’t all happen at once, you have to understand the workload impact of migration.

I am just not convinced that these massively monolithic arrays are entirely sensible; you can certainly provide secure multi-tenancy but can you prevent behaviours impacting the availability and performance of your data? And can you do it in all circumstances, such as code-level changes and migrations.

And if you’ve ever seen the back-out plan for a failed Enginuity upgrade; well the last time I saw one, it was terrifying.

I guess the phrase ‘Eggs and Baskets’ comes to mind; yet we still believe that bigger is better when we talk about arrays.

I think we need to have some serious discussion about optimum array sizes to cope with exceptions and when things go wrong. And then some discussion about the migration conundrum. Currently I’m thinking that a petabyte is as large as I want to go and as for the number of hosts/virtual hosts attached, I’m not sure. Although it might be better to think about the number of services an array supports and what can co-exist, both performance-wise but also availability window-wise.

No, the role of the Storage Admin is far from dead; it’s just become about administering and managing services as opposed to LUNs. Yet, the long-term future of the Big Iron array is limited for most people.

If you as an architect continue to architect all your solutions around Big Iron storage…you could be limiting your own future and the future of your company.

And you know what? I think EMC know this…but they don’t want to scare the horses!

A New Symm?

So EMC-World is here and the breathless hype begins all over again and in amongst the shiny, shiny,shiny booths; the acolytes worship the monolith that is the new Symmetrix. Yet a question teases the doubters, do we need a new Symmetrix?

Okay, enough of the ‘Venus in Furs’ inspired imagery; although it might be strangely appropriate for the Las Vegas setting but there is a question which needs to be asked, do we need a new Symmetrix?

Now I am probably these days far enough removed but not so distant that I can have a stab at an answer. And the answer is; no, I don’t believe that we actually needed a new Symmetrix but EMC needed to develop one anyway.

There are certainly lots of great improvements; a simpler management interface and the bringing it into the Unisphere world has been long overdue. It seems that many manufacturers are beginning to realise that customers want commonality and that shiny GUIs can help to sell a product.

Improvements to Timefinder snaps are welcome; we’ve come a long way from BCVs and mirror poistions; there’s still a long way to go to get customers to come along with you tho’. Many cling onto the complex rules with tenacity.

Certainly the mirroring of FAST-VP so that in the event of fail-over, there is a Performance Recovery Point of 0 is  achievable is very nice; it’s  something I’ve blogged about before and is a weakness in many automated tiering solutions.

eMLC SSDs; bringing the cost of SSD down whilst maintaining performance, this is another over-due capability as is the support of 2.5″ SAS drives improving density and I suspect performance of spinning rust.

Physical dispersal of cabinets; you probably won’t believe how long this has been discussed and asked for. Long, long overdue but hey, EMC are not the only guilty parties.

And of course, Storage ‘Federation’ of 3rd party arrays; I’m sure HDS and IBM will welcome the vindication of their technology by EMC or at least have a good giggle.

But did we need a new Symmetrix to deliver all this? Or would the old one have done?

Probably but where’s the fun in that?

I don’t know but perhaps concentrating on the delivery to the business before purchasing a new Big Iron array might be more fitting. I don’t know about you but in the same way that I look at mainframes with nostalgia and affection; I’m beginning to look at the Symmetrix and the like in the same way.

If you need one, you need one but ask yourself…do I really need one?

Flash Changed My Life

All the noise about all flash arrays and acquisitions set me thinking a bit about SSDs and flash; how it has changed things for me.

To be honest, the flash discussions haven’t yet really impinged on my reality in my day-to-day job, we do have the odd discussion about moving metadata onto flash but we don’t need it quite yet; most of the stuff we do is sequential large I/O and spinning rust is mostly adequate. Streaming rust i.e tape is actually adequate for a great proportion of our workload. But we keep a watching-eye on the market and where the various manufacturers are going with flash.

But flash has made a big difference to the way that I use my personal machines and if I was going to deploy flash in a way that would make the largest material difference to my user-base, I would probably put it in their desktops.

Firstly, I now turn my desktop off; I never used to unless I really had to but waiting for it to boot or even awake from sleep was at times painful. And sleep had a habit of not sleeping or flipping out on a restart; turning the damn thing off is much better. This has had the consequence that I now have my desktops on an eco-plug which turns off all the peripherals as well; good for the planet and good for my bills.

Secondly, the fact that the SSD is smaller means that I keep less crap on it and am a bit more sensible about what I install. Much of my data is now stored on the home NAS environment which means I am reducing the number of copies I hold; I find myself storing less data. There is another contributing factor; fast Internet access means that I tend to keep less stuff backed-up and can stream a lot from the Cloud.

Although the SSD is smaller and probably needs a little more disciplined house-keeping; running a full virus check which I do on occasion is a damn sight quicker and there are no more lengthy defrags to worry about.

Thirdly, applications load a lot faster; although my desktop has lots of ‘chuff’ and can cope with lots of applications open, I am more disciplined about not keeping applications open because their loading times are that much shorter. This helps keeping my system running snappily, as does shutting down nightly I guess.

I often find on my non-SSD work laptop that I have stupid numbers of documents open; some have been open for days and even weeks. This never happens on my desktop.

So all-in-all; I think if you really want bang-for-buck and to put smiles on many of your users’ faces; the first thing you’ll do is flash-enable the stuff that they do everyday.