Storagebod Rotating Header Image

gestaltit

Living on a prayer

Much of what we do in Storage Management can be considered living on a prayer; this is not just a result of the parlous state of the storage management tools that we use but also due to the complex interactions which can happen within shared infrastructure.

And frighteningly enough, we are in the process of making the whole thing worse! The two darling storage technologies of the moment; thin provisioning and de-dupe scare me witless. Both of these technologies in the wrong hands have the capability to bring a server estate to its knees. By wrong hands, I mean just about anybody.

Both of them allow you to virtualise capacity and allocate storage which isn't actually there and hope that you never need it.

Thin provisioning is predicated on the inefficient practises that we have come to know and love; we all know that when a user asks for storage, they nearly always ask for too much! Thin provisioning allows us to logically allocate the disk and only when it gets written to, does it actually get consumed.

The problem is, what happens in the event of a perfect storm and every application wants its capacity at the same time? How much do you over commit your physical capacity? Or maybe not a perfect storm, you just realise that you're going to add physical capacity above and beyond that which is supported by the array simply to cater for rate the thinnly provisioned storage is being consumed.  A rapid application migration ensues.

And then there is the scarey proposition of de-duped primary storage. You could be many times over-subscribed with de-duped storage; certainly in a virtualised server environment or a development environment where you have many copies of the same data. And then someone does something; a user decides to turn on encryption and what was many deduped copies of the same data actually becomes many full copies of the same data and you have run out of storage space in a spectacular fashion.

Also migrating deduped primary storage between arrays is going to be a lot of fun as well and is going to need a lot of planning. Deduping primary storage may well be one of the ultimate vendor lock-ins if we are not careful.

Both thin-provisioning and primary storage dedupe take a lot of control away from the storage team; this is not necessarily a bad thing but the storage team now need to understand a lot more about what their users are doing.

It will no longer be enough to just to think about supplying spinning rust which can deliver a certain amount of capacity and performance. We are going to have to understand what the users are doing day-to-day and we are going to have to communicate with each other.

And yes, we'll need better tools which allow us to see what is going on in the environment but also to model the complex interactions and impacts of various events. We are going to need to know if a user is intending to do something like enabling encryption, a big data-refresh, operating system patching; events which in the past were not hugely significant could now have serious ramifications to our storage capacity.

I still think that thin-provisioning and de-dupe are good ideas but like all good ideas; they come with challenges and a certain amount of risk…

Flexible Thinking

I think that there is some interesting discussion to be had from Hu's latest blog in response to my own comments and thoughts about whether storage virtualisation as demonstrated by the external storage virtualisation devices has a long-term future.

The response is not really to do with virtualisation at all; it is all about aligning IT and Business. He talks about an all too familiar case where storage decisions are made locally by the Business Units and the procurement strategy does not take account of the long-term health of the group; ongoing OpEx costs are not born by individual business units and become the problem of the IT department. The concept and value of shared infrastructure was not really understood by the Business.

But I suspect that these very same Business Units will be happy to use Cloud-based services, Infrastructure as a Service etc, concepts which are built on multi-tenanted shared external infrastructure. Why? Because they can deploy rapidly and flexibly; they don't need to engage the slow and cumbersome IT department and they feel that they are masters of their own destiny.

Storage virtualisation won't bring these Business Units back but storage virtualisation may enable IT to put together a flexible and responsive service catalogue. CIOs need to engage with their customers and understand what they want but they need the support of vendors to clearly demonstrate cost to their customers and to their boards.

The service must be something that the customer wants, customers do not understand why it takes weeks to provision servers, storage and networks. They can go down to PC World and the likes and pick things off the shelf *NOW*.

The more savvy customer knows that they can enter a credit card number into a cloud provider and can provision dozens of servers much quicker than the IT department can and what's more, they can turn them off again equally quickly and not be stuck with kit that they do not need and do not want.

Customers want dynamic, flexible infrastructures which can rapidly respond to their needs; Corporate IT departments along with their vendors have been pretty poor at providing these. Virtualisation and abstracted infrastructures are enabling technologies but Hu's right in that people and processes are key…

As for external storage virtualisation devices, I still wonder if they are the future? They are not necessary to provide the service and if I was building a green-field data centre with no legacy to deal with, I am not sure I would deploy them.

Infrastructure provision is at an interesting inflection point; if you don't understand this and your customers do, you have got a problem.

Set the Wide Stripes Free

There have been a couple of articles recently on the HDS blogs about HDP; Hitachi Dynamic Provisioning is HDS' thin provisioning offering and like all of the thin-provisioning, it also offers wide-striping. I am not going to get into whether HDS' offering is chubby, skinny or just slightly overweight but what I am going to ask is…
if wide-striping is so foundational and so important to the storage industry and especially to improving my TCO as a end–user, why do I have to pay extra for it?

HDS and EMC are both extremely guilty in this regard, both Virtual Provisioning and Dynamic Provisioning cost me extra as an end-user to license. But this is the technology upon which all future block-based storage arrays will be built. If you guys want to improve the TCO and show that you are serious about reducing the complexity to manage your arrays, you will license for free. You will encourage the end-user to break free from the shackles of complexity and you will improve the image of Tier-1 storage in the enterprise.

I understand that you feel that you have to maintain the legacy architectures and designs that you have inflicted upon us in the past but it is time that you stopped enabling our mashochistic tendencies and it is time that you encouraged us to move away from the pain of the past. You can do this by removing the licensing costs for wide-striping; keeping the cost for thin-provisioning is just about acceptable but charging for this key simplifying technology is not!

And if one of you does it first; it's okay to copy! Really it is!! I can feel that we're going to be friends.

Stuff Happens!

Somewhere in the world, there are a bunch of sys-admins, DBAs, application specialists, storage specialists, incident managers, service managers, network specialists and probably a whole bunch of other people running round trying to recover service after a data-centre outage.

How much running around, panic, chaos, shouting and headless chicken mode depends on how much planning, practice and preparedness they have for the event. You might not even notice even if you are using the service because if they have done their work properly, you shouldn't.

Outages happen; big horrible nasty outages happen. In a career which now spans over twenty years, I've been involved with probably half a dozen; from PDUs catching fire due to overload to failed air-conditioning to wrong application of the EPO*. I have been involved in numerous tests; failing over services and whole data-centres on a regular basis and for most of these tests, the end-user would not have been aware anything was happening.

So when Amazon loose a data-centre in their cloud, this should not be news! It will happen, it may be a whole data centre, it may be a partial loss. This not a failure of the Cloud as a concept; it is not even a failure of the public Cloud; there are thousands of companies who host their IT at hosting companies and it's not that different.

What it is a failure of is those companies who are using the Cloud without considering all the normal disciplines. Yes, deploying to the Cloud is quick, easy and often cheap but if you do it without thought, without planning, it will end up as expensive as any traditional IT deployment. Deploying in the Cloud removes much of the grunt-work but it doesn't remove the need for thought!

Shit happens, deal with it and plan for it!

* Emergency Power Off switches should always be protected by a shield and should never be able to be mistaken for a door opening button! But the momentary silence is bliss!

Enterprise Storage?

Myself and Tony Asaro have had a bit of snit over the uniqueness of the USP-V; he opines that it is unique and I am right that it is not unique. In many ways, this comes down to Tony's opinion that the USP-V is unique because it is the only external storage virtualisation array which is Enterprise Storage. In his opinion neither the v-Series or the SVC are Enterprise Storage and hence do not compete with the USP, DMX and DS8K range. Also in SVC's case because it does not have it's own disk and simply virtualises external arrays; it is not a storage device (I'll leave that comment alone).

So what this really boils down to is what is Enterprise Storage? A couple of years ago, I probably could have sat down and told you what is and what isn't Enterprise Storage but now? I'm not so sure, I can list you some characteristics of Enterprise Storage but the problem is that pretty much all of the arrays from most vendors have those characteristics!

  • Highly Available – 99.99%+ available

  • Highly Scalable – Supports 500+ disks and supports many hosts attached

  • Highly Performant – Whatever that means

  • Non-disruptive upgrades – Internal code and hardware can be replaced/upgraded with no service outage

  • Supports multiple RAID Levels

  • Supports multiple disk-types and sizes within the array

Problem is, as I say pretty much every array from most vendors have these characteristics. So what actually is Enterprise Storage or is it entirely defined by the price you pay? Are some things simply too cheap to be classed as Enterprise Storage?

You see, I'm no longer sure and does it really matter? I suspect it matters alot to the Hitachis and the EMCs of this world but to anyone else? For the rest of us, it probably comes down to the eye of the beholder. Thoughts anyone?

Sort of Right, Kind of Wrong!

Steve Duplessie is both right and wrong in his post on SSDs here!

He is right that simply sticking SSDs into an array and treating them as just Super Speedy Disk can cause yet more work and heartache! Concepts such as Tier 0 are just a nightmare to manage!

He is also right that the problem should be defined high-level as the interaction between the user and their data, getting them access to the data as quickly as possible.

He is also right that just fixing one part of the infrastructure and making one part faster does not fix the whole problem. It just moves the problem around!

Unfortunately, whilst every other component in the infrastructure has got faster and faster; arguably, storage is actually getting slower! At a SNIA Academy event recently, they suggested that if storage speeds had kept up with the rest of the infrastructure improvements; disks would now spin at 192,000 RPM. The ratio of capacity to IOPs gets less and less favourable every year; wide striping has helped mitigate the issue but as disks get bigger, we either look at the situation where we waste more and more capacity as the areal density of IOPs means that most of the capacity on a spindle should just be used for data at rest or we need a faster storage medium.

But we probably don't need a huge amount of faster storage medium and a small sprinkling will go a long way; that's why we need dynamic optimisation tools which move hot chunks of data about. SSDs will be good but just treating them as old-fashioned LUNs might not be the best use of them.

Automation is the answer but I think Steve knows that! Dynamic optimisation of infrastructure end-to-end is the Holy Grail; we are some way off that I suspect! I'd just settle for reliable and efficient automation tools for Storage Management at this point. 

Pots, Kettles, Stones and Glasshouses

I have a lot of sympathy for Chad's and Chuck's recent posts here and here on Oracle support for VMWare but I would have a lot more sympathy for them if EMC did not have such a track-record for using the support matrix as a marketing weapon.

EMC's continued refusal to certify another controller in front of their arrays, be it NetApp, HDS or IBM; makes their current spat with Oracle quite amusing from where I'm sitting. I know many customers who have requested that EMC certify NetApp v-class in front of various EMC arrays, this always met with an unequivocal NO with dark mutterings as there being issues. If you challenge EMC as to what the issues are, you generally get a lot hand-wafting and nothing more.

Now, we know that various EMC arrays work behind NetApp because there's an increasing number of customers who are doing so without EMC's certification but certification would be nice or at least an open discussion to as to what the problems are?

The support matrix should not be used as a marketing and sales tool; it should be used to genuinely add value to the customer/vendor relationship. So guys put your own houses in order before throwing stones!

Maintenance Madness

Despite working for a vendor, David Merrill has a habit of posting some very good entries full of common sense; I find myself nodding in agreement with much of what he posts. His latest couple of entries here and here had me nodding in agreement; it's not just the vendors who are guilty of some dubious voodoo economics, I'm sure that most of us have put together business cases which if were really scrutinised, don't really stack up.

We often talk about trying to make capital acquistions cost neutral in less than eighteen months; a reduction in Opex to offset the capital cost. Vendors are often complicit in this, as I mentioned in my previous entry, inflated maintenance costs mean that is often cheaper to refresh and take the bundled maintenance offered with a new system than to continue to pay maintenance on the legacy kit.

However, if I examine the failures that we tend to have; it is generally the moving parts which fail; you know those things which spin at speed? Yes, the spinning rust. And if there is one thing which has fallen in cost; it is spinning rust.

Okay; with the very much older disks, vendors simply can't get new drives that small but I assume that most of you are aware that a large number of maintenance replacements are not actually new components? They can be previously failed and reconditioned components or perhaps pulled from arrays which have been migrated to the latest and greatest technology.

Maintenance in the IT industry is a fantastic example of Voodoo Economics…but hey it's green, well they are recycling and re-using! But remember, there is a third part to that; REDUCE!

Vendors don't have any real incentive to reduce maintenance costs; it firstly enables a constant upgrade treadmill because if you really had to evaluate the value of the new features, life would be a lot more complex but if you don't upgrade, maintenance is a very nice and high margin activity.

Actually EMC should be thanking companies like HDS and IBM; it enables people to keep their legacy arrays around for a lot longer and hence keep paying EMC high maintenance! And no I'm not saying that EMC's maintenance charges are especially high, there are much worse offenders out there!

Investment Strategies and Virtualisation

I sat in a meeting today where the subject of how often you refresh your storage infrastructure came up. I know that many companies are working on a three-five year model but we were discussing whether this should be increased to seven and what needs to happen to make this so.

There are a few reasons why we were coming to this conclusion; firstly spinning rust in the Enterprise is probably at it's peak and actually anything over the current maximum size of a spindle has potentially limited use i.e anything over a 1-2 terabyte drive is not especially useful for shared storage infrastructure. Please note, I say shared storage infrastructure!

Larger drives may still have a place to play in your archive tier but even that is debatable. And if you look at most Enterprise end-user desktops; they often have rather small local drives. It is the home-user and their insatiable demand for storage which really drives the size of spindles now.

We also know that the performance of the spinning rust is probably not going to improve dramatically. So what does change? Well, yes we have the introduction of SSDs and a couple of things mean that a four-five refresh cycle for that technology is probably sensible. And then there are the storage controllers themselves; these don't especially wear out but technology does move on. 

But the current designs of arrays mean that when we refresh; we are forced to refresh the lot. We are also forced to refresh by overly inflated maintenance costs. Let's be honest; most refreshes are justified by cost savings on the OpEX i.e maintenance. Even if I go to a virtualised infrastructure as espoused by HDS or IBM; these maintenance costs still mean it is often more attractive to refresh rather than sweat the asset.

However the current economic climate means that we are now more closely beginning to examine the model of keeping things for longer and examining our maintenance budgets very carefully. Dropping maintenance for software which is now stable and at terminal releases; potentially talking to third-party maintenance organisations who are much more willing to support legacy kit at a reasonable cost.

And we are considering strategies which enable us to continue to make use of kit for longer. VMWare's announcements today allowing replication and thin-provisioning at the hypervisor layer for example.  So funnily enough, EMC have come round to external storage virtualisation; you just buy it from VMWare as a software product.

It'll be interesting to see what other traditional storage related functionality makes its way into the hypervisor. And at what point EMC realise that they are actually selling 'traditional' storage virtualisation but as a software product and at which point that they do become a software company.

Funny old world, as EMC slowly catalyzes into a software butterfly selling storage virtualisation, Oracle becomes a hardware grub. And in the space of a week; EMC 'kill' DMX with V-MAX, then they kill V-MAX with vSphere. Now that's what I call progress!

FAST and Furious

Whilst HDS and EMC throw rocks at each other with regards to whether it is better to build custom parts or take things off the shelf and just use custom when you require (I'm expect the other Barry to sit on his hands but there are good reasons why the SVC team decided to build out of commodity parts and I suspect that they are very similar reasons to EMCs). I think we should look beyond the hardware and look at what is coming down the line to us.

The most important thing roadmapped is FAST, Fully Automated Storage Tiering. FAST changes things; it takes a whole bunch of ideas from a whole bunch of places and runs with them. If you are another vendor and you feel aggrieved that EMC have stolen your idea; just take heart, it won't be the first time in history that this has happened and it won't be the last.

The foundation is Wide-Striping* using a model which splits your data into chunk(let)s and spreads it across spindles. Once these chunks are distributed, you can monitor the characteristics of the I/O at an individual chunk level; this allows us to do tiering at a sub-LUN level. A hot chunk of data can be moved to a higher tier and a cooler chunk of data down into a lower tier.

In the past we have been limited to moving a whole LUN (with the exception of Compellant); this has always been a time consuming job, identifying what needs to move and then moving it. Yes, technologies have come along to make this easier but to sweat the asset and especially to make best use of SSDs; we needed to move individual 'blocks' as in a given file-system , it is possible that only some blocks are hot and frequently accessed. Traditionally if you could, you would hold these in cache but if SSDs are expensive, cache is yet more so. This approach will allow some cache to be replaced by SSDs and for some cache unfriendly workloads, to all intents and purposes, you have massively increased the amount of cache available. You might not want to hold a terabyte or so of real cache for that evil 100% random read app but with SSDs; this becomes viable and not at a huge utilisation hit.

But there are going to be issues with the FAST approach; firstly, where do you put a new workload? If you simply assign it some disk and let the array decide, what the hell is it going to do with the workload? It could put it on the slowest tier possible and then migrate up; it could stick it on the fastest tier and migrate down. Both of these approaches have significant risk, so I suspect we are going to have to give the array some clues and we are going to have to understand more about the whole system we are putting in. The difference in performance between the top tier and the bottom tier is going to be large.

No longer will the Storage Admin be a Lun Monkey; they are going to need to really understand their workloads and the applications. They are going to need to learn to talk the application developers and understand workloads, they are also going to have understand business cycles.

For example applications which spend 11 months of the year pretty idle may suddenly at year end need a lot of performance. What happens if all your applications demand stellar performance once a year? Perhaps you need a way of warning the array that it needs to prefetch a load of data. A badly written end-of-year reporting extracting which generates thousands of random read IOPs. A badly written user-generated SQL; in the past, this just crippled the application; with FAST, this could cripple the whole array as it tries to react.

The FAST approach is potentially the thin-provisioning of IOPs. This going to need a lot of thinking about. Potentially you will have to domain storage to protect applications from the impact of one another. We are going to need to know more about the whole system than we have before if we are to truly benefit from FAST,

Building rules which suit your applications; sure, V-MAX will come with its own canned rules for things like VMware and known applications. Indeed EMC will probably be leveraging all the performance data that they have been gathering over the years to help us write the rules. Storage Templates as described by Steve Todd here are just the start.

So although at one level, the Storage Admin's job could get alot easier; the Storage Manager's job has got a whole lot harder. Yes Barry, I asked for FAST and now you've given it to us; now we'll have work out what this all means!

I have some really 'interesting' ideas as to where EMC could take V-MAX but they'll have to wait for another time as I'm still supposed to be on leave from Enterprise IT.

* It's Wide Striping not Wide StripPing as I keep seeing written; Wide Stripping is what happens on a Rugby Tour after a good night out!