Much of what we do in Storage Management can be considered living on a prayer; this is not just a result of the parlous state of the storage management tools that we use but also due to the complex interactions which can happen within shared infrastructure.
And frighteningly enough, we are in the process of making the whole thing worse! The two darling storage technologies of the moment; thin provisioning and de-dupe scare me witless. Both of these technologies in the wrong hands have the capability to bring a server estate to its knees. By wrong hands, I mean just about anybody.
Both of them allow you to virtualise capacity and allocate storage which isn't actually there and hope that you never need it.
Thin provisioning is predicated on the inefficient practises that we have come to know and love; we all know that when a user asks for storage, they nearly always ask for too much! Thin provisioning allows us to logically allocate the disk and only when it gets written to, does it actually get consumed.
The problem is, what happens in the event of a perfect storm and every application wants its capacity at the same time? How much do you over commit your physical capacity? Or maybe not a perfect storm, you just realise that you're going to add physical capacity above and beyond that which is supported by the array simply to cater for rate the thinnly provisioned storage is being consumed. A rapid application migration ensues.
And then there is the scarey proposition of de-duped primary storage. You could be many times over-subscribed with de-duped storage; certainly in a virtualised server environment or a development environment where you have many copies of the same data. And then someone does something; a user decides to turn on encryption and what was many deduped copies of the same data actually becomes many full copies of the same data and you have run out of storage space in a spectacular fashion.
Also migrating deduped primary storage between arrays is going to be a lot of fun as well and is going to need a lot of planning. Deduping primary storage may well be one of the ultimate vendor lock-ins if we are not careful.
Both thin-provisioning and primary storage dedupe take a lot of control away from the storage team; this is not necessarily a bad thing but the storage team now need to understand a lot more about what their users are doing.
It will no longer be enough to just to think about supplying spinning rust which can deliver a certain amount of capacity and performance. We are going to have to understand what the users are doing day-to-day and we are going to have to communicate with each other.
And yes, we'll need better tools which allow us to see what is going on in the environment but also to model the complex interactions and impacts of various events. We are going to need to know if a user is intending to do something like enabling encryption, a big data-refresh, operating system patching; events which in the past were not hugely significant could now have serious ramifications to our storage capacity.
I still think that thin-provisioning and de-dupe are good ideas but like all good ideas; they come with challenges and a certain amount of risk…
Martin-
I think you raise a very interesting point. I think instead of exposing the dangers of these oversubscription technologies, it highlights the needs for the infrastructure and application teams to work together closely. Remember, in the end you all work for the same company and things like dedupe and thin provisioning mean big savings for your employer. Oversubscription happens all over the datacenter, not just with storage.
Think about the chargeback model that many large enterprises employ. Why do they do that? It’s not to make Infrastructure a revenue center, but to make the application owners be more accurate with their requests for things like storage. In a perfect world where infrastructure and application owners work collaboratively to properly size and maintain the environment, a chargeback model is not required.
The two keys to successfully oversubscribing your storage environment are non-technical and lie with management:
#1- management must fully support the architecture and make sure everyone who touches it is mindful of the effects of changes on oversubscription
#2- giving the storage team the capability to quickly add storage if the need arrives (pre-approved purchase order, etc) .
Both things will minimize the risk of oversubscription.
I completely agree, it is certainly time for application teams and infrastructure teams to work a lot more closely and develop a much better appreciation of each other’s challenges. We need to de-mystify the black arts and operate with transparency; explaining how the infrastructure works and what the risks can be.
And charge-back models in an oversubscribed environment; sounds like potential profit centre to me!!
“Just a Storage Guy” was right when he said that oversubscription happens all over, but not just in the datacenter. How many times have we seen service providers oversubscribe network links, airlines oversubscribe planes, and ticketmaster oversell venues?
I think it has become less of a problem of planning and more of a problem of statistics. In the analog world, are the benefits of oversubscription such that it is financially in our best interest to continue these practices, or will lash-back from the consumer be overbearing?
In the digital world, what is the statistical likelihood of the “perfect storm” happening, and most importantly, how does that statistic interrelate with our guaranteed uptime requirements? Of course, IT is the only field I know of where one-in-a-million occurrences happen every day.
In the end, it all boils down to this: “Risk Management” is not “Risk Elimination”.
Martin,
Very interesting and worthwhile post. I have several responses to this, which I have posted on my blog under the fanciful headline “Who’s Afraid of the Big, Bad Dedupe?”: http://onlinestorageoptimization.com/index.php/whos-afraid-of-the-big-bad-dedupe/
I expect there will be further discussion on this topic, which I look forward to reading.
Carter George, VP Products, Ocarina Networks
http://www.ocarinanetworks.com
Martin…thanks for raising these issues. Prospective buyers need to understand these pitfalls. Data reduction should be completely transparent and complementary to existing storage management best practices. Today, dedupe for online data does not pass that test on multiple fronts.
For anyone interested I propose an alternative perspective based upon real-time compression in my blog titled ‘Is Dedupe Right for Online Data?’
http://storwize.wordpress.com/2009/07/28/is-dedupe-right-for-online-data/
Thanks,
Peter Smails
SVP, Marketing, Storwize
“Hope for the best, plan for the worst” is an old saying still true. You should never use Thin Provisioning unless you understand your plan to rebuild if disks fail, or how to manage the risk of exhausting your capacity.
At HDS we avoid broad statements about capacity savings since where you set the peg is extremely dependent on the applications and usage patterns, and on your risk management preferences and strategies. You make a good point about the need for increased communication with and awareness of the requirements of the application community.
Some situations allow for generous over-provisioning, others don’t. Our customers have found that in a tiered storage environment it certainly is a useful option.
You focus on overprovisioning as the key benefit of thin provisioning. Our experience has led us to believe that with or without running thin, the improvements in operational provisioning and automating performance optimization and load balancing make it worth while. Saving money by running thin and storage reclamation is gravy on top of the real improvements in these latter areas.