With Automated Tiering solutions now rapidly becoming de rigeur, well in announcement form anyway; perhaps it's time for us to consider again what is going to go wrong and how much trouble the whole thing is going to cause some people.
Fully Automated Tiering solutions are going to cause catastrophic failures when they are deployed without understanding the potential pitfalls. I keep going on about this because I think that some people are sleep-walking into a false nirvana where they no longer need skilled people to manage their infrastructure estates; in fact, automated solutions means that you are going need even more skilled people; you will probably need less monkeys but you are going to need people who can think and learn.
Fully Automated Tiering solutions cannot be fully automated and be safe for most large businesses to run; they work on the premise that data gets aged out to slower and less performant disk, this reduces the amount of expensive fast disk that you require and can save you serious money. So why wouldn't you do this, it's obvious that you want to do this, isn't it?
Actually in most cases, it probably is; unstructured file data almost certainly is safe to shift on this basis, actually most unstructured data could be stuck on the lowest tier of disk from day one and most people wouldn't notice. In fact you could write most of it to /dev/null and not that many people would notice.
But structured data often has more complex life-cycle; accounting and billing systems for instance, the data can be very active initially eventually entering a period of dormancy when it basically goes to sleep. However, then something will happen to wake it up; periodic accounting runs, auditing and then the data will be awake and almost certainly required pretty sharpish. If your automated array has moved all the dormant data to a less performant tier; your periodic accounting jobs will suddenly take a lot longer than expect and potentially cripple your infrastructure. And that is just the predictable periodic jobs.
Automated Tiering solutions also encourage poor data management policies; actually, they don't encourage but they certainly reinforce a lot of bad practise which is in place at the moment. How many people have applications which have data which will grow forever? Those applications which have been designed to acquire data but do not age it out? Well, with Automated Tiering solutions, growing your data forever no longer attracts the cost that it once did; well, certainly from a storage point of view and it might even make you look good; as the data continues to grow, the amount of active data as a percentage of the estate will fall and hence the amount of data which sits on the lower tiers increases and you could argue that you have really good storage management policies in place. And you do!
However, you have really poor Data Management policies in place and I think that one of the consequences of Automated Tiering, is the need for Data Management; your storage might well be self-managing, your data isn't. If you do not address this and do not get a true understanding of your data, its life-cycle, its value; you are looking down the barrel of a gun. Eventually, you are going to face a problem which dwarfs the current storage-management problems.
What kind of problems?
- Applications that are impossible to upgrade because the time taken to upgrade data-in-place will be unacceptable.
- Applications that are impossible to recover in a timely fashion.
- Application data which is impossible to migrate.
- Data corruptions which down your application for weeks
And I'm sure there are a lot more. It's hard enough to get the business to agree on retention periods for back-ups; you need to start addressing Data Management now with both your Business and your Application teams.
The good news is that as Storage Management becomes easier, you've got more time to think about it; the bad news is you should have been doing something about it already.
(Full Disclosure, I am an EMC employee)
Martin-
You bring up a very good point that I fully agree with. I don’t think there is one silver-bullet solution when it comes to tiering, and tiering should not be an excuse for bad data management practices.
I am personally a big fan of EMC’s current approach (I know, big surprise) of combining sub-lun tiering (FAST) with solid state caching (FAST Cache). Basically, do your tiering as a two-tier approach (FC&SATA) and then use the EFD drives as solid state read/write cache. The tiering handles the slow progression of data from warm to cold, while the EFD cache handles the suddenly hot data that was previously warm/cold.
While I wouldn’t be surprised to see a vendor come out with some sort of tiering scheduling interface to move specific data up based on business rules (month end reporting etc) beforehand. That said, I think that adds to the complexity.
A scheduling interface is almost certainly going to come from at least one vendor and I suspect from most. Does it add complexity? Only in that we have to start really understanding our data; not really such a bad thing in my opinion. If we are going to be overcommitting on I/O and that is pretty much what we are doing, we need to understand how much overcommitment that we have and whether we can handle perfect-storm peak-load.
We will still need to consider balancing workloads and although automation and instrumentation should make this easier, we will still require skilled humans to make decisions. The skills will be different from that required today however.
I don’t want to shill for one vendor or approach, but just taking EMC as an example, its FAST Cache approach would presumably deal with the sub-24-hour turnaround you’d need for your once-every-30-days billing cycle data while its “normal” FAST would benefit data that tends to stay hot for longer intervals.
FAST-cache may be of benefit in some cases but not all; if the data you are interested in has ended up in the lowest tier of disk; it still has to get off that disk to end up in cache; depending on the access patterns, this may or may not introduce latency which you can live with.
There may well be data that you want to ‘pin’ to particular tiers of disk; for example, you may decide that some data never needs to make it to your highest tier and some data should never be placed on your lowest tier.
The implementation of automated tiering does not abrogate responsibility for understanding and management of your data.
Martin,
Great post! This is something that is definitely near and to our hearts at Isilon. Would love to speak with you to expand on how this aligns with our vision. Interested?
Chris