Storagebod Rotating Header Image

Simple Scales?

One of the biggest problems that I and many others have with ‘Big Data’ is that it tends to imply that all data is the same; actually, that is probably a problem that the storage industry has been wrestling with for some time.

All data does not have the same value, life-cycle, access requirements, access characteristics but in many ways we have carried on as though it has. What is true with ‘Small Data’ is also true for ‘Big Data’; ‘Big Data’ comes in all sorts of shapes and sizes.

Does this mean that EMC are wrong to talk about ‘Big Data’; well, the answer to that is both yes and no. I prefer to think about ‘Data at Scale’; how do we manage all types of data at scale? And are we trying to manage data at the wrong level? Once it gets to the array, is it already too late to manage it?

It is interesting that what we are seeing is that when we start to manage at scale, we start to look at simplicity and stripping out much of the value add which the vendors have spent their development dollars adding to the traditional arrays. We instead look to the implementation of these value add features to the applications which are generating the data as it is at that level that the data characteristics should be understood.

The question for the storage vendors is will this trend continue? Is it simply because the features do not scale at present both administratively and technically? Or is management of data at scale simply different?

Is the era of general purpose storage coming to end? Will all storage trend to the lowest capability and rely on application intelligence to provide features? Or will the storage vendors find ways to clothe their products with new value propositions and complexity?


One Comment

  1. Glyn Bowden says:

    Hey Martin,

    Great post! So let me give you my take on things here.

    The “Big Data” term is missing a word. That word is “set”. Everything currently in one of these pools I would consider a part of the same data set, therefor the same class and relevance as every other piece of data. If only a small subsection of it was required at one point then traditional storage approaches would be adequate with some archiving or ILM type activity to push it into a store when it becomes passive.

    However, big data (set) assumes that any of that data is going to be needed at any time and quickly and with no obvious pattern to that access for alot of that data, or at least a response demanded that doesn’t allow any data movement to happen as a result of an intelligent prediction. Think Facebook for example. A huge pool of data that could be called upon at any point and needs to be massively distributed as your embarrassing photos from the last #storagebeers could be recalled for tagging at any moment. Just as an example. :-).

    There are definitely challenges in this as over time the value of certain pools will decrease and at this point they would be moved off the Big Data pool and intro more traditional backup. Quite how that will happen is up for discussion, but a data mover of some sort or the application processing would simply have a third destination to place the expired data on. CDMI will be a massive enabler to this I believe so that’s certainly a capability to keep track of.

    Also you’re point about Big Data storage not currently having the other value props that vendors are touting now such as thin provisioning, dedupe etc. is spot on, for now. There is no reason that some of this intelligence can’t find it’s way back to the arrays and I have no doubt that it will, but how effective that is will depend greatly on how distributed or parallel the filesystem is. Dedupe across multiple backend storage units would be a pretty neat trick but would the processing and bandwidth requirements be worth the result? I’m not so sure. What would be required for this functionality to operate efficiently would be some intelligence external to the storage in a manager of sorts with understanding of each individual object. That again, sounds a hell of a lock like CDMI to me.

    “Big Data” as a phrase has gone very quickly the same way as “Cloud” as a phrase. Pretty meaningless. The concept however is very real and will begin to effect more and more organisations as our ability to produce new data and consume that data continues its exponential growth.

Leave a Reply

Your email address will not be published. Required fields are marked *