A number of conversations about archiving, backup and the like got me thinking on my walk to work this morning about data and the half-life of data.
I am wondering if big-data, customer self-service and just generally improved access to data has changed the half-life of data and especially that of structured data-types. Lets take Amazon as an example and compare it to a mail-order retailer in the past.
In days gone-by, let’s say I ordered something by mail-order; I’d select my items, possibly from a catalogue and place my order. The retailer would enter the order into their system and fulfil that order. Once that order was completed, it was probably extremely unlikely that the record which was associated with the order would be accessed again. And it almost certainly would not be accessed multiple times, that record to all intents and purposes dead.
Now, if we look at Amazon; the process of ordering is pretty similar apart from the fact that the catalogue is online and I enter the order into their system. But once the order is completed, the data is still accessed many times.
Every time I browse the catalogue and look at an item, Amazon checks to see if I’ve ordered that item in the past and warns me if I have. I regularly check my own order history as well. Amazon use my order history to recommend items to me. The record still has value and lives on. The half-life of that data is changing and becoming longer.
We may be generating more and more unstructured data; its growth far outstrips that of structured data and there is a huge amount of focus on that but I suggest that we ignore structured data growth at our peril and we need to understand the impact of its changed half-life.
It’s a real problem. Over the years I’ve watched the half-life of backups increase substantially, which is what led me to discuss that last year
The real challenge in managing storage growth is not actually keeping pace with data growth, but making sure that we’re not keeping data around “because it’s cheap to do so”, since in the end, having stagnant/stale data that’s exceeded its half life on primary storage just creates even worse problems.
Understanding data, its value, its use, its life-cycle and then managing it is probably the biggest challenge which face storage and data teams. The focus on increased agility has yet again pushed this back into the shadows because increasing agility and velocity is easy and cool.
Compression and deduplication are sticking plasters which enable people to pretend that they have this under control. They don’t and like antibiotics; they will become less effective over time.
This does not mean that I am anti-dedupe or compression but ultimately we will end up with our backs against the wall.
Well, I think in an over-all view, the average half-life of data may be increasing, because every day new ways of generating, analyzing and “cross-using” (of already or parallely stored data) are invented. With some of them it really makes sense and brings additional business cases to keep them, some of them have to be kept due to compliances and laws but imho a huge part of them is just stored because of lacking good information lifecycle processes and policies. Although I hope Amazon stores their transaction data in a structured format, “Big Data” (Tadaaa!) seems to be the ideal candidate for “Store me->Forget me->Find me again->Be overwhelmed by my mass->Give up->Just keep me”. We definitely need better ways to organize and keep track of unstructured data and I’m sure all the major players have several interesting things to come in their future products pipeline for that purpose.
By the way, a nice approach would be to have a kind of neuronal network of information with weighted values determining the worth of a certain data set in relation to others… hmm… my fingers start to tingle… time to write some code :o)