Storagebod Rotating Header Image

More Data

One of the problems with Big Data is that it is Big; this may seem obvious but actually to many people it’s not. When they hear Big Data what they end up thinking is that is simply More Data; so they engineer a solution based upon that premise.

What they don’t realise is that they are not simply dealing with More Data; they are dealing with Big Data! So for example, I know of one initiative which currently captures, lets say 50,000 data points today and someone has decided that it might be better if it captured 5,000,000 data points. Now, the solution to this is simply to throw bigger hardware at it and  not to re-engineer the underlying database.

Yes, there’s going to be some tweaking and work done on the indices but ultimately, it will be the same database. Now, this is not to say that this will not work but will it work when the 5,000,000 inevitably becomes 50,000,000 data points? It is simply not enough to extrapolate performance from your existing solutions but how much of capacity planning is simply that?

If you are already in the position that you are looking at More Data; it will probably become Big Data before you know it and if you haven’t already engineered for it, you are going to have a legacy, dare I say it ‘Toxic’ situation in short order.

Everything is changing; don’t get left behind.

Big /=More; Big == Different.

Think Big, Think Different.

 


2 Comments

  1. Great post Martin. I totally agree with more data inevitably becoming Big Data once people truly understand the untapped potential that their data offers them from a business perspective. They will just want more and more data points to tap into which means more an more innovation, creativity, improvements, etc. “Ask not what to do with your data, ask what your data can do for you” Sorry – had to throw that in.

  2. InsaneGeek says:

    This post hits very close to home, and one of the key items that needs to really be talked about is limitations. Why won’t it scale up, why should we look at it differently, why must we plan for this, etc. Limitations doesn’t mean technology choices we’ve made are bad but that there are simply things that it won’t be able do no matter how big of a check you are writing.

    i.e. earlier this year I had a performance problem come to me where the end user had done excellent job of describing an a problem that they wanted to use with existing infrastructure (great detail, too bad it was a horrible problem)

    They had 2x tasks that needed to do 99% random I/O over NFS (or other shared cluster medium)

    1) a sustained read/write 140,000 4KB xml documents/sec over a dataset of 10billion files (40TB dataset)
    2) needed to do 2.9Million retrievals/second or ~250billion retrieves/day of 200byte data structures (~10TB dataset).

    I had to have a fairly significant conversation with them that, no we will never be able to do that and it’s not because we made poor vendor decisions. The business wouldn’t be able to purchase a system that could do it at the rate they wanted to (within reason anyway). This came out of left field to us, I just received a performance call one day and found out that they had been running in production for about 8 months but because it’s dataset grew at a rate of N squared the performance started to drop and could we “buy faster array, network, servers, etc” to get out the problem.

    So be wary that big data may already be hiding in your organization, and you’re blissfully unaware until you get a phone call that sounds completely insane. So make sure people understand what limits you have on buying them out of a problem, as IT can’t buy themselves out of a big data problem as they could in the past.

Leave a Reply

Your email address will not be published. Required fields are marked *