At a SNW briefing session today, X-IO (Xiotech) talked a lot of sense about Big Data; in fact it was almost the most sense that I have heard spoken about Big Data in a long time. The fact is that most Big Data isn’t really that big and the data-sets are not huge; there are exceptions but most big data-sets that many companies will use can be measured in a few terabytes and not the tens or hundreds of terabytes that the big storage vendors want to talk about.
Sentiment data which can derived from social networking, these are not necessarily big data sets. A tweet for example is 140 characters, so 140 bytes…a terabyte is 1 099 511 627 776 bytes; we can store a lot of tweets in a terabyte and within that data, there is a lot of information that can be extracted.
In fact, there are probably some Big Answers in that not so Big Data but we need to get rid of the noise; in order to do this, we need to be able to process this data differently and directly. The most important thing that the storage can do is to vanish and become invisible; allow data processing to be carried out in the most natural way and not require various work-arounds which hide the deficiencies of the storage.
If your storage vendor spends all their time talking about the bigness of data; then perhaps they might be the wrong vendor.
Hi Martin,
isn’t “Big Data” more in the direction of “big amounts of unstructured data”? I don’t think that “Big Data = many terabytes” hits the point. You can have big SQL data bases of normalized easy-to-query data and you wouldn’t call that Big Data (Yes, of course a lot of storage hw vendors would…). Of course you could say that tweets are structured in some way (sender, date, content, references to other tweets) but the really interesting information is the content itself. It could be an attached picture, a link to an article which itself is maybe a re-post or a reaction to another article. And of course the references to other people + the retweets + the replies + the favorites etc are an own area of interest by themselves. The analytic intelligence being able to get valuable information out of that turn the 140 bytes into Big Data. Furthermore as many examples of Big Data is a stream of data that should be analyzed in real-time to get the best out of it. (Like the never-ending stream of tweets about bacon, cloud and American sports in the tweeting storage community) So “Big Data” is not just “much data” and sales people should start to cope with that.
Remember unstructured data is just data which you haven’t worked out the structure for…even a video stream could have a structure but at present it is relatively hard to make the structure meaningful.
And Big Data is whatever the vendor is selling…
Totally agree with you… for some people its “big” for some its just lots and lots little bits 🙂
[…] Answers Need Big Data? Oct 30th, 2012 by Martin […]
Hey Martin,
To me, “Big Data” is actually the bridge between structured and unstructured (modeled / unmodeled). It is (will be, actually) the set of tools that will help us model unstructured data. Ultimately and in a perfect world, all unstructured data will become structured, but i doubt this will ever really happen.
And even so, there will always be a need for PB-scale storage for unstructured data waiting to be modeled. But i would define Big Data as the set of tools and both sets of data for an enterprise or a service provider. In that sense, i think Big Data is not several TBs, far from it, but i hear your argument.
However, I think it’s fair to say that for an enterprise, the meaninful data extracted by Big Data tools from their 100s of TB of data (structured, semi- and un-) will probably not be much larger than a few TB.
My 2 cents 😉