I see various figures thrown about as to how much unstructured data is produced, stored etc by people; often it is this data which people believe can be deduped,ILMed, migrated, compressed etc. This is often expressed as a percentage of the total data stored by a company, is it 10%, 15, 20%, 30% + but how much is there?
I suspect very few people really know; for example, nearly all of my unstructured data sits on my laptop hard-drive; a great deal of it is my offline mail folders, I'm a fairly heavy mail-user and probably have 2-3 gigs of PST file generated over three years. Documents, another couple of gigs; so I suspect all-in-all I have less than about 10 gigs of unstructured data and I suspect that I'm way above average in our company. Even if I was average, I'd only be really be looking at 150 terabytes of data and to be honest, that is less than 5% of our total storage estate.
Logs for various systems; should be compressed by default and we know logs generally compress really well. But we've had cron-jobs etc for years which rotate logs, compress them etc. So we need to keep them for longer and longer periods of time but even so we should be able to get that sort of stuff onto the right tier of disk straight away.
So what other unstructured data might people be storing; okay, I work for a media company and we have wodges of the stuff but that's slightly unusual.
I wonder how much unstructured stuff there really is? Any thoughts? And of the unstructured stuff which users are storing, how much really belongs on the corporate storage in first place?
This is purely the anecdotal musings of a storage manager but what are the real figures?
Financial services company.
~3,000 users.
10 TB.
We keep a pretty close eye out for MP3 and other non-business media files.
One question i’d ask in all this ‘unstructured data ZBs’ is… “how much of the unstructured data is unique and couldn’t be reconstituted?”
eg PVRs & IPODs etc are often quoted as hcing a massive total storage – yet most of it is duplicated and can be recovered from master media libraries (if the will is there). Similarly real-time transcoding helps a lot here…
CCTV – yes that is generating a lot of unique data, but little easily accessible information…
I’d wager that for most large companies (ie those dealing in 10s PB today) their structured data is far larger than the unstructured data (certainly is in mine).
So to me the Q is – why can’t we make it easier for user storage to be CDN like caches of master content with changes/deltas held as ‘user end delta changes’ (eg kind like the central master with user owned writeable snapshots’… Would certainly make a lot qty of data storage reduce…
Now the topic of user generated content (eg pictures/videos etc) is a whole other topic!
I’d also wager that a large % of a business user’s personal unstructured data is an informal ‘local cache’ of data to compensate for issues with accessibility of the central store…
We need to focus on prevention at the information layer rather than band-aids at the data layer…
Ian, when you guys fix the bandwidth issues to the home, we’ll fix the content on demand thing..(I’m pretending that we’re not an ISP and it’s our problem as well).
Fibre to every house!! That’d change things somewhat!
Well, you hit on one of the 2 biggest questions on my mind (the other is utilization rates, but that’s for another day).
Question #1 is: what do you count? Is this a datacenter or do we include all the zillions of hard drives that we all have in support of companies. Some verticals have massive amounts of unstractured data. Include healthcare, media, science and engineering, oil and gas, finance, insurance, etc. I read once that a single “crash test dummy” test generates aver 10TB (dummies create more data than we do!). As hospitals and medical research centers digitize, that’s a bunch of data, too. And how about all these websites with their billions of html files. YouTube, Flickr, and every TV/radio station in the world that keeps their coverage and shows online…it boggles the mind (then again, mine boggles easily).
But the question is valid. All I know is it is A LOT. Is that definitive enough?