At the moment, I am an in the unusual position to be considering building a greenfield backup/archiving solution with none of the legacy considerations of the past. I do not have to think about migrating existing policies into the new environment.
This gives me the fairly unusual position of being able to consider how to do things properly. Although key technology positions have been made i.e what tools I have available and some key policy decisions have been made i.e the archive is forever (whatever that means); much is still up for grabs.
So I hope to be able to lay some long-held sacred cows to rest, especially in the arena of back-up; instead of focussing on back-up, I want to focus on restore and how we restore.
Like security, I want to back-up up just enough so that I can restore what needs to be restored reliably; this means that areas like build policies should be taken into account. If you are never, ever, going to do a bare-metal restore; why back-up all those operating system files etc? Do I really need 4000 copies of Redhat backed-up? We are putting in place automated build-tools and application repositories; it will be quicker to restore using these than to do a traditional restore.
So identifying what needs to be backed up is going to be key; I think I've got buy-in from the sysadmins and the dbas to help with this. We are not going to sit and say that it's too hard; that's lame!
Also, from day one; I'll shall be enforcing zero failure tolerance; I've seen what happens when you tolerate a certain number of failures and the insidious nature of this. You build an air of complacency which eventually bites you; either with data-loss or the sheer tedium of producing S-OX reports about the failures. By taking a zero failure tolerance from day one; hopefully when this environment is running 10s of 1000s of back-up jobs, management will not be the horrendous nightmare that it is in other environments.
Oh, I do have another advantage; the whole environmnet is being designed with back-up in mind, it is not an adjunct, it is a key delivery.
G’day,
It is a topic that has been mulled over by a many admin – Why do i need to back _everything_ up, surely I can just backup the volumes with actual data on them.
Until, in the middle of the night the lone dba or sysadmin goes, jeez look at all that space on the root volume..
I guess this is where the value of dedupe really comes to fruition.
What really sucks is there is no silver bullet to backups. No one application does it all well 🙁
That is until someone has a cron/startup script, etc somewhere that everybody forgot about and nobody can figure why that app won’t work at all because that guy left 3 years ago; and 20 hours later you figure out that simple little thing that only removes a lock file somewhere is why you’ve been down for so much longer than you were so supposed to be.
I’ve been thinking about going the opposite direction, the time to restore because of capacity is so long is it any value for the “oops” moment. I’m pushing for a root vol on mass storage. Having gone down the route of SAN booting a number of years with limited success, I’m putting a spin of software mirror my local drives onto the san. Yes it takes up more space, but I can do snapshots, clone, replicate, etc and I can have consistency across them all; additionally the concerns of management and configuring it are dramatically reduced (I have an OS rather than a HBA bios to deal with, server admins can see logs without the storage, etc). If something horrible happens I can san boot a snapshot much quicker than any bare-metal tape, think about replicated DR situations where data is just there and you have to present it rather than spin it off something.
Unix/linux is a lot more forgiving on hardware, but even with windows you can mount the boot lun on a different host and “get it” working pretty quick with some out of place driver installs.
To me tape is really becoming a matter of compliancy, stick it in the ground type of task or where you need to roll-back the clock 6 months ago; where snapshots + replication is becoming the only sustainable oops protection method going forward. Restores are slow to begin with, throw in a large volume of small files and by the time you restore a large quantity of data well you might be out of business. A small number of files is no problem, but have some admin wipe out a volume with 40 million 4k files in it and try to do a file level restore.
My situation might be a bit different as we talk about file counts in the billions which moving the data to from the host isn’t the slow part… doing the individual file meta operations is the problem. We are still doing daily file-level VTL backups, but I’m trying to push the business into a different direction (couple of days/week snapshots for the oops situation and block level dumps to VTL for extended retention). After my current project drops off they are wanting me to re-evaluate our protection methods.