Storagebod Rotating Header Image

Disastrous Thinking

As follow-up to my blog here; I'd like to share yet more thoughts on availability and the potential negative impacts on some of the new technologies out there.

How many of you run clusters of servers? HA/CMP? Veritas Cluster? Microsoft Cluster? VMWare Clustering? I suspect lots of you do? How many of you cluster NAS heads? Yet again, I suspect lots of you do? How many of you cluster arrays? Not so many I guess? Certainly in my experience, it is uncommon to cluster an array. And when I talk about clustering an array, I don't mean the implementation of replication.

So, if you don't cluster your arrays; how do you protect against the failure of a RAID rank? Statistically unlikely but it is it more or less unlikely than a loss of data-centre? I'm not sure and the failure of a RAID rank for many people could well mean the invocation of the disaster recovery plan. Why? 

The loss of a RAID rank might well lead to the loss of an application/service and if it is an absolutely business critical service, can you bring it up at the remote replication site in isolation? As a discrete component? If you can, can you cope with increased transaction times due to latency? Many applications now have complex interactions with partner applications; these might not be well understood. So the failure of RAID rank could lead to the invocation of the Disaster Recovery Plan. Actually in my experience, this is very nearly always the case unless the service has been designed with recovery in mind; this requires infrastructure and application teams to work together, something which we are not exactly good at.

But you now take the challenge and make sure that every application can be failed over as a discrete component. Excellent, a winner is you! You know the impact of loosing a RAID rank, you know what applications it impacts, you've done your service mappings etc, etc. And you have been very careful to lay things out to minimise a single RAID failure's impact.

And then you implement automated storage tiering. Firstly, you now have no idea in advance what impact a RAID rank failure may have; you have no idea what applications may be impacted. And actually, the failure of a single RAID rank may well have huge impact. We could be looking at restoring many terabytes of data to cope with the failure of a couple of terabytes and many applications failing.

It will depend on the implementation of the automated storage tiering and I am concerned that at present we do not know enough about the various implementations which will hitting our arrays over the next eighteen months. So despite automation making things day-to-day a lot easier, we cannot treat it as Automagic Storage Tiering; we need to know how this works and how we plan to manage this. 

And perhaps for key applications, we will need to cluster storage arrays locally; that in itself will bring challenges.

I'm still a big fan of automated storage tiering but over the next few months, I would like to see the various vendors start talking about how they mitigate some of this risk. Barry Burke has made a big thing about the impact of a double disk failure on an XIV array in the past; in a FAST v2 environment, I would like to see how EMC mitigate against very similar problems.

I would also like to know what impact of a PAM card failure from NetApp is; does the array degrade to the extent where it is not useable? What kind of tools can NetApp give me to assess potential impact. As Preston points out here; failure of individual components within an array could have significant impacts.

We are heading to a situation where technology gets every more complex and arguably ever more reliable. But we rely on it to ever more greater extents; so we must understand risks and mitigations to a much greater amount than we have in the past. 

  


10 Comments

  1. Chuck Hollis says:

    As always, excellent questions.
    As a vendor, we feel pretty good about this discussion, although you may find that answers will vary across the industry.
    Example: for many years, IBM has sold an “enterprise storage” product with two controllers that drops to half-performance if a controller fails.
    Some may find that sort of thing acceptable in their environment ; other may not 🙂
    — Chuck

  2. Jon says:

    The loss of a RAID is not uncommon and an occurrence that I’d say the majority of larger Broadcast and multimedia clients I’ve worked with. Therefore, synced data in critical paths is an absolute must! Always stack the odds in your favour!!
    However, I’d ask – you mention do-it-yourself clustering approach, which leads to bespoke set-ups, often untested in critical circumstances; isn’t the winner an off-the-shelf storage solution, tried, tested and qualified? Why re-invent the wheels?
    (the plural of which was entirely intended!)

  3. While you can never eliminate the occurrence of double-drive failures, you can indeed reduce the probability of their occurrence.
    XIV’s RAID-X has issues because the dependent set of drives that will be damaged by (for example) two drives failing simulatanuosly is so large. With RAID-X, every LUN has data on every single spindle (up to 180 per XIV array), and there are only two copies – if two drives fail, by definition there are a positive number of blocks for which both copies were lost.
    With Symmetrix, we strongly recommend the use of RAID-6 (2 parity) for all SATA RAID sets. This straight-away reduces the probability of data loss to a very, very small number, because it requires THREE (not 2) drives to fail WITHIN THE RAID SET before data is lost. Even though a LUN might be made up of multiple such RAID-6 sets, the dependent set of drives that are exposed to data loss is limited to the RAID-6 set size (6+2 or 14+2). The probability of data loss with RAID-6 is thus a tiny fraction of the probability for RAID-X.
    FWIW, even simple RAID-1 has a lower probability of data loss due to drive failure than XIV’s RAID-X. And all this holds true even considering the differences in drive rebuild/time-to-protected between the approaches.

  4. Han Solo says:

    Ummm…there is a huge flaw here, most arrays are already clustered designs and basically two or more network servers connected to the same backend physical disks. Pretty much the same way you would build a regular HA cluster.
    What your asking is: Do you cluster your clusters?

  5. Martin G says:

    So Barry, we are still playing probabilities? And are you recommending RAID-6 for SSDs? What happens if I loose an SSD rank in a FAST2 environment? Am I exposed to many LUNs failing? Do you move a block onto SSD or do you copy the block; do you cache the write and update both the SSD and SATA copy, so basically you are mirroring the SSD onto SATA (or whatever disk type it might be?). When we start doing clever things; invariably, we end up having to do lots of clever things to mitigate our own cleverness.

  6. Martin G says:

    I may well be asking that question. However, I would argue that the failure of a disk array is inordinately more disruptive in many data-centres than a simple host failing. And the impact of a failing disk rank, could be just as impactful as a failing array. RAID ranks do fail on occasion and as we do clever things, the impact of these failures could well get worse. Do you want to invoke DR for a simple RAID failure?

  7. Rob says:

    > if you don’t cluster your arrays; how do you protect against the failure of a RAID rank?
    A lot of times tiering will drive the decision. At a large client site , their definition of tier1 application was 1 minute RPO, 5 minute RTO. Design meant across storage frames at a minimum. Loss of a frame could be accounted for. I shadowed drives at the host level across frames, across data centers. During one particularly nasty outage, the app I was responsible for was the only one up. Of course no one could get to it as the network backbone was not available. There went that whole tiering strategy I guess. I *guess* a number of folks assumed the main datacenter is there and/or budget wouldn’t allow them to truly design correctly or the OS was lacking, “can we get around this requirement?”
    “Back in the day” VMWare with SRM was on the distant horizon.
    (tier1 = very important. Adequate budget and design, loss of a RAID array has zero impact in my opinion. This isn’t always possible in commercial space on down, budgets and all that).

  8. ianhf says:

    Is it a good time to mention my long standing RFE request to be able to build a raid-group out of arrays themselves? Think of a raid-6 group spanning multiple arrays (essentially think of arrays instead of spindles) – 7yrs ago I requested raid-group federated arrays. 😉
    You also have the conversation about hot-spare disks etc as well as just parity disks – but as spindle sizes ever increase then the time to return to BAU becomes longer (days & weeks) and we’ll need more parity disks (or just move to erasure codes).
    Certainly also need vendors to get transparent and start openly publishing rebuild times, replacement times, impacts & degradations (FR & NFR)upon a component issue.
    Certainly agree that as we automate & manage with policies more of the storage we certainly need tools several steps better than current tools for articulating risk, contention, consequence & impact. Tools that can demonstrate risk in terms of policy framed SLA/RPO/RTO etc and allow the admin to refine/control for ‘acceptable business risk’.

  9. ianhf says:

    I’d also argue there is a strong need to clarify over DR & HA – very few services really benefit from clusters (plenty info show clusters reduce availability whist 3x cost).
    And given snapshots, duplication, classic B&R and ‘within database’ backup (eg flashback etc) do you really need to invoke DR? ASM ‘normal’ resiliency mode with failure groups copes pretty well with raid-group failures on arrays…
    So let the array do the transparent policy based sub-lun tiering, and the host (ASM, VMFS etc) do the ‘data resiliency’ between disparate arrays?

  10. Barry Whyte says:

    We have numerous users providing HA solutions using SVC’s Vdisk Mirroring which essentially provides the kind of clustering you refer to – allowing a single virtual disk to have a copy on multiple storage controllers. I know Ian’s requests to extend this to RAID-5 or 6 – then you could simply power off an entire controller if required and no loss of access.
    Of course, there would be nothing to stop you using the VDM function over the top of an SVC implementation of auto-tiering.

Leave a Reply

Your email address will not be published. Required fields are marked *