On Metacritic: BioShock 2: The reviews are in
BNET Business Network:
BNET
TechRepublic
ZDNet

July 18th, 2007

Why RAID 5 stops working in 2009

Posted by Robin Harris @ 6:18 am

Categories: Disk drives, RAID

Tags: Disk, RAID, RAID 5, Robin Harris

The storage version of Y2k? No, it’s a function of capacity growth and RAID 5’s limitations. If you are thinking about SATA RAID for home or business use, or using RAID today, you need to know why.

RAID 5 protects against a single disk failure. You can recover all your data if a single disk breaks. The problem: once a disk breaks, there is another increasingly common failure lurking. And in 2009 it is highly certain it will find you.

Disks fail
While disks are incredibly reliable devices, they do fail. Our best data - from CMU and Google - finds that over 3% of drives fail each year in the first three years of drive life, and then failure rates start rising fast.

With 7 brand new disks, you have ~20% chance of seeing a disk failure each year. Factor in the rising failure rate with age and over 4 years you are almost certain to see a disk failure during the life of those disks.

But you’re protected by RAID 5, right? Not in 2009.

Reads fail
SATA drives are commonly specified with an unrecoverable read error rate (URE) of 10^14. Which means that once every 100,000,000,000,000 bits, the disk will very politely tell you that, so sorry, but I really, truly can’t read that sector back to you.

One hundred trillion bits is about 12 terabytes. Sound like a lot? Not in 2009.

Disk capacities double
Disk drive capacities double every 18-24 months. We have 1 TB drives now, and in 2009 we’ll have 2 TB drives.

With a 7 drive RAID 5 disk failure, you’ll have 6 remaining 2 TB drives. As the RAID controller is busily reading through those 6 disks to reconstruct the data from the failed drive, it is almost certain it will see an URE.

So the read fails. And when that happens, you are one unhappy camper. The message “we can’t read this RAID volume” travels up the chain of command until an error message is presented on the screen. 12 TB of your carefully protected - you thought! - data is gone. Oh, you didn’t back it up to tape? Bummer!

So now what?
The obvious answer, and the one that storage marketers have begun trumpeting, is RAID 6, which protects your data against 2 failures. Which is all well and good, until you consider this: as drives increase in size, any drive failure will always be accompanied by a read error. So RAID 6 will give you no more protection than RAID 5 does now, but you’ll pay more anyway for extra disk capacity and slower write performance.

Gee, paying more for less! I can hardly wait!

The Storage Bits take
Users of enterprise storage arrays have less to worry about: your tiny costly disks have less capacity and thus a smaller chance of encountering an URE. And your spec’d URE rate of 10^15 also helps.

There are some other fixes out there as well, some fairly obvious and some, I’m certain, waiting for someone much brighter than me to invent. But even today a 7 drive RAID 5 with 1 TB disks has a 50% chance of a rebuild failure. RAID 5 is reaching the end of its useful life.

Update: I’ve clearly tapped into a rich vein of RAID folklore. Just to be clear I’m talking about a failed drive (i.e. all sectors are gone) plus an URE on another sector during a rebuild. With 12 TB of capacity in the remaining RAID 5 stripe and an URE rate of 10^14, you are highly likely to encounter a URE. Almost certain, if the drive vendors are right.

As well-informed commenter Liam Newcombe notes:

The key point that seems to be missed in many of the comments is that when a disk fails in a RAID 5 array and it has to rebuild there is a significant chance of a non-recoverable read error during the rebuild (BER / UER). As there is no longer any redundancy the RAID array cannot rebuild, this is not dependent on whether you are running Windows or Linux, hardware or software RAID 5, it is simple mathematics. An honest RAID controller will log this and generally abort, allowing you to restore undamaged data from backup onto a fresh array.

Thus my comment about hoping you have a backup.

Mr. Newcombe, just as I was beginning to like him, then took me to task for stating that “RAID 6 will give you no more protection than RAID 5 does now”. What I had hoped to communicate is this: in a few years - if not 2009 then not long after - all SATA RAID failures will consist of a disk failure + URE.

RAID 6 will protect you against this quite nicely, just as RAID 5 protects against a single disk failure today. In the future, though, you will require RAID 6 to protect against single disk failures + the inevitable URE and so, effectively, RAID 6 in a few years will give you no more protection than RAID 5 does today. This isn’t RAID 6’s fault. Instead it is due to the increasing capacity of disks and their steady URE rate. RAID 5 won’t work at all, and, instead, RAID 6 will replace RAID 5.

Originally the developers of RAID suggested RAID 6 as a means of protecting against 2 disk failures. As we now know, a single disk failure means a second disk failure is much more likely - see the CMU pdf Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? for details - or check out my synopsis in Everything You Know About Disks Is Wrong. RAID 5 protection is a little dodgy today due to this effect and RAID 6 - in a few years - won’t be able to help.

Finally, I recalculated the AFR for 7 drives using the 3.1% AFR from the CMU paper, using the formula suggested by a couple of readers - 1-96.9 ^# of disks - and got 19.8%. So I changed the ~23% number to ~20%.

Comments welcome, of course. And I got home despite a blow out on the Scottsdale’s 101N in 110 degree heat. I thought of it as a Bikram Tire Changing Asana.

Robin HarrisRobin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. See his full profile and disclosure of his industry affiliations.


Email Robin Harris

Subscribe to Storage Bits via Email alerts or RSS.

Related Discussions on TechRepublic

Did you know you can take part in these discussions with your ZDNet membership?

  • Talkback
  • Most Recent of 158 Talkback(s)
A loss of the data that failed to be read, yes.
But isn't that just a single bit or byte? (Read the rest)
Posted by: AzuMao Posted on: 11/13/09 You are currently: a Guest | | Terms of Use
Been doing this for 15 years  ITGuy04 | 07/18/07
You've never had a RAID 5 rebuild fail?  R HarrisZDNet Moderator | 07/19/07
Umm... Not at this time and Hopefully Never  nucrash | 07/19/07
You've never had a RAID 5 rebuild fail  RARE_AT_BEST | 10/21/08
raid 5 failed rebuild  capsteve | 10/21/08
Your storage system is not large enough  storagelunatic | 10/22/08
Not only large drives!  w_c_mead | 04/13/09
Your whole analysis is based on a faulty assumption.  ShadeTree | 07/18/07
Well you know the saying  voska | 07/18/07
An unrecoverable read error on rebuild...  bjbrock | 07/18/07
And if there is a fire in your PC all your data can be destroyed.  ShadeTree | 07/19/07
There is an old saying  Michael L Hereid Sr | 07/19/07
data storage does increase at the rate of Moore's law  noglider@... | 07/19/07
How does that work?  R HarrisZDNet Moderator | 07/19/07
RAID controllers, sectors, and lists  filker0 | 07/19/07
How does that work?  storagelunatic | 10/22/08
Isnt that what disk mainenance is for  pcguy777 | 07/19/07
Given that Optical Disks have jumped  Species8472 | 07/20/07
Matching defenses to threats  Yagotta B. Kidding | 07/18/07
Charming  Yagotta B. Kidding | 07/18/07
You are correct about 2 URE's  R HarrisZDNet Moderator | 07/19/07
What ya think of reduced error- rates on flash/RAM SSD's, genius? (Robin)  Reality Incoming | 03/17/09
Okay..  AzuMao | 08/03/09
not exactly  stormculture | 07/19/07
Nothing is "100%" perfect.  Reality Incoming | 03/17/09
Your URE failure info is per disk...  bjbrock | 07/18/07
Alternative to RAID 5 (or RAID in general)  Thought1 | 07/18/07
WHS is junk  ITGuy04 | 07/18/07
does that mean....  JoeMama_z | 07/18/07
Does it support other OS's?  ITGuy04 | 07/18/07
junk would mean it didn't work as advertised....  JoeMama_z | 07/18/07
does that mean....  aussieblnd@... | 07/19/07
WHS - corruption  speculatrix | 10/23/08
Raid 1 (Mirror)  Metawatch | 07/18/07
One more reason...  bjbrock | 07/18/07
Something does not compute  Patrick Jones | 07/18/07
Stats 101  Yagotta B. Kidding | 07/18/07
I was never good at statistics  Patrick Jones | 07/18/07
Stats 100  stormculture | 07/19/07
A statistics short course.  liljim@... | 07/19/07
Simplify it  standalone.sysadmin | 10/21/08
do you have 'good' numbers?  Linux Geek | 07/18/07
About your numbers....  desamuelson | 07/18/07
error note  sundby@... | 07/18/07
The solution ...  George Mitchell | 07/18/07
From the hardware side...  nucrash | 07/18/07
I'm all for getting more intelligence on the drive side as well ...  George Mitchell | 07/18/07
Yes, but what are the odds..  nucrash | 07/19/07
RAM is not secure... The real solution  stormculture | 07/19/07
Single sector failure != Total drive failure  CobraA1 | 07/18/07
I guess those ESS's don't exist  nucrash | 07/18/07
Even after reading this  nucrash | 07/18/07
We all know that Robin is not a big RAID fan ...  George Mitchell | 07/18/07
Just a little math correction  Ronbo13 | 07/18/07
What is going on?  lawryll@... | 07/18/07
Complete falsehoods  cmdrrickhunter@... | 07/18/07
Your understanding is at odds with the facts  R HarrisZDNet Moderator | 07/20/07
Ever hear of datascrubbing  RARE_AT_BEST | 10/21/08
.  RARE_AT_BEST | 10/21/08
Consistency Checks!  Uber Dweeb | 07/19/07
Consistency Checks!  javatexan | 10/22/08
Scaremongering!!  Don Ticulate | 07/19/07
Jumping off a bridge?  filker0 | 07/19/07
First BOTT, then OU and now YOU?  Old Timer 8080 | 07/19/07
totally agree  Drakaran | 07/19/07
How could I have made it clearer?  R HarrisZDNet Moderator | 07/20/07
My REALITY response  Old Timer 8080 | 07/20/07
Stupidity starts working in 2007  shraven | 07/19/07
12 Tb? You're kidding right?  Drakaran | 07/19/07
Can't see anyone using 12TB in next 2 years?  res0792v@... | 07/19/07
One Mistake  DHarris@... | 07/19/07
Correct.  R HarrisZDNet Moderator | 07/20/07
Just plain wrong  Greg.Freemyer@... | 07/19/07
Consider Enterprise Drives!!  Greg.Freemyer@... | 07/22/07
Correcting some poorly informed comments  Liam Newcombe | 07/19/07
To add to that  sashkashurik | 07/19/07
This whole thread is dangerous  bmorgen@... | 07/19/07
Truth is frequently dangerous  R HarrisZDNet Moderator | 07/20/07
how can this fail?  pcguy777 | 07/19/07
Where did Mr. Harris go?  nucrash | 07/19/07
I went to bed.  R HarrisZDNet Moderator | 07/20/07
First Line of Defense  reabd@... | 07/19/07
Raid never replaced the need for Backup  rstoebe@... | 07/19/07
tapes  ernieoporto@... | 10/21/08
Why raid may work longer  hski | 07/20/07
I'm sincerely and seriously not over reacting  intrepi@... | 07/21/07
Balderdash  Skeptical in Phila | 08/09/07
Not to mention most drives  Lerianis | 04/10/09
Hey, I agree. (You don't see that often in this thread).  Logics | 08/22/07
Have faith in technology!  fraser_donald | 12/13/07
Staggering Ignorance  meh130@... | 02/23/08
RAID6 should last a while  kaufi.at | 03/13/08
RE: Why RAID 5 stops working in 2009  1djk1 | 07/07/08
RE: Why RAID 5 stops working in 2009  ZDNET_guest666 | 07/28/08
RAID 6 math... you have it wrong.  ericjgarland | 10/20/08
Thanks for a great comment - and . . .  R HarrisZDNet Moderator | 10/22/08
RE: Why RAID 5 stops working in 2009  ernieoporto@... | 10/21/08
The writer obviously is looking for shock value  RAIDMAN1234 | 10/21/08
RE: Why RAID 5 stops working in 2009  jch12 | 10/21/08
How real are UREs anyways?  cmdrrickhunter@... | 10/21/08
Discs mean time between failure  Lerianis | 04/10/09
High Availability vs. Data Protection  dbisping | 10/21/08
Scrub your RAID arrays  Mace Moneta | 10/21/08
RAID needs to be smarter to continue to work  dritchey | 10/21/08
How about people need to be smarter in order for RAID 5 to work?  RARE_AT_BEST | 10/21/08
Learn more about modern RAID controllers...  cbreaker | 10/21/08
EXACTLY!  RARE_AT_BEST | 10/21/08
And then learn about SSDs  pheible | 10/22/08
People get paid to write this?  RFC3251 | 10/21/08
RE: Why RAID 5 stops working in 2009  RFC3251 | 10/21/08
URE rates are at block level maybe?  Sandro Tolaini | 10/22/08
URE rates are at teh BIT level  r_p_m | 10/22/08
RE: Why RAID 5 stops working in 2009  AriBurton | 10/22/08
RE: Why RAID 5 stops working in 2009  ffOpiR | 10/22/08
Actual testing on different R5 implementations?  quux | 10/22/08
The problem is actually Seagate  seagatesux | 10/22/08
RE: The problem is actually Seagate  seagatesux | 10/22/08
Woeful misunderstanding of Raid 6  tom.dean@... | 10/22/08
Yes! A correct answer, Tom.  metajam | 10/22/08
No, he doesn't  RFC3251 | 10/22/08
Hmmm?  tom.dean@... | 10/22/08
Not you happy  RFC3251 | 10/22/08
The Viability of the (URE) of 10^14 spec is presumptive of drive size.  mrjava66 | 10/22/08
The calculations are not correct  freeman.sm@... | 10/22/08
Do you think hardware or software will replace it?  zboyles | 10/22/08
RE: Why RAID 5 stops working in 2009  Arghtastic | 10/22/08
Typical Robin  croberts | 10/22/08
RE: Why RAID 5 stops working in 2009  mvachhar | 10/22/08
RE: Why RAID 5 stops working in 2009 (circa 1967-68)  El Condor | 10/22/08
WRONG and MISLEADING:The read error rate is per drive  d7a7z7e7d | 10/22/08
Error Rate is per drive  fukawi2 | 10/22/08
Doesn't RAID-6 square the time between failures?  guspaz | 10/23/08
RE: Why RAID 5 stops working in 2009  docbillnet | 10/23/08
RE: Why RAID 5 stops working in 2009  waxnet | 10/23/08
No more whinning, solutions please.  MV_z | 10/24/08
Raid 0+1 ?  onephatcat@... | 10/24/08
More than half of the solution  jml5@... | 10/24/08
RE: Why RAID 5 stops working in 2009  ripfree | 10/24/08
RE: Why RAID 5 stops working in 2009  maferious | 10/26/08
Intelligent Controllers and System Admins  ttmcmurry | 10/29/08
after more thought  ttmcmurry | 11/04/08
youre talking about different kind of URE  flashcoder | 04/14/09
Noise level  ashepard@... | 10/31/08
ZFS is the solution to this.  Orvar | 03/31/09
Not patentable, so Microsoft is never going to include it  Lerianis | 04/10/09
ZFS IS the solution to this problem.  Orvar | 05/25/09
This doesn't sound like a realistic situation  Lerianis | 04/10/09
Easy fix for hidden URE  flashcoder | 04/14/09
RE: Why RAID 5 stops working in 2009  murphtron | 05/06/09
RE: Why RAID 5 stops working in 2009  fabio479 | 05/07/09
Other potential mitigation methods...  matthew_of_cambridge | 05/08/09
Increasing the ECC length helps -  Robin HarrisZDNet Moderator | 06/24/09
RE: Why RAID 5 stops working in 2009  Hatsepsut | 07/11/09
It's 2009, and it's now clear that you were wrong  sayotte | 07/29/09
RE: Why RAID 5 stops working in 2009  drkhos | 08/31/09
RE: Why RAID 5 stops working in 2009  jamiehallan | 11/04/09
RE: Why RAID 5 stops working in 2009  max@... | 11/12/09
A loss of the data that failed to be read, yes.  AzuMao | 11/13/09

What do you think?

SponsoredWhite Papers, Webcasts, and Downloads

Click Here
advertisement

Recent Entries

advertisement

Archives

Favorite Links

ZDNet Blogs

White Papers, Webcasts, and Downloads

SmartPlanet

Click Here