On mySimon: Holiday Gifts for Him
BNET Business Network:
BNET
TechRepublic
ZDNet

September 17th, 2007

Data corruption is worse than you know

Posted by Robin Harris @ 9:01 pm

Categories: Disk drives, RAID, RAM

Tags: Disk, CERN, Data Corruption, Data, News, Error, Robin Harris

Many people reacted with disbelief to my recent series on data corruption (see How data gets lost, 50 ways to lose your data and How Microsoft puts your data at risk), claiming it had never happened to them. Really? Never had to reinstall an application, an OS, or had a file that wouldn’t open?

Are you sure?
The research on silent data corruption has been theoretical or anecdotal, not statistical. But now, finally, some statistics are in. And the numbers are worse than I’d imagined.

Petabytes of on-disk data analyzed
At CERN, the world’s largest particle physics lab, several researchers have analyzed the creation and propagation of silent data corruption. CERN’s huge collider - built beneath Switzerland and France - will generate 15 thousand terabytes of data next year.

The experiments at CERN - high energy “shots” that create many terabytes of data in a few seconds - then require months of careful statistical analysis to find traces of rare and short-lived particles. Errors in the data could invalidate the results, so CERN scientists and engineers did a systematic analysis to find silent data corruption events.

Statistics work best with large sample sizes. As you’ll see CERN has very large sample sizes.

The program
The analysis looked at data corruption at 3 levels:

  • Disk errors.The wrote a special 2 GB file to more than 3,000 nodes every 2 hours and read it back checking for errors for 5 weeks. They found 500 errors on 100 nodes.
    • Single bit errors. 10% of disk errors.
    • Sector (512 bytes) sized errors. 10% of disk errors.
    • 64 KB regions. 80% of disk errors. This one turned out to be a bug in WD disk firmware interacting with 3Ware controller cards which CERN fixed by updating the firmware in 3,000 drives.
  • RAID errors. They ran the verify command on 492 RAID systems each week for 4 weeks. The RAID controllers were spec’d at a Bit Error Rate of 10^14 read/written. The good news is that the observed BER was only about a 3rd of the spec’d rate. The bad news is that in reading/writing 2.4 petabytes of data there were some 300 errors.
  • Memory errors. Good news: only 3 double-bit errors in 3 months on 1300 nodes. Bad news: according to the spec there shouldn’t have been any. Only double bit errors can’t be corrected.

All of these errors will corrupt user data. When they checked 8.7 TB of user data for corruption - 33,700 files - they found 22 corrupted files, or 1 in every 1500 files.

The bottom line
CERN found an overall byte error rate of 3 * 10^7, a rate considerably higher than numbers like 10^14 or 10^12 spec’d for components would suggest. This isn’t sinister.

It’s the BER of each link in the chain from CPU to disk and back again plus the fact that for some traffic, such as transferring a byte from the network to a disk, requires 6 memory r/w operations. That really pumps up the data volume and with it the likelihood of encountering an error.

The Storage Bits take
My system has 1 TB of data on it, so if the CERN numbers hold true for me I have 3 corrupt files. Not a big deal for most people today. But if the industry doesn’t fix it the silent data corruption problem will get worse. In “Rules of thumb in data engineering” the late Jim Gray posited that everything on disk today will be in main memory in 10 years.

If that empirical relationship holds, my PC in 2017 will have a 1 TB main memory and a 200 TB disk store. And about 500 corrupt files. At that point everyone will see data corruption and the vendors will have to do something.

So why not start fixing the problem now?

Comments welcome, of course. Here’s a link to the CERN Data Integrity paper. CERN runs Linux clusters, but based on the research Windows and Mac wouldn’t be much different.

Robin HarrisRobin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. See his full profile and disclosure of his industry affiliations.


Email Robin Harris

Subscribe to Storage Bits via Email alerts or RSS.

  • Talkback
  • Most Recent of 81 Talkback(s)
amusing
LOL,

As a total non-tech, just a normal Windows user;
This discussion, while over my head, is amusing non the less. For instance, judging anything by what happens when you used bad media is ... (Read the rest)
Posted by: John N. Posted on: 08/08/09 You are currently: a Guest | | Terms of Use
Opportunity for software designers, maybe ...  terry flores | 09/18/07
Linux Performance Demonstration  yyuko@... | 09/18/07
Windows lockup  kd5auq | 09/18/07
Um...what did that prove?  stevets32 | 09/18/07
One thing it proved...  filker0 | 09/18/07
This is the background level  Chad_z | 09/18/07
not as bad as it sounds.  shravenk | 09/18/07
Re: not as bad as it sounds  ucf1985 | 09/20/07
Considering a CPU oscillating...  bjbrock | 09/18/07
Success rate?  Technocrat@... | 09/18/07
Whoa. Take a chill pill.  bjbrock | 09/18/07
Actually, living systems can claim a success rate far above modern PCs  Dr_Zinj | 09/18/07
RE: Data corruption is worse than you know  thetruth_z | 09/18/07
I think we are pushing...  bjbrock | 09/18/07
Smaller, Better, Faster...CHEAPER...  stevets32 | 09/18/07
Quality is now defined ...  kd5auq | 09/18/07
You pretend to pay us, we pretend to work  xfer_rdy | 09/18/07
Microsoft scandisk errors  Qlueless | 09/18/07
What?  stevets32 | 09/18/07
What??? - you may need to go back to training  socialism=nowhere | 09/18/07
Huh?!?!  cornpie | 09/18/07
RE: Data corruption is worse than you know  sw-mobboy | 09/18/07
I don't think so  croberts | 09/18/07
RE: Data corruption is worse than you know  rfrysztak@... | 09/18/07
A While Back...  KenQ | 09/18/07
RE:A while back  GreyGeek | 09/18/07
RE: Data corruption is worse than you know  markyannone | 09/18/07
RE: Data corruption is worse than you know  otaru_@... | 09/18/07
With memory errors, how can you trust anything else  JonODonnell | 09/18/07
Unexpected high level of errors is proof of existence  PhilippeV | 09/18/07
The only proof is that CERN has 3000 unreliable systems  JonODonnell | 09/18/07
then who's wrong?  PhilippeV | 09/18/07
Ignorance is bliss  xfer_rdy | 09/18/07
Ignorance is ignorance  JonODonnell | 09/18/07
not necessarily a bug in specs  PhilippeV | 09/18/07
Almost in Eisenberg's shoes  xfer_rdy | 09/19/07
What about the other possibilities?  JonODonnell | 09/19/07
RE: Data corruption is worse than you know  tefox@... | 09/18/07
photo, video and music files  kd5auq | 09/18/07
I see the problem...  reholli@... | 09/18/07
That' not serious  PhilippeV | 09/18/07
memory errors that should not occur: needed technology  PhilippeV | 09/18/07
1000 terabytes = 1 petabyte  jinko | 09/18/07
1024 terabytes = 1 petabyte  jjarman | 09/18/07
Completely wrong usage of binary units names  PhilippeV | 09/18/07
I understand the metric system...  jjarman | 09/18/07
Interfaces  filker0 | 09/18/07
RE: Data corruption is worse than you know  dnendza | 09/18/07
Errors also suggest existence of still undetected particles  PhilippeV | 09/18/07
2nd Law effects?  GreyGeek | 09/18/07
RE: typo  GreyGeek | 09/18/07
Could Errors be caused by high energy particles?  timelessone | 09/18/07
where particles are coming from?  PhilippeV | 09/18/07
i disagree  shis-ka-bob | 09/23/07
RE: Data corruption is worse than you know  rossie.hammer@... | 09/18/07
Relax Everyone, Scientists at Work  steven.g.kahn@... | 09/18/07
EVERYTHING DIGITAL HAS A BIOS IN IT  BALTHOR | 09/18/07
BIOS?  filker0 | 09/18/07
What?  Crestview | 09/18/07
AFTER A PROGRAM IS WRITTEN  BALTHOR | 09/18/07
Huh?  filker0 | 09/18/07
read-only storage  PhilippeV | 09/18/07
ROM Storage  filker0 | 09/19/07
Are you sure?  electro@... | 09/18/07
This is news?  Ginevra | 09/18/07
Your bias is showing  tonymcs@... | 09/18/07
Data is not bias  xfer_rdy | 09/18/07
Show me a mass storage system with no errors...  drahardja | 09/18/07
ECC algorithms  PhilippeV | 09/18/07
Faiure to correct single errors  jeffrey.denenberg@... | 09/20/07
RE: Data corruption is worse than you know  postmaster@... | 09/18/07
What does this say about regular re-write tasks?  pgribble@... | 09/19/07
Rewrite  filker0 | 09/19/07
What will the CERN people do about it?  MV_z | 09/19/07
All storage is not as unreliable as this  thestorageanarchist | 09/19/07
But you said only Microsoft puts your data at risk  NonZealot | 09/20/07
Another misquote by NonZealot?  hmcm@... | 10/03/07
Surely there is an error in the exponent  shis-ka-bob | 09/23/07
Error detection and correction  hmcm@... | 10/03/07
Is any hardware/opering system better than other systems?  RossHowatson | 10/05/07
amusing  John N. | 08/08/09

What do you think?

SponsoredWhite Papers, Webcasts, and Downloads

Click Here
advertisement

Recent Entries

Archives

Favorite Links

ZDNet Blogs

White Papers, Webcasts, and Downloads

  • Smart Tech Expert advice on innovations in healthcare and the green technologies that make it happen. Find out more
  • Smart Business Discussion and advice on management issues that revolve around making your world smarter and more useful. More Smart Advice
  • Smart People The best and worst moves in the management and strategy trenches. Learn More