On TechRepublic: Windows 7: Slower to boot than Vista?
BNET Business Network:
BNET
TechRepublic
ZDNet

October 4th, 2009

DRAM error rates: Nightmare on DIMM street

Posted by Robin Harris @ 10:04 pm

Categories: Uncategorized

Tags: DRAM, Error, DIMM, Hardware Failure, Memory, Semiconductors, Hardware, Components, Robin Harris

A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought — a mean of 3,751 correctable errors per DIMM per year.

This is the world’s first large-scale study of RAM errors in the field. It looked at multiple vendors, DRAM densities and DRAM types including DDR1, DDR2 and FB-DIMM.

Every system architect and motherboard designer should read it carefully.

If you can’t trust DRAM . . .
Here are some hard numbers from DRAM Errors in the Wild: A Large-Scale Field Study by Bianca Schroeder, U of Toronto, and Eduardo Pinheiro and Wolf-Dietrich Weber of Google.

The Google servers use ECC DRAM that typically corrects single bit errors and reports double bit errors. It is a rare notebook or consumer desktop that supports ECC.

You could be having DRAM problems and not know it because even the system doesn’t know.

Non-ECC DRAM is more common
Most DIMMs don’t include ECC because it costs more. Without ECC the system doesn’t know a memory error has occurred.

Everything is fine until the data corruption means a missed memory reference or an incorrect value or a flipped bit in a file writing to disk. What you see is a “file not found” or a “file not readable” message or, worse yet, silent data corruption - or even a system crash. And nothing that says “memory error.”

Conventional Wisdom
The industry take on DRAM is summed in a quote from an old AnandTech FAQ that took the industry at its word:

Everyone can agree that hard errors are fairly rare. . . . For the frequency of soft errors. . . . IBM stated . . . that at sea level, a soft error event occurs once per month of constant use in a 128MB PC100 SDRAM module. Micron has stated that it is closer to once per six months . . . .

An even bigger surprise: it appears that hard errors, not soft errors, are the dominant error mode - the reverse of the conventional wisdom.

Good news
The study had several findings that are good news for consumers:

  • Temperature plays little role in errors - just as Google found with disk drives - so heroic cooling isn’t necessary.
  • The problem isn’t getting worse. The latest, most dense generations of DRAM perform as well, error wise, as previous generations.
  • Heavily used systems have more errors - meaning casual users have less to worry about.
  • No significant differences between vendors or DIMM types (DDR1, DDR2 or FB-DIMM). You can buy on price - at least for the ECC-type DIMMS they investigated.
  • Only 8% of DIMMs had errors per year on average. Fewer DIMMs = fewer error problems - good news for users of smaller systems.

But something to think about for large-memory servers running, say, in-memory databases.

Bad news
Besides error rates much higher than expected - which is plenty bad - the study found that error rates were motherboard, not DIMM type or vendor, dependent. This means that some popular mobos have poor EMI hygiene. Route a memory trace too close to noisy component or shirk on grounding layers and instant error problems.

Hardware failures are much more common as well and may be the most common type of memory failure. Google replaces all DIMMs with hard errors - as do most data centers - as a matter of policy.

Other interesting findings
For all platforms they found that 20% of the machines with errors make up more than 90% of all observed errors on that platform. There be lemons out there!

In more than 93% of the cases a machine that sees a correctable error experiences at least one more in the same year. They don’t get better by themselves.

High quality error correction codes are effective in reducing uncorrectable errors. There are “chip-kill” DIMM/mobo combinations that can detect and correct 4 bit errors, but few vendors make those.

Besides costing more, ECC DIMMs are about 3-5% slower than unprotected DIMMs. Few of us would ever notice that small a performance hit, but gamers might care.

The Storage Bits take
You’d think that given the several decades of semiconductor DRAM usage that this study would be old news. I did.

Like most folks I accepted industry assurances that DRAM is reliable. My main machine today uses power-hungry fully-buffered ECC DIMMs.

But I was surprised when I checked out my memory section of “About this Mac” and discovered that 1 of my 6 2GB DIMMs was reporting correctable memory errors. Time to see if the “lifetime” warranty means anything.

I suspect this is another example of the industry’s code of omerta. Big system vendors have scads of data on disk drives, DRAM, network adapters, OS and filesystem based on mortality and tech support calls, but do they share this with the consuming public? Nothing to see here folks, just move along.

Kudos to Google for doing the long-term research required for substantive results and then sharing those results with the rest of us. Data is what makes your computer YOUR computer, and it is worth protecting. Forking over a bit more for ECC mobos and DIMMs may be worth it for serious users.

I expect ECC systems will become a lot more popular in the years ahead.

Comments welcome, of course. Can someone please document how to access ECC error reporting on Windows and Linux machines too? Thanks.

Robin HarrisRobin Harris has been messing with computers for over 30 years and selling and marketing data storage for over 20 in companies large and small. See his full profile and disclosure of his industry affiliations.


Email Robin Harris

Subscribe to Storage Bits via Email alerts or RSS.

  • Talkback
  • Most Recent of 80 Talkback(s)
There was a really good ECC memory product, but nobody was interested
Back in 1996 our company developed error correcting memory modules (SIMMs were used at that time). Unlike regular ECC which corrects single bit errors and detects double errors our memory corrected up... (Read the rest)
Posted by: vschukin Posted on: 10/13/09 You are currently: a Guest | | Terms of Use
Windows.  jdbukis@... | 10/05/09
Which utility? Care to enlighten us?  de-void | 10/05/09
I dunno about Vista... but in 7...  James T. Kirk | 10/05/09
WIndows has no such program.  AzuMao | 10/05/09
Where in "About this Mac"?  MC_z | 10/05/09
Click on Memory under Hardware...  olePigeon | 10/05/09
Thanks  MC_z | 10/05/09
Might want to check if you have ECC memory  DNSB | 10/05/09
Here's what a Mac ECC error looks like  R HarrisZDNet Moderator | 10/05/09
Mac ECC  R HarrisZDNet Moderator | 10/05/09
If you're a gamer, you'd take the faster cheaper dimm  georgeou | 10/05/09
Nope!  AzuMao | 10/05/09
Riiiiight  Narg | 10/05/09
Wroooooong  MC_z | 10/05/09
Wrong  seveprim@... | 10/05/09
CRC  DNSB | 10/05/09
from nothing-much to man-am-I-screwed.  Agnostic_OS | 10/12/09
Did you even read my post? In fact; do you even know what a computer is?  AzuMao | 10/05/09
Rather low probability  DNSB | 10/05/09
What are you on about?  AzuMao | 10/06/09
Rather low probability  Agnostic_OS | 10/12/09
I think Google must be doing something different.  CobraA1 | 10/05/09
...doing something different  RWNorman | 10/05/09
Sure  CobraA1 | 10/05/09
Not so sure  rjplummer | 10/05/09
For another thing  jorjitop | 10/05/09
I never mentioned the OS  CobraA1 | 10/05/09
ECC is hardware, genius.  AzuMao | 10/05/09
Erm  Bozzer | 10/06/09
Stable machines...  gjsherr | 10/05/09
reliable servers  BikeSp | 10/09/09
Huh?  AzuMao | 10/05/09
RE: DRAM error rates: nightmare on DIMM street  AlterGeek | 10/05/09
Alpha particle problem was solved years ago!  lcarliner@... | 10/05/09
ABSOLUTELY NOT!  de-void | 10/05/09
Other outside causes  Agnostic_OS | 10/12/09
error rates  BikeSp | 10/09/09
Magnetic field weakening is irrelevant...  David A. Pimentel | 10/05/09
DRAM error rate... cosmic rays are relavant  jrlambert | 10/05/09
RE: DRAM error rates: nightmare on DIMM street  lcarliner@... | 10/05/09
So Identify the Brand of Motherboards!  mark@... | 10/05/09
Google doesn't identify bum vendors.  R HarrisZDNet Moderator | 10/05/09
Full disclosure would be nice...  MV_z | 10/06/09
RE: DRAM error rates: Vista error reporting  flboffin | 10/05/09
Hard Fault isn't an Error  AlanO93 | 10/05/09
re: Vista error reporting  wolfmeiister@... | 10/05/09
Helpful Linux commands related to memory  lamapper | 10/05/09
Many thanks  Agnostic_OS | 10/12/09
ECC isn't necessary here...  stormculture | 10/05/09
ECC is necessary here  nichow | 10/05/09
That is a huge amount of overhead in two ways  DevGuy_z | 10/05/09
No  AzuMao | 10/05/09
There are multiple ECC versions  R HarrisZDNet Moderator | 10/05/09
ECC more robust solution  Agnostic_OS | 10/12/09
Hardfault is both an error and normal  nichow | 10/05/09
Page faults..  AzuMao | 10/05/09
This Just Stinks! Forget the Lawyers, Just Shoot the Accountants Instead!!  Irritated_User | 10/05/09
"Sometimers" isn't just for old guys any more!  kd5auq | 10/05/09
RE: DRAM error rates: nightmare on DIMM street  scotsilv | 10/05/09
RE: DRAM error rates: nightmare on DIMM street  tommydino | 10/05/09
On Linux, it's EDAC  Mace Moneta | 10/05/09
Linux newbe  Agnostic_OS | 10/12/09
Finding reliability source is tough - here are some places to start looking  Chas_ | 10/05/09
RE: DRAM error rates: nightmare on DIMM street  rakslice | 10/05/09
ECC RAM is not that expensive...  MV_z | 10/06/09
RE: DRAM error rates: nightmare on DIMM street  bbodnar1@... | 10/06/09
performance difference for gamers is more than 3-5 % because:  pard | 10/06/09
No. For most games latency is more important than bandwidth.  AzuMao | 10/06/09
that's incorrect  pard | 10/08/09
Well excuse me for using the wrong wording. I meant timing.  AzuMao | 10/08/09
Cosmic rays  BillFerreira | 10/06/09
Cosmic ray levels?  samvilain | 10/06/09
On linux and server boards use ipmitools  danb1974 | 10/06/09
RE: DRAM error rates: nightmare on DIMM street  gunawan_1970@... | 10/07/09
"Should start thinking about it"??  AzuMao | 10/07/09
Without getting a server motherboard, what is a good MB that uses ECC?  John238 | 10/07/09
Why not server?  BikeSp | 10/09/09
Linux ECC monitoring  BikeSp | 10/09/09
RE: DRAM error rates: nightmare on DIMM street  Sabatian Hiddink | 10/12/09
There was a really good ECC memory product, but nobody was interested  vschukin | 10/13/09

What do you think?

SponsoredWhite Papers, Webcasts, and Downloads

Click Here
advertisement

Recent Entries

advertisement

Archives

Favorite Links

ZDNet Blogs

White Papers, Webcasts, and Downloads