public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Robert Hancock <hancockr@shaw.ca>
To: Bruce Allen <ballen@gravity.phys.uwm.edu>
Cc: Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	Bruce Allen <bruce.allen@aei.mpg.de>
Subject: Re: ECC and DMA to/from disk controllers
Date: Mon, 10 Sep 2007 19:09:13 -0600	[thread overview]
Message-ID: <46E5EAB9.9020504@shaw.ca> (raw)
In-Reply-To: <fa.qDlrttL6xIyU0yqEbWvsJiyy5tU@ifi.uio.no>

Bruce Allen wrote:
> Dear LKML,
> 
> Apologies in advance for potential mis-use of LKML, but I don't know 
> where else to ask.
> 
> An ongoing study on datasets of several Petabytes have shown that there 
> can be 'silent data corruption' at rates much larger than one might 
> naively expect from the expected error rates in RAID arrays and the 
> expected probability of single bit uncorrected errors in hard disks.
> 
> The origin of this data corruption is still unknown.  See for example 
> http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf 
> 
> 
> In thinking about this, I began to wonder about the following.  Suppose 
> that a (possibly RAID) disk controller correctly reads data from disk 
> and has correct data in the controller memory and buffers.  However when 
> that data is DMA'd into system memory some errors occur (cosmic rays, 
> electrical noise, etc).  Am I correct that these errors would NOT be 
> detected, even on a 'reliable' server with ECC memory?  In other words 
> the ECC bits would be calculated in server memory based on incorrect 
> data from the disk.

It depends where the data got corrupted. Normally transfers over the PCI 
or PCI Express bus are protected by parity (or CRC or something, I 
assume on PCI-E) so errors there would get detected. This is quite rare 
unless the motherboard or expansion card is faulty or badly designed 
with timing problems.

However, it's conceivable that data could get corrupted inside the 
controller, or inside the chipset. This seems quite rare however, except 
in the presence of design flaws (like some VIA southbridges that had 
nasty problems with losing data if PCI bus masters kept the CPU off the 
PCI bus too long, which we have to work around).

> 
> The alternative is that disk controllers (or at least ones that are 
> meant to be reliable) DMA both the data AND the ECC byte into system 
> memory. So that if an error occurs in this transfer, then it would most 
> likely be picked up and corrected by the ECC mechanism.  But I don't 
> think that 'this is how it works'.  Could someone knowledgable please 
> confirm or contradict?

I don't know any controller that works in this way. This would greatly 
increase CPU overhead since the CPU would need to perform this CRC 
calculation.

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


       reply	other threads:[~2007-09-11  1:10 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <fa.qDlrttL6xIyU0yqEbWvsJiyy5tU@ifi.uio.no>
2007-09-11  1:09 ` Robert Hancock [this message]
2007-09-13  3:37   ` ECC and DMA to/from disk controllers Bruce Allen
2007-09-10 12:19 Bruce Allen
2007-09-10 13:54 ` Alan Cox
2007-09-14  9:32   ` KELEMEN Peter
2007-09-10 18:05 ` linux-os (Dick Johnson)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=46E5EAB9.9020504@shaw.ca \
    --to=hancockr@shaw.ca \
    --cc=ballen@gravity.phys.uwm.edu \
    --cc=bruce.allen@aei.mpg.de \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox