linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Roger Heflin <rogerheflin@gmail.com>
To: Greg Freemyer <greg.freemyer@gmail.com>
Cc: "Michał Przyłuski" <mikylie@gmail.com>,
	"Peter Rabbitson" <rabbit+list@rabbit.us>,
	Redeeman <redeeman@metanurb.dk>,
	linux-raid@vger.kernel.org
Subject: Re: detection/correction of corruption with raid6
Date: Fri, 05 Dec 2008 18:39:18 -0600	[thread overview]
Message-ID: <4939C9B6.30600@gmail.com> (raw)
In-Reply-To: <87f94c370812051443nd154992kfb61f3b6f0f5625d@mail.gmail.com>

Greg Freemyer wrote:

> I'm also very concerned about silent corruption and we often "verify"
> our critical large files by  performing MD5 verifies against a known
> good value.  Especially when we make copies or move them from one
> media to another.
> 
> But in all the cases of silent corruption I've seen, it was never the
> disk.  Instead I've seen it be the cable, the controller, bad memory,
> bad power supply, but never the disk itself.  Not to say the disk
> controller could not be the cause, just that I have not seen it.
> 
> I did not read the relevant threads, but do they cover all of these
> sources of silent corruption, or just if a disk is the source?
> 
> Thanks
> Greg

I will second what Greg says, I have debugged a number of corruptions 
related to filesystems.    I have never seen it be the disk, I have 
seen 3-4 different controllers corrupt (bad PCI/MB interaction-2 
different manufacturers controllers, and a bad controller).

And then the #1 issue is actual bad memory or bad power supply in the 
machine.   None of the actual cases I saw actually affected *ONLY* a 
single disk=they affected all of the disks on the controller, so 
whatever has to be done would almost have to be done a the filesystem 
level or the application level.    The typical corruption is not data 
off of the disk, the platters themselves (and the internals of the 
disk) appear to have very very good corruption detection and 
correction, it is really really unlikely for a bad sector read to not 
get caught.   The PCI bus only has parity (and likely parity errors on 
the PCI bus are not being monitored-unless you installed the edac_mc 
module) so 50% of the errors that happen get missed.    This was one 
of the bad PCI/MB interactions, one of the slots on a certain MB (all 
of the specific MB with a couple of different companies card) *HAD* to 
be throttled to not produce corrupt data every 1GB of reads or so.

And internally the controllers often have poor checking, and will miss 
things if the controller goes bad.   The disks themselves appear to 
have very good internal controls-I have never seen disk electronics 
screw up and corrupt data either.

Basically don't waste time worrying about the single disk corrupting 
data silently, worry about everything after the disk first as that is 
the weakest link of everything and is far far more likely to bite you.


  reply	other threads:[~2008-12-06  0:39 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-12-05 21:00 detection/correction of corruption with raid6 Redeeman
2008-12-05 21:02 ` Justin Piszcz
2008-12-05 21:06   ` Redeeman
2008-12-05 21:09     ` Justin Piszcz
2008-12-05 21:12       ` Redeeman
2008-12-05 21:17         ` Justin Piszcz
2008-12-05 21:30         ` Michał Przyłuski
2008-12-05 22:12           ` Peter Rabbitson
2008-12-05 22:26             ` Michał Przyłuski
2008-12-05 22:43               ` Greg Freemyer
2008-12-06  0:39                 ` Roger Heflin [this message]
2008-12-12 15:31           ` Redeeman
2008-12-16  2:33             ` Neil Brown
2008-12-16  6:33               ` Redeeman
2008-12-16  7:59               ` Mattias Wadenstein
2008-12-16 22:20                 ` Chris Worley
  -- strict thread matches above, loose matches on Subject: below --
2008-12-16 21:58 Piergiorgio Sartor
2008-12-16 22:25 ` Redeeman
2008-12-17 21:52   ` Piergiorgio Sartor
2008-12-19  4:39     ` Neil Brown
2008-12-19  5:38       ` Redeeman
2008-12-17 14:48 ` Bill Davidsen
2008-12-17 15:50   ` David Lethe
     [not found]     ` <494960E8.8020407@tmr.com>
2008-12-17 21:47       ` David Lethe
2008-12-19  8:40 piergiorgio.sartor
2008-12-19 13:10 ` Redeeman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4939C9B6.30600@gmail.com \
    --to=rogerheflin@gmail.com \
    --cc=greg.freemyer@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mikylie@gmail.com \
    --cc=rabbit+list@rabbit.us \
    --cc=redeeman@metanurb.dk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).