From: David Brown <david.brown@hesbynett.no>
To: Roy Sigurd Karlsbakk <roy@karlsbakk.net>
Cc: Linux Raid <linux-raid@vger.kernel.org>,
Bernd Schubert <bernd.schubert@fastmail.fm>
Subject: Re: Checksumming RAID?
Date: Tue, 27 Nov 2012 13:37:31 +0100 [thread overview]
Message-ID: <50B4B40B.3000807@hesbynett.no> (raw)
In-Reply-To: <22100889.14.1354016374529.JavaMail.root@zimbra>
On 27/11/2012 12:39, Roy Sigurd Karlsbakk wrote:
>> I can certainly sympathise with you, but I am not sure that data
>> checksumming would help here. If your hardware raid sends out
>> nonsense, then it is going to be very difficult to get anything
>> trustworthy. The obvious answer here is to throw out the broken
>> hardware raid and use a system that works - but it is equally
>> obvious that that is easier said than done! But I would find it
>> hard to believe that this is a common issue with hardware raid
>> systems - it goes against the whole point of data storage.
>>
>> There is always a chance of undetected read errors - the question
>> is if the chances of such read errors, and the consequences of
>> them, justify the costs of extra checking. And if they /do/ justify
>> extra checking, are data checksums the right way?
>
> The chance of a silent corruption is rather small with your average
> 3TB home storage. On the other hand, if you had a petabyte or five,
> the chances would be very high indeed to get silent corruption (ref
> the CERN study done in 2007). In my last job, I worked with ZFS with
> ~350TiB storage, and there we saw errors happen rather frequently,
> but then, since ZFS checksums data and uses it to deal with errors,
> we never saw any data loss. That is, except on an older machine,
> running ZFS on a hardware RAID controlled storage unit (NexSAN
> SATABeast). We had error corruption on that one as well, after a disk
> failure, and had to resort to restoring from tape, since ZFS couldn't
> control the RAID.
Of course even a small chance-per-bit turns into a significant total
chance when you have enough bits! There is always a chance of
undetected issues - your aim it to reduce that chance until it is no
longer relevant (or until the chance is under 1 in 150 million per year
- then you should worry more about being killed by lightning).
>
>> I agree with Neil's post that end-to-end checksums (such as CRCs in
>> a gzip file, or GPG integrity checks) are the best check when they
>> are possible, but they are not always possible because they are not
>> transparent.
>
> The problem with end-to-end-checksums at the application level, is it
> will only be able to detect the error, not fix it, similar to the
> issues I mentioned above.
>
Checksumming, as suggested by the originally mentioned paper, will not
be able to correct anything either. At first glance, it might seem that
it would tell you which block was wrong, and therefore let you re-build
that block from the rest of the raid stripe. But that will not be the
case if there are issues while writing, such as unexpected power
failures - it could just as easily be the data blocks that are correctly
written while the checksum block is wrong. And exactly as discussed in
Neil's post on "smart" recovery, the principle of least surprise
suggests giving the data blocks back unchanged is the least harmful.
To do checksumming (and in particular, recovery), requires higher level
knowledge of the data. The filesystem can track when it writes a file,
and update metadata (including, if desired, a data checksum) once it
knows the file is correctly stored. But I don't think it can sensibly
be done at the block device level - the recovery procedure doesn't know
what is old data, what is new data, or which bit is important to the
filesystem.
So I think it can make sense to use a filesystem like ZFS or BTRFS that
can do checksumming - that is a reasonable level to add the checksum.
One way to handle this at md block level would be to have an option for
raid arrays to always do a full stripe read and consistency check
whenever a block is read. If the consistency check fails (without any
errors being indicated from the drives), the array should simply return
a read error - it should /not/ attempt to recover the data (since it
can't tell which parts are the real problem). If arrays with this
option are used as first-level arrays, with a "normal" md raid array
(raid1, raid5, etc.) on top, then the normal raid recovery process will
replace the bad data and initiate a new write to correct the undetected
read error. I think this would perhaps give you the level of
reliability you are looking for, and be suitable for big arrays (indeed,
it would be unsuitable for small arrays as you need at least two levels).
mvh.,
David
next prev parent reply other threads:[~2012-11-27 12:37 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-11-26 13:27 Checksumming RAID? Roy Sigurd Karlsbakk
2012-11-27 9:45 ` David Brown
2012-11-27 10:17 ` Bernd Schubert
2012-11-27 11:20 ` David Brown
2012-11-27 11:39 ` Roy Sigurd Karlsbakk
2012-11-27 12:37 ` David Brown [this message]
2012-11-27 13:09 ` Roy Sigurd Karlsbakk
2012-11-27 13:20 ` David Brown
2012-11-27 13:56 ` Roy Sigurd Karlsbakk
2012-11-27 14:34 ` David Brown
2012-11-27 20:49 ` Stan Hoeppner
2012-11-28 10:58 ` Roy Sigurd Karlsbakk
2012-11-27 12:31 ` Bernd Schubert
2012-11-27 13:05 ` David Brown
2012-11-27 18:53 ` Chris Murphy
2012-11-27 19:27 ` Roy Sigurd Karlsbakk
2012-11-27 19:50 ` Chris Murphy
2012-11-28 10:56 ` Roy Sigurd Karlsbakk
2012-11-28 10:59 ` Roy Sigurd Karlsbakk
2012-11-28 13:25 ` Drew
2012-11-28 17:51 ` Roy Sigurd Karlsbakk
2012-11-28 19:16 ` Chris Murphy
2012-11-28 19:08 ` Chris Murphy
2012-11-28 19:18 ` Roy Sigurd Karlsbakk
2012-11-28 20:02 ` Chris Murphy
2012-11-27 13:54 ` Joe Landman
2012-11-27 18:48 ` Chris Murphy
2012-11-27 19:36 ` Chris Murphy
2012-12-03 12:24 ` Pasi Kärkkäinen
2012-12-03 14:09 ` Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP Pasi Kärkkäinen
2012-12-05 19:05 ` Martin K. Petersen
2012-12-06 11:10 ` John Robinson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=50B4B40B.3000807@hesbynett.no \
--to=david.brown@hesbynett.no \
--cc=bernd.schubert@fastmail.fm \
--cc=linux-raid@vger.kernel.org \
--cc=roy@karlsbakk.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.