Checksumming RAID?

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Checksumming RAID?
@ 2012-11-26 13:27 Roy Sigurd Karlsbakk
  2012-11-27  9:45 ` David Brown
  2012-12-03 12:24 ` Pasi Kärkkäinen
  0 siblings, 2 replies; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-26 13:27 UTC (permalink / raw)
  To: Linux Raid

Hi all

I see from an article at http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf that an implementation has been made to allow for ZFS-like checksumming inside Linux MD. However, this code doesn't seem to exist in any kernel trees. Does anyone know the current status for data checksumming in MD?

-- 
Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-26 13:27 Checksumming RAID? Roy Sigurd Karlsbakk
@ 2012-11-27  9:45 ` David Brown
  2012-11-27 10:17   ` Bernd Schubert
  2012-11-27 18:48   ` Chris Murphy
  2012-12-03 12:24 ` Pasi Kärkkäinen
  1 sibling, 2 replies; 32+ messages in thread
From: David Brown @ 2012-11-27  9:45 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid

On 26/11/2012 14:27, Roy Sigurd Karlsbakk wrote:
> Hi all
>
> I see from an article at
> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
> that an implementation has been made to allow for ZFS-like
> checksumming inside Linux MD. However, this code doesn't seem to
> exist in any kernel trees. Does anyone know the current status for
> data checksumming in MD?
>

See <http://neil.brown.name/blog/20110227114201> for a discussion on 
data checksums.

As far as I have seen on this mailing list, there has been no "official" 
work on checksums as described in that paper.  I suspect it's just a 
matter of a student or two doing a project as part of their university 
degree.  It's great that people can do that - they are free to take a 
copy of the kernel, and experiment with new ideas.  If the ideas are 
good, then it is possible to work it back into the mainline kernel 
development.

However, in this case I think there is not much support for data 
checksumming amongst the "big boys" in this part of the Linux kernel - 
as explained by Neil in his blog post.

My first thought when reading the paper in question is that it doesn't 
really add much that is actually useful.  md does not need checksums - 
it already has a more powerful system for error detection and correction 
through the parity blocks.  If you want more checksumming than raid5 
gives you, then use raid6.

What might be of interest for confirming the data integrity is so say 
that whenever a block is to be read, the stripe it is in should be 
scrubbed.  This would enforce regular scrubbing of data that is 
regularly used, and give the same benefits as the article's data 
checksumming.  It would lead to more disk reads when you have small 
reads, but the overhead would be small for larger reads or for RMW 
writes (since the whole stripe, minus the parity, is read in this case).

However, referring to another of Neil's blog posts at 
<http://neil.brown.name/blog/20100211050355>, you have to ask yourself 
how likely is it that data will be read from the drive with an error, 
but without the disk telling you of the error - and what can you 
sensibly do about it?  You don't need checksums to tell you that there 
is a problem reading data from the disk - the disk already has very 
comprehensive checking of the data, and if that fails it will report an 
error and the md layer will re-construct the data from the parity and 
the rest of the stripe.

So before worrying about data checksums, please read Neil's posts, and 
try to think out scenarios where it really would help.  And if you find 
you have a good argument, then post it here.

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27  9:45 ` David Brown
@ 2012-11-27 10:17   ` Bernd Schubert
  2012-11-27 11:20     ` David Brown
  2012-11-27 18:48   ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Bernd Schubert @ 2012-11-27 10:17 UTC (permalink / raw)
  To: David Brown; +Cc: Roy Sigurd Karlsbakk, Linux Raid

On 11/27/2012 10:45 AM, David Brown wrote:
> On 26/11/2012 14:27, Roy Sigurd Karlsbakk wrote:
>> Hi all
>>
>> I see from an article at
>> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
>>
>> that an implementation has been made to allow for ZFS-like
>> checksumming inside Linux MD. However, this code doesn't seem to
>> exist in any kernel trees. Does anyone know the current status for
>> data checksumming in MD?
>>
>
> See <http://neil.brown.name/blog/20110227114201> for a discussion on
> data checksums.
>
> As far as I have seen on this mailing list, there has been no "official"
> work on checksums as described in that paper.  I suspect it's just a
> matter of a student or two doing a project as part of their university
> degree.  It's great that people can do that - they are free to take a
> copy of the kernel, and experiment with new ideas.  If the ideas are
> good, then it is possible to work it back into the mainline kernel
> development.
>
> However, in this case I think there is not much support for data
> checksumming amongst the "big boys" in this part of the Linux kernel -
> as explained by Neil in his blog post.
>
> My first thought when reading the paper in question is that it doesn't
> really add much that is actually useful.  md does not need checksums -
> it already has a more powerful system for error detection and correction
> through the parity blocks.  If you want more checksumming than raid5
> gives you, then use raid6.
>
> What might be of interest for confirming the data integrity is so say
> that whenever a block is to be read, the stripe it is in should be
> scrubbed.  This would enforce regular scrubbing of data that is
> regularly used, and give the same benefits as the article's data
> checksumming.  It would lead to more disk reads when you have small
> reads, but the overhead would be small for larger reads or for RMW
> writes (since the whole stripe, minus the parity, is read in this case).
>
> However, referring to another of Neil's blog posts at
> <http://neil.brown.name/blog/20100211050355>, you have to ask yourself
> how likely is it that data will be read from the drive with an error,
> but without the disk telling you of the error - and what can you
> sensibly do about it?  You don't need checksums to tell you that there
> is a problem reading data from the disk - the disk already has very
> comprehensive checking of the data, and if that fails it will report an
> error and the md layer will re-construct the data from the parity and
> the rest of the stripe.

Thats the theory, real live unfortunately teaches a different story. I 
just helped to recover as much data as possible from a troublesome 
infortrend raid system, which again is part of a software raid. The 
stupid hardware raid decided for unknown reasons to return different 
data on each read. And this is already the 4th or 5th time that happened 
(its a rather big installation and each time another hardware raid 
causes the trouble).
And yes, I also aready have seen several hard disks to return wrong 
data. That is actually the reason why some hardware raid vendors such as 
DDN do parity reads all the time and then correct wrong data or entirely 
fail the disks.

I will sent patches to better handle parity mismatches during the next 
weeks (for performance reasons only for background checks).

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 10:17   ` Bernd Schubert
@ 2012-11-27 11:20     ` David Brown
  2012-11-27 11:39       ` Roy Sigurd Karlsbakk
                         ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: David Brown @ 2012-11-27 11:20 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Roy Sigurd Karlsbakk, Linux Raid

On 27/11/2012 11:17, Bernd Schubert wrote:
> On 11/27/2012 10:45 AM, David Brown wrote:
>> On 26/11/2012 14:27, Roy Sigurd Karlsbakk wrote:
>>> Hi all
>>>
>>> I see from an article at
>>> http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf
>>>
>>>
>>> that an implementation has been made to allow for ZFS-like
>>> checksumming inside Linux MD. However, this code doesn't seem to
>>> exist in any kernel trees. Does anyone know the current status for
>>> data checksumming in MD?
>>>
>>
>> See <http://neil.brown.name/blog/20110227114201> for a discussion on
>> data checksums.
>>
>> As far as I have seen on this mailing list, there has been no "official"
>> work on checksums as described in that paper.  I suspect it's just a
>> matter of a student or two doing a project as part of their university
>> degree.  It's great that people can do that - they are free to take a
>> copy of the kernel, and experiment with new ideas.  If the ideas are
>> good, then it is possible to work it back into the mainline kernel
>> development.
>>
>> However, in this case I think there is not much support for data
>> checksumming amongst the "big boys" in this part of the Linux kernel -
>> as explained by Neil in his blog post.
>>
>> My first thought when reading the paper in question is that it doesn't
>> really add much that is actually useful.  md does not need checksums -
>> it already has a more powerful system for error detection and correction
>> through the parity blocks.  If you want more checksumming than raid5
>> gives you, then use raid6.
>>
>> What might be of interest for confirming the data integrity is so say
>> that whenever a block is to be read, the stripe it is in should be
>> scrubbed.  This would enforce regular scrubbing of data that is
>> regularly used, and give the same benefits as the article's data
>> checksumming.  It would lead to more disk reads when you have small
>> reads, but the overhead would be small for larger reads or for RMW
>> writes (since the whole stripe, minus the parity, is read in this case).
>>
>> However, referring to another of Neil's blog posts at
>> <http://neil.brown.name/blog/20100211050355>, you have to ask yourself
>> how likely is it that data will be read from the drive with an error,
>> but without the disk telling you of the error - and what can you
>> sensibly do about it?  You don't need checksums to tell you that there
>> is a problem reading data from the disk - the disk already has very
>> comprehensive checking of the data, and if that fails it will report an
>> error and the md layer will re-construct the data from the parity and
>> the rest of the stripe.
>
> Thats the theory, real live unfortunately teaches a different story. I
> just helped to recover as much data as possible from a troublesome
> infortrend raid system, which again is part of a software raid. The
> stupid hardware raid decided for unknown reasons to return different
> data on each read. And this is already the 4th or 5th time that happened
> (its a rather big installation and each time another hardware raid
> causes the trouble).
> And yes, I also aready have seen several hard disks to return wrong
> data. That is actually the reason why some hardware raid vendors such as
> DDN do parity reads all the time and then correct wrong data or entirely
> fail the disks.
>
> I will sent patches to better handle parity mismatches during the next
> weeks (for performance reasons only for background checks).
>
> Cheers,
> Bernd
>
>

I can certainly sympathise with you, but I am not sure that data 
checksumming would help here.  If your hardware raid sends out nonsense, 
then it is going to be very difficult to get anything trustworthy.  The 
obvious answer here is to throw out the broken hardware raid and use a 
system that works - but it is equally obvious that that is easier said 
than done!  But I would find it hard to believe that this is a common 
issue with hardware raid systems - it goes against the whole point of 
data storage.

There is always a chance of undetected read errors - the question is if 
the chances of such read errors, and the consequences of them, justify 
the costs of extra checking.  And if they /do/ justify extra checking, 
are data checksums the right way?  I agree with Neil's post that 
end-to-end checksums (such as CRCs in a gzip file, or GPG integrity 
checks) are the best check when they are possible, but they are not 
always possible because they are not transparent.

mvh.,

David



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 11:20     ` David Brown
@ 2012-11-27 11:39       ` Roy Sigurd Karlsbakk
  2012-11-27 12:37         ` David Brown
  2012-11-27 12:31       ` Bernd Schubert
  2012-11-27 13:54       ` Joe Landman
  2 siblings, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-27 11:39 UTC (permalink / raw)
  To: David Brown; +Cc: Linux Raid, Bernd Schubert

> I can certainly sympathise with you, but I am not sure that data
> checksumming would help here. If your hardware raid sends out
> nonsense,
> then it is going to be very difficult to get anything trustworthy. The
> obvious answer here is to throw out the broken hardware raid and use a
> system that works - but it is equally obvious that that is easier said
> than done! But I would find it hard to believe that this is a common
> issue with hardware raid systems - it goes against the whole point of
> data storage.
> 
> There is always a chance of undetected read errors - the question is
> if
> the chances of such read errors, and the consequences of them, justify
> the costs of extra checking. And if they /do/ justify extra checking,
> are data checksums the right way?

The chance of a silent corruption is rather small with your average 3TB home storage. On the other hand, if you had a petabyte or five, the chances would be very high indeed to get silent corruption (ref the CERN study done in 2007). In my last job, I worked with ZFS with ~350TiB storage, and there we saw errors happen rather frequently, but then, since ZFS checksums data and uses it to deal with errors, we never saw any data loss. That is, except on an older machine, running ZFS on a hardware RAID controlled storage unit (NexSAN SATABeast). We had error corruption on that one as well, after a disk failure, and had to resort to restoring from tape, since ZFS couldn't control the RAID.

> I agree with Neil's post that
> end-to-end checksums (such as CRCs in a gzip file, or GPG integrity
> checks) are the best check when they are possible, but they are not
> always possible because they are not transparent.

The problem with end-to-end-checksums at the application level, is it will only be able to detect the error, not fix it, similar to the issues I mentioned above.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 11:20     ` David Brown
  2012-11-27 11:39       ` Roy Sigurd Karlsbakk
@ 2012-11-27 12:31       ` Bernd Schubert
  2012-11-27 13:05         ` David Brown
  2012-11-27 13:54       ` Joe Landman
  2 siblings, 1 reply; 32+ messages in thread
From: Bernd Schubert @ 2012-11-27 12:31 UTC (permalink / raw)
  To: David Brown, Roy Sigurd Karlsbakk, Linux Raid

On 11/27/2012 12:20 PM, David Brown wrote:
> I can certainly sympathise with you, but I am not sure that data
> checksumming would help here.  If your hardware raid sends out nonsense,
> then it is going to be very difficult to get anything trustworthy.  The

When a single hardware unit (any kind of block device) in a
raid-level > 0 decides to send wrong data, correct data always can be 
reconstructed. You only need to know which unit it is - checksums help 
to figure that out.

> obvious answer here is to throw out the broken hardware raid and use a
> system that works - but it is equally obvious that that is easier said
> than done!  But I would find it hard to believe that this is a common
> issue with hardware raid systems - it goes against the whole point of
> data storage.

With disks it is not that uncommon. But yes, hardware raid controllers 
usually do not scramble data.

>
> There is always a chance of undetected read errors - the question is if
> the chances of such read errors, and the consequences of them, justify
> the costs of extra checking.  And if they /do/ justify extra checking,
> are data checksums the right way?  I agree with Neil's post that
> end-to-end checksums (such as CRCs in a gzip file, or GPG integrity
> checks) are the best check when they are possible, but they are not
> always possible because they are not transparent.

Everything below block or filesystem level is too late. Just remember, 
writing not a complete stripe implies reads in order to update the p and 
q parity blocks. So even if your application could later on detect that 
(Do your applications usually verify checksums?  In HPC I don't know of 
a single application to do that...), file system meta data already would 
be broken.

Cheers,
Bernd

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 11:39       ` Roy Sigurd Karlsbakk
@ 2012-11-27 12:37         ` David Brown
  2012-11-27 13:09           ` Roy Sigurd Karlsbakk
  2012-11-27 20:49           ` Stan Hoeppner
  0 siblings, 2 replies; 32+ messages in thread
From: David Brown @ 2012-11-27 12:37 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid, Bernd Schubert

On 27/11/2012 12:39, Roy Sigurd Karlsbakk wrote:
>> I can certainly sympathise with you, but I am not sure that data
>> checksumming would help here. If your hardware raid sends out
>> nonsense, then it is going to be very difficult to get anything
>> trustworthy. The obvious answer here is to throw out the broken
>> hardware raid and use a system that works - but it is equally
>> obvious that that is easier said than done! But I would find it
>> hard to believe that this is a common issue with hardware raid
>> systems - it goes against the whole point of data storage.
>>
>> There is always a chance of undetected read errors - the question
>> is if the chances of such read errors, and the consequences of
>> them, justify the costs of extra checking. And if they /do/ justify
>> extra checking, are data checksums the right way?
>
> The chance of a silent corruption is rather small with your average
> 3TB home storage. On the other hand, if you had a petabyte or five,
> the chances would be very high indeed to get silent corruption (ref
> the CERN study done in 2007). In my last job, I worked with ZFS with
> ~350TiB storage, and there we saw errors happen rather frequently,
> but then, since ZFS checksums data and uses it to deal with errors,
> we never saw any data loss. That is, except on an older machine,
> running ZFS on a hardware RAID controlled storage unit (NexSAN
> SATABeast). We had error corruption on that one as well, after a disk
> failure, and had to resort to restoring from tape, since ZFS couldn't
> control the RAID.

Of course even a small chance-per-bit turns into a significant total 
chance when you have enough bits!  There is always a chance of 
undetected issues - your aim it to reduce that chance until it is no 
longer relevant (or until the chance is under 1 in 150 million per year 
- then you should worry more about being killed by lightning).

>
>> I agree with Neil's post that end-to-end checksums (such as CRCs in
>> a gzip file, or GPG integrity checks) are the best check when they
>> are possible, but they are not always possible because they are not
>> transparent.
>
> The problem with end-to-end-checksums at the application level, is it
> will only be able to detect the error, not fix it, similar to the
> issues I mentioned above.
>

Checksumming, as suggested by the originally mentioned paper, will not 
be able to correct anything either.  At first glance, it might seem that 
it would tell you which block was wrong, and therefore let you re-build 
that block from the rest of the raid stripe.  But that will not be the 
case if there are issues while writing, such as unexpected power 
failures - it could just as easily be the data blocks that are correctly 
written while the checksum block is wrong.  And exactly as discussed in 
Neil's post on "smart" recovery, the principle of least surprise 
suggests giving the data blocks back unchanged is the least harmful.

To do checksumming (and in particular, recovery), requires higher level 
knowledge of the data.  The filesystem can track when it writes a file, 
and update metadata (including, if desired, a data checksum) once it 
knows the file is correctly stored.  But I don't think it can sensibly 
be done at the block device level - the recovery procedure doesn't know 
what is old data, what is new data, or which bit is important to the 
filesystem.

So I think it can make sense to use a filesystem like ZFS or BTRFS that 
can do checksumming - that is a reasonable level to add the checksum.

One way to handle this at md block level would be to have an option for 
raid arrays to always do a full stripe read and consistency check 
whenever a block is read.  If the consistency check fails (without any 
errors being indicated from the drives), the array should simply return 
a read error - it should /not/ attempt to recover the data (since it 
can't tell which parts are the real problem).  If arrays with this 
option are used as first-level arrays, with a "normal" md raid array 
(raid1, raid5, etc.) on top, then the normal raid recovery process will 
replace the bad data and initiate a new write to correct the undetected 
read error.  I think this would perhaps give you the level of 
reliability you are looking for, and be suitable for big arrays (indeed, 
it would be unsuitable for small arrays as you need at least two levels).

mvh.,

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 12:31       ` Bernd Schubert
@ 2012-11-27 13:05         ` David Brown
  2012-11-27 18:53           ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: David Brown @ 2012-11-27 13:05 UTC (permalink / raw)
  To: Bernd Schubert; +Cc: Roy Sigurd Karlsbakk, Linux Raid

On 27/11/2012 13:31, Bernd Schubert wrote:
> On 11/27/2012 12:20 PM, David Brown wrote:
>> I can certainly sympathise with you, but I am not sure that data
>> checksumming would help here.  If your hardware raid sends out nonsense,
>> then it is going to be very difficult to get anything trustworthy.  The
>
> When a single hardware unit (any kind of block device) in a
> raid-level > 0 decides to send wrong data, correct data always can be
> reconstructed. You only need to know which unit it is - checksums help
> to figure that out.

If checksums (as described in the paper) only "help" to figure that out, 
then they are not good enough - you can only do automatic on-the-fly 
correction if you are /sure/ you know which device is the problem (at 
least for a very high probability of "sure").  I think that adding an 
extra checksum block to the stripe only gives an indication of the 
problem disk (or lower-level raid) - without being sure of the order 
that data hits the different disks (or lower-level raids), I don't think 
it is reliable enough.  (I could be wrong in all this - I'm just waving 
around ideas, and have no experience with big arrays.)

>
>> obvious answer here is to throw out the broken hardware raid and use a
>> system that works - but it is equally obvious that that is easier said
>> than done!  But I would find it hard to believe that this is a common
>> issue with hardware raid systems - it goes against the whole point of
>> data storage.
>
> With disks it is not that uncommon. But yes, hardware raid controllers
> usually do not scramble data.

With disks it /is/ uncommon.  /Detected/ disk errors are not a problem - 
the disks's own ECC system finds it has an unrecoverable error, and 
returns a read error, and the raid system replaces the data using the 
rest of the stripe.  It is /undetected/ disk errors that are a problem. 
  Typical figures I have seen are around 1 in 1e12 4KB blocks - or 1 in 
3e16 bits.  If you've got a 1 PB disk array, that's one error for every 
four full reads - which is certainly enough to be relevant, but I 
wouldn't say it is "not that uncommon".

>
>>
>> There is always a chance of undetected read errors - the question is if
>> the chances of such read errors, and the consequences of them, justify
>> the costs of extra checking.  And if they /do/ justify extra checking,
>> are data checksums the right way?  I agree with Neil's post that
>> end-to-end checksums (such as CRCs in a gzip file, or GPG integrity
>> checks) are the best check when they are possible, but they are not
>> always possible because they are not transparent.
>
> Everything below block or filesystem level is too late. Just remember,
> writing not a complete stripe implies reads in order to update the p and
> q parity blocks. So even if your application could later on detect that
> (Do your applications usually verify checksums?  In HPC I don't know of
> a single application to do that...), file system meta data already would
> be broken.
>

When you say "below block or filesystem level", I presume you mean such 
as "application level"?  I always think of that as above the filesystem, 
which is above the block level.  I certainly agree that it is often not 
practical to verify checksums at the application level.

As I mentioned in another post, I think there are times when filesystem 
checksumming can make sense.  I also described another idea at block 
level - I am curious as to what you think of that.

mvh.,

David

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 12:37         ` David Brown
@ 2012-11-27 13:09           ` Roy Sigurd Karlsbakk
  2012-11-27 13:20             ` David Brown
  2012-11-27 20:49           ` Stan Hoeppner
  1 sibling, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-27 13:09 UTC (permalink / raw)
  To: David Brown; +Cc: Linux Raid, Bernd Schubert

> One way to handle this at md block level would be to have an option
> for
> raid arrays to always do a full stripe read and consistency check
> whenever a block is read. If the consistency check fails (without any
> errors being indicated from the drives), the array should simply
> return
> a read error - it should /not/ attempt to recover the data (since it
> can't tell which parts are the real problem). If arrays with this
> option are used as first-level arrays, with a "normal" md raid array
> (raid1, raid5, etc.) on top, then the normal raid recovery process
> will
> replace the bad data and initiate a new write to correct the
> undetected
> read error. I think this would perhaps give you the level of
> reliability you are looking for, and be suitable for big arrays
> (indeed,
> it would be unsuitable for small arrays as you need at least two
> levels).

If this system is running RAID-6, recovery should be possible to check both parity chunks, right?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 13:09           ` Roy Sigurd Karlsbakk
@ 2012-11-27 13:20             ` David Brown
  2012-11-27 13:56               ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 32+ messages in thread
From: David Brown @ 2012-11-27 13:20 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid, Bernd Schubert

On 27/11/2012 14:09, Roy Sigurd Karlsbakk wrote:
>> One way to handle this at md block level would be to have an
>> option for raid arrays to always do a full stripe read and
>> consistency check whenever a block is read. If the consistency
>> check fails (without any errors being indicated from the drives),
>> the array should simply return a read error - it should /not/
>> attempt to recover the data (since it can't tell which parts are
>> the real problem). If arrays with this option are used as
>> first-level arrays, with a "normal" md raid array (raid1, raid5,
>> etc.) on top, then the normal raid recovery process will replace
>> the bad data and initiate a new write to correct the undetected
>> read error. I think this would perhaps give you the level of
>> reliability you are looking for, and be suitable for big arrays
>> (indeed, it would be unsuitable for small arrays as you need at
>> least two levels).
>
> If this system is running RAID-6, recovery should be possible to
> check both parity chunks, right?

Yes, of course.  (And if anyone ever needs it, it is possible to extend 
raid6 to 3 parity chunks.  I've done the maths, but it is not 
implemented - there doesn't seem to be a big need for it.)  But - again 
referring back to Neil's blog - if the low-level raid spots a 
consistency error, it still cannot correct it reliably even with 2 
parity chunks, and should pass on a read error to the higher level raid. 
  Using raid6 at the low level would let you do a good consistency check 
even in the case of a failed drive (or a known read error on a drive) - 
or two simultaneous undetected read errors.  And raid6 on the higher 
level raid would let you correct such errors, even when there are other 
errors around.  You'd soon reach the point where it is more likely for 
your disks to spontaneously turn into a bowl of petunias than for read 
errors to be undetected or unrecoverable.

>
> Vennlige hilsener / Best regards
>
> roy -- Roy Sigurd Karlsbakk (+47) 98013356 roy@karlsbakk.net
> http://blogg.karlsbakk.net/ GPG Public key:
> http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt -- I all
> pedagogikk er det essensielt at pensum presenteres intelligibelt. Det
> er et elementært imperativ for alle pedagoger å unngå eksessiv
> anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller
> eksisterer adekvate og relevante synonymer på norsk.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 11:20     ` David Brown
  2012-11-27 11:39       ` Roy Sigurd Karlsbakk
  2012-11-27 12:31       ` Bernd Schubert
@ 2012-11-27 13:54       ` Joe Landman
  2 siblings, 0 replies; 32+ messages in thread
From: Joe Landman @ 2012-11-27 13:54 UTC (permalink / raw)
  To: David Brown; +Cc: Bernd Schubert, Roy Sigurd Karlsbakk, Linux Raid

On 11/27/2012 06:20 AM, David Brown wrote:
> On 27/11/2012 11:17, Bernd Schubert wrote:

[...]

>> I will sent patches to better handle parity mismatches during the next
>> weeks (for performance reasons only for background checks).
>>
>> Cheers,
>> Bernd

Give me a heads up when they are ready, and I can get some testing in 
for you.

>>
>>
>
> I can certainly sympathise with you, but I am not sure that data
> checksumming would help here.  If your hardware raid sends out nonsense,

Well, unfortunately, Bernd (and DDN et al) are right, it is helpful.  It 
has to be engineered correctly to be of use.  T10-DIF and PI are efforts 
in this direction.  This can be implemented in software.

> then it is going to be very difficult to get anything trustworthy.  The
> obvious answer here is to throw out the broken hardware raid and use a
> system that works - but it is equally obvious that that is easier said

... which is not so obvious when you start dealing with hundreds to TB 
and PB of data, and you have hardware which works perfectly most of the 
time (apart from a random cosmic ray, power fluctuation/surge, ...)

> than done!  But I would find it hard to believe that this is a common
> issue with hardware raid systems - it goes against the whole point of
> data storage.

http://storagemojo.com/2007/09/19/cerns-data-corruption-research/

(and no, I don't work for Robin, he largely ignores us :( )

> There is always a chance of undetected read errors - the question is if
> the chances of such read errors, and the consequences of them, justify
> the costs of extra checking.  And if they /do/ justify extra checking,

I guess the real question is, how valuable is your data ... if you took 
the trouble to store it, I am guessing that you'd like to know a) its 
stored correctly, b) it is retrievable, and c) what you retrieve is 
correct.

Hardware (and software) RAID help with b, and sometimes a.  C is what 
T10-DIF/PI and related are trying to solve.

> are data checksums the right way?  I agree with Neil's post that
> end-to-end checksums (such as CRCs in a gzip file, or GPG integrity
> checks) are the best check when they are possible, but they are not
> always possible because they are not transparent.

I personally would like to push the checking more into the file system 
layers  than the disk block layers.  Though I expect strong resistance 
to that as well.  File systems assume perfectly operating underlying 
storage blocks in most cases, and breaking that model (or breaking it 
any more than we are doing now) would be troubling to many.

Adding in crc verify on read (and crc generation and storage on write) 
shouldn't be too painful at the block layer.  Much of the infrastructure 
is in place.  I've been thinking of building a pluggable connection into 
MD, so we could experiment with different mechanisms (w/o rebuilding the 
kernel or MD each time).  Though this is (unfortunately) pretty far down 
our list of priorities at the moment.

>
> mvh.,
>
> David
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 13:20             ` David Brown
@ 2012-11-27 13:56               ` Roy Sigurd Karlsbakk
  2012-11-27 14:34                 ` David Brown
  0 siblings, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-27 13:56 UTC (permalink / raw)
  To: David Brown; +Cc: Linux Raid, Bernd Schubert

> > If this system is running RAID-6, recovery should be possible to
> > check both parity chunks, right?
> 
> Yes, of course. (And if anyone ever needs it, it is possible to extend
> raid6 to 3 parity chunks. I've done the maths, but it is not
> implemented - there doesn't seem to be a big need for it.) But - again
> referring back to Neil's blog - if the low-level raid spots a
> consistency error, it still cannot correct it reliably even with 2
> parity chunks, and should pass on a read error to the higher level
> raid.
> Using raid6 at the low level would let you do a good consistency check
> even in the case of a failed drive (or a known read error on a drive)
> -
> or two simultaneous undetected read errors. And raid6 on the higher
> level raid would let you correct such errors, even when there are
> other
> errors around. You'd soon reach the point where it is more likely for
> your disks to spontaneously turn into a bowl of petunias than for read
> errors to be undetected or unrecoverable.

That would be nice. So what should be done here in the first place, is to change the code to allow parity data to be read and calculated also on reads?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 13:56               ` Roy Sigurd Karlsbakk
@ 2012-11-27 14:34                 ` David Brown
  0 siblings, 0 replies; 32+ messages in thread
From: David Brown @ 2012-11-27 14:34 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid, Bernd Schubert

On 27/11/2012 14:56, Roy Sigurd Karlsbakk wrote:
>>> If this system is running RAID-6, recovery should be possible to
>>> check both parity chunks, right?
>>
>> Yes, of course. (And if anyone ever needs it, it is possible to
>> extend raid6 to 3 parity chunks. I've done the maths, but it is
>> not implemented - there doesn't seem to be a big need for it.) But
>> - again referring back to Neil's blog - if the low-level raid spots
>> a consistency error, it still cannot correct it reliably even with
>> 2 parity chunks, and should pass on a read error to the higher
>> level raid. Using raid6 at the low level would let you do a good
>> consistency check even in the case of a failed drive (or a known
>> read error on a drive) - or two simultaneous undetected read
>> errors. And raid6 on the higher level raid would let you correct
>> such errors, even when there are other errors around. You'd soon
>> reach the point where it is more likely for your disks to
>> spontaneously turn into a bowl of petunias than for read errors to
>> be undetected or unrecoverable.
>
> That would be nice. So what should be done here in the first place,
> is to change the code to allow parity data to be read and calculated
> also on reads?

Well, what should be done /first/ is to hope that some of the more 
experienced md raid experts express an opinion on the idea - is it 
possible, is it useful, and is it practical to implement?

The main aim would be to add an option to a md arrays that will turn 
each read into an implicit scrub or check of the whole stripe, and that 
a consistency error there would return a read error to the next layer of 
md raid.

I can see plenty of scope for complications here, such as what to do on 
normal (detected) read errors, or how to ensure that the upper layer 
re-writes the whole stripe and not just part of it (or perhaps partial 
re-writes would be enough).  I am fully aware that I'm just giving a 
rough idea here - it needs a lot more thought before anyone can start 
changing code.  But if my theory here is correct, and if it is practical 
to implement, then it might be a useful tool for big data producers.

mvh.,

David

>
> Vennlige hilsener / Best regards
>
> roy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27  9:45 ` David Brown
  2012-11-27 10:17   ` Bernd Schubert
@ 2012-11-27 18:48   ` Chris Murphy
  2012-11-27 19:36     ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2012-11-27 18:48 UTC (permalink / raw)
  To: Linux Raid

On Nov 27, 2012, at 2:45 AM, David Brown <david.brown@hesbynett.no> wrote:

> My first thought when reading the paper in question is that it doesn't really add much that is actually useful.  md does not need checksums - it already has a more powerful system for error detection and correction through the parity blocks.  If you want more checksumming than raid5 gives you, then use raid6.

parity != checksum

Reading a data block in, computing a checksum and comparing it to a previously written checksum for that same data block, is much less expensive than the choices you have with parity. Reading a data block in, computing parity, reading previously written parity, comparing the two; or, reading data block, and using parity to reconstruct data, and comparing the two data streams.

Further in the case of raid5 single parity, it's ambiguous whether a bit error is attributable to parity or the data. Which is wrong? You don't actually know. With raid6 dual parity, there is no ambiguity if both sets of parity are used to reconstruct data and are compared to each other rather than one being blindly trusted, or more cheaply is to reconstruct the data once from either set of parity, compute a checksum and compare to the written checksum.

So there is still a role for checksums, the question is whether this is better or equally managed at a file system level vs an md level. The checksumming with RAID applications today are file system level in ZFS and btrfs (and maybe ReFS).

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 13:05         ` David Brown
@ 2012-11-27 18:53           ` Chris Murphy
  2012-11-27 19:27             ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2012-11-27 18:53 UTC (permalink / raw)
  To: Linux Raid


On Nov 27, 2012, at 6:05 AM, David Brown <david.brown@hesbynett.no> wrote:
> 
> It is /undetected/ disk errors that are a problem.  Typical figures I have seen are around 1 in 1e12 4KB blocks - or 1 in 3e16 bits.

Is this with consumer SATA? Or other?


Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 18:53           ` Chris Murphy
@ 2012-11-27 19:27             ` Roy Sigurd Karlsbakk
  2012-11-27 19:50               ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-27 19:27 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux Raid

----- Opprinnelig melding -----
> On Nov 27, 2012, at 6:05 AM, David Brown <david.brown@hesbynett.no>
> wrote:
> >
> > It is /undetected/ disk errors that are a problem. Typical figures I
> > have seen are around 1 in 1e12 4KB blocks - or 1 in 3e16 bits.
> 
> Is this with consumer SATA? Or other?

According to a few studies I've read, the number of silent errors are identical on both consumer SATA drives and SAS drives. It's the density that makes errors, and the platters and disk heads are produced alike.\x1d

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 18:48   ` Chris Murphy
@ 2012-11-27 19:36     ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2012-11-27 19:36 UTC (permalink / raw)
  To: Linux Raid

On Nov 27, 2012, at 11:48 AM, Chris Murphy <lists@colorremedies.com> wrote:

> Further in the case of raid5 single parity, it's ambiguous whether a bit error is attributable to parity or the data. Which is wrong? You don't actually know. With raid6 dual parity, there is no ambiguity if both sets of parity are used to reconstruct data and are compared to each other rather than one being blindly trusted, or more cheaply is to reconstruct the data once from either set of parity, compute a checksum and compare to the written checksum.

Also a parity only solution, is in effect a RAID 6+ (dual+ parity) solution. RAID5 and lower are left out. And since RAID 10 and RAID 1+linear are scalable and parity solutions aren't, I think that a solution that works for them from the outset needs consideration.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 19:27             ` Roy Sigurd Karlsbakk
@ 2012-11-27 19:50               ` Chris Murphy
  2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2012-11-27 19:50 UTC (permalink / raw)
  To: Linux Raid

On Nov 27, 2012, at 12:27 PM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

> ----- Opprinnelig melding -----
>> On Nov 27, 2012, at 6:05 AM, David Brown <david.brown@hesbynett.no>
>> wrote:
>>> 
>>> It is /undetected/ disk errors that are a problem. Typical figures I
>>> have seen are around 1 in 1e12 4KB blocks - or 1 in 3e16 bits.
>> 
>> Is this with consumer SATA? Or other?
> 
> According to a few studies I've read, the number of silent errors are identical on both consumer SATA drives and SAS drives. It's the density that makes errors, and the platters and disk heads are produced alike.\x1d

That's not what I'm reading:

Corruption detected in 8.5% of nearline disks, and 1.9% in enterprise disks.
http://www.pdsi-scidac.org/events/PDSW09/resources/pdsw09_slides13.pdf

In this study it's 0.86% of nearline, and 0.065% of enterprise disks developing checksum mismatches, out of 1.53 million disks.
http://static.usenix.org/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf

The ratios are the same.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 12:37         ` David Brown
  2012-11-27 13:09           ` Roy Sigurd Karlsbakk
@ 2012-11-27 20:49           ` Stan Hoeppner
  2012-11-28 10:58             ` Roy Sigurd Karlsbakk
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2012-11-27 20:49 UTC (permalink / raw)
  To: David Brown; +Cc: Roy Sigurd Karlsbakk, Linux Raid, Bernd Schubert

On 11/27/2012 6:37 AM, David Brown wrote:

> To do checksumming (and in particular, recovery), requires higher level
> knowledge of the data.  The filesystem can track when it writes a file,
> and update metadata (including, if desired, a data checksum) once it
> knows the file is correctly stored.  But I don't think it can sensibly
> be done at the block device level - the recovery procedure doesn't know
> what is old data, what is new data, or which bit is important to the
> filesystem.
> 
> So I think it can make sense to use a filesystem like ZFS or BTRFS that
> can do checksumming - that is a reasonable level to add the checksum.

You'll see CRC in XFS in the future as well.  Some of the foundation is
already laid to allow it, but IIRC it requires an on disk format change
for full implementation.  On disk format changes are a big deal and are
taken with great care.  IIRC XFS has only seen one or two in 18 years.

-- 
Stan


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 19:50               ` Chris Murphy
@ 2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
  2012-11-28 10:59                   ` Roy Sigurd Karlsbakk
                                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 10:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux Raid

> > According to a few studies I've read, the number of silent errors
> > are identical on both consumer SATA drives and SAS drives. It's the
> > density that makes errors, and the platters and disk heads are
> > produced alike.
> 
> That's not what I'm reading:
> 
> Corruption detected in 8.5% of nearline disks, and 1.9% in enterprise
> disks.
> http://www.pdsi-scidac.org/events/PDSW09/resources/pdsw09_slides13.pdf

That doesn't make sense. Nearline and enterprise drives are large 7k2 drives that come in consumer and enterprise models, such as Hitachi Deskstar and Ultrastar. The latter is said to have better bearings etc, although I don't know the difference for sure. With WDs large drives, most of the difference is said to be in the firmware (such as TLER in the enterprise/RAID drives). Comparing "nearline" and "enterprise" like done here, is merely a comparison of small and large drives, where it's well known that the smaller (less dense) drives have less errors.

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-27 20:49           ` Stan Hoeppner
@ 2012-11-28 10:58             ` Roy Sigurd Karlsbakk
  0 siblings, 0 replies; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 10:58 UTC (permalink / raw)
  To: stan; +Cc: Linux Raid, Bernd Schubert, David Brown

> > To do checksumming (and in particular, recovery), requires higher
> > level
> > knowledge of the data. The filesystem can track when it writes a
> > file,
> > and update metadata (including, if desired, a data checksum) once it
> > knows the file is correctly stored. But I don't think it can
> > sensibly
> > be done at the block device level - the recovery procedure doesn't
> > know
> > what is old data, what is new data, or which bit is important to the
> > filesystem.
> >
> > So I think it can make sense to use a filesystem like ZFS or BTRFS
> > that
> > can do checksumming - that is a reasonable level to add the
> > checksum.
> 
> You'll see CRC in XFS in the future as well. Some of the foundation is
> already laid to allow it, but IIRC it requires an on disk format
> change
> for full implementation. On disk format changes are a big deal and are
> taken with great care. IIRC XFS has only seen one or two in 18 years.

I'm afraid that won't help much either, since it'll only allow for *detecting* the errors, and not *fixing* them (as with ZFS and Btrfs). Or perhaps if XFS can integrate with MD somehow?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
@ 2012-11-28 10:59                   ` Roy Sigurd Karlsbakk
  2012-11-28 13:25                   ` Drew
  2012-11-28 19:08                   ` Chris Murphy
  2 siblings, 0 replies; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 10:59 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux Raid

> That doesn't make sense. Nearline and enterprise drives are large 7k2
> drives that come in consumer and enterprise models, such as Hitachi
> Deskstar and Ultrastar.

Small typo here. Change to "Nearline drives are large 7k2 drives that come in consumer and enterprise models…"

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
  2012-11-28 10:59                   ` Roy Sigurd Karlsbakk
@ 2012-11-28 13:25                   ` Drew
  2012-11-28 17:51                     ` Roy Sigurd Karlsbakk
  2012-11-28 19:08                   ` Chris Murphy
  2 siblings, 1 reply; 32+ messages in thread
From: Drew @ 2012-11-28 13:25 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Chris Murphy, Linux Raid

> That doesn't make sense. Nearline and enterprise drives are large 7k2 drives that come in consumer and enterprise models, such as Hitachi Deskstar and Ultrastar. The latter is said to have better bearings etc, although I don't know the difference for sure. With WDs large drives, most of the difference is said to be in the firmware (such as TLER in the enterprise/RAID drives). Comparing "nearline" and "enterprise" like done here, is merely a comparison of small and large drives, where it's well known that the smaller (less dense) drives have less errors.
>

Less dense but also the drives are probably the 10k & 15k RPM drives
which have to be built to better tolerances given the expectation
they'll be spinning for most of their five year life.

-- 
Drew

"Nothing in life is to be feared. It is only to be understood."
--Marie Curie

"This started out as a hobby and spun horribly out of control."
-Unknown

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 13:25                   ` Drew
@ 2012-11-28 17:51                     ` Roy Sigurd Karlsbakk
  2012-11-28 19:16                       ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 17:51 UTC (permalink / raw)
  To: Drew; +Cc: Chris Murphy, Linux Raid

> > That doesn't make sense. Nearline and enterprise drives are large
> > 7k2 drives that come in consumer and enterprise models, such as
> > Hitachi Deskstar and Ultrastar. The latter is said to have better
> > bearings etc, although I don't know the difference for sure. With
> > WDs large drives, most of the difference is said to be in the
> > firmware (such as TLER in the enterprise/RAID drives). Comparing
> > "nearline" and "enterprise" like done here, is merely a comparison
> > of small and large drives, where it's well known that the smaller
> > (less dense) drives have less errors.
> >
> 
> Less dense but also the drives are probably the 10k & 15k RPM drives
> which have to be built to better tolerances given the expectation
> they'll be spinning for most of their five year life.

So what you're saying is that the "enterprise" drives that spin on 7k2 are just overpriced desktop drives? That really doesn't make sense. In my vocabolary, "nearline" means "slower", not worse. It'd be really interesting to get some opinions from someone that actually knows this, or perhaps some real-world statistics. Google had a study some years ago, which at that point showed desktop drives and enterprise drives had about the same error rate. That may be slightly outdated now, but still…

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
  2012-11-28 10:59                   ` Roy Sigurd Karlsbakk
  2012-11-28 13:25                   ` Drew
@ 2012-11-28 19:08                   ` Chris Murphy
  2012-11-28 19:18                     ` Roy Sigurd Karlsbakk
  2 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2012-11-28 19:08 UTC (permalink / raw)
  To: Linux Raid

On Nov 28, 2012, at 3:56 AM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

> That doesn't make sense. Nearline and enterprise drives are large 7k2 drives that come in consumer and enterprise models,

Nearline drives are in between consumer and enterprise.

> Comparing "nearline" and "enterprise" like done here, is merely a comparison of small and large drives, where it's well known that the smaller (less dense) drives have less errors.

The paper I cited doesn't support this. The nearline SATA with hardware adapters to convert the interface to fibre channel. The enterprise drives were already fibre channel. The paper discusses the contribution of the adapter to checksum mismatches, but this alone doesn't account for all of the higher rate of error in nearline SATA.

Further, the paper says "There is no clear indication that disk size affects the probability of developing checksum mis- matches."

At least when it comes to URE (which is not SDC), we note that the manufacturer specs are the same for a model regardless of disk size.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 17:51                     ` Roy Sigurd Karlsbakk
@ 2012-11-28 19:16                       ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2012-11-28 19:16 UTC (permalink / raw)
  To: Linux Raid

On Nov 28, 2012, at 10:51 AM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:
> 
> Google had a study some years ago, which at that point showed desktop drives and enterprise drives had about the same error rate. That may be slightly outdated now, but still…

The only Google drive study I'm aware of is this one:
http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

That study is about drive failures, not SDC. They aren't looking at corruption, but rather SMART errors relating to disk failure. And the disks are consumer grade, none were enterprise.

The main finding of the study is the correlation (or lack thereof) of SMART prediction of drive failure (the health status of the drive) to reality. It turns out SMART is not so great at doing this if you only trust the health status. You can get better prediction of drive failure by taking specific attributes into account, but even this isn't completely reliable. New attributes are needed to better predict drive failures is the take away.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 19:08                   ` Chris Murphy
@ 2012-11-28 19:18                     ` Roy Sigurd Karlsbakk
  2012-11-28 20:02                       ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Roy Sigurd Karlsbakk @ 2012-11-28 19:18 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Linux Raid

> > That doesn't make sense. Nearline and enterprise drives are large
> > 7k2 drives that come in consumer and enterprise models,
> 
> Nearline drives are in between consumer and enterprise.

Well, they are named enterprise…

> > Comparing "nearline" and "enterprise" like done here, is merely a
> > comparison of small and large drives, where it's well known that the
> > smaller (less dense) drives have less errors.
> 
> The paper I cited doesn't support this. The nearline SATA with
> hardware adapters to convert the interface to fibre channel. The
> enterprise drives were already fibre channel. The paper discusses the
> contribution of the adapter to checksum mismatches, but this alone
> doesn't account for all of the higher rate of error in nearline SATA.

7k2 "nearline" SAS drives exist from several vendors, with SAS interfaces without the need for adapters. Also, the interface, wheather SAS or SATA or FC, doesn't really mean much when the drive is the same. SAS (and FC?) has better timing and better controllers, but a SAS drive is a SAS drive nonetheless if it has a SAS interface.

> Further, the paper says "There is no clear indication that disk size
> affects the probability of developing checksum mis- matches."
> 
> At least when it comes to URE (which is not SDC), we note that the
> manufacturer specs are the same for a model regardless of disk size.

Then, if this is true, perhaps they should name their 4TB drives something else than "enterprise", which they do. Do you have more references on this part?

Vennlige hilsener / Best regards

roy
--
Roy Sigurd Karlsbakk
(+47) 98013356
roy@karlsbakk.net
http://blogg.karlsbakk.net/
GPG Public key: http://karlsbakk.net/roysigurdkarlsbakk.pubkey.txt
--
I all pedagogikk er det essensielt at pensum presenteres intelligibelt. Det er et elementært imperativ for alle pedagoger å unngå eksessiv anvendelse av idiomer med xenotyp etymologi. I de fleste tilfeller eksisterer adekvate og relevante synonymer på norsk.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-28 19:18                     ` Roy Sigurd Karlsbakk
@ 2012-11-28 20:02                       ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2012-11-28 20:02 UTC (permalink / raw)
  To: Linux Raid

On Nov 28, 2012, at 12:18 PM, Roy Sigurd Karlsbakk <roy@karlsbakk.net> wrote:

>>> That doesn't make sense. Nearline and enterprise drives are large
>>> 7k2 drives that come in consumer and enterprise models,
>> 
>> Nearline drives are in between consumer and enterprise.
> 
> Well, they are named enterprise…

Yep it's misleading. SNIA has a slide show describing the differences between consumer nearline and enterprise.

> 7k2 "nearline" SAS drives exist from several vendors, with SAS interfaces without the need for adapters. Also, the interface, wheather SAS or SATA or FC, doesn't really mean much when the drive is the same. SAS (and FC?) has better timing and better controllers, but a SAS drive is a SAS drive nonetheless if it has a SAS interface.

Nearline SAS is effectively a consumer SATA drive (you don't get the better mechanics), with a SAS interface.

> Then, if this is true, perhaps they should name their 4TB drives something else than "enterprise", which they do. Do you have more references on this part?

Basically you have to look at the specs. Maybe a reasonable rule of thumb is the URE, but even that gets perverted sometimes. < 1 bit in 1E14 is consumer, 1E15 is nearline, 1E16 is enterprise. 

For example:
http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771386.pdf

<10 in 10^16 = <1 in 10^15, means this is nearline. It's clearly the same as the RE SATA in every way (all mechanical specs) except that it's a SAS interface.

Whereas this:
http://www.wdc.com/wdproducts/library/SpecSheet/ENG/2879-771429.pdf

<10 in 10^17 = <1 in 10^16 is enterprise.

Notice the size and speed difference, which is where "nearline" is being used. More capacity and slower than enterprise makes it nearline, more so than the URE. But URE makes it easier to distinguish between consumer and nearline.

Chris Murphy--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID?
  2012-11-26 13:27 Checksumming RAID? Roy Sigurd Karlsbakk
  2012-11-27  9:45 ` David Brown
@ 2012-12-03 12:24 ` Pasi Kärkkäinen
  2012-12-03 14:09   ` Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP Pasi Kärkkäinen
  1 sibling, 1 reply; 32+ messages in thread
From: Pasi Kärkkäinen @ 2012-12-03 12:24 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid

On Mon, Nov 26, 2012 at 02:27:30PM +0100, Roy Sigurd Karlsbakk wrote:
> Hi all
> 
> I see from an article at http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf that an implementation has been made to allow for ZFS-like checksumming inside Linux MD. However, this code doesn't seem to exist in any kernel trees. Does anyone know the current status for data checksumming in MD?
> 

Afaik Linux md-raid raid0 and raid1 already supports T10 PI (Protection Information) and DIF (Data Integrity Fields) / DIX (Data Integrity Extensions).

I wonder if it'd be possible to utilize this mdadm code with normal disks that don't have T10 PI support? ie. implement custom checksums in the "backend" with normal non-PI disks..

T10 PI and DIF/DIX is a SCSI/SAS feature, so it's mostly available in new(ish) enterprise SAS disks and it's not available in SATA.. but I think there are plans to implement something similar in SATA aswell in the future.

A couple of SAS HBAs (mpt2sas, for example) and some FC HBAs support DIF/DIX in Linux.

The purpose of DIF/DIX is to allow passing and verifying end-to-end checksums from all the way from the applications to the disks.. and probably in more common case that'd be checksums all the way from filesystems to the disks.

Some more links about T10 PI and DIF/DIX:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=Documentation/block/data-integrity.txt;hb=HEAD
https://oss.oracle.com/~mkp/docs/dix.pdf
https://oss.oracle.com/~mkp/docs/osd2008-data-integrity.pdf
https://oss.oracle.com/~mkp/docs/ols2008-petersen.pdf
https://oss.oracle.com/~mkp/docs/ppdc.pdf

-- Pasi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP
  2012-12-03 12:24 ` Pasi Kärkkäinen
@ 2012-12-03 14:09   ` Pasi Kärkkäinen
  2012-12-05 19:05     ` Martin K. Petersen
  0 siblings, 1 reply; 32+ messages in thread
From: Pasi Kärkkäinen @ 2012-12-03 14:09 UTC (permalink / raw)
  To: Roy Sigurd Karlsbakk; +Cc: Linux Raid, Martin K. Petersen

On Mon, Dec 03, 2012 at 02:24:45PM +0200, Pasi Kärkkäinen wrote:
> On Mon, Nov 26, 2012 at 02:27:30PM +0100, Roy Sigurd Karlsbakk wrote:
> > Hi all
> > 
> > I see from an article at http://pages.cs.wisc.edu/~bpkroth/cs736/md-checksums/md-checksums-paper.pdf that an implementation has been made to allow for ZFS-like checksumming inside Linux MD. However, this code doesn't seem to exist in any kernel trees. Does anyone know the current status for data checksumming in MD?
> > 
> 
> Afaik Linux md-raid raid0 and raid1 already supports T10 PI (Protection Information) and DIF (Data Integrity Fields) / DIX (Data Integrity Extensions).
> 
> I wonder if it'd be possible to utilize this mdadm code with normal disks that don't have T10 PI support? ie. implement custom checksums in the "backend" with normal non-PI disks..
> 
> T10 PI and DIF/DIX is a SCSI/SAS feature, so it's mostly available in new(ish) enterprise SAS disks and it's not available in SATA.. but I think there are plans to implement something similar in SATA aswell in the future.
> 

Now I remember.. it's the SATA T13 "External Path Protection" (EPP).

(Added CC to Martin in the case he has some thoughts about generic Linux checksumming RAID without T10 PI disks..)


-- Pasi


> A couple of SAS HBAs (mpt2sas, for example) and some FC HBAs support DIF/DIX in Linux.
> 
> The purpose of DIF/DIX is to allow passing and verifying end-to-end checksums from all the way from the applications to the disks.. and probably in more common case that'd be checksums all the way from filesystems to the disks.
> 
> Some more links about T10 PI and DIF/DIX:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=blob_plain;f=Documentation/block/data-integrity.txt;hb=HEAD
> https://oss.oracle.com/~mkp/docs/dix.pdf
> https://oss.oracle.com/~mkp/docs/osd2008-data-integrity.pdf
> https://oss.oracle.com/~mkp/docs/ols2008-petersen.pdf
> https://oss.oracle.com/~mkp/docs/ppdc.pdf
> 
> 
> -- Pasi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP
  2012-12-03 14:09   ` Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP Pasi Kärkkäinen
@ 2012-12-05 19:05     ` Martin K. Petersen
  2012-12-06 11:10       ` John Robinson
  0 siblings, 1 reply; 32+ messages in thread
From: Martin K. Petersen @ 2012-12-05 19:05 UTC (permalink / raw)
  To: Pasi Kärkkäinen
  Cc: Roy Sigurd Karlsbakk, Linux Raid, Martin K. Petersen

>>>>> "Pasi" == Pasi Kärkkäinen <pasik@iki.fi> writes:

Pasi> Now I remember.. it's the SATA T13 "External Path Protection"
Pasi> (EPP).

EPP is dead. So is SATA. Not going to happen.


Pasi> (Added CC to Martin in the case he has some thoughts about generic
Pasi> Linux checksumming RAID without T10 PI disks..)

There have been a few attempts at a checksumming DM target. However, I
think btrfs is a much better solution for this stuff.

-- 
Martin K. Petersen	Oracle Linux Engineering
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP
  2012-12-05 19:05     ` Martin K. Petersen
@ 2012-12-06 11:10       ` John Robinson
  0 siblings, 0 replies; 32+ messages in thread
From: John Robinson @ 2012-12-06 11:10 UTC (permalink / raw)
  To: Martin K. Petersen
  Cc: Pasi Kärkkäinen, Roy Sigurd Karlsbakk, Linux Raid

On 05/12/2012 19:05, Martin K. Petersen wrote:
>>>>>> "Pasi" == Pasi Kärkkäinen <pasik@iki.fi> writes:
> Pasi> (Added CC to Martin in the case he has some thoughts about generic
> Pasi> Linux checksumming RAID without T10 PI disks..)
>
> There have been a few attempts at a checksumming DM target. However, I
> think btrfs is a much better solution for this stuff.

I think there's room for both. Checksumming at the block level, below md 
RAID so presumably in a DM target, could help avoid silent data 
corruption in such a way that the md layer could reconstruct valid data. 
Checksumming at the filesystem level doesn't give you reconstruction - 
unless you add redundancy/RAID functions into the filesystem, which is a 
whole other discussion.

Cheers,

John.

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-12-06 11:10 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-26 13:27 Checksumming RAID? Roy Sigurd Karlsbakk
2012-11-27  9:45 ` David Brown
2012-11-27 10:17   ` Bernd Schubert
2012-11-27 11:20     ` David Brown
2012-11-27 11:39       ` Roy Sigurd Karlsbakk
2012-11-27 12:37         ` David Brown
2012-11-27 13:09           ` Roy Sigurd Karlsbakk
2012-11-27 13:20             ` David Brown
2012-11-27 13:56               ` Roy Sigurd Karlsbakk
2012-11-27 14:34                 ` David Brown
2012-11-27 20:49           ` Stan Hoeppner
2012-11-28 10:58             ` Roy Sigurd Karlsbakk
2012-11-27 12:31       ` Bernd Schubert
2012-11-27 13:05         ` David Brown
2012-11-27 18:53           ` Chris Murphy
2012-11-27 19:27             ` Roy Sigurd Karlsbakk
2012-11-27 19:50               ` Chris Murphy
2012-11-28 10:56                 ` Roy Sigurd Karlsbakk
2012-11-28 10:59                   ` Roy Sigurd Karlsbakk
2012-11-28 13:25                   ` Drew
2012-11-28 17:51                     ` Roy Sigurd Karlsbakk
2012-11-28 19:16                       ` Chris Murphy
2012-11-28 19:08                   ` Chris Murphy
2012-11-28 19:18                     ` Roy Sigurd Karlsbakk
2012-11-28 20:02                       ` Chris Murphy
2012-11-27 13:54       ` Joe Landman
2012-11-27 18:48   ` Chris Murphy
2012-11-27 19:36     ` Chris Murphy
2012-12-03 12:24 ` Pasi Kärkkäinen
2012-12-03 14:09   ` Checksumming RAID? / SCSI SAS T10 PI and DIF/DIX / T13 SATA EPP Pasi Kärkkäinen
2012-12-05 19:05     ` Martin K. Petersen
2012-12-06 11:10       ` John Robinson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).