Questions about bitrot and RAID 5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions about bitrot and RAID 5/6
@ 2014-01-20 20:34 Mason Loring Bliss
  2014-01-20 21:46 ` NeilBrown
  0 siblings, 1 reply; 32+ messages in thread
From: Mason Loring Bliss @ 2014-01-20 20:34 UTC (permalink / raw)
  To: linux-raid

I was initially writing to HPA, and he noted the existence of this list, so
I'm going to boil down what I've got so far for the list. In short, I'm
trying to understand if there's a reasonable way to get something equivlant
to ZFS/BTRFS on-a-mirror-with-scrubbing if I'm using MD RAID 6.

I recently read (or attempted to read, for those sections that exceeded my
background in math) HPA's paper "The mathematics of RAID-6", and I was
particularly interested in section four, "Single-disk corruption recovery".
What I'm wondering if he's describing something theoretically possible given
the redundant data RAID 6 stores, or something that's actually been
implemented in (specifically) MD RAID 6 on Linux.

The world is in a rush to adopt ZFS and BTRFS, but there are dinosaurs among
us that would love to maintain proper layering with the RAID layer being able
to correct for bitrot itself. A common scenario that would benefit from this
is having an encrypted layer sitting atop RAID, with LVM atop that.

I just looked through the code for the first time today, and I'd love to know
if my understanding is correct. My current read of the code is as follows:

linux-source/lib/raid6/recov.c suggests that for a single-disk failure,
recovery is handled by the RAID 5 code. In raid5.c, if I'm reading it
correctly, raid5_end_read_request will request a rewrite attempt if uptodate
is not true, which can call md_error, which can initiate recovery.

I'm struggling a little to trace recovery, but it does seem like MD maintains
a list of bad blocks and can map out bad sectors rather than marking a whole
drive as being dead.

Am I correct in assuming that bitrot will show up as a bad read, thus making
the read check fail and causing a rewrite attempt, which will mark the sector
in question as bad and write the data somewhere else if it's detected? If
this is the case then there's a very viable, already deployed option for
catching bitrot that doesn't require complete upheaval of how people manage
disk space and volumes nowadays.

On a related note, raid6check was mention to me. I don't see that available
on Debian or RHEL stable, but I found a man page:

    https://github.com/neilbrown/mdadm/blob/master/raid6check.8

The man page says, "No write operations are performed on the array or the
components," but my reading of the code makes it seem like a read error will
trigger a write implicitly. Am I misunderstanding this? Overall, am I barking
up the wrong tree in thinking that RAID 6 might let me preserve proper
layering while giving me the data integrity safeguards I'd otherwise get from
ZFS or BTRFS?

Thanks in advance for clarifications and pointers!

-- 
Mason Loring Bliss             mason@blisses.org            Ewige Blumenkraft!
(if awake 'sleep (aref #(sleep dream) (random 2))) -- Hamlet, Act III, Scene I

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-20 20:34 Questions about bitrot and RAID 5/6 Mason Loring Bliss
@ 2014-01-20 21:46 ` NeilBrown
  2014-01-20 22:55   ` Peter Grandi
                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: NeilBrown @ 2014-01-20 21:46 UTC (permalink / raw)
  To: Mason Loring Bliss; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 5866 bytes --]

On Mon, 20 Jan 2014 15:34:33 -0500 Mason Loring Bliss <mason@blisses.org>
wrote:

> I was initially writing to HPA, and he noted the existence of this list, so
> I'm going to boil down what I've got so far for the list. In short, I'm
> trying to understand if there's a reasonable way to get something equivlant
> to ZFS/BTRFS on-a-mirror-with-scrubbing if I'm using MD RAID 6.
> 
> 
> 
> I recently read (or attempted to read, for those sections that exceeded my
> background in math) HPA's paper "The mathematics of RAID-6", and I was
> particularly interested in section four, "Single-disk corruption recovery".
> What I'm wondering if he's describing something theoretically possible given
> the redundant data RAID 6 stores, or something that's actually been
> implemented in (specifically) MD RAID 6 on Linux.
> 
> The world is in a rush to adopt ZFS and BTRFS, but there are dinosaurs among
> us that would love to maintain proper layering with the RAID layer being able
> to correct for bitrot itself. A common scenario that would benefit from this
> is having an encrypted layer sitting atop RAID, with LVM atop that.
> 
> 
> 
> I just looked through the code for the first time today, and I'd love to know
> if my understanding is correct. My current read of the code is as follows:
> 
> linux-source/lib/raid6/recov.c suggests that for a single-disk failure,
> recovery is handled by the RAID 5 code. In raid5.c, if I'm reading it
> correctly, raid5_end_read_request will request a rewrite attempt if uptodate
> is not true, which can call md_error, which can initiate recovery.
> 
> I'm struggling a little to trace recovery, but it does seem like MD maintains
> a list of bad blocks and can map out bad sectors rather than marking a whole
> drive as being dead.
> 
> Am I correct in assuming that bitrot will show up as a bad read, thus making
> the read check fail and causing a rewrite attempt, which will mark the sector
> in question as bad and write the data somewhere else if it's detected? If
> this is the case then there's a very viable, already deployed option for
> catching bitrot that doesn't require complete upheaval of how people manage
> disk space and volumes nowadays.

ars technica recently had an article about "Bitrot and atomics COWs: Inside
"next-gen" filesystems."

http://feeds.arstechnica.com/~r/arstechnica/everything/~3/Cb4ylzECYVQ/

Early on it talks about creating a brtfs filesystem with RAID1 configured and
then binary-editing one of the device to flip one bit.  Then magically btrfs
survives while some other filesystem suffered data corruption.
That is where I stopped reading because that is *not* how bitrot happens.

Drives have sophisticated error checking and correcting codes.  If a bit on
the media changes, the device will either fix it transparently or report an
error - just like you suggest.  It is extremely unlikely to return bad data
as though it were good data.  And the codes that btrfs use have roughly the
same probability of reporting bad data as good - infinitesimal but not zero.

i.e. that clever stuff done by btrfs is already done by the drive!

To be fair to btrfs there are other possible sources of corruption than just
media defect.  On the path from the CCD which captures the photo of the cat,
to the LCD which displays the image, there are lots of memory buffers and
busses which carry the data.  Any one of those could theoretically flip one
or more bits.  Each of them *should* have appropriate error detecting and
correcting codes.  Apparently not all of them do.
So the magic in btrfs doesn't really protect against media errors (though if
your drive is buggy it could help there) but against errors in some (but not
all) other buffers or paths.

i.e. it sounds like a really cool idea but I find it very hard to evaluate
how useful it really is and whether it is worth the cost.   My gut feeling is
that for data it probably isn't.  For metadata it might be.

So to answer your question:  yes- raid6 on reasonable-quality drives already
protects you against media errors.  There are however theoretically possible
sources of corruption that md/raid6 does not protect you against.  btrfs
might protect you against some of those.  Nothing can protect you against all
of them.

As is true for any form of security (and here are at talking about data
security) you can only evaluate how safe you are against some specific threat
model.  Without a clear threat model it is all just hand waving.

I had a drive one which had a dodgy memory buffer.  When reading a 4k block,
one specific bit would often be set when it should be clear.  md would not
help with that (and in fact was helpfully copying the corruption from the
source drive to a space in a RAID1 for me :-).  btrfs would have caught that
particular corruption if checksumming were enabled on all data and metadata.

md could conceivably read the whole "stripe" on every read and verify all
parity blocks before releasing any data.  This has been suggested several
times, but no one has provided code or performance analysis yet.

NeilBrown

> 
> On a related note, raid6check was mention to me. I don't see that available
> on Debian or RHEL stable, but I found a man page:
> 
>     https://github.com/neilbrown/mdadm/blob/master/raid6check.8
> 
> The man page says, "No write operations are performed on the array or the
> components," but my reading of the code makes it seem like a read error will
> trigger a write implicitly. Am I misunderstanding this? Overall, am I barking
> up the wrong tree in thinking that RAID 6 might let me preserve proper
> layering while giving me the data integrity safeguards I'd otherwise get from
> ZFS or BTRFS?
> 
> Thanks in advance for clarifications and pointers!
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-20 21:46 ` NeilBrown
@ 2014-01-20 22:55   ` Peter Grandi
  2014-01-21  9:18   ` David Brown
  2014-01-21 17:19   ` Mason Loring Bliss
  2 siblings, 0 replies; 32+ messages in thread
From: Peter Grandi @ 2014-01-20 22:55 UTC (permalink / raw)
  To: Linux RAID

[ ... ]

>> In short, I'm trying to understand if there's a reasonable way to
>> get something equivlant to ZFS/BTRFS on-a-mirror-with-scrubbing
>> if I'm using MD RAID 6.  [ ... ] "Single-disk corruption
>> recovery". What I'm wondering if he's describing something
>> theoretically possible given the redundant data RAID 6 stores,

This seems to me a stupid idea that comes up occasionally on
this list, and the answer is always the same: the redundancy in
RAID is designed for *reconstruction* of data, not for integrity
*checking* of data, and RAID assumes that the underlying storage
system reports *every* error, that is there are never undetected
errors from the lower layer. When an error is reported, RAID
uses redundancy to reconstruct the lost data. That's how it was
designed, and for good reasons including simplicity (also see
later).

It might be possible to design RAID systems that provide
protection against otherwise undetected storage errors, but it
would cost a lot in time and complexity (issues with both BTRFS
and ZFS) and would be rather pointless in many if not most
cases.

Existing facilities like 'check' in MD RAID are there for extra
convenience, as opportunistic little hints, and should not be
relied upon for data integrity; they are mostly there to
exercise the storage layer, not to detect otherwise undetected
errors.

> ars technica recently had an article about "Bitrot and atomics
> COWs: Inside "next-gen" filesystems."
> http://feeds.arstechnica.com/~r/arstechnica/everything/~3/Cb4ylzECYVQ/
> Early on it talks about creating a brtfs filesystem with RAID1
> configured and then binary-editing one of the device to flip one
> bit. Then magically btrfs survives while some other filesystem
> suffered data corruption. That is where I stopped reading
> because that is *not* how bitrot happens.

Indeed, and "bitrot" happens for example as reported here:

  http://w3.hepix.org/storage/hep_pdf/2007/Spring/kelemen-2007-HEPiX-Silent_Corruptions.pdf

> Drives have sophisticated error checking and correcting codes.
> If a pbit on the media changes, the device will either fix it
> transparently or report an error [ ... ]

That's also because storage manufacturers understand that RAID
systems and filesystems are designed to absolutely rely on error
reporting by the storage layer...

> On the path from the CCD which captures the photo of the cat,
> to the LCD which displays the image, there are lots of memory
> buffers and busses which carry the data. Any one of those
> could theoretically flip one or more bits.

That's part of what the CERN study above reports: a significant
number of otherwise undetected error not because of failing
hardware, but pretty obviously from bugs in the Linux kernel, in
drivers, in host adapter firmware, in buses, in drive firmware.

  Note: I have seen situations where "bad" devices on a PCI bus
  would corrupt random memory locations *after* the storage
  layer and filesystem had verified the checksums...

Note that in th CERN tests *all* disks were modern devices with
extensive ECC, and all servers were "enterprise" class stuff.

> Each of them *should* have appropriate error detecting and
> correcting codes.

That's more than arguable, especially as to "correcting". For
much data even error detection is not that important, and for a
large amount of content correction is even less important.

A lot of disk drives are full of graphical or audio content
where uncorrected errors are unnoticeable, for example. After
all essentially all consumer devices don't have RAM ECC and
nobody seems to complain about the inevitable undetected
errors...

In general the "end-to-end" argument applies: if some data
really needs strong error detection and/or correction, put it in
the file format itself, so that the relevant costs are only paid
in the specific cases, and it is portable across filesystems and
storage layers, so that those extremely delicate and critical
filesystems and storage layers can stay skinny and simple.

[ ... ]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-20 21:46 ` NeilBrown
  2014-01-20 22:55   ` Peter Grandi
@ 2014-01-21  9:18   ` David Brown
  2014-01-21 17:19   ` Mason Loring Bliss
  2 siblings, 0 replies; 32+ messages in thread
From: David Brown @ 2014-01-21  9:18 UTC (permalink / raw)
  To: NeilBrown, Mason Loring Bliss; +Cc: linux-raid

On 20/01/14 22:46, NeilBrown wrote:
> On Mon, 20 Jan 2014 15:34:33 -0500 Mason Loring Bliss <mason@blisses.org>
> wrote:
> 
>> I was initially writing to HPA, and he noted the existence of this list, so
>> I'm going to boil down what I've got so far for the list. In short, I'm
>> trying to understand if there's a reasonable way to get something equivlant
>> to ZFS/BTRFS on-a-mirror-with-scrubbing if I'm using MD RAID 6.
>>
>>
>>
>> I recently read (or attempted to read, for those sections that exceeded my
>> background in math) HPA's paper "The mathematics of RAID-6", and I was
>> particularly interested in section four, "Single-disk corruption recovery".
>> What I'm wondering if he's describing something theoretically possible given
>> the redundant data RAID 6 stores, or something that's actually been
>> implemented in (specifically) MD RAID 6 on Linux.
>>
>> The world is in a rush to adopt ZFS and BTRFS, but there are dinosaurs among
>> us that would love to maintain proper layering with the RAID layer being able
>> to correct for bitrot itself. A common scenario that would benefit from this
>> is having an encrypted layer sitting atop RAID, with LVM atop that.
>>
>>
>>
>> I just looked through the code for the first time today, and I'd love to know
>> if my understanding is correct. My current read of the code is as follows:
>>
>> linux-source/lib/raid6/recov.c suggests that for a single-disk failure,
>> recovery is handled by the RAID 5 code. In raid5.c, if I'm reading it
>> correctly, raid5_end_read_request will request a rewrite attempt if uptodate
>> is not true, which can call md_error, which can initiate recovery.
>>
>> I'm struggling a little to trace recovery, but it does seem like MD maintains
>> a list of bad blocks and can map out bad sectors rather than marking a whole
>> drive as being dead.
>>
>> Am I correct in assuming that bitrot will show up as a bad read, thus making
>> the read check fail and causing a rewrite attempt, which will mark the sector
>> in question as bad and write the data somewhere else if it's detected? If
>> this is the case then there's a very viable, already deployed option for
>> catching bitrot that doesn't require complete upheaval of how people manage
>> disk space and volumes nowadays.
> 
> ars technica recently had an article about "Bitrot and atomics COWs: Inside
> "next-gen" filesystems."
> 
> http://feeds.arstechnica.com/~r/arstechnica/everything/~3/Cb4ylzECYVQ/
> 
> Early on it talks about creating a brtfs filesystem with RAID1 configured and
> then binary-editing one of the device to flip one bit.  Then magically btrfs
> survives while some other filesystem suffered data corruption.
> That is where I stopped reading because that is *not* how bitrot happens.

That is certainly true - their fake "bitrot" was very unrealistic, at
least as a disk error.  Undetected disk read errors are incredibly rare,
even on cheap disks, and you will not get them without warning (very
high /detectable/ disk read error rates).  However, as Peter points out
there can be other sources of undetected errors, such as memory errors,
bus errors, etc.

I've read your blog on this topic, and I fully agree that checksumming
or read-time verification should not be part of the raid layer.  The
ideal place is whatever is generating the data generates the checksum,
and whatever is reading the data checks it - then /any/ error in the
storage path will be detected.  But that is unrealistic to achieve - you
can't change every program.  Putting the checksums in the filesystem, as
btrfs does, is the next best thing - it is the highest layer where this
is practical.  Of course it comes at a cost - checksums have to be
calculated and stored - but that cost is small on modern cpus.

Another nice thing that is easier and faster with filesystem checksums
is deduplication, which is not really something you want on the raid layer.

David



> 
> Drives have sophisticated error checking and correcting codes.  If a bit on
> the media changes, the device will either fix it transparently or report an
> error - just like you suggest.  It is extremely unlikely to return bad data
> as though it were good data.  And the codes that btrfs use have roughly the
> same probability of reporting bad data as good - infinitesimal but not zero.
> 
> i.e. that clever stuff done by btrfs is already done by the drive!
> 
> To be fair to btrfs there are other possible sources of corruption than just
> media defect.  On the path from the CCD which captures the photo of the cat,
> to the LCD which displays the image, there are lots of memory buffers and
> busses which carry the data.  Any one of those could theoretically flip one
> or more bits.  Each of them *should* have appropriate error detecting and
> correcting codes.  Apparently not all of them do.
> So the magic in btrfs doesn't really protect against media errors (though if
> your drive is buggy it could help there) but against errors in some (but not
> all) other buffers or paths.
> 
> i.e. it sounds like a really cool idea but I find it very hard to evaluate
> how useful it really is and whether it is worth the cost.   My gut feeling is
> that for data it probably isn't.  For metadata it might be.
> 
> So to answer your question:  yes- raid6 on reasonable-quality drives already
> protects you against media errors.  There are however theoretically possible
> sources of corruption that md/raid6 does not protect you against.  btrfs
> might protect you against some of those.  Nothing can protect you against all
> of them.
> 
> As is true for any form of security (and here are at talking about data
> security) you can only evaluate how safe you are against some specific threat
> model.  Without a clear threat model it is all just hand waving.
> 
> I had a drive one which had a dodgy memory buffer.  When reading a 4k block,
> one specific bit would often be set when it should be clear.  md would not
> help with that (and in fact was helpfully copying the corruption from the
> source drive to a space in a RAID1 for me :-).  btrfs would have caught that
> particular corruption if checksumming were enabled on all data and metadata.
> 
> md could conceivably read the whole "stripe" on every read and verify all
> parity blocks before releasing any data.  This has been suggested several
> times, but no one has provided code or performance analysis yet.
> 
> NeilBrown
> 
> 
>>
>> On a related note, raid6check was mention to me. I don't see that available
>> on Debian or RHEL stable, but I found a man page:
>>
>>     https://github.com/neilbrown/mdadm/blob/master/raid6check.8
>>
>> The man page says, "No write operations are performed on the array or the
>> components," but my reading of the code makes it seem like a read error will
>> trigger a write implicitly. Am I misunderstanding this? Overall, am I barking
>> up the wrong tree in thinking that RAID 6 might let me preserve proper
>> layering while giving me the data integrity safeguards I'd otherwise get from
>> ZFS or BTRFS?
>>
>> Thanks in advance for clarifications and pointers!
>>
> 


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-20 21:46 ` NeilBrown
  2014-01-20 22:55   ` Peter Grandi
  2014-01-21  9:18   ` David Brown
@ 2014-01-21 17:19   ` Mason Loring Bliss
  2014-01-22 10:40     ` David Brown
  2 siblings, 1 reply; 32+ messages in thread
From: Mason Loring Bliss @ 2014-01-21 17:19 UTC (permalink / raw)
  To: linux-raid

On Tue, Jan 21, 2014 at 08:46:17AM +1100, NeilBrown wrote:

> ars technica recently had an article about "Bitrot and atomics COWs: Inside
> "next-gen" filesystems."
[...]
> That is where I stopped reading because that is *not* how bitrot happens.

I'm not finding the specific things I've read to this effect, and some of it
was on ephemeral media (IRC), but one of the justifications I've seen for the
ZFS/BTRFS approach is that some drives might not consistently report errors.
I think it's likely the case that one is in somewhat bad trouble in that
situation, but paranoia isn't strictly a bad thing.

> i.e. that clever stuff done by btrfs is already done by the drive!

The Ars Technica article shook my faith in this a little, and I'm
appreciating the balanced view. (And, I'm spinning up smartd anywhere where
it's not now running.)

On Mon, Jan 20, 2014 at 10:55:06PM +0000, Peter Grandi wrote:

> This seems to me a stupid idea that comes up occasionally on this list, and
> the answer is always the same: the redundancy in RAID is designed for
> *reconstruction* of data, not for integrity *checking* of data,

And yet, one person's stupid is another person's glaringly obvious. The RAID
layer is the only one where you can have redundant data available from
distinct devices. If it's desired, fault-tolerance ought to exist at every
level.

> and RAID assumes that the underlying storage system reports *every* error,
> that is there are never undetected errors from the lower layer.

I wouldn't want to force extra processing and storage onto everyone, but it
seems like something that doesn't muddy the design or complicate things at
all. It seems like a perfect option for the paranoid - think of ordered data
mode in EXT4. You don't have to turn it on if you don't want it.

On Tue, Jan 21, 2014 at 10:18:14AM +0100, David Brown wrote:

> I've read your blog on this topic, and I fully agree that checksumming or
> read-time verification should not be part of the raid layer.

Can you provide a link, please?

> The ideal place is whatever is generating the data generates the checksum,
> and whatever is reading the data checks it - then /any/ error in the
> storage path will be detected.

Detected, but not corrected. Again, fault tolerance means that the system
works around errors. As has been pointed out, there are potential sources of
error at every level. It's not at all unreasonable for each layer to take
advantage of available information to ensure correct operation.

Hell, in a past life when I was working on embedded medical devices, I wrote
code to store critical variables in reprodicibly-mutated form so that on
accessing them I could verify that the hardware wasn't faulty and that
nothing was randomly spraying memory. Certainly it cost a tiny bit of extra
processing. The goal wasn't fault tolerance there, it was detection, but the
point is that we didn't have to trust the substrate, so we did what we could
to use it without trust.

> Putting the checksums in the filesystem, as btrfs does, is the next best
> thing - it is the highest layer where this is practical.

Again, depending on the goal. It's practical error detection, but doesn't add
to the reliability of the overall system at all if there's no source of
redundant data for a quorum.

-- 
The creatures outside looked from pig to man, and from man to pig, and from pig
to man again; but already it was impossible to say which was which. - G. Orwell

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-21 17:19   ` Mason Loring Bliss
@ 2014-01-22 10:40     ` David Brown
  2014-01-23  0:48       ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: David Brown @ 2014-01-22 10:40 UTC (permalink / raw)
  To: Mason Loring Bliss, linux-raid

On 21/01/14 18:19, Mason Loring Bliss wrote:
> On Tue, Jan 21, 2014 at 08:46:17AM +1100, NeilBrown wrote:
> 
>> ars technica recently had an article about "Bitrot and atomics COWs: Inside
>> "next-gen" filesystems."
> [...]
>> That is where I stopped reading because that is *not* how bitrot happens.
> 
> I'm not finding the specific things I've read to this effect, and some of it
> was on ephemeral media (IRC), but one of the justifications I've seen for the
> ZFS/BTRFS approach is that some drives might not consistently report errors.
> I think it's likely the case that one is in somewhat bad trouble in that
> situation, but paranoia isn't strictly a bad thing.
> 
> 
>> i.e. that clever stuff done by btrfs is already done by the drive!
> 
> The Ars Technica article shook my faith in this a little, and I'm
> appreciating the balanced view. (And, I'm spinning up smartd anywhere where
> it's not now running.)
> 
> 
> 
> On Mon, Jan 20, 2014 at 10:55:06PM +0000, Peter Grandi wrote:
> 
>> This seems to me a stupid idea that comes up occasionally on this list, and
>> the answer is always the same: the redundancy in RAID is designed for
>> *reconstruction* of data, not for integrity *checking* of data,
> 
> And yet, one person's stupid is another person's glaringly obvious. The RAID
> layer is the only one where you can have redundant data available from
> distinct devices. If it's desired, fault-tolerance ought to exist at every
> level.
> 
> 
>> and RAID assumes that the underlying storage system reports *every* error,
>> that is there are never undetected errors from the lower layer.
> 
> I wouldn't want to force extra processing and storage onto everyone, but it
> seems like something that doesn't muddy the design or complicate things at
> all. It seems like a perfect option for the paranoid - think of ordered data
> mode in EXT4. You don't have to turn it on if you don't want it.
> 

If the raid system reads in the whole stripe, and finds that the
parities don't match, what should it do?  Before considering what checks
can be done, you need to think through what could cause those checks to
fail - and what should be done about it.  If the stripe's parities don't
match, then something /very/ bad has happened - either a disk has a read
error that it is not reporting, or you've got hardware problems with
memory, buses, etc., or the software has a serious bug.  In any case,
you have to question the accuracy of anything you read off the array -
you certainly have no way of knowing which disk is causing the trouble.
 Probably the best you could do is report the whole stripe read as
failed, and hope that the filesystem can recover.

> 
> 
> On Tue, Jan 21, 2014 at 10:18:14AM +0100, David Brown wrote:
> 
>> I've read your blog on this topic, and I fully agree that checksumming or
>> read-time verification should not be part of the raid layer.
> 
> Can you provide a link, please?
> 

<http://neil.brown.name/blog/20110227114201>
<http://neil.brown.name/blog/20100211050355>

> 
>> The ideal place is whatever is generating the data generates the checksum,
>> and whatever is reading the data checks it - then /any/ error in the
>> storage path will be detected.
> 
> Detected, but not corrected. Again, fault tolerance means that the system
> works around errors. 

That's true - but the same applies to checking raid stripes for
consistency.  You can only detect an error, not correct it.  To be able
to correct the error, you need to put the checking mechanism below the
layer of the redundancy.  This is what btrfs does - the checksum is on
the file block or extent, and that block or extent is stored redundantly
(for raid1, dup, etc.) as is its checksum.  You cannot do the
/correcting/ above the redundancy layer unless you are talking about
hamming codes or other forward error correction, which would be
massively invasive for performance.

So if you want to /correct/ errors at the raid level, you need
checksumming (or other detection mechanisms) just below the raid layer -
and that is the block layer, typically the disk layer.  But the disk
layer already has such a mechanism - it is the ECC system built into the
disk.  Another checksum on the block layer is just a duplication of the
work already done by the disk - at best, you are checking the
connections and buffers along the way.  These are, as with everything
else, a potential source of error - but they are definitely a low-risk
point.

> As has been pointed out, there are potential sources of
> error at every level. It's not at all unreasonable for each layer to take
> advantage of available information to ensure correct operation.
> 

It is certainly not unreasonable to consider it - but you always have to
balance the probability of something going wrong, the consequences of
such errors, and the costs of correcting them.

The types of error that btrfs checksumming can detect (and correct,
given redundant copies) are extremely rare - the huge majority of
unrecoverable disk read errors are detected and reported by the drive.
But it turns out that this checksumming is relatively inexpensive when
it is done as part of the filesystem, and the checksums have other
potential benefits (such as for deduplication, smarter rsyncs, etc.).
So it is worth doing here.

Why would you then want to spend additional effort for a less useful,
more expensive checking at the raid level that covers fewer possible
errors?  I know I can't give a proper analysis without relevant
statistics, but my gut feeling is that the cost/benefit ratio is very
much against trying to correct failures - or even do stripe checking -
at the raid level.

> Hell, in a past life when I was working on embedded medical devices, I wrote
> code to store critical variables in reprodicibly-mutated form so that on
> accessing them I could verify that the hardware wasn't faulty and that
> nothing was randomly spraying memory. Certainly it cost a tiny bit of extra
> processing. The goal wasn't fault tolerance there, it was detection, but the
> point is that we didn't have to trust the substrate, so we did what we could
> to use it without trust.

Yes, and that is why there is ECC in the disks, and ECC memory.  High
reliability systems, where the cost is justified, use ECC techniques at
many other levels too - some processors even have whole cores redundant
and fault tolerant.  But you have to suit your redundancy, checks, and
corrections to your threat model and your cost model.  And while I agree
that /checking/ at the raid level is not too expensive, /correcting/
would be much more demanding.

> 
> 
>> Putting the checksums in the filesystem, as btrfs does, is the next best
>> thing - it is the highest layer where this is practical.
> 
> Again, depending on the goal. It's practical error detection, but doesn't add
> to the reliability of the overall system at all if there's no source of
> redundant data for a quorum.
> 

Yes, without redundancy then the btrfs checksum is error detection, but
not correction.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-22 10:40     ` David Brown
@ 2014-01-23  0:48       ` Chris Murphy
  2014-01-23  8:18         ` David Brown
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-23  0:48 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org List

On Jan 22, 2014, at 3:40 AM, David Brown <david.brown@hesbynett.no> wrote:
> 
> If the raid system reads in the whole stripe, and finds that the
> parities don't match, what should it do?  

https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
page 8 shows how it can be determined whether data, or P, or Q are corrupt. Multiple corruptions could indicate if a particular physical drive is the only source of corruptions and then treat it as an erasure. Using normal reconstruction code, the problem is correctable. But I'm uncertain if this enables determination of the specific device/chunk when there is data corruption within a single stripe.

It seems there's still an assumption that if data chunks produce P' and Q' which do not match P or Q, that P and Q are both correct which might not be true.

> Before considering what checks
> can be done, you need to think through what could cause those checks to
> fail - and what should be done about it.  If the stripe's parities don't
> match, then something /very/ bad has happened - either a disk has a read
> error that it is not reporting, or you've got hardware problems with
> memory, buses, etc., or the software has a serious bug.

Yes but we know that these things actually happen, even if rare. I don't know how common ECC fails to detect error, or detects but wrongly corrects, but we know that there are (rarely) misdirected writes. That not lonly obliterates data that might have been stored where the data landed, but it also means it's missing where it's expected. Neither drive nor controller ECC helps in such cases.

>  In any case,
> you have to question the accuracy of anything you read off the array -
> you certainly have no way of knowing which disk is causing the trouble.

I'm not certain. From the Anvin paper, equation 27 suggests it's possible to know which disk is causing the trouble. But I don't know if that equation is intended for physical drives corrupting a mix of data, P and Q parities - or if it works to isolate the specific corrupt data chunk in a single (or more correctly, isolated) stripe data/parity mismatch event.

I think in the case of a single, non-overlapping corruption in a data chunk, that RS parity can be used to localize the error. If that's true, then it can be treated as a "read error" and the normal reconstruction for that chunk applies.

> Probably the best you could do is report the whole stripe read as
> failed, and hope that the filesystem can recover.

With default chunk size of 512KB that's quite a bit of data loss for a file system that doesn't use checksummed metadata.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23  0:48       ` Chris Murphy
@ 2014-01-23  8:18         ` David Brown
  2014-01-23 17:28           ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: David Brown @ 2014-01-23  8:18 UTC (permalink / raw)
  To: Chris Murphy, linux-raid@vger.kernel.org List

On 23/01/14 01:48, Chris Murphy wrote:
> 
> On Jan 22, 2014, at 3:40 AM, David Brown <david.brown@hesbynett.no>
> wrote:
>> 
>> If the raid system reads in the whole stripe, and finds that the 
>> parities don't match, what should it do?
> 
> https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf page 8
> shows how it can be determined whether data, or P, or Q are corrupt.
> Multiple corruptions could indicate if a particular physical drive is
> the only source of corruptions and then treat it as an erasure. Using
> normal reconstruction code, the problem is correctable. But I'm
> uncertain if this enables determination of the specific device/chunk
> when there is data corruption within a single stripe.

That's true - but (as pointed out in Neil's blog) there can be other
reasons why one block is "wrong" compared to the others.  Supposing you
need to change a single block in a raid 6 stripe.  That means you will
change that block and both parity blocks.  If the disk system happens to
write out the data disk, but there is a crash before the parities are
written, then you will get a stripe that is consistent if you "erase"
the new data block - when in fact it is the parity blocks that are wrong.

Another reason for avoiding "correcting" data blocks is that it can
confuse the filesystem layer if it has previously read in that block
(and the raid layer cannot know for sure that it has not done so), and
then the raid layer were to "correct" it without the filesystem's knowledge.

So automatic "correction" here would be hard, expensive (erasure needs a
lot more computation than generating or checking parities), and will
sometimes make problems worse.  There are good arguments for using such
erasure during an offline check/scrub (especially once the 3+ parity
raids are in place), but not online.  For online error correction, you
need more sophistication, such as battery backed memory to track the
write orders.

> 
> It seems there's still an assumption that if data chunks produce P'
> and Q' which do not match P or Q, that P and Q are both correct which
> might not be true.
> 
>> Before considering what checks can be done, you need to think
>> through what could cause those checks to fail - and what should be
>> done about it.  If the stripe's parities don't match, then
>> something /very/ bad has happened - either a disk has a read error
>> that it is not reporting, or you've got hardware problems with 
>> memory, buses, etc., or the software has a serious bug.
> 
> Yes but we know that these things actually happen, even if rare. I
> don't know how common ECC fails to detect error, or detects but
> wrongly corrects, but we know that there are (rarely) misdirected
> writes. That not lonly obliterates data that might have been stored
> where the data landed, but it also means it's missing where it's
> expected. Neither drive nor controller ECC helps in such cases.
> 

I have no disagreement about adding extra checking (and correcting, if
possible) into the system - but I think btrfs is the right place, not
the raid layer.  Btrfs will spot exactly these cases, and correct them
if it has redundant copies of the data.  And because it is at the
filesystem level, it has more knowledge and can do more sophisticated
error checking for a lot less effort than is done at the raid level.

It would be /nice/ if this could be done well - reliably and cheaply -
at the raid level.  But it can't.

>> In any case, you have to question the accuracy of anything you read
>> off the array - you certainly have no way of knowing which disk is
>> causing the trouble.
> 
> I'm not certain. From the Anvin paper, equation 27 suggests it's
> possible to know which disk is causing the trouble. But I don't know
> if that equation is intended for physical drives corrupting a mix of
> data, P and Q parities - or if it works to isolate the specific
> corrupt data chunk in a single (or more correctly, isolated) stripe
> data/parity mismatch event.

The principle is quite simple, although it involves quite a bit of
calculations.

Read the whole stripe - D0, D1, ..., Dn, P, Q at once.  We can assume
that the drive reports all reads as "good" - if not (and this is the
usual case on read errors), we know which block is bad.  Use the read
D0, ..., Dn to calculate new P' and Q'.  If these match the read P and
Q, we are happy.  If not, then something is wrong.  If P matches, then
assume the Q block is bad - if Q matches, assume the P block is bad.

Failing that, try assuming that D0 is bad - recreate D0' from D1, ...,
Dn, P.  Calculate a new Q'.  If this matches the read Q, then we can
make the stripe consistent by replacing D0.  We have no guarantees that
D0 is the problem, but it is the best bet statistically.  If we still
don't have a match for Q' = Q, then keep the read in D0 and guess that
D1 is wrong.  If we make it through all the drives without getting a
match, there is more than one inconsistency and we have no chance.

> 
> I think in the case of a single, non-overlapping corruption in a data
> chunk, that RS parity can be used to localize the error. If that's
> true, then it can be treated as a "read error" and the normal
> reconstruction for that chunk applies.

It /could/ be done - but as noted above it might not help (even though
statistically speaking it's a good guess), and it would involve very
significant calculations on every read.  At best, it would mean that
every read involves reading a whole stripe (crippling small read
performance) and parity calculations - making reads as slow as writes.
This is a very big cost for detecting an error that is /incredibly/
rare.  (The link posted earlier in this thread suggested 1000 incidents
in 41 PB of data.  At that rate, I know that it is far more likely that
my company building will burn down, losing everything, than that I will
ever see such an error in the company servers.  And I've got a backup.)

Checksuming at the btrfs level, on the other hand, is cheap - because
the filesystem already has the checksumming data on hand as part of the
metadata for the file.  This is a type of "shortcut" that the raid level
cannot possibly do with the current structure, because it knows nothing
about the structure of the data on the disks.  Of course, if the
computer had a nice block of fast non-volatile memory, md raid could use
it to store things like block checksums, write bitmaps, logs, etc., and
make a safer and faster system than we have today.  But there is no such
convenient memory available for now.

So if you worry about these sorts of errors, use btrfs or zfs.  Or
combine regular backups with extra checks (such as running your files
through sha256sum and comparing to old copies).

> 
>> Probably the best you could do is report the whole stripe read as 
>> failed, and hope that the filesystem can recover.
> 
> With default chunk size of 512KB that's quite a bit of data loss for
> a file system that doesn't use checksummed metadata.
> 
> 
> Chris Murphy
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in the body of a message to majordomo@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23  8:18         ` David Brown
@ 2014-01-23 17:28           ` Chris Murphy
  2014-01-23 18:53             ` Phil Turmel
  2014-01-23 22:02             ` David Brown
  0 siblings, 2 replies; 32+ messages in thread
From: Chris Murphy @ 2014-01-23 17:28 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org List

On Jan 23, 2014, at 1:18 AM, David Brown <david.brown@hesbynett.no> wrote:
> 
> That's true - but (as pointed out in Neil's blog) there can be other
> reasons why one block is "wrong" compared to the others.  Supposing you
> need to change a single block in a raid 6 stripe.  That means you will
> change that block and both parity blocks.  If the disk system happens to
> write out the data disk, but there is a crash before the parities are
> written, then you will get a stripe that is consistent if you "erase"
> the new data block - when in fact it is the parity blocks that are wrong.

Sure but I think that's an idealized scenario of a bad scenario in that if there's a crash it's entirely likely that we end up with one or more torn writes to a chunk, rather than completely correctly written data chunk, and parities that aren't written at all. Chances are we do in fact end up with corruption in this case, and there's simply not enough information to unwind it. The state of the data chunk is questionable, and the state of P+Q are questionable. There's really not a lot to do here, although it seems better to have the parities recomputed from the data chunks *such as they are* rather than permit parity reconstruction to effectively rollback just one chunk. 

> Another reason for avoiding "correcting" data blocks is that it can
> confuse the filesystem layer if it has previously read in that block
> (and the raid layer cannot know for sure that it has not done so), and
> then the raid layer were to "correct" it without the filesystem's knowledge.

In this hypothetical implementation, I'm suggesting that data chunks have P' and Q' computed, and compared to on-disk P and Q, for all reads. So there wouldn't be a condition as you suggest. If whatever was previously read in was "OK" but then somehow a bit flips on the next read, is detect, and corrected, it's exactly what you'd want to have happen.

> So automatic "correction" here would be hard, expensive (erasure needs a
> lot more computation than generating or checking parities), and will
> sometimes make problems worse.  

I could see a particularly reliable implementation (ECC memory, good quality components including the right drives, all correctly configured, and on UPS) where this would statistically do more good than bad. And for all I know there are proprietary hardware raid6 implementations that do this. But it's still not really fixing the problem we want fixed, so it's understandable the effort goes elsewhere.

> 
>> 
>> I think in the case of a single, non-overlapping corruption in a data
>> chunk, that RS parity can be used to localize the error. If that's
>> true, then it can be treated as a "read error" and the normal
>> reconstruction for that chunk applies.
> 
> It /could/ be done - but as noted above it might not help (even though
> statistically speaking it's a good guess), and it would involve very
> significant calculations on every read.  At best, it would mean that
> every read involves reading a whole stripe (crippling small read
> performance) and parity calculations - making reads as slow as writes.
> This is a very big cost for detecting an error that is /incredibly/
> rare.

It mostly means that the default chunk size needs to be reduced, a long standing argument, to avoid this very problem. Those who need big chunk sizes for large streaming (media) writes, get less of a penalty for a too small chunk size in this hypothetical implementation than the general purpose case would. 

Btrfs computes crc32c for every extent read and compares with what's stored in metadata, and its reads are not meaningfully faster with the nodatasum option. And granted that's not apples to apples, because it's only computing a checksum for the extent read, not the equivalent of a whole stripe. So it's always efficient. Also I don't know to what degree the Q computation is hardware accelerated, whereas Btrfs crc32c checksum is hardware accelerated (SSE 4.2) for some time now.

>  (The link posted earlier in this thread suggested 1000 incidents
> in 41 PB of data.  At that rate, I know that it is far more likely that
> my company building will burn down, losing everything, than that I will
> ever see such an error in the company servers.  And I've got a backup.)

It's a fair point. I've recently run across some claims on a separate forum with hardware raid5 arrays containing all enterprise drives, with regularly scrubs, yet with such excessive implosions that some integrators have moved to raid6 and completely discount the use of raid5. The use case is video production. This sounds suspiciously like microcode or raid firmware bugs to me. I just don't see how ~6-8 enterprise drives in a raid5 translates into significantly higher array collapses that then essentially vanish when it's raid6.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23 17:28           ` Chris Murphy
@ 2014-01-23 18:53             ` Phil Turmel
  2014-01-23 21:38               ` Chris Murphy
  2014-01-23 22:06               ` David Brown
  2014-01-23 22:02             ` David Brown
  1 sibling, 2 replies; 32+ messages in thread
From: Phil Turmel @ 2014-01-23 18:53 UTC (permalink / raw)
  To: Chris Murphy, linux-raid@vger.kernel.org List

Hi Chris,

On 01/23/2014 12:28 PM, Chris Murphy wrote:
> It's a fair point. I've recently run across some claims on a separate
> forum with hardware raid5 arrays containing all enterprise drives,
> with regularly scrubs, yet with such excessive implosions that some
> integrators have moved to raid6 and completely discount the use of
> raid5. The use case is video production. This sounds suspiciously
> like microcode or raid firmware bugs to me. I just don't see how ~6-8
> enterprise drives in a raid5 translates into significantly higher
> array collapses that then essentially vanish when it's raid6.

I just wanted to address this one point.  Raid6 is many orders of
magnitude more robust than raid5 in the rebuild case.  Let me illustrate:

How to lose data in a raid5:

1) Experience unrecoverable read errors on two of the N drives at the
same *time* and same *sector offset* of the two drives.  Absurdly
improbable.  On the order of 1x10^-36 for 1T consumer-grade drives.

2a) Experience hardware failure on one drive followed by 2b) an
unrecoverable read error in another drive.  You can expect a hardware
failure rate of a few percent per year.  Then, when rebuilding on the
replacement drive, the odds skyrocket.  On large arrays, the odds of
data loss are little different from the odds of a hardware failure in
the first place.

How to lose data in a raid6:

1) Experience unrecoverable read errors on *three* of the N drives at
the same *time* and same *sector offset* of the drives.  Even more
absurdly improbable.  On the order of 1x10^-58 for 1T consumer-grade drives.

2) Experience hardware failure on one drive followed by unrecoverable
read errors on two of the remaining drives at the same *time* and same
*sector offset* of the two drives.  Again, absurdly improbable.  Same as
for the raid5 case "1".

3) Experience hardware failure on two drives followed by an
unrecoverable read error in another drive.  As with raid5 on large
arrays, you probably can't complete the rebuild error-free.  But the
odds of this event are subject to management--quick reponse to case "2"
greatly reduces the odds of case "3".

It is no accident that raid5 is becoming much less popular.

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23 18:53             ` Phil Turmel
@ 2014-01-23 21:38               ` Chris Murphy
  2014-01-24 13:22                 ` Phil Turmel
  2014-01-23 22:06               ` David Brown
  1 sibling, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-23 21:38 UTC (permalink / raw)
  To: linux-raid@vger.kernel.org List

On Jan 23, 2014, at 11:53 AM, Phil Turmel <philip@turmel.org> wrote:

> Hi Chris,
> 
> On 01/23/2014 12:28 PM, Chris Murphy wrote:
>> It's a fair point. I've recently run across some claims on a separate
>> forum with hardware raid5 arrays containing all enterprise drives,
>> with regularly scrubs, yet with such excessive implosions that some
>> integrators have moved to raid6 and completely discount the use of
>> raid5. The use case is video production. This sounds suspiciously
>> like microcode or raid firmware bugs to me. I just don't see how ~6-8
>> enterprise drives in a raid5 translates into significantly higher
>> array collapses that then essentially vanish when it's raid6.
> 
> I just wanted to address this one point.  Raid6 is many orders of
> magnitude more robust than raid5 in the rebuild case.  Let me illustrate:
> 
> How to lose data in a raid5:
> 
> 1) Experience unrecoverable read errors on two of the N drives at the
> same *time* and same *sector offset* of the two drives.  Absurdly
> improbable.  On the order of 1x10^-36 for 1T consumer-grade drives.
> 
> 2a) Experience hardware failure on one drive followed by 2b) an
> unrecoverable read error in another drive.  You can expect a hardware
> failure rate of a few percent per year.  Then, when rebuilding on the
> replacement drive, the odds skyrocket.  On large arrays, the odds of
> data loss are little different from the odds of a hardware failure in
> the first place.

Yes I understand this, but 2a and 2b occurring at the same time also seems very improbable with enterprise drives and regularly scheduled scrubs. That's the context I'm coming from.

What are the odds of a latent sector error resulting in a read failure, within ~14 days from the most recent scrub? And with enterprise drives that by design have the proper SCT ERC value? And at the same time as a single disk failure? It seems like a rather low probability. I'd sooner expect to see a 2nd disk failure before the rebuild completes.

> 
> It is no accident that raid5 is becoming much less popular.

Sure and I don't mean to indicate raid6 isn't orders of magnitude safer. I'm suggesting that massive safety margin is being used to paper over common improper configurations of raid5 arrays.  e.g. using drives with the wrong SCT ERC timeout for either controller or SCSI block layer, and also not performing any sort of raid or SMART scrubbing enabling latent sector errors to develop. 

The accumulation of latent sector errors makes raid5 collapse only somewhat less likely than the probability of a single drive failure. So raid5 is particularly sensitive to failure in the case of bad setups, whereas dual parity can in-effect mitigate the consequences of bad setups. But that's not really what it's designed for. If we're talking about exactly correctly configured setups, the comparison is overwhelmingly about (multiple) drive failure probability.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23 21:38               ` Chris Murphy
@ 2014-01-24 13:22                 ` Phil Turmel
  2014-01-24 16:11                   ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Phil Turmel @ 2014-01-24 13:22 UTC (permalink / raw)
  To: Chris Murphy, linux-raid@vger.kernel.org List

Hi Chris,

[BTW, reply-to-all is proper etiquette on kernel.org lists.  You keep
dropping CCs.]

On 01/23/2014 04:38 PM, Chris Murphy wrote:
> 
> On Jan 23, 2014, at 11:53 AM, Phil Turmel <philip@turmel.org> wrote:
> 

>> 2a) Experience hardware failure on one drive followed by 2b) an 
>> unrecoverable read error in another drive.  You can expect a
>> hardware failure rate of a few percent per year.  Then, when
>> rebuilding on the replacement drive, the odds skyrocket.  On large
>> arrays, the odds of data loss are little different from the odds of
>> a hardware failure in the first place.
> 
> Yes I understand this, but 2a and 2b occurring at the same time also
> seems very improbable with enterprise drives and regularly scheduled
> scrubs. That's the context I'm coming from.

No, they aren't improbable.  That's my point.  For consumer drives, you
can expect a new URE every 12T or so read, on average.  (Based on
claimed URE rates.)  So big arrays (tens of terabytes) are likely find a
*new* URE on *every* scrub, even if they are back-to-back.  And on
rebuild after a hardware failure, which also reads the entire array.

> What are the odds of a latent sector error resulting in a read
> failure, within ~14 days from the most recent scrub? And with
> enterprise drives that by design have the proper SCT ERC value? And
> at the same time as a single disk failure? It seems like a rather low
> probability. I'd sooner expect to see a 2nd disk failure before the
> rebuild completes.

It's not even close.  The URE on rebuild is near *certain* on very large
arrays.

Enterprise drives push the URE rate down another factor of ten, so the
problem is most apparent on arrays of high tens of T or hundreds of T.
But enterprise customers are even more concerned with data loss, moving
the threshold right back.  And if you are a data center with thousands
of drives, the hardware failure rate is noticeable.

Also, all of my analysis presumes proper error-recovery configuration.
Without it, you're toast.

>> It is no accident that raid5 is becoming much less popular.
> 
> Sure and I don't mean to indicate raid6 isn't orders of magnitude
> safer. I'm suggesting that massive safety margin is being used to
> paper over common improper configurations of raid5 arrays.  e.g.
> using drives with the wrong SCT ERC timeout for either controller or
> SCSI block layer, and also not performing any sort of raid or SMART
> scrubbing enabling latent sector errors to develop.

No, the problem is much more serious than that.  Improper ERC just
causes a dramatic array collapse that confuses the hobbyist.  That's why
it gets a lot of attention on linux-raid.

> The accumulation of latent sector errors makes raid5 collapse only
> somewhat less likely than the probability of a single drive failure.
> So raid5 is particularly sensitive to failure in the case of bad
> setups, whereas dual parity can in-effect mitigate the consequences
> of bad setups. But that's not really what it's designed for. If we're
> talking about exactly correctly configured setups, the comparison is
> overwhelmingly about (multiple) drive failure probability.

No, improper ERC setup will take out a raid6 almost as fast as raid5,
since any URE kicks the drive out.  It happens to mostly to hobbyists
who haven't scheduled scrubs, since anyone doing scrubs finds this out
relatively quickly.  (Because they are afflicted with a rash of drive
"failures" that aren't.)

Your comments suggest you've completely discounted the fact that
published URE rates are now close to, or within, drive capacities.

Spend some time with the math and you will be very concerned.

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 13:22                 ` Phil Turmel
@ 2014-01-24 16:11                   ` Chris Murphy
  2014-01-24 17:03                     ` Phil Turmel
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-24 16:11 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org List


On Jan 24, 2014, at 6:22 AM, Phil Turmel <philip@turmel.org> wrote:
> 
> No, they aren't improbable.  That's my point.  For consumer drives, you
> can expect a new URE every 12T or so read, on average.

- Define URE.

Western Digital, HGST, and Seagate don't use the term URE/unrecoverable read error. They use, respectively:

non-recoverable read error per bits read
error rate, non-recoverable, per bits read
nonrecoverable Read Errors per Bits Read, Max

These are all identical terms?

- How does the URE manifest? That is, does the drive always report a read error such as this?

ata3.00: cmd c8/00:08:55:e8:8d/00:00:00:00:00/e2 tag 0 dma 4096 in
es 51/40:00:56:e8:8d/00:00:00:00:00/02 Emask 0x9 (media error)
ata3.00: status: { DRDY ERR }
ata3.00: error: { UNC }

Or does URE include silent data corruption, and disk failure?

- How many bits of loss occur with one URE?



> 
> Your comments suggest you've completely discounted the fact that
> published URE rates are now close to, or within, drive capacities.
> 
> Spend some time with the math and you will be very concerned.

Yeah I tried that a year ago and when it came to really super basic questions, no one was willing to answer them and the thread died as if we don't actually know what we're talking about. So I think some rather basic definitions are in order and an agreement that we don't get to redefine mathematics by saying a max error rate is a mean.

http://www.spinics.net/lists/raid/msg41669.html




Chris Murphy


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 16:11                   ` Chris Murphy
@ 2014-01-24 17:03                     ` Phil Turmel
  2014-01-24 17:59                       ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Phil Turmel @ 2014-01-24 17:03 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org List

On 01/24/2014 11:11 AM, Chris Murphy wrote:
> 
> On Jan 24, 2014, at 6:22 AM, Phil Turmel <philip@turmel.org> wrote:
>>
>> No, they aren't improbable.  That's my point.  For consumer drives, you
>> can expect a new URE every 12T or so read, on average.
> 
> - Define URE.

Unrecoverable Read Error.  Also known as a non-recoverable read error or
an uncorrectable read.

> Western Digital, HGST, and Seagate don't use the term URE/unrecoverable read error. They use, respectively:
> 
> non-recoverable read error per bits read
> error rate, non-recoverable, per bits read
> nonrecoverable Read Errors per Bits Read, Max
> 
> These are all identical terms?

These are statements about *rates* of UREs.  But yes, identical.

> - How does the URE manifest? That is, does the drive always report a read error such as this?
> 
> ata3.00: cmd c8/00:08:55:e8:8d/00:00:00:00:00/e2 tag 0 dma 4096 in
> es 51/40:00:56:e8:8d/00:00:00:00:00/02 Emask 0x9 (media error)
> ata3.00: status: { DRDY ERR }
> ata3.00: error: { UNC }

Yes.  I'm not sure if { DRDY ERR } is always present.

> Or does URE include silent data corruption, and disk failure?

No, and no.

> - How many bits of loss occur with one URE?

Complete physical sector.  The error correction codes on the market
operate on entire physical sectors.  Once the correcting capacity of the
code is exceeded, the math involved can no longer identify which bits in
the sector were corrupted, so the whole sector must be declared unknown.
 Google "Reed-Solomon" for an introduction to such codes.

>> Your comments suggest you've completely discounted the fact that
>> published URE rates are now close to, or within, drive capacities.
>>
>> Spend some time with the math and you will be very concerned.
> 
> Yeah I tried that a year ago and when it came to really super basic questions, no one was willing to answer them and the thread died as if we don't actually know what we're talking about. So I think some rather basic definitions are in order and an agreement that we don't get to redefine mathematics by saying a max error rate is a mean.
> 
> http://www.spinics.net/lists/raid/msg41669.html

I participated in that thread.  Some of your comments there imply that
the math is simple.  It's not (unless you are whiz with statistics).
Look at the Poisson distribution I referenced and the computation
examples I gave.

Note that a statement about the rate of a randomly occurring error is
implicitly stating an average.  The specification sheets state that the
rate (an average) will not exceed (max) a certain value within the
warranteed life of the drive.  Two UREs occurring much less than 10^14
bits apart don't violate the spec.  A long series of UREs averaging out
to less than 10^14 bits apart would be a violation.

Note that the rate does change over time.  A brand new drive in good
condition can have a rate much less than the per 10^14 bits spec.  But a
drive that is approaching or past its warranty life can be expected to
be close to it.  (Or the manufacturers would claim that better
performance due to marketing pressure.)

Regards,

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 17:03                     ` Phil Turmel
@ 2014-01-24 17:59                       ` Chris Murphy
  2014-01-24 18:12                         ` Phil Turmel
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-24 17:59 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org List

On Jan 24, 2014, at 10:03 AM, Phil Turmel <philip@turmel.org> wrote:
>> w many bits of loss occur with one URE?
> 
> Complete physical sector.

A complete physical sector represents 512 bytes / 4096 bits, or in the case of AF disks 4096 bytes / 32768 bits, of loss for one URE. Correct?

So a URE is either 4096 bits nonrecoverable, or 32768 bits nonrecoverable, for HDDs. Correct?

>>> Your comments suggest you've completely discounted the fact that
>>> published URE rates are now close to, or within, drive capacities.
>>> 
>>> Spend some time with the math and you will be very concerned.
>> 
>> Yeah I tried that a year ago and when it came to really super basic questions, no one was willing to answer them and the thread died as if we don't actually know what we're talking about. So I think some rather basic definitions are in order and an agreement that we don't get to redefine mathematics by saying a max error rate is a mean.
>> 
>> http://www.spinics.net/lists/raid/msg41669.html
> 
> I participated in that thread.  Some of your comments there imply that
> the math is simple.  It's not (unless you are whiz with statistics).
> Look at the Poisson distribution I referenced and the computation
> examples I gave.

At the moment a Poisson distribution is out of scope because my questions have nothing to do with how often, when, or how many, such URE's will occur. At the moment I only want complete utter clarity on what a URE/nonrecoverable error (not even the rate) is in terms of quantity. That's my main problem.

> 
> Note that a statement about the rate of a randomly occurring error is
> implicitly stating an average.  

Except that it has only one limiter, with the next notch a whole order magnitude less error. So I don't see how you get an average unless you're willing to just make assumptions about the bottom end. It doesn't make sense that a manufacturer would state a maximum error rate of X and then target that as an average. The average is certainly well below the max. 

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 17:59                       ` Chris Murphy
@ 2014-01-24 18:12                         ` Phil Turmel
  2014-01-24 19:32                           ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Phil Turmel @ 2014-01-24 18:12 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org List

On 01/24/2014 12:59 PM, Chris Murphy wrote:
> 
> On Jan 24, 2014, at 10:03 AM, Phil Turmel <philip@turmel.org> wrote:
>>> w many bits of loss occur with one URE?
>> 
>> Complete physical sector.
> 
> 
> A complete physical sector represents 512 bytes / 4096 bits, or in
> the case of AF disks 4096 bytes / 32768 bits, of loss for one URE.
> Correct?
> 
> So a URE is either 4096 bits nonrecoverable, or 32768 bits
> nonrecoverable, for HDDs. Correct?

Yes.  Note that the specification is for an *event*, not for a specific
number of bits lost.  The error rate is not "bits lost per bits read",
it is "bits lost event per bits read".

>>>> Your comments suggest you've completely discounted the fact
>>>> that published URE rates are now close to, or within, drive
>>>> capacities.
>>>> 
>>>> Spend some time with the math and you will be very concerned.
>>> 
>>> Yeah I tried that a year ago and when it came to really super
>>> basic questions, no one was willing to answer them and the thread
>>> died as if we don't actually know what we're talking about. So I
>>> think some rather basic definitions are in order and an agreement
>>> that we don't get to redefine mathematics by saying a max error
>>> rate is a mean.
>>> 
>>> http://www.spinics.net/lists/raid/msg41669.html
>> 
>> I participated in that thread.  Some of your comments there imply
>> that the math is simple.  It's not (unless you are whiz with
>> statistics). Look at the Poisson distribution I referenced and the
>> computation examples I gave.
> 
> At the moment a Poisson distribution is out of scope because my
> questions have nothing to do with how often, when, or how many, such
> URE's will occur. At the moment I only want complete utter clarity on
> what a URE/nonrecoverable error (not even the rate) is in terms of
> quantity. That's my main problem.

Ok, but the earlier arguments in this thread over the relative merits of
raid5 versus raid6 very much depend on the error rate.

>> Note that a statement about the rate of a randomly occurring error
>> is implicitly stating an average.
> 
> Except that it has only one limiter, with the next notch a whole
> order magnitude less error. So I don't see how you get an average
> unless you're willing to just make assumptions about the bottom end.
> It doesn't make sense that a manufacturer would state a maximum error
> rate of X and then target that as an average. The average is
> certainly well below the max.

You are confused.  The specification is a maximum of an average.  An
average that changes with time, and cannot be measured from single events.

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 18:12                         ` Phil Turmel
@ 2014-01-24 19:32                           ` Chris Murphy
  2014-01-24 19:57                             ` Phil Turmel
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-24 19:32 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org List

On Jan 24, 2014, at 11:12 AM, Phil Turmel <philip@turmel.org> wrote:

> On 01/24/2014 12:59 PM, Chris Murphy wrote:
>> 
>> On Jan 24, 2014, at 10:03 AM, Phil Turmel <philip@turmel.org> wrote:
>>>> w many bits of loss occur with one URE?
>>> 
>>> Complete physical sector.
>> 
>> 
>> A complete physical sector represents 512 bytes / 4096 bits, or in
>> the case of AF disks 4096 bytes / 32768 bits, of loss for one URE.
>> Correct?
>> 
>> So a URE is either 4096 bits nonrecoverable, or 32768 bits
>> nonrecoverable, for HDDs. Correct?
> 
> Yes.  Note that the specification is for an *event*, not for a specific
> number of bits lost.  The error rate is not "bits lost per bits read",
> it is "bits lost event per bits read".

I don't understand this. You're saying it's a "1 URE event in 10^14 bits read" spec? Not a "1 bit nonrecoverable in 10^14 bits read" spec?

It seems that a nonrecoverable read error rate of 1 in 2 would mean, 1 bit nonrecoverable per 2 bits read. Same as 512 bits nonrecoverable per 1024 bits read. Same as 1 sector nonrecoverable per 2 sectors read.

>> At the moment a Poisson distribution is out of scope because m
>> questions have nothing to do with how often, when, or how many, such
>> URE's will occur. At the moment I only want complete utter clarity on
>> what a URE/nonrecoverable error (not even the rate) is in terms of
>> quantity. That's my main problem.
> 
> Ok, but the earlier arguments in this thread over the relative merits of
> raid5 versus raid6 very much depend on the error rate.

Absolutely. But if I get the much earlier math wrong, then my understanding of the risk will be one or more orders of magnitude wrong. Whether underestimating or overestimating the risk, there are bad consequences.

> 
>>> Note that a statement about the rate of a randomly occurring error
>>> is implicitly stating an average.
>> 
>> Except that it has only one limiter, with the next notch a whole
>> order magnitude less error. So I don't see how you get an average
>> unless you're willing to just make assumptions about the bottom end.
>> It doesn't make sense that a manufacturer would state a maximum error
>> rate of X and then target that as an average. The average is
>> certainly well below the max.
> 
> You are confused.  

Be specific, because….

> The specification is a maximum of an average.

Stating the average rate is below the max specified rate, is consistent with the spec being a maximum of an average. I don't see where you're getting the average from when there isn't even an X < Y < Z published. All we have is X < Z.

>  An
> average that changes with time, and cannot be measured from single events.

On that point we agree. But with identical publish error rate specs we routinely see model drives give us more problems than others, even among the same manufacturer, even sometimes within a model varying by batch. So obviously the spec has a rather massive range to it.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 19:32                           ` Chris Murphy
@ 2014-01-24 19:57                             ` Phil Turmel
  2014-01-24 20:54                               ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Phil Turmel @ 2014-01-24 19:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org List

On 01/24/2014 02:32 PM, Chris Murphy wrote:
>>> So a URE is either 4096 bits nonrecoverable, or 32768 bits 
>>> nonrecoverable, for HDDs. Correct?
>> 
>> Yes.  Note that the specification is for an *event*, not for a
>> specific number of bits lost.  The error rate is not "bits lost per
>> bits read", it is "bits lost event per bits read".
> 
> I don't understand this. You're saying it's a "1 URE event in 10^14
> bits read" spec? Not a "1 bit nonrecoverable in 10^14 bits read"
> spec?
> 
> It seems that a nonrecoverable read error rate of 1 in 2 would mean,
> 1 bit nonrecoverable per 2 bits read. Same as 512 bits nonrecoverable
> per 1024 bits read. Same as 1 sector nonrecoverable per 2 sectors
> read.

I don't know what more to say here.  Your "seems" is not.

[trim /]

>> You are confused.
> 
> Be specific, because….
> 
>> The specification is a maximum of an average.
> 
> Stating the average rate is below the max specified rate, is
> consistent with the spec being a maximum of an average. I don't see
> where you're getting the average from when there isn't even an X < Y
> < Z published. All we have is X < Z.

I think you are also struggling with the fact the rate, on a single
drive, aside from any specification, is *itself* an average.

The manufacturer is stating that that average, which cannot be clearly
understood without grasping how a Poisson distribution works (or similar
distributions), won't exceed a certain value within the warranty life (a
maximum).  To achieve this, the manufacturer will certainly arrange to
keep the average of these averages below the maximum.

>> An average that changes with time, and cannot be measured from
>> single events.
> 
> On that point we agree. But with identical publish error rate specs
> we routinely see model drives give us more problems than others, even
> among the same manufacturer, even sometimes within a model varying by
> batch. So obviously the spec has a rather massive range to it.

To some extent, manufacturers have to make educated guesses about future
performance on new products.  They pay real $ penalties in warranty
claims if they err greatly in one direction, and real $ penalties in
"unnecessary" process equipment if the err greatly in the other direction.

Obviously, some manufacturers have better knowledge of their own
production facilities than others.

Um, I think we're drifting off-topic now.

Phil
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 19:57                             ` Phil Turmel
@ 2014-01-24 20:54                               ` Chris Murphy
  2014-01-25 10:23                                 ` Dag Nygren
                                                   ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Chris Murphy @ 2014-01-24 20:54 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org List

On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@turmel.org> wrote:

> On 01/24/2014 02:32 PM, Chris Murphy wrote:
>>>> So a URE is either 4096 bits nonrecoverable, or 32768 bits 
>>>> nonrecoverable, for HDDs. Correct?
>>> 
>>> Yes.  Note that the specification is for an *event*, not for a
>>> specific number of bits lost.  The error rate is not "bits lost per
>>> bits read", it is "bits lost event per bits read".
>> 
>> I don't understand this. You're saying it's a "1 URE event in 10^14
>> bits read" spec? Not a "1 bit nonrecoverable in 10^14 bits read"
>> spec?
>> 
>> It seems that a nonrecoverable read error rate of 1 in 2 would mean,
>> 1 bit nonrecoverable per 2 bits read. Same as 512 bits nonrecoverable
>> per 1024 bits read. Same as 1 sector nonrecoverable per 2 sectors
>> read.
> 
> I don't know what more to say here.  Your "seems" is not.

Please define "bits lost event" and cite some reference. Google returns exactly ONE hit on that, which is this thread. If we cannot agree on the units, we aren't talking about the same thing, at all, with a commensurately huge misunderstanding of the problem and thus the solution.

So please to not merely respond to the 2nd paragraph you disagree with. Answer the two questions above that paragraph.

If the spec is "1 URE event in 1E14 bits read" that is "1 bit nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector drive, and hilariously becomes far worse at "1 bit nonrecoverable in 3E9 bits read" for 4096 byte physical sector drives.

A very simple misunderstanding should have a very simple corrective answer rather than hand waiving and giving up.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 20:54                               ` Chris Murphy
@ 2014-01-25 10:23                                 ` Dag Nygren
  2014-01-25 15:48                                 ` Phil Turmel
  2014-01-25 17:56                                 ` Wilson Jonathan
  2 siblings, 0 replies; 32+ messages in thread
From: Dag Nygren @ 2014-01-25 10:23 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, linux-raid@vger.kernel.org List

On Friday 24 January 2014 13:54:35 Chris Murphy wrote:
> 
> On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@turmel.org> wrote:
> 
> > On 01/24/2014 02:32 PM, Chris Murphy wrote:
> >>>> So a URE is either 4096 bits nonrecoverable, or 32768 bits 
> >>>> nonrecoverable, for HDDs. Correct?
> >>> 
> >>> Yes.  Note that the specification is for an *event*, not for a
> >>> specific number of bits lost.  The error rate is not "bits lost per
> >>> bits read", it is "bits lost event per bits read".
> >> 
> >> I don't understand this. You're saying it's a "1 URE event in 10^14
> >> bits read" spec? Not a "1 bit nonrecoverable in 10^14 bits read"
> >> spec?
> >> 
> >> It seems that a nonrecoverable read error rate of 1 in 2 would mean,
> >> 1 bit nonrecoverable per 2 bits read. Same as 512 bits nonrecoverable
> >> per 1024 bits read. Same as 1 sector nonrecoverable per 2 sectors
> >> read.
> > 
> > I don't know what more to say here.  Your "seems" is not.
> 
> Please define "bits lost event" and cite some reference. Google returns exactly ONE hit on that, which is this thread. If we cannot agree on the units, we aren't talking about the same thing, at all, with a commensurately huge misunderstanding of the problem and thus the solution.
> 
> So please to not merely respond to the 2nd paragraph you disagree with. Answer the two questions above that paragraph.
> 
> If the spec is "1 URE event in 1E14 bits read" that is "1 bit nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector drive, and hilariously becomes far worse at "1 bit nonrecoverable in 3E9 bits read" for 4096 byte physical sector drives.
> 
> A very simple misunderstanding should have a very simple corrective answer rather than hand waiving and giving up.

I don't see your problem?

1. 1 bit unrecoverable = Data is wrong
2. 1 URE = Data is wrong

They are the same thing! And that will give you the average probablility
of getting a read error?

Best
Dag

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 20:54                               ` Chris Murphy
  2014-01-25 10:23                                 ` Dag Nygren
@ 2014-01-25 15:48                                 ` Phil Turmel
  2014-01-25 17:44                                   ` Stan Hoeppner
  2014-01-27  3:20                                   ` Chris Murphy
  2014-01-25 17:56                                 ` Wilson Jonathan
  2 siblings, 2 replies; 32+ messages in thread
From: Phil Turmel @ 2014-01-25 15:48 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org List

Hi Chris,

I sat on my reply for a day so I could make sure my response was
suitably professional.

On 01/24/2014 03:54 PM, Chris Murphy wrote:
> 
> On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@turmel.org> wrote:
> 
>> On 01/24/2014 02:32 PM, Chris Murphy wrote:
>>>>> So a URE is either 4096 bits nonrecoverable, or 32768 bits 
>>>>> nonrecoverable, for HDDs. Correct?
>>>> 
>>>> Yes.  Note that the specification is for an *event*, not for a
>>>>  specific number of bits lost.  The error rate is not "bits
>>>> lost per bits read", it is "bits lost event per bits read".
>>> 
>>> I don't understand this. You're saying it's a "1 URE event in 
>>> 10^14 bits read" spec? Not a "1 bit nonrecoverable in 10^14 bits 
>>> read" spec?
>>> 
>>> It seems that a nonrecoverable read error rate of 1 in 2 would 
>>> mean, 1 bit nonrecoverable per 2 bits read. Same as 512 bits 
>>> nonrecoverable per 1024 bits read. Same as 1 sector 
>>> nonrecoverable per 2 sectors read.
>> 
>> I don't know what more to say here.  Your "seems" is not.
> 
> Please define "bits lost event" and cite some reference. Google 
> returns exactly ONE hit on that, which is this thread. If we cannot 
> agree on the units, we aren't talking about the same thing, at all, 
> with a commensurately huge misunderstanding of the problem and thus 
> the solution.

I am not trying to define terminology, nor do I intend to.  I have been
paraphrasing and rephrasing in an attempt to help you understand the
published terminology.  It's hardly surprising that this thread is the
only hit.

As this list is *the* reference for linux raid technology, and is a
reference for raid technology in general, I hope this helps future
googlers understand the issue.

> So please to not merely respond to the 2nd paragraph you disagree 
> with. Answer the two questions above that paragraph.

The paired questions simply restated my previous answer with a few
substitutions.  I skipped what I presumed was a rhetorical form, and
replied to your commentary in answer to the whole.

> If the spec is "1 URE event in 1E14 bits read" that is "1 bit 
> nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector 
> drive, and hilariously becomes far worse at "1 bit nonrecoverable in 
> 3E9 bits read" for 4096 byte physical sector drives.

It is only hilariously far worse in *your* mind.

> A very simple misunderstanding should have a very simple corrective 
> answer rather than hand waiving and giving up.

I'm sorry if you think my attempts to teach have been hand-waving.  I'm
giving up.  I can't help you further.

Regards,

Phil Turmel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-25 15:48                                 ` Phil Turmel
@ 2014-01-25 17:44                                   ` Stan Hoeppner
  2014-01-27  3:34                                     ` Chris Murphy
  2014-01-27  3:20                                   ` Chris Murphy
  1 sibling, 1 reply; 32+ messages in thread
From: Stan Hoeppner @ 2014-01-25 17:44 UTC (permalink / raw)
  To: Phil Turmel, Chris Murphy; +Cc: linux-raid@vger.kernel.org List

Hadn't paid attention to this thread til now, as the posts kept piling up.

On 1/25/2014 9:48 AM, Phil Turmel wrote:

> On 01/24/2014 03:54 PM, Chris Murphy wrote:
...
>> If the spec is "1 URE event in 1E14 bits read" that is "1 bit 
>> nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector 
>> drive, and hilariously becomes far worse at "1 bit nonrecoverable in 
>> 3E9 bits read" for 4096 byte physical sector drives.

First, there is no distinction between the terms "unrecoverable read
error" and "non recoverable read error".  They are two terms describing
the same event, the former most often used when referring to what occurs
on the host, the latter in drive.  I'll simply call it a hard read error.

Sector size has no bearing on hard read error probability.  The fact
that a whole sector is failed on a single bit hard error is simply an
artifact of one sector being the smallest request size possible from the
host.

You buy a loaf of bread and find mold on one slice of the 512 in the
package.  The store will only exchange the whole loaf, not just one
slice.  The probability that one slice of the 2.5 million sold that day
might have mold doesn't change with the quantity in each package.

The probability is the same.  The only difference is how much you have
to throw away.

-- 
Stan

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-25 17:44                                   ` Stan Hoeppner
@ 2014-01-27  3:34                                     ` Chris Murphy
  2014-01-27  7:16                                       ` Mikael Abrahamsson
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-27  3:34 UTC (permalink / raw)
  To: stan; +Cc: Phil Turmel, linux-raid@vger.kernel.org List

On Jan 25, 2014, at 10:44 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:
> 
> Sector size has no bearing on hard read error probability.

http://www.seagate.com/tech-insights/advanced-format-4k-sector-hard-drives-master-ti/

See Figure 2: Media Defects and Areal Density. And also the paragraph under Figure 6.

They are making a claim that bigger sectors (assuming otherwise equal media density) translates into reduced potential for unrecoverable error.

>  The fact
> that a whole sector is failed on a single bit hard error is simply an
> artifact of one sector being the smallest request size possible from the
> host.

I accept that. But since there are two sector sizes, one URE does not represent the same amount of data loss.

> The probability is the same.  The only difference is how much you have
> to throw away.

Fine, but if you accept that the probability of URE is the same between conventional and AF disks, you accept that there are more bits being lost on AF disks than conventional disks. Yet the available data says the opposite is true.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-27  3:34                                     ` Chris Murphy
@ 2014-01-27  7:16                                       ` Mikael Abrahamsson
  2014-01-27 18:20                                         ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Mikael Abrahamsson @ 2014-01-27  7:16 UTC (permalink / raw)
  To: Chris Murphy; +Cc: stan, Phil Turmel, linux-raid@vger.kernel.org List

On Sun, 26 Jan 2014, Chris Murphy wrote:

> I accept that. But since there are two sector sizes, one URE does not 
> represent the same amount of data loss.

This is your interpretation of the claim. My interpretation of the data is 
that you will get an URE for every 10^14 bits read. How many data bits 
that are lost by this URE is not important, either you lose 512 bytes or 
4096 bytes. The bit error rate is still calcluated on total amount read, 
regardless of how many bits are when this URE happens.

In data communication (ethernet for instance), we say the bit error rate 
is 10^-12. When you get a bit error, you're going to lose the entire 
packet. The size of the packet doesn't count in the error rate 
calculation. I don't see why HDDs would be different.

The bit error rate is one thing, the consequence of the bit error and how 
much data is lost is another thing. You insist that these are directly 
coupled.

> Fine, but if you accept that the probability of URE is the same between 
> conventional and AF disks, you accept that there are more bits being 
> lost on AF disks than conventional disks. Yet the available data says 
> the opposite is true.

Where exactly in the available data does it say that?

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-27  7:16                                       ` Mikael Abrahamsson
@ 2014-01-27 18:20                                         ` Chris Murphy
  2014-01-30 10:22                                           ` Mikael Abrahamsson
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Murphy @ 2014-01-27 18:20 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: stan, Phil Turmel, linux-raid@vger.kernel.org List

On Jan 27, 2014, at 12:16 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Sun, 26 Jan 2014, Chris Murphy wrote:
> 
>> I accept that. But since there are two sector sizes, one URE does not represent the same amount of data loss.
> 
> This is your interpretation of the claim. My interpretation of the data is that you will get an URE for every 10^14 bits read.

How does your interpretation differ if the < sign is removed from the stated spec? Linguistically, how do you alter the description if that symbol is and isn't present?

> How many data bits that are lost by this URE is not important, either you lose 512 bytes or 4096 bytes. The bit error rate is still calcluated on total amount read, regardless of how many bits are when this URE happens.
> 
> In data communication (ethernet for instance), we say the bit error rate is 10^-12. When you get a bit error, you're going to lose the entire packet. The size of the packet doesn't count in the error rate calculation. I don't see why HDDs would be different.

This is a really good point. Data communication standards are more readily published and must be interoperable. But even the network specs describe BER in "less than X in Y" or "X in Y, max" terms, not as an error occurring every Y bits.

And we also know that the size of the packet does affect error rates, just not within an order of magnitude, such is also the case with HDDs between conventional and AF disks. But the allowance of up to but not including an order of magnitude is necessarily implied by the less than sign or it wouldn't be there. It's a continuum, it's not a statement of what will happen on average. It's a statement that error will occur but won't exceed X errors in Y bits.

> The bit error rate is one thing, the consequence of the bit error and how much data is lost is another thing. You insist that these are directly coupled.

No, I'm saying that the actual break down of bits lost translates to an irrational consequence when reading the spec as if there isn't a less than symbol present.

> 
>> Fine, but if you accept that the probability of URE is the same between conventional and AF disks, you accept that there are more bits being lost on AF disks than conventional disks. Yet the available data says the opposite is true.
> 
> Where exactly in the available data does it say that?

Drive manufacturers saying AF disks means better error correction both because of the larger size of sectors, and also more ECC bits.

http://www.snia.org/sites/default/files2/SDC2011/presentations/wednesday/CurtisStevens_Advanced_Format_Legacy.pdf

http://storage.toshiba.eu/export/sites/toshiba-sdd/media/downloads/advanced_format/4KWhitePaper_TEG.pdf
", if the data field to be protected in each sector is larger than 512 bytes, the ECC algorithm could be improved to correct for a higher number of bits in error"

http://www.idema.org/wp-content/plugins/download-monitor/download.php?id=1244
page 3, 2nd figure "errors per 512 bytes vs physical block size"

There's loads of information on this…

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-27 18:20                                         ` Chris Murphy
@ 2014-01-30 10:22                                           ` Mikael Abrahamsson
  2014-01-30 20:59                                             ` Chris Murphy
  0 siblings, 1 reply; 32+ messages in thread
From: Mikael Abrahamsson @ 2014-01-30 10:22 UTC (permalink / raw)
  To: Chris Murphy; +Cc: stan, Phil Turmel, linux-raid@vger.kernel.org List

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2605 bytes --]

On Mon, 27 Jan 2014, Chris Murphy wrote:

>> This is your interpretation of the claim. My interpretation of the data is that you will get an URE for every 10^14 bits read.
>
> How does your interpretation differ if the < sign is removed from the 
> stated spec? Linguistically, how do you alter the description if that 
> symbol is and isn't present?

THe spec is an SLA. THe manufacturer will try to beat that number to keep 
the SLA. Sometimes they're a lot better, sometimes they're worse and then 
they have to compensate the customer.

> And we also know that the size of the packet does affect error rates, 
> just not within an order of magnitude, such is also the case with HDDs 
> between conventional and AF disks. But the allowance of up to but not 
> including an order of magnitude is necessarily implied by the less than 
> sign or it wouldn't be there. It's a continuum, it's not a statement of 
> what will happen on average. It's a statement that error will occur but 
> won't exceed X errors in Y bits.

If you run the connection full, the packet size doesn't affect the bit 
error rate, only the result of the bit error.

>> Where exactly in the available data does it say that?
>
> Drive manufacturers saying AF disks means better error correction both because of the larger size of sectors, and also more ECC bits.
>
> http://www.snia.org/sites/default/files2/SDC2011/presentations/wednesday/CurtisStevens_Advanced_Format_Legacy.pdf
>
> http://storage.toshiba.eu/export/sites/toshiba-sdd/media/downloads/advanced_format/4KWhitePaper_TEG.pdf
> ", if the data field to be protected in each sector is larger than 512 bytes, the ECC algorithm could be improved to correct for a higher number of bits in error"
>
> http://www.idema.org/wp-content/plugins/download-monitor/download.php?id=1244
> page 3, 2nd figure "errors per 512 bytes vs physical block size"
>
> There's loads of information on this…

The 4k sector design is an internal design means to achieve the specified 
SLA. So while 4k ECC is better, the manufacturer might use a higher 
density with a higher bit error rate, but which end result is still within 
the offered SLA because of better error correction method.

So we're back to what the 10^-14 means. This is all you have to go on, 
because internally the manufacturer is free to use 512b sector size, 4k 
sector size, or pixie dust to achieve the specs they're offering the end 
customer. There is nothing that says that you as a customer gets to 
partake in any improvement due to internal changes within the unit.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-30 10:22                                           ` Mikael Abrahamsson
@ 2014-01-30 20:59                                             ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2014-01-30 20:59 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: linux-raid@vger.kernel.org List

On Jan 30, 2014, at 3:22 AM, Mikael Abrahamsson <swmike@swm.pp.se> wrote:

> On Mon, 27 Jan 2014, Chris Murphy wrote:
> 
>>> This is your interpretation of the claim. My interpretation of the data is that you will get an URE for every 10^14 bits read.
>> 
>> How does your interpretation differ if the < sign is removed from the stated spec? Linguistically, how do you alter the description if that symbol is and isn't present?
> 
> THe spec is an SLA. THe manufacturer will try to beat that number to keep the SLA. Sometimes they're a lot better, sometimes they're worse and then they have to compensate the customer.

The spec being an agreement that activates a warrantied replacement is a plausible argument. And I agree with the characterization that a particular drive may be perform better or worse.

But the spec says "less than 1 per 1E14 bits" not "less than or equal to". If we actually get 1 URE in 1E14 bits read, that's busting the spec. An average of 1 URE in 1E14 bits read is likewise busting the spec. And if that were an average across a population it would mean drive manufacturers are on the hook for some 50% of their drives being replaced, and we know that is definitely not happening.

And I'm not reading this as "the first time you get 2 UREs in less than 1E14 read" you've hit the SLA, although someone could possibly make that argument.

I'd say overwhelmingly drives are performing a lot better than this spec. The idea we should expect a URE on average just by reading a 4TB drive three times in a row makes no sense. People would be having all sorts of problems, which they aren't. And if it really were a mean, this should be readily reproducible yet it isn't. In 12x full reads of three 3TB drives, zero UREs. That's over 100TB read without a URE. And these are ~2 year old drives.

> 
>> And we also know that the size of the packet does affect error rates, just not within an order of magnitude, such is also the case with HDDs between conventional and AF disks. But the allowance of up to but not including an order of magnitude is necessarily implied by the less than sign or it wouldn't be there. It's a continuum, it's not a statement of what will happen on average. It's a statement that error will occur but won't exceed X errors in Y bits.
> 
> If you run the connection full, the packet size doesn't affect the bit error rate, only the result of the bit error.

Packet size doesn't affect raw bit error rate, it does affect the packet error rate. Bigger packets means a higher packet error rate. The URE is argued to be "errors per bits read" not "bit errors per bits read" so comparing URE to BER is mixing units. 

The URE is more analogous to packet error rate. The limitation with that comparison is that network CRC is the same regardless of packet size. Whereas the ECC in 512 byte and 4096 byte sector drives is not the same.

> 
>>> Where exactly in the available data does it say that?
>> 
>> Drive manufacturers saying AF disks means better error correction both because of the larger size of sectors, and also more ECC bits.
>> 
>> http://www.snia.org/sites/default/files2/SDC2011/presentations/wednesday/CurtisStevens_Advanced_Format_Legacy.pdf
>> 
>> http://storage.toshiba.eu/export/sites/toshiba-sdd/media/downloads/advanced_format/4KWhitePaper_TEG.pdf
>> ", if the data field to be protected in each sector is larger than 512 bytes, the ECC algorithm could be improved to correct for a higher number of bits in error"
>> 
>> http://www.idema.org/wp-content/plugins/download-monitor/download.php?id=1244
>> page 3, 2nd figure "errors per 512 bytes vs physical block size"
>> 
>> There's loads of information on this…
> 
> The 4k sector design is an internal design means to achieve the specified SLA. So while 4k ECC is better, the manufacturer might use a higher density with a higher bit error rate, but which end result is still within the offered SLA because of better error correction method.
> 
> So we're back to what the 10^-14 means. This is all you have to go on, because internally the manufacturer is free to use 512b sector size, 4k sector size, or pixie dust to achieve the specs they're offering the end customer. There is nothing that says that you as a customer gets to partake in any improvement due to internal changes within the unit.

Agreed, insofar as we only knowt the max error rate anticipated by the spec. We do not know the average occurrence based on the spec. To compute that we need a scientific sample of drives, with all of the drives producing error rates greater than 1 URE in 1E14 bits discarded. An unweighted average would be useless because such drives should trigger a warranty replacement. And I don't know of any published studies that have done that - presumably this has been done by drive manufacturers though.

Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-25 15:48                                 ` Phil Turmel
  2014-01-25 17:44                                   ` Stan Hoeppner
@ 2014-01-27  3:20                                   ` Chris Murphy
  1 sibling, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2014-01-27  3:20 UTC (permalink / raw)
  To: Phil Turmel; +Cc: linux-raid@vger.kernel.org List

On Jan 25, 2014, at 8:48 AM, Phil Turmel <philip@turmel.org> wrote:
> 
>> If the spec is "1 URE event in 1E14 bits read" that is "1 bit 
>> nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector 
>> drive, and hilariously becomes far worse at "1 bit nonrecoverable in 
>> 3E9 bits read" for 4096 byte physical sector drives.
> 
> It is only hilariously far worse in *your* mind.

How would you characterize a nearly one order magnitude difference in bit error rate? 

The claim that "you can expect a new URE every 12T or so read, on average" is only congruent with its corollary which is that we should then expect a greater error rate with AF disks, on average. It's simple math because there are two different values for URE loss depending on whether the sector is 4096 bits or 32768 bits. If the URE odds are identical regardless of whether the disk is AF or not, then that means we have greater rates of loss on AF disks.

And we're told that can't be true because the disk manufacturers have said the reason for bigger sectors is to reduce the error rate. A 512 byte sector is smaller on today's higher density media, while defects are approximately the same size, which means conventional sized sectors are at increasing risk of a larger percentage being affected by defect to the point they can't be recovered.

The way to rectify this apparent problem is to reject the claim "you can expect a new URE every 12TB or so read, on average." We don't know the average. We only know the max. It very well could be the average is one URE in every 100TB for AF disks and one URE every 50TB for conventional disks.

Chris Murphy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-24 20:54                               ` Chris Murphy
  2014-01-25 10:23                                 ` Dag Nygren
  2014-01-25 15:48                                 ` Phil Turmel
@ 2014-01-25 17:56                                 ` Wilson Jonathan
  2014-01-27  4:07                                   ` Chris Murphy
  2 siblings, 1 reply; 32+ messages in thread
From: Wilson Jonathan @ 2014-01-25 17:56 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Phil Turmel, linux-raid@vger.kernel.org List

On Fri, 2014-01-24 at 13:54 -0700, Chris Murphy wrote:
> On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@turmel.org> wrote:
> 
> Please define "bits lost event" and cite some reference. Google returns exactly ONE hit on that, which is this thread. If we cannot agree on the units, we aren't talking about the same thing, at all, with a commensurately huge misunderstanding of the problem and thus the solution.
> 
> So please to not merely respond to the 2nd paragraph you disagree with. Answer the two questions above that paragraph.
> 
> If the spec is "1 URE event in 1E14 bits read" that is "1 bit nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector drive, and hilariously becomes far worse at "1 bit nonrecoverable in 3E9 bits read" for 4096 byte physical sector drives.
> 
> A very simple misunderstanding should have a very simple corrective answer rather than hand waiving and giving up.

As I understand it, its "1" error (of no determinate size) for every
10E14 bits read....

The size of sectors would make no difference to the raw amount of data
read (although it does open an interesting question of what the 10E14
actually means, does it also include any check summing data, or is it
purely "data") nor the fact that 1 URE statistically might happen.

The amount of data corrupted is, I would have thought, variable
depending on what forms of checksums etc. was used and is indeterminable
without knowing the exact forms of work done on the raw data, how many
checksum values there might be for a "block" and so on, to try and
recover a meaningful, and valid, return... it could be that just 1 bit
of data was corrupted or it could be that the entire sectors worth of
data is garbage; it could also be that the 1 URE is in such a place that
it causes multiple sectors to be invalid...

Unless there is some industry standard document outlining what a "URE"
is it would be impossible to know for sure, and even then it may not
even define it to a specific amount of data corruption per data read;
just that "an error" is statistically likely to have happened.

> 
> 
> Chris Murphy
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-25 17:56                                 ` Wilson Jonathan
@ 2014-01-27  4:07                                   ` Chris Murphy
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Murphy @ 2014-01-27  4:07 UTC (permalink / raw)
  To: Wilson Jonathan; +Cc: linux-raid@vger.kernel.org List


On Jan 25, 2014, at 10:56 AM, Wilson Jonathan <piercing_male@hotmail.com> wrote:

> On Fri, 2014-01-24 at 13:54 -0700, Chris Murphy wrote:
>> On Jan 24, 2014, at 12:57 PM, Phil Turmel <philip@turmel.org> wrote:
>> 
>> Please define "bits lost event" and cite some reference. Google returns exactly ONE hit on that, which is this thread. If we cannot agree on the units, we aren't talking about the same thing, at all, with a commensurately huge misunderstanding of the problem and thus the solution.
>> 
>> So please to not merely respond to the 2nd paragraph you disagree with. Answer the two questions above that paragraph.
>> 
>> If the spec is "1 URE event in 1E14 bits read" that is "1 bit nonrecoverable in 2.4E10 bits read" for a 512 byte physical sector drive, and hilariously becomes far worse at "1 bit nonrecoverable in 3E9 bits read" for 4096 byte physical sector drives.
>> 
>> A very simple misunderstanding should have a very simple corrective answer rather than hand waiving and giving up.
> 
> As I understand it, its "1" error (of no determinate size) for every
> 10E14 bits read….

Well as I understand it the < symbol is the "less than" sign, so if the rate is errors per bits, then it's less than 1 error for ever 10E14 bits read.


> The size of sectors would make no difference to the raw amount of data
> read (although it does open an interesting question of what the 10E14
> actually means, does it also include any check summing data, or is it
> purely "data") nor the fact that 1 URE statistically might happen.

It's an interesting question if "bits read" includes non-user data bits, such as the ECC bits. I'm also curious if there's an ATA or SCSI command that instructs the drive to hand over those 512 bytes, such as they are, despite a read error, or if we're just screwed.


> The amount of data corrupted is, I would have thought, variable
> depending on what forms of checksums etc. was used and is indeterminable
> without knowing the exact forms of work done on the raw data, how many
> checksum values there might be for a "block" and so on, to try and
> recover a meaningful, and valid, return... it could be that just 1 bit
> of data was corrupted or it could be that the entire sectors worth of
> data is garbage; it could also be that the 1 URE is in such a place that
> it causes multiple sectors to be invalid…

I'm willing to bet dollars to donuts that every vendor has differences in the effectiveness of their ECC, yet all of them can detect and correct merely 1 bit in 512/4096 bytes, and actually probably quite a few more bit errors than this.



Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23 18:53             ` Phil Turmel
  2014-01-23 21:38               ` Chris Murphy
@ 2014-01-23 22:06               ` David Brown
  1 sibling, 0 replies; 32+ messages in thread
From: David Brown @ 2014-01-23 22:06 UTC (permalink / raw)
  To: Phil Turmel; +Cc: Chris Murphy, linux-raid@vger.kernel.org List

On 23/01/14 19:53, Phil Turmel wrote:
> Hi Chris,
>
> On 01/23/2014 12:28 PM, Chris Murphy wrote:
>> It's a fair point. I've recently run across some claims on a separate
>> forum with hardware raid5 arrays containing all enterprise drives,
>> with regularly scrubs, yet with such excessive implosions that some
>> integrators have moved to raid6 and completely discount the use of
>> raid5. The use case is video production. This sounds suspiciously
>> like microcode or raid firmware bugs to me. I just don't see how ~6-8
>> enterprise drives in a raid5 translates into significantly higher
>> array collapses that then essentially vanish when it's raid6.
>
> I just wanted to address this one point.  Raid6 is many orders of
> magnitude more robust than raid5 in the rebuild case.  Let me illustrate:
>
> How to lose data in a raid5:
>
> 1) Experience unrecoverable read errors on two of the N drives at the
> same *time* and same *sector offset* of the two drives.  Absurdly
> improbable.  On the order of 1x10^-36 for 1T consumer-grade drives.
>
> 2a) Experience hardware failure on one drive followed by 2b) an
> unrecoverable read error in another drive.  You can expect a hardware
> failure rate of a few percent per year.  Then, when rebuilding on the
> replacement drive, the odds skyrocket.  On large arrays, the odds of
> data loss are little different from the odds of a hardware failure in
> the first place.
>
> How to lose data in a raid6:
>
> 1) Experience unrecoverable read errors on *three* of the N drives at
> the same *time* and same *sector offset* of the drives.  Even more
> absurdly improbable.  On the order of 1x10^-58 for 1T consumer-grade drives.
>
> 2) Experience hardware failure on one drive followed by unrecoverable
> read errors on two of the remaining drives at the same *time* and same
> *sector offset* of the two drives.  Again, absurdly improbable.  Same as
> for the raid5 case "1".
>
> 3) Experience hardware failure on two drives followed by an
> unrecoverable read error in another drive.  As with raid5 on large
> arrays, you probably can't complete the rebuild error-free.  But the
> odds of this event are subject to management--quick reponse to case "2"
> greatly reduces the odds of case "3".
>
> It is no accident that raid5 is becoming much less popular.
>
> Phil

Don't forget the other possible cause of read errors - some muppet sees 
that one drive has a complete failure, and when going to replace it 
pulls out the wrong drive...  Raid 6 gives extra protection against that 
most unbounded of error sources - human error!

Anyway, the issue of checksums and the type of bitrot situation invented 
for the Ars Technica article is about /undetected/ errors - dealing with 
/detected/ unrecoverable read errors is easy.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: Questions about bitrot and RAID 5/6
  2014-01-23 17:28           ` Chris Murphy
  2014-01-23 18:53             ` Phil Turmel
@ 2014-01-23 22:02             ` David Brown
  1 sibling, 0 replies; 32+ messages in thread
From: David Brown @ 2014-01-23 22:02 UTC (permalink / raw)
  To: Chris Murphy; +Cc: linux-raid@vger.kernel.org List

On 23/01/14 18:28, Chris Murphy wrote:
>
> On Jan 23, 2014, at 1:18 AM, David Brown <david.brown@hesbynett.no>
> wrote:
>>
>> That's true - but (as pointed out in Neil's blog) there can be
>> other reasons why one block is "wrong" compared to the others.
>> Supposing you need to change a single block in a raid 6 stripe.
>> That means you will change that block and both parity blocks.  If
>> the disk system happens to write out the data disk, but there is a
>> crash before the parities are written, then you will get a stripe
>> that is consistent if you "erase" the new data block - when in fact
>> it is the parity blocks that are wrong.
>
> Sure but I think that's an idealized scenario of a bad scenario in
> that if there's a crash it's entirely likely that we end up with one
> or more torn writes to a chunk, rather than completely correctly
> written data chunk, and parities that aren't written at all. Chances
> are we do in fact end up with corruption in this case, and there's
> simply not enough information to unwind it. The state of the data
> chunk is questionable, and the state of P+Q are questionable. There's
> really not a lot to do here, although it seems better to have the
> parities recomputed from the data chunks *such as they are* rather
> than permit parity reconstruction to effectively rollback just one
> chunk.

Agreed.

>
>> Another reason for avoiding "correcting" data blocks is that it
>> can confuse the filesystem layer if it has previously read in that
>> block (and the raid layer cannot know for sure that it has not done
>> so), and then the raid layer were to "correct" it without the
>> filesystem's knowledge.
>
> In this hypothetical implementation, I'm suggesting that data chunks
> have P' and Q' computed, and compared to on-disk P and Q, for all
> reads. So there wouldn't be a condition as you suggest. If whatever
> was previously read in was "OK" but then somehow a bit flips on the
> next read, is detect, and corrected, it's exactly what you'd want to
> have happen.
>

Yes, I guess if all reads were handled in this way, then it is very 
unlikely that you'd get something different in a latter read.

>
>> So automatic "correction" here would be hard, expensive (erasure
>> needs a lot more computation than generating or checking parities),
>> and will sometimes make problems worse.
>
> I could see a particularly reliable implementation (ECC memory, good
> quality components including the right drives, all correctly
> configured, and on UPS) where this would statistically do more good
> than bad. And for all I know there are proprietary hardware raid6
> implementations that do this. But it's still not really fixing the
> problem we want fixed, so it's understandable the effort goes
> elsewhere.

Indeed.  It is not that I think the idea is so bad - given random 
failures it is likely to do more good than harm.  I just don't think it 
would do enough good to be worth the effort, especially when 
alternatives like btrfs checksums are more useful for less work.  Of 
course, btrfs checksums don't help if you want to use XFS or another 
filesystem!

>
>
>>
>>>
>>> I think in the case of a single, non-overlapping corruption in a
>>> data chunk, that RS parity can be used to localize the error. If
>>> that's true, then it can be treated as a "read error" and the
>>> normal reconstruction for that chunk applies.
>>
>> It /could/ be done - but as noted above it might not help (even
>> though statistically speaking it's a good guess), and it would
>> involve very significant calculations on every read.  At best, it
>> would mean that every read involves reading a whole stripe
>> (crippling small read performance) and parity calculations - making
>> reads as slow as writes. This is a very big cost for detecting an
>> error that is /incredibly/ rare.
>
> It mostly means that the default chunk size needs to be reduced, a
> long standing argument, to avoid this very problem. Those who need
> big chunk sizes for large streaming (media) writes, get less of a
> penalty for a too small chunk size in this hypothetical
> implementation than the general purpose case would.
>
> Btrfs computes crc32c for every extent read and compares with what's
> stored in metadata, and its reads are not meaningfully faster with
> the nodatasum option. And granted that's not apples to apples,
> because it's only computing a checksum for the extent read, not the
> equivalent of a whole stripe. So it's always efficient. Also I don't
> know to what degree the Q computation is hardware accelerated,
> whereas Btrfs crc32c checksum is hardware accelerated (SSE 4.2) for
> some time now.

The Q checksum is fast on modern cpus (it uses SSE acceleration), but 
not as fast as crc32c.  It is the read of the whole stripe that makes 
the real difference.  If you have a 4+2 raid6 with 512 KB chunks, and 
you read a 20 KB file, you've got to read in 128 blocks from 6 drives, 
and calculate and compare 1 MB worth of parity from 2 MB worth of data. 
  With btrfs, you've got to calculate and compare a 32-bit checksum from 
20 KB of data.  Even if the Q calculations were as fast per byte as the 
crc32c, that's still a factor of 1000 difference - and you also have the 
seek time of 6 drives rather than 1 drive.

Smaller chunks would make this a little less terrible, but overall raid6 
throughput can be affected by chunk size.

>
>
>> (The link posted earlier in this thread suggested 1000 incidents in
>> 41 PB of data.  At that rate, I know that it is far more likely
>> that my company building will burn down, losing everything, than
>> that I will ever see such an error in the company servers.  And
>> I've got a backup.)
>
> It's a fair point. I've recently run across some claims on a separate
> forum with hardware raid5 arrays containing all enterprise drives,
> with regularly scrubs, yet with such excessive implosions that some
> integrators have moved to raid6 and completely discount the use of
> raid5. The use case is video production. This sounds suspiciously
> like microcode or raid firmware bugs to me. I just don't see how ~6-8
> enterprise drives in a raid5 translates into significantly higher
> array collapses that then essentially vanish when it's raid6.
>
>
> Chris Murphy
>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2014-01-30 20:59 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-01-20 20:34 Questions about bitrot and RAID 5/6 Mason Loring Bliss
2014-01-20 21:46 ` NeilBrown
2014-01-20 22:55   ` Peter Grandi
2014-01-21  9:18   ` David Brown
2014-01-21 17:19   ` Mason Loring Bliss
2014-01-22 10:40     ` David Brown
2014-01-23  0:48       ` Chris Murphy
2014-01-23  8:18         ` David Brown
2014-01-23 17:28           ` Chris Murphy
2014-01-23 18:53             ` Phil Turmel
2014-01-23 21:38               ` Chris Murphy
2014-01-24 13:22                 ` Phil Turmel
2014-01-24 16:11                   ` Chris Murphy
2014-01-24 17:03                     ` Phil Turmel
2014-01-24 17:59                       ` Chris Murphy
2014-01-24 18:12                         ` Phil Turmel
2014-01-24 19:32                           ` Chris Murphy
2014-01-24 19:57                             ` Phil Turmel
2014-01-24 20:54                               ` Chris Murphy
2014-01-25 10:23                                 ` Dag Nygren
2014-01-25 15:48                                 ` Phil Turmel
2014-01-25 17:44                                   ` Stan Hoeppner
2014-01-27  3:34                                     ` Chris Murphy
2014-01-27  7:16                                       ` Mikael Abrahamsson
2014-01-27 18:20                                         ` Chris Murphy
2014-01-30 10:22                                           ` Mikael Abrahamsson
2014-01-30 20:59                                             ` Chris Murphy
2014-01-27  3:20                                   ` Chris Murphy
2014-01-25 17:56                                 ` Wilson Jonathan
2014-01-27  4:07                                   ` Chris Murphy
2014-01-23 22:06               ` David Brown
2014-01-23 22:02             ` David Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).