Re: Questions about bitrot and RAID 5/6

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: David Brown <david.brown@hesbynett.no>
To: Chris Murphy <lists@colorremedies.com>
Cc: "linux-raid@vger.kernel.org List" <linux-raid@vger.kernel.org>
Subject: Re: Questions about bitrot and RAID 5/6
Date: Thu, 23 Jan 2014 23:02:55 +0100	[thread overview]
Message-ID: <52E1918F.3080206@hesbynett.no> (raw)
In-Reply-To: <DE020E0C-E6EC-48E9-8D7B-09F5A65A2DF5@colorremedies.com>

On 23/01/14 18:28, Chris Murphy wrote:
>
> On Jan 23, 2014, at 1:18 AM, David Brown <david.brown@hesbynett.no>
> wrote:
>>
>> That's true - but (as pointed out in Neil's blog) there can be
>> other reasons why one block is "wrong" compared to the others.
>> Supposing you need to change a single block in a raid 6 stripe.
>> That means you will change that block and both parity blocks.  If
>> the disk system happens to write out the data disk, but there is a
>> crash before the parities are written, then you will get a stripe
>> that is consistent if you "erase" the new data block - when in fact
>> it is the parity blocks that are wrong.
>
> Sure but I think that's an idealized scenario of a bad scenario in
> that if there's a crash it's entirely likely that we end up with one
> or more torn writes to a chunk, rather than completely correctly
> written data chunk, and parities that aren't written at all. Chances
> are we do in fact end up with corruption in this case, and there's
> simply not enough information to unwind it. The state of the data
> chunk is questionable, and the state of P+Q are questionable. There's
> really not a lot to do here, although it seems better to have the
> parities recomputed from the data chunks *such as they are* rather
> than permit parity reconstruction to effectively rollback just one
> chunk.

Agreed.

>
>> Another reason for avoiding "correcting" data blocks is that it
>> can confuse the filesystem layer if it has previously read in that
>> block (and the raid layer cannot know for sure that it has not done
>> so), and then the raid layer were to "correct" it without the
>> filesystem's knowledge.
>
> In this hypothetical implementation, I'm suggesting that data chunks
> have P' and Q' computed, and compared to on-disk P and Q, for all
> reads. So there wouldn't be a condition as you suggest. If whatever
> was previously read in was "OK" but then somehow a bit flips on the
> next read, is detect, and corrected, it's exactly what you'd want to
> have happen.
>

Yes, I guess if all reads were handled in this way, then it is very 
unlikely that you'd get something different in a latter read.

>
>> So automatic "correction" here would be hard, expensive (erasure
>> needs a lot more computation than generating or checking parities),
>> and will sometimes make problems worse.
>
> I could see a particularly reliable implementation (ECC memory, good
> quality components including the right drives, all correctly
> configured, and on UPS) where this would statistically do more good
> than bad. And for all I know there are proprietary hardware raid6
> implementations that do this. But it's still not really fixing the
> problem we want fixed, so it's understandable the effort goes
> elsewhere.

Indeed.  It is not that I think the idea is so bad - given random 
failures it is likely to do more good than harm.  I just don't think it 
would do enough good to be worth the effort, especially when 
alternatives like btrfs checksums are more useful for less work.  Of 
course, btrfs checksums don't help if you want to use XFS or another 
filesystem!

>
>
>>
>>>
>>> I think in the case of a single, non-overlapping corruption in a
>>> data chunk, that RS parity can be used to localize the error. If
>>> that's true, then it can be treated as a "read error" and the
>>> normal reconstruction for that chunk applies.
>>
>> It /could/ be done - but as noted above it might not help (even
>> though statistically speaking it's a good guess), and it would
>> involve very significant calculations on every read.  At best, it
>> would mean that every read involves reading a whole stripe
>> (crippling small read performance) and parity calculations - making
>> reads as slow as writes. This is a very big cost for detecting an
>> error that is /incredibly/ rare.
>
> It mostly means that the default chunk size needs to be reduced, a
> long standing argument, to avoid this very problem. Those who need
> big chunk sizes for large streaming (media) writes, get less of a
> penalty for a too small chunk size in this hypothetical
> implementation than the general purpose case would.
>
> Btrfs computes crc32c for every extent read and compares with what's
> stored in metadata, and its reads are not meaningfully faster with
> the nodatasum option. And granted that's not apples to apples,
> because it's only computing a checksum for the extent read, not the
> equivalent of a whole stripe. So it's always efficient. Also I don't
> know to what degree the Q computation is hardware accelerated,
> whereas Btrfs crc32c checksum is hardware accelerated (SSE 4.2) for
> some time now.

The Q checksum is fast on modern cpus (it uses SSE acceleration), but 
not as fast as crc32c.  It is the read of the whole stripe that makes 
the real difference.  If you have a 4+2 raid6 with 512 KB chunks, and 
you read a 20 KB file, you've got to read in 128 blocks from 6 drives, 
and calculate and compare 1 MB worth of parity from 2 MB worth of data. 
  With btrfs, you've got to calculate and compare a 32-bit checksum from 
20 KB of data.  Even if the Q calculations were as fast per byte as the 
crc32c, that's still a factor of 1000 difference - and you also have the 
seek time of 6 drives rather than 1 drive.

Smaller chunks would make this a little less terrible, but overall raid6 
throughput can be affected by chunk size.

>
>
>> (The link posted earlier in this thread suggested 1000 incidents in
>> 41 PB of data.  At that rate, I know that it is far more likely
>> that my company building will burn down, losing everything, than
>> that I will ever see such an error in the company servers.  And
>> I've got a backup.)
>
> It's a fair point. I've recently run across some claims on a separate
> forum with hardware raid5 arrays containing all enterprise drives,
> with regularly scrubs, yet with such excessive implosions that some
> integrators have moved to raid6 and completely discount the use of
> raid5. The use case is video production. This sounds suspiciously
> like microcode or raid firmware bugs to me. I just don't see how ~6-8
> enterprise drives in a raid5 translates into significantly higher
> array collapses that then essentially vanish when it's raid6.
>
>
> Chris Murphy
>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-raid" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

     prev parent reply	other threads:[~2014-01-23 22:02 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-01-20 20:34 Questions about bitrot and RAID 5/6 Mason Loring Bliss
2014-01-20 21:46 ` NeilBrown
2014-01-20 22:55   ` Peter Grandi
2014-01-21  9:18   ` David Brown
2014-01-21 17:19   ` Mason Loring Bliss
2014-01-22 10:40     ` David Brown
2014-01-23  0:48       ` Chris Murphy
2014-01-23  8:18         ` David Brown
2014-01-23 17:28           ` Chris Murphy
2014-01-23 18:53             ` Phil Turmel
2014-01-23 21:38               ` Chris Murphy
2014-01-24 13:22                 ` Phil Turmel
2014-01-24 16:11                   ` Chris Murphy
2014-01-24 17:03                     ` Phil Turmel
2014-01-24 17:59                       ` Chris Murphy
2014-01-24 18:12                         ` Phil Turmel
2014-01-24 19:32                           ` Chris Murphy
2014-01-24 19:57                             ` Phil Turmel
2014-01-24 20:54                               ` Chris Murphy
2014-01-25 10:23                                 ` Dag Nygren
2014-01-25 15:48                                 ` Phil Turmel
2014-01-25 17:44                                   ` Stan Hoeppner
2014-01-27  3:34                                     ` Chris Murphy
2014-01-27  7:16                                       ` Mikael Abrahamsson
2014-01-27 18:20                                         ` Chris Murphy
2014-01-30 10:22                                           ` Mikael Abrahamsson
2014-01-30 20:59                                             ` Chris Murphy
2014-01-27  3:20                                   ` Chris Murphy
2014-01-25 17:56                                 ` Wilson Jonathan
2014-01-27  4:07                                   ` Chris Murphy
2014-01-23 22:06               ` David Brown
2014-01-23 22:02             ` David Brown [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52E1918F.3080206@hesbynett.no \
    --to=david.brown@hesbynett.no \
    --cc=linux-raid@vger.kernel.org \
    --cc=lists@colorremedies.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).