Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: btrfs check inconsistency with raid1, part 1
Date: Tue, 22 Dec 2015 10:44:11 -0500	[thread overview]
Message-ID: <56796FCB.5080303@gmail.com> (raw)
In-Reply-To: <pan$30f15$f6f8b13b$d72dd248$aaee075f@cox.net>

On 2015-12-22 05:23, Duncan wrote:
> Kai Krakow posted on Tue, 22 Dec 2015 02:48:04 +0100 as excerpted:
>
>> I just wondered if btrfs allows for the case where both stripes could
>> have valid checksums despite of btrfs-RAID - just because a failure
>> occurred right on the spot.
>>
>> Is this possible? What happens then? If yes, it would mean not to
>> blindly trust the RAID without doing the homeworks.
>
> The one case where btrfs could get things wrong that I know of is as I
> discovered in my initial pre-btrfs-raid1-deployment testing...
I've had exactly one case where I got _really_ unlucky and had a bunch 
of media errors on a BTRFS raid1 setup that happened to result in 
something similar to this.  Things happened such that one copy of a 
block (we'll call this one copy 1) had correct data, and the other 
(we'll call this one copy 2) had incorrect data, except that one copy of 
the metadata had the correct checksum for copy 2, and the other metadata 
copy had a correct checksum for copy 1, but, due to a hash collision, 
the checksum for the metadata block was correct for both copies.  As a 
result of this, I ended up getting a read-error about 25% of the time 
(which then forced a re-read of the data, the correct data about 37.5% 
of the time, and incorrect data the remaining 37.5% of the time.  I 
actually ran the numbers on how likely this was to happen (more than a 
dozen errors on different disks in blocks that happened to reference 
each other, and a hash collision involving a 4 byte difference between 
two 16k blocks of data), and it's a statistical impossibility (It's more 
likely that one of Amazon or Google's data-centers goes offline due to 
hardware failures than it is that this will happen again).  Obviously it 
did happen, but I would say it's such a unrealistic edge case that you 
probably don't need to worry about it (although I learned _a lot_ about 
the internals of BTRFS in trying to figure out what was going on).
>
[...snip...]
>
>  From all I know and from everything others told me when I asked at the
> time, which copy you get then is entirely unpredictable, and worse yet,
> you might get btrfs acting on divergent metadata when writing to the
> other device.
>
This is indeed the case.  Because of how BTRFS verifies checksums, 
there's a roughly 50% chance that the first read attempt will result in 
picking a mismatched checksum and data block, which will trigger a 
re-read which has an independent 50% chance of again picking a mismatch, 
resulting in a 25% chance that any read that actually goes to the device 
returns a read error.  The remaining 75% of the time, you'll get either 
one block or the other.  These numbers of course get skewed by the VFS 
cache.  In my case above, the file that was affected was one that is 
almost never in cache when it gets accessed, so I saw numbers relatively 
close to what you would get without the cache.

  reply	other threads:[~2015-12-22 15:44 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-14  4:16 btrfs check inconsistency with raid1, part 1 Chris Murphy
2015-12-14  5:48 ` Qu Wenruo
2015-12-14  7:24   ` Chris Murphy
2015-12-14  8:04     ` Qu Wenruo
2015-12-14 17:59       ` Chris Murphy
2015-12-20 22:32         ` Chris Murphy
     [not found]         ` <CAJCQCtSEx_wYPkfazik0bcpQwXxJCA=O5f0o6RbxON4jjB4q7A@mail.gmail.com>
     [not found]           ` <5677592F.5000202@cn.fujitsu.com>
2015-12-21  2:12             ` Chris Murphy
2015-12-21  2:23               ` Qu Wenruo
2015-12-21  2:46                 ` Chris Murphy
2015-12-22  1:05                 ` Kai Krakow
2015-12-22  1:22                   ` Qu Wenruo
2015-12-22  1:48                     ` Kai Krakow
2015-12-22  2:15                       ` Qu Wenruo
2015-12-22  4:21                         ` Chris Murphy
2015-12-22 10:23                       ` Duncan
2015-12-22 15:44                         ` Austin S. Hemmelgarn [this message]
2015-12-29 21:33                           ` Chris Murphy
2015-12-14 11:51     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56796FCB.5080303@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox