linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Expected behavior of bad sectors on one drive in a RAID1
Date: Tue, 20 Oct 2015 15:48:07 -0400	[thread overview]
Message-ID: <56269A77.1080709@gmail.com> (raw)
In-Reply-To: <pan$312c1$94a948da$df1250c5$4706911@cox.net>

[-- Attachment #1: Type: text/plain, Size: 5409 bytes --]

On 2015-10-20 14:54, Duncan wrote:
> But tho I'm a user not a dev and thus haven't actually checked the source
> code itself, my believe here is with Russ and disagrees with Austin, as
> based on what I've read both on the wiki and seen here previously, btrfs
> runtime (that is, not during scrub) actually repairs the problem on-
> hardware as well, from that second copy, not just fetching it for use
> without the repair, the distinction between normal runtime error
> detection and scrub thus being that scrub systematically checks
> everything, while normal runtime on most systems will only check the
> stuff it reads in normal usage, thus getting the stuff that's regularly
> used, but not the stuff that's only stored and never read.
>
> *WARNING*:  From my experience at least, at least on initial mount, btrfs
> isn't particularly robust when the number of read errors on one device
> start to go up dramatically.  Despite never seeing an error in scrub that
> it couldn't fix, twice I had enough reads fail on a mount that the mount
> itself failed and I couldn't mount successfully despite repeated
> attempts.  In both cases, I was able to use btrfs restore to restore the
> contents of the filesystem to some other place (as it happens, the
> reiserfs on spinning rust I use for my media filesystem, since being for
> big media files, that had enough space to recover the as I said above
> reasonably small btrfs into), and ultimate recreating the filesystem
> using mkfs.btrfs.
>
> But given that despite not being able to mount, neither SMART nor dmesg
> ever mentioned anything about the "good" device having errors, I'm left
> to conclude that btrfs itself ultimately crashed on attempt to mount the
> filesystem, even tho only the one copy was bad.  After a couple of those
> events I started scrubbing much more frequently, thus fixing the errors
> while btrfs could still mount the filesystem and /let/ me run a scrub.
> It was actually those more frequent scrubs that quickly became the hassle
> and lead me to give up on the device.  If btrfs had been able to fall
> back to the second/valid copy even in that case, as it really should have
> done, then I would have very possibly waited quite a bit longer to
> replace the dying device.
>
> So on that one I'd say to be sure, get confirmation either directly from
> the code (if you can read it) or from a dev who has actually looked at it
> and is basing his post on that, tho I still /believe/ btrfs still runtime-
> corrects checksumming issues actually on-device, if there's a validating
> second copy it can use to do so.
>
FWIW, my assessment is based on some testing I did a while back (kernel 
3.14 IIRC) using a VM.  The (significantly summarized of course) 
procedure I used was:
1. Create a basic minimalistic Linux system in a VM (in my case, I just 
used a stage3 tarball for Gentoo, with a paravirtuaized Xen domain) 
using BTRFS as the root filesystem with a raid1 setup.  Make sure and 
verify that it actually boots.
2. Shutdown the VM, use btrfs-progs on the host to find the physical 
location of an arbitrary file (ideally one that is not touched at all 
during the boot process, IIRC, I think I used one of the e2fsprogs 
binaries), and then intentionally clear the CRC in one of the copies of 
a block from the file.
3. Boot the VM, read the file.
4. Shutdown the VM again.
5. Verify whether the file block you cleared the checksum on has a valid 
checksum now.

I repeated this more than a dozen times using different files and 
different methods of reading the file, and each time the CRC I had 
cleared was untouched.  Based on this, unless BTRFS does some kind of 
deferred re-write that doesn't get forced during a clean unmount of the 
FS, I felt it was relatively safe to conclude that it did not 
automatically fix corrupted blocks.  I did not however, test corrupting 
the block itself instead of the checksum, but I doubt that that would 
impact anything in this case.

As I mentioned, many veteran sysadmins would want to disable 
automatically fixing this in the FS driver without having some kind of 
notification.  This preference largely dates back to traditional RAID1, 
where the system has no way to know for certain which copy is correct in 
the case of a mismatch, and therefore to safely fix mismatches, the 
admin needs to intervene.  While it is possible to fix this safely 
because of how BTRFS is designed, there is still the possibility of it 
getting things wrong.  There was one time I had a BTRFS raid1 filesystem 
where one copy of a block got corrupted but miraculously had a correct 
CRC (which is statistically impossible), and the other copy of the block 
was correct, but the CRC for it was wrong (which, while unlikely, is 
very much possible).  In such a case (which was a serious pain to 
debug), automatically 'fixing' the supposedly bad block would have 
resulted in data loss.  Of course, the chance that happening more than 
once in a lifetime is astronomically small, but it is still possible.

It's also worth noting that ZFS has been considered mature for more than 
a decade now, and the ZFS developers _still_ aren't willing to risk 
their user's data with something like this, which should be an immediate 
red flag for anyone developing a filesystem with features like ZFS.


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

  reply	other threads:[~2015-10-20 19:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-10-20  4:16 Expected behavior of bad sectors on one drive in a RAID1 james harvey
2015-10-20  4:45 ` Russell Coker
2015-10-20 13:00   ` Austin S Hemmelgarn
2015-10-20 13:15     ` Russell Coker
2015-10-20 13:59       ` Austin S Hemmelgarn
2015-10-20 19:20         ` Duncan
2015-10-20 19:59           ` Austin S Hemmelgarn
2015-10-20 20:54             ` Tim Walberg
2015-10-21 11:51             ` Austin S Hemmelgarn
2015-10-21 12:07               ` Austin S Hemmelgarn
2015-10-21 16:01                 ` Chris Murphy
2015-10-21 17:28                   ` Austin S Hemmelgarn
2015-10-20 18:54 ` Duncan
2015-10-20 19:48   ` Austin S Hemmelgarn [this message]
2015-10-20 21:24     ` Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56269A77.1080709@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).