* Why do BTRFS (still) forgets what device to write to?
@ 2017-03-05 16:26 waxhead
2017-03-06 2:48 ` Duncan
0 siblings, 1 reply; 2+ messages in thread
From: waxhead @ 2017-03-05 16:26 UTC (permalink / raw)
To: linux-btrfs
I am doing some test on BTRFS with both data and metadata in raid1.
uname -a
Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28) x86_64
GNU/Linux
btrfs--version
btrfs-progs v4.7.3
01. mkfs.btrfs /dev/sd[fgh]1
02. mount /dev/sdf1 /btrfs_test/
03. btrfs balance start -dconvert=raid1 /btrfs_test/
04. copied a lots of 3-4MB files to it (about 40GB)...
05. Started to compress some of the files to create one larger file...
06. Pulled the (sata) plug on one of the drives... (sdf1)
07. dmesg shows that the kernel is rejecting I/O to offline device +
[sdf] killing request]
08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0,
corrupt 0, gen 0
09. the previous line repeats - increasing rd count
10. Reconnecting the sdf1 drive again makes it show up as sdi1
11. btrfs fi sh /btrfs_test shows sd1 as the correct device id (1).
12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1)
: bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0....
13. and the above line repeats increasing wr, and rd errors.
14. BTRFS never seems to "get in tune again" while the filesystem is
mounted.
The conclusion appears to be that the device ID is back again in the
btrfs pool so why does btrfs still try to write to the wrong device (or
does it?!).
The good thing here is that BTRFS does still work fine after a unmount
and mount again. Running a scrub on the filesystem cleans up tons of
errors , but no uncorrectable errors.
However it says total bytes scrubbed 94.21GB with 75 errors ... and
further down it says corrected errors: 72, uncorrectable errors: 0 ,
unverified errors: 0
Why 75 vs 72 errors?! did it correct all or not?
I have recently lost 1x 5 device BTRFS filesystem as well as 2x 3 device
BTRFS filesystems set up in RAID1 (both data and medata) by toying
around with them. The 2x filesystems I lost was using all bad disks (all
3 of them) but the one mentioned here uses good (but old) 400GB drives
just for the record.
By lost I mean that mount does not recognize the filesystem, but BTRFS
fi sh does show that all devices are present. I did not make notes for
those filesystems , but it appears that RAID1 is a bit fragile.
I don't need to recover anything. This is just a "toy system" for
playing around with btrfs and doing some tests.
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: Why do BTRFS (still) forgets what device to write to?
2017-03-05 16:26 Why do BTRFS (still) forgets what device to write to? waxhead
@ 2017-03-06 2:48 ` Duncan
0 siblings, 0 replies; 2+ messages in thread
From: Duncan @ 2017-03-06 2:48 UTC (permalink / raw)
To: linux-btrfs
waxhead posted on Sun, 05 Mar 2017 17:26:36 +0100 as excerpted:
> I am doing some test on BTRFS with both data and metadata in raid1.
>
> uname -a Linux daffy 4.9.0-1-amd64 #1 SMP Debian 4.9.6-3 (2017-01-28)
> x86_64 GNU/Linux
>
> btrfs--version btrfs-progs v4.7.3
>
>
> 01. mkfs.btrfs /dev/sd[fgh]1 02. mount /dev/sdf1 /btrfs_test/
> 03. btrfs balance start -dconvert=raid1 /btrfs_test/
> 04. copied a lots of 3-4MB files to it (about 40GB)...
> 05. Started to compress some of the files to create one larger file...
> 06. Pulled the (sata) plug on one of the drives... (sdf1)
> 07. dmesg shows that the kernel is rejecting I/O to offline device +
> [sdf] killing request]
> 08. BTRS error (device sdf1) bdev /dev/sdf1 errs: wr 0, rd 1, flush 0,
> corrupt 0, gen 0 09. the previous line repeats - increasing rd count 10.
> Reconnecting the sdf1 drive again makes it show up as sdi1 11. btrfs fi
> sh /btrfs_test shows sd1 as the correct device id (1).
> 12. Yet dmesg shows tons of errors like this: BTRFS error (device sdf1)
> : bdev /dev/sdi1 errs wr 37182, rd 39851, flush 1, corrupt 0, gen 0....
> 13. and the above line repeats increasing wr, and rd errors.
> 14. BTRFS never seems to "get in tune again" while the filesystem is
> mounted.
>
> The conclusion appears to be that the device ID is back again in the
> btrfs pool so why does btrfs still try to write to the wrong device (or
> does it?!).
The base problem is that btrfs doesn't (yet) have any concept of a device
disconnecting and reconnecting "live", only after unmount/remount.
When a device drops out, btrfs will continue to attempt to write to it.
Things will continue normally on all other devices, and only after some
time will btrfs actually finally give up on the device. (I /believe/
it's after the level of dirty memory exceeds some safety threshold, with
the unwritten writes taking up a larger and larger part of dirty memory
until something gives. However, I'm not a dev just a user and list
regular, and this is just my supposition filling in the blanks, so don't
take it for gospel unless you get confirmation either directly from the
code or from an actual dev.)
If the outage is short enough for the kernel to bring back the device as
the same device node, great, btrfs can and does resume writing to it.
However, once the outage is long enough that the kernel brings back the
physical device as a different device node, yes, btrfs filesystem show
will show the device back as its normal ID, but that information isn't
properly communicated to the "live" still-mounted filesystem, and it
continues to attempt writing to the old device node.
There's plans for, and even patches that introduce limited support for,
live detection and automatic (re)integration of a new or reintroduced
device, but those patches are in a longterm development project and last
I read weren't even in a state where they even applied cleanly to current
kernels, as they've not been kept current and have gone stale.
Of course it should be kept in mind that btrfs is still under heavy
development, and while stabilizing, isn't considered, certainly not by
its devs, to be anywhere near feature complete and stabilized, even, at
times such as this, for features that are generally considered as
reasonably stable and mature as btrfs itself is -- that is, still
stabilizING, not yet fully stable and mature -- keep backups and be
prepared to use them if you value your data, because you may indeed be
calling on them!
In that state, it's only to be expected that there will still be some
incomplete features such as this, where manual intervention may be
required that wouldn't be in more complete/stable/mature solutions.
Basically, it comes with the territory.
> The good thing here is that BTRFS does still work fine after a unmount
> and mount again. Running a scrub on the filesystem cleans up tons of
> errors , but no uncorrectable errors.
Correct. An unmount will leave all that data unwritten to the device it
still considers missing, so of course those checksums aren't going to
match. On remount, btrfs sees the device again, and should and AFAIK
consistently does note the difference in commit generations, pulling from
the updated device where they differ. A scrub can then be used to bring
the outdated device back in sync.
But be sure to do that scrub as soon as possible. Should further
instability continue to drop out devices, or further not entirely
graceful unmounts/shutdowns occur, the damage may get worse and not be
entirely repairable, certainly not with only a simple scrub.
> However it says total bytes scrubbed 94.21GB with 75 errors ... and
> further down it says corrected errors: 72, uncorrectable errors: 0 ,
> unverified errors: 0
>
> Why 75 vs 72 errors?! did it correct all or not?
>From my own experience (and I actually deliberately ran btrfs raid1
with a failing device for awhile to test this sort of thing, btrfs'
checksumming worked very well with scrub to fix things... as long as the
remaining device didn't start to fail with its mirror copy at the same
places, of course), I can quite confidently say it's fixing them all, as
long as unverified errors are 0 and you don't have some other source of
errors, say bad memory, introducing further problems including some that
checksumming won't fix as the data's bad before it gets a checksum.
Of course you can rerun the scrub just to be sure, but here, the only
times it found more errors was when unverified errors popped up.
(Unverified errors are where an error at a higher level in the metadata
kept lower metadata blocks as well as data blocks from being checksum-
verified. Once the upper level errors were fixed, the lower level ones
could then be tested. Back when I was running with the gradually failing
device, this required manual rerun of the scrub if unverified errors
showed up. I believe patches have been introduced since then that rerun
the scrub on the unverified error blocks when necessary, once the upper
level blocks have been corrected, thus making it possible to verify the
lower level ones. So as long as there's no uncorrectable errors, as
there shouldn't be in raid1 unless both copies of a block end up failing
checksum verification, there should now be no unverified errors either.
Of course if both copies fail checksum verification, then there's going
to be uncorrectable errors, and if they're at the higher metadata levels,
there could then still be unverified errors as a result.)
What I believe is going on in such cases (72 vs. 75 errors), is some
blocks will be counted twice as they have multiple references. These
will only be fixed once, but with that fix, will actually correct
multiple errors due to the multiple times that block was referenced.
> I have recently lost 1x 5 device BTRFS filesystem as well as 2x 3 device
> BTRFS filesystems set up in RAID1 (both data and medata) by toying
> around with them. The 2x filesystems I lost was using all bad disks (all
> 3 of them) but the one mentioned here uses good (but old) 400GB drives
> just for the record.
>
> By lost I mean that mount does not recognize the filesystem, but BTRFS
> fi sh does show that all devices are present. I did not make notes for
> those filesystems , but it appears that RAID1 is a bit fragile.
>
> I don't need to recover anything. This is just a "toy system" for
> playing around with btrfs and doing some tests.
FWIW, I lost a couple some time ago, but none for over a year now, I
believe. However, I was lucky and was able to recover current data using
btrfs restore. (I had backups but they weren't entirely current. Of
course if you've read many of my posts you'll know I tend to strongly
emphasize backups if the data is of value, and realize that I was in
reality defining that data in the delta between the current and backed up
versions as worth less than the time and trouble necessary to update the
backup, so if I lost the data it would have been entirely my own weighed
decision that lead to that loss, but btrfs restore was actually able to
restore the data for me, so I didn't have to deal with the loss I was
knowingly risking. I don't count on restore working /every/ time, but if
I need to try it, I can still be glad when it /does/ work. =:^)
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2017-03-06 2:48 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-03-05 16:26 Why do BTRFS (still) forgets what device to write to? waxhead
2017-03-06 2:48 ` Duncan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).