From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Dark Penguin <darkpenguin@yandex.ru>, linux-btrfs@vger.kernel.org
Subject: Re: Unexpected raid1 behaviour
Date: Mon, 18 Dec 2017 08:31:56 -0500 [thread overview]
Message-ID: <5e9df444-0cbc-5854-2058-3435efe78c59@gmail.com> (raw)
In-Reply-To: <5A357909.8010206@yandex.ru>
On 2017-12-16 14:50, Dark Penguin wrote:
> Could someone please point me towards some read about how btrfs handles
> multiple devices? Namely, kicking faulty devices and re-adding them.
>
> I've been using btrfs on single devices for a while, but now I want to
> start using it in raid1 mode. I booted into an Ubuntu 17.10 LiveCD and
> tried to see how does it handle various situations. The experience left
> me very surprised; I've tried a number of things, all of which produced
> unexpected results.
Expounding a bit on Duncan's answer with some more specific info.
>
> I create a btrfs raid1 filesystem on two hard drives and mount it.
>
> - When I pull one of the drives out (simulating a simple cable failure,
> which happens pretty often to me), the filesystem sometimes goes
> read-only. ??? > - But only after a while, and not always. ???
The filesystem won't go read-only until it hits an I/O error, and it's
non-deterministic how long it will be before that happens on an idle
filesystem that only sees read access (because if all the files that are
being read are in the page cache).
> - When I fix the cable problem (plug the device back), it's immediately
> "re-added" back. But I see no replication of the data I've written onto
> a degraded filesystem... Nothing shows any problems, so "my filesystem
> must be ok". ???
One of two things happens in this case, and why there is no re-sync is
dependent on which happens, but both ultimately have to do with the fact
that BTRFS assumes I/O errors are from device failures, and are at worst
transient. Either:
1. The device reappears with the same name. This happens if the time it
was disconnected is less than the kernel's command timeout (30 seconds
by default). In this case, BTRFS may not even notice that the device
was gone (and if it doesn't, then a re-sync isn't necessary, since it
will retry all the writes it needs to). In this case, BTRFS assumes the
I/O errors were temporary, and keeps using the device after logging the
errors. If this happens, then you need to manually re-sync things by
scrubbing the filesystem (or balancing, but scrubbing is preferred as it
should run quicker and will only re-write what is actually needed).
2. The device reappears with a different name. In this case, the device
was gone long enough that the block layer is certain it was
disconnected, and thus when it reappears and BTRFS still holds open
references to the old device node, it gets a new device node. In this
case, if the 'new' device is scanned, BTRFS will recognize it as part of
the FS, but will keep using the old device node. The correct fix here
is to unmount the filesystem, re-scan all devices, and then remount the
filesystem and manually re-sync with a scrub.
> - If I unmount the filesystem and then mount it back, I see all my
> recent changes lost (everything I wrote during the "degraded" period).
I'm not quite sure about this, but I think BTRFS is rolling back to the
last common generation number for some reason.
> - If I continue working with a degraded raid1 filesystem (even without
> damaging it further by re-adding the faulty device), after a while it
> won't mount at all, even with "-o degraded".
This is (probably) a known bug relating to chunk handling. In a two
device volume using a raid1 profile with a missing device, older kernels
(I don't remember when the fix went in, but I could have sworn it was in
4.13) will (erroneously) generate single-profile chunks when they need
to allocate new chunks. When you then go to mount the filesystem, the
check for the degraded mount-ability of the FS fails because there is a
device missing and single profile chunks.
Now, even without that bug, it's never a good idea t0o run a storage
array degraded for any extended period of time, regardless of what type
of array it is (BTRFS, ZFS, MD, LVM, or even hardware RAID). By keeping
it in 'degraded' mode, you're essentially telling the system that the
array will be fixed in a reasonably short time-frame, which impacts how
it handles the array. If you're not going to fix it almost immediately,
you should almost always reshape the array to account for the missing
device if at all possible, as that will improve relative data safety and
generally get you better performance than running degraded will.
>
> I can't wrap my head about all this. Either the kicked device should not
> be re-added, or it should be re-added "properly", or it should at least
> show some errors and not pretend nothing happened, right?..
BTRFS is not the best at error reporting at the moment. If you check
the output of `btrfs device stats` for that filesystem though, it should
show non-zero values in the error counters (note that these counters are
cumulative, so they are counts since the last time they were reset (or
when the FS was created if they have never been reset). Similarly,
scrub should report errors, there should be error messages in the kernel
log, and switching the FS to read-only mode _is_ technically reporting
an error, as that's standard error behavior for most sensible
filesystems (ext[234] being the notable exception, they just continue as
if nothing happened).
>
> I must be missing something. Is there an explanation somewhere about
> what's really going on during those situations? Also, do I understand
> correctly that upon detecting a faulty device (a write error), nothing
> is done about it except logging an error into the 'btrfs device stats'
> report? No device kicking, no notification?.. And what about degraded
> filesystems - is it absolutely forbidden to work with them without
> converting them to a "single" filesystem first?..
As mentioned above, going read-only _is_ a notification that something
is wrong. Translating that (and the error counter increase, and the
kernel log messages) into a user visible notification is not really the
job of BTRFS, especially considering that no other filesystem or device
manager does so either (yes, you can get nice notifications from LVM,
but they aren't _from_ LVM itself, they're from other software that
watches for errors, and the same type of software works just fine for
BTRFS too). If you're this worried about it and don't want to keep on
top of it yourself by monitoring things manually, you really need to
look into a tool like monit [1] that can handle this for you.
[1] https://mmonit.com/monit/
next prev parent reply other threads:[~2017-12-18 13:32 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48 ` Peter Grandi
2017-12-17 20:42 ` Chris Murphy
2017-12-18 8:49 ` Anand Jain
2017-12-18 8:49 ` Anand Jain
2017-12-18 10:36 ` Peter Grandi
2017-12-18 12:10 ` Nikolay Borisov
2017-12-18 13:43 ` Anand Jain
2017-12-18 22:28 ` Chris Murphy
2017-12-18 22:29 ` Chris Murphy
2017-12-19 12:30 ` Adam Borowski
2017-12-19 12:54 ` Andrei Borzenkov
2017-12-19 12:59 ` Peter Grandi
2017-12-18 13:06 ` Austin S. Hemmelgarn
2017-12-18 19:43 ` Tomasz Pala
2017-12-18 22:01 ` Peter Grandi
2017-12-19 12:46 ` Austin S. Hemmelgarn
2017-12-19 12:25 ` Austin S. Hemmelgarn
2017-12-19 14:46 ` Tomasz Pala
2017-12-19 16:35 ` Austin S. Hemmelgarn
2017-12-19 17:56 ` Tomasz Pala
2017-12-19 19:47 ` Chris Murphy
2017-12-19 21:17 ` Tomasz Pala
2017-12-20 0:08 ` Chris Murphy
2017-12-23 4:08 ` Tomasz Pala
2017-12-23 5:23 ` Duncan
2017-12-20 16:53 ` Andrei Borzenkov
2017-12-20 16:57 ` Austin S. Hemmelgarn
2017-12-20 20:02 ` Chris Murphy
2017-12-20 20:07 ` Chris Murphy
2017-12-20 20:14 ` Austin S. Hemmelgarn
2017-12-21 1:34 ` Chris Murphy
2017-12-21 11:49 ` Andrei Borzenkov
2017-12-19 20:11 ` Austin S. Hemmelgarn
2017-12-19 21:58 ` Tomasz Pala
2017-12-20 13:10 ` Austin S. Hemmelgarn
2017-12-19 23:53 ` Chris Murphy
2017-12-20 13:12 ` Austin S. Hemmelgarn
2017-12-19 18:31 ` George Mitchell
2017-12-19 20:28 ` Tomasz Pala
2017-12-19 19:35 ` Chris Murphy
2017-12-19 20:41 ` Tomasz Pala
2017-12-19 20:47 ` Austin S. Hemmelgarn
2017-12-19 22:23 ` Tomasz Pala
2017-12-20 13:33 ` Austin S. Hemmelgarn
2017-12-20 17:28 ` Duncan
2017-12-21 11:44 ` Andrei Borzenkov
2017-12-21 12:27 ` Austin S. Hemmelgarn
2017-12-22 16:05 ` Tomasz Pala
2017-12-22 21:04 ` Chris Murphy
2017-12-23 2:52 ` Tomasz Pala
2017-12-23 5:40 ` Duncan
2017-12-19 23:59 ` Chris Murphy
2017-12-20 8:34 ` Tomasz Pala
2017-12-20 8:51 ` Tomasz Pala
2017-12-20 19:49 ` Chris Murphy
2017-12-18 5:11 ` Anand Jain
2017-12-18 1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn [this message]
2018-01-12 12:26 ` Dark Penguin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5e9df444-0cbc-5854-2058-3435efe78c59@gmail.com \
--to=ahferroin7@gmail.com \
--cc=darkpenguin@yandex.ru \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox