From: Nikolay Borisov <nborisov@suse.com>
To: Anand Jain <anand.jain@oracle.com>,
Peter Grandi <pg@btrfs.list.sabi.co.UK>,
Linux fs Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unexpected raid1 behaviour
Date: Mon, 18 Dec 2017 14:10:47 +0200 [thread overview]
Message-ID: <9a2f4ed4-26a0-833d-1225-5a5773ab7a61@suse.com> (raw)
In-Reply-To: <be7e5573-4247-89bd-00f0-7d84120ce582@oracle.com>
On 18.12.2017 10:49, Anand Jain wrote:
>
>
>> Put another way, the multi-device design is/was based on the
>> demented idea that block-devices that are missing are/should be
>> "remove"d, so that a 2-device volume with a 'raid1' profile
>> becomes a 1-device volume with a 'single'/'dup' profile, and not
>> a 2-device volume with a missing block-device and an incomplete
>> 'raid1' profile,
>
> Agreed. IMO degraded-raid1-single-chunk is an accidental feature
> caused by [1], which we should revert back, since..
> - balance (to raid1 chunk) may fail if FS is near full
> - recovery (to raid1 chunk) will take more writes as compared
> to recovery under degraded raid1 chunks
>
> [1]
> commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
> Btrfs: don't consider the missing device when allocating new chunks
>
> There is an attempt to fix it [2], but will certainly takes time as
> there are many things to fix around this.
>
> [2]
> [PATCH RFC] btrfs: create degraded-RAID1 chunks
>
>> even if things have been awkwardly moving in
>> that direction in recent years.
>> Note the above is not totally accurate today because various
>> hacks have been introduced to work around the various issues.
> May be you are talking about [3]. Pls note its a workaround
> patch (which I mentioned in its original patch). Its nice that
> we fixed the availability issue through this patch and the
> helper function it added also helps the other developments.
> But for long term we need to work on [2].
>
> [3]
> btrfs: Introduce a function to check if all chunks a OK for degraded rw
> mount
>
>>> Thus, if a device disappears, to get it back you really have
>>> to reboot, or at least unload/reload the btrfs kernel module,
>>> in ordered to clear the stale device state and have btrfs
>>> rescan and reassociate devices with the matching filesystems.
>>
>> IIRC that is not quite accurate: a "missing" device can be
>> nowadays "replace"d (by "devid") or "remove"d, the latter
>> possibly implying profile changes:
>>
>>
>> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
>>
>>
>> Terrible tricks like this also work:
>>
>> https://www.spinics.net/lists/linux-btrfs/msg48394.html
>
> Its replace, which isn't about bringing back a missing disk.
>
>
>>> Meanwhile, as mentioned above, there's active work on proper
>>> dynamic btrfs device tracking and management. It may or may
>>> not be ready for 4.16, but once it goes in, btrfs should
>>> properly detect a device going away and react accordingly,
>>
>> I haven't seen that, but I doubt that it is the radical redesign
>> of the multi-device layer of Btrfs that is needed to give it
>> operational semantics similar to those of MD RAID, and that I
>> have vaguely described previously.
>
> I agree that btrfs volume manager is incomplete in view of
> data center RAS requisites, there are couple of critical
> bugs and inconsistent design between raid profiles, but I
> doubt if it needs a radical redesign.
>
> Pls take a look at [4], comments are appreciated as usual.
> I have experimented with two approaches and both are reasonable. -
> There isn't any harm to leave failed disk opened (but stop any
> new IO to it). And there will be udev
> 'btrfs dev forget --mounted <dev>' call when device disappears
> so that we can close the device.
> In the 2nd approach, close the failed device right away when disk
> write fails, so that we continue to have only two device states.
> I like the latter.
>
>>> and it should detect a device coming back as a different
>>> device too.
>>
>> That is disagreeable because of poor terminology: I guess that
>> what was intended that it should be able to detect a previous
>> member block-device becoming available again as a different
>> device inode, which currently is very dangerous in some vital
>> situations.
>
> If device disappears, the patch [4] will completely take out the
> device from btrfs, and continues to RW in degraded mode.
> When it reappears then [5] will bring it back to the RW list.
but [5] relies on someone from userspace (presumably udev) actually
invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Because
device_list_add is only ever called from btrfs_scan_one_device, which in
turn is called by either of the aforementioned IOCTLS or during mount
(which is not at play here).
>
> [4]
> btrfs: introduce device dynamic state transition to failed
> [5]
> btrfs: handle dynamically reappearing missing device
>
> From the btrfs original design, it always depends on device SB
> fsid:uuid:devid so it does not matter about the device
> path or device inode or device transport layer. For eg. Dynamically
> you can bring a device under different transport and it will work
> without any down time.
>
>
>> That would be trivial if the complete redesign of block-device
>> states of the Btrfs multi-device layer happened, adding an
>> "active" flag to an "accessible" flag to describe new member
>> states, for example.
>
> I think you are talking about BTRFS_DEV_STATE.. But I think
> Duncan is talking about the patches which I included in my
> reply.
>
> Thanks, Anand
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2017-12-18 12:10 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48 ` Peter Grandi
2017-12-17 20:42 ` Chris Murphy
2017-12-18 8:49 ` Anand Jain
2017-12-18 8:49 ` Anand Jain
2017-12-18 10:36 ` Peter Grandi
2017-12-18 12:10 ` Nikolay Borisov [this message]
2017-12-18 13:43 ` Anand Jain
2017-12-18 22:28 ` Chris Murphy
2017-12-18 22:29 ` Chris Murphy
2017-12-19 12:30 ` Adam Borowski
2017-12-19 12:54 ` Andrei Borzenkov
2017-12-19 12:59 ` Peter Grandi
2017-12-18 13:06 ` Austin S. Hemmelgarn
2017-12-18 19:43 ` Tomasz Pala
2017-12-18 22:01 ` Peter Grandi
2017-12-19 12:46 ` Austin S. Hemmelgarn
2017-12-19 12:25 ` Austin S. Hemmelgarn
2017-12-19 14:46 ` Tomasz Pala
2017-12-19 16:35 ` Austin S. Hemmelgarn
2017-12-19 17:56 ` Tomasz Pala
2017-12-19 19:47 ` Chris Murphy
2017-12-19 21:17 ` Tomasz Pala
2017-12-20 0:08 ` Chris Murphy
2017-12-23 4:08 ` Tomasz Pala
2017-12-23 5:23 ` Duncan
2017-12-20 16:53 ` Andrei Borzenkov
2017-12-20 16:57 ` Austin S. Hemmelgarn
2017-12-20 20:02 ` Chris Murphy
2017-12-20 20:07 ` Chris Murphy
2017-12-20 20:14 ` Austin S. Hemmelgarn
2017-12-21 1:34 ` Chris Murphy
2017-12-21 11:49 ` Andrei Borzenkov
2017-12-19 20:11 ` Austin S. Hemmelgarn
2017-12-19 21:58 ` Tomasz Pala
2017-12-20 13:10 ` Austin S. Hemmelgarn
2017-12-19 23:53 ` Chris Murphy
2017-12-20 13:12 ` Austin S. Hemmelgarn
2017-12-19 18:31 ` George Mitchell
2017-12-19 20:28 ` Tomasz Pala
2017-12-19 19:35 ` Chris Murphy
2017-12-19 20:41 ` Tomasz Pala
2017-12-19 20:47 ` Austin S. Hemmelgarn
2017-12-19 22:23 ` Tomasz Pala
2017-12-20 13:33 ` Austin S. Hemmelgarn
2017-12-20 17:28 ` Duncan
2017-12-21 11:44 ` Andrei Borzenkov
2017-12-21 12:27 ` Austin S. Hemmelgarn
2017-12-22 16:05 ` Tomasz Pala
2017-12-22 21:04 ` Chris Murphy
2017-12-23 2:52 ` Tomasz Pala
2017-12-23 5:40 ` Duncan
2017-12-19 23:59 ` Chris Murphy
2017-12-20 8:34 ` Tomasz Pala
2017-12-20 8:51 ` Tomasz Pala
2017-12-20 19:49 ` Chris Murphy
2017-12-18 5:11 ` Anand Jain
2017-12-18 1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn
2018-01-12 12:26 ` Dark Penguin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9a2f4ed4-26a0-833d-1225-5a5773ab7a61@suse.com \
--to=nborisov@suse.com \
--cc=anand.jain@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=pg@btrfs.list.sabi.co.UK \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox