Re: Unexpected raid1 behaviour

Linux Btrfs filesystem development
 help / color / mirror / Atom feed

From: Nikolay Borisov <nborisov@suse.com>
To: Anand Jain <anand.jain@oracle.com>,
	Peter Grandi <pg@btrfs.list.sabi.co.UK>,
	Linux fs Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unexpected raid1 behaviour
Date: Mon, 18 Dec 2017 14:10:47 +0200	[thread overview]
Message-ID: <9a2f4ed4-26a0-833d-1225-5a5773ab7a61@suse.com> (raw)
In-Reply-To: <be7e5573-4247-89bd-00f0-7d84120ce582@oracle.com>



On 18.12.2017 10:49, Anand Jain wrote:
> 
> 
>> Put another way, the multi-device design is/was based on the
>> demented idea that block-devices that are missing are/should be
>> "remove"d, so that a 2-device volume with a 'raid1' profile
>> becomes a 1-device volume with a 'single'/'dup' profile, and not
>> a 2-device volume with a missing block-device and an incomplete
>> 'raid1' profile, 
> 
>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>  caused by [1], which we should revert back, since..
>    - balance (to raid1 chunk) may fail if FS is near full
>    - recovery (to raid1 chunk) will take more writes as compared
>      to recovery under degraded raid1 chunks
> 
>  [1]
>  commit 95669976bd7d30ae265db938ecb46a6b7f8cb893
>  Btrfs: don't consider the missing device when allocating new chunks
> 
>  There is an attempt to fix it [2], but will certainly takes time as
>  there are many things to fix around this.
> 
>  [2]
>  [PATCH RFC] btrfs: create degraded-RAID1 chunks
> 
>> even if things have been awkwardly moving in
>> that direction in recent years.
>> Note the above is not totally accurate today because various
>> hacks have been introduced to work around the various issues.
>  May be you are talking about [3]. Pls note its a workaround
>  patch (which I mentioned in its original patch). Its nice that
>  we fixed the availability issue through this patch and the
>  helper function it added also helps the other developments.
>  But for long term we need to work on [2].
> 
>  [3]
>  btrfs: Introduce a function to check if all chunks a OK for degraded rw
> mount
> 
>>> Thus, if a device disappears, to get it back you really have
>>> to reboot, or at least unload/reload the btrfs kernel module,
>>> in ordered to clear the stale device state and have btrfs
>>> rescan and reassociate devices with the matching filesystems.
>>
>> IIRC that is not quite accurate: a "missing" device can be
>> nowadays "replace"d (by "devid") or "remove"d, the latter
>> possibly implying profile changes:
>>
>>   
>> https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete
>>
>>
>> Terrible tricks like this also work:
>>
>>    https://www.spinics.net/lists/linux-btrfs/msg48394.html
> 
>  Its replace, which isn't about bringing back a missing disk.
> 
> 
>>> Meanwhile, as mentioned above, there's active work on proper
>>> dynamic btrfs device tracking and management. It may or may
>>> not be ready for 4.16, but once it goes in, btrfs should
>>> properly detect a device going away and react accordingly,
>>
>> I haven't seen that, but I doubt that it is the radical redesign
>> of the multi-device layer of Btrfs that is needed to give it
>> operational semantics similar to those of MD RAID, and that I
>> have vaguely described previously.
> 
>  I agree that btrfs volume manager is incomplete in view of
>  data center RAS requisites, there are couple of critical
>  bugs and inconsistent design between raid profiles, but I
>  doubt if it needs a radical redesign.
> 
>  Pls take a look at [4], comments are appreciated as usual.
>  I have experimented with two approaches and both are reasonable. -
>  There isn't any harm to leave failed disk opened (but stop any
>  new IO to it). And there will be udev
>  'btrfs dev forget --mounted <dev>' call when device disappears
>  so that we can close the device.
>  In the 2nd approach, close the failed device right away when disk
>  write fails, so that we continue to have only two device states.
>  I like the latter.
> 
>>> and it should detect a device coming back as a different
>>> device too.
>>
>> That is disagreeable because of poor terminology: I guess that
>> what was intended that it should be able to detect a previous
>> member block-device becoming available again as a different
>> device inode, which currently is very dangerous in some vital
>> situations.
> 
>  If device disappears, the patch [4] will completely take out the
>  device from btrfs, and continues to RW in degraded mode.
>  When it reappears then [5] will bring it back to the RW list.

but [5] relies on someone from userspace (presumably udev) actually
invoking BTRFS_IOC_SCAN_DEV/IOSC_DEVICES_READY, no ? Because
device_list_add is only ever called from btrfs_scan_one_device, which in
turn is called by either of the aforementioned IOCTLS or during mount
(which is not at play here).

> 
>   [4]
>   btrfs: introduce device dynamic state transition to failed
>   [5]
>   btrfs: handle dynamically reappearing missing device
> 
>  From the btrfs original design, it always depends on device SB
>  fsid:uuid:devid so it does not matter about the device
>  path or device inode or device transport layer. For eg. Dynamically
>  you can bring a device under different transport and it will work
>  without any down time.
> 
> 
>> That would be trivial if the complete redesign of block-device
>> states of the Btrfs multi-device layer happened, adding an
>> "active" flag to an "accessible" flag to describe new member
>> states, for example.
> 
>  I think you are talking about BTRFS_DEV_STATE.. But I think
>  Duncan is talking about the patches which I included in my
>  reply.
> 
> Thanks, Anand
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

next prev parent reply	other threads:[~2017-12-18 12:10 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48   ` Peter Grandi
2017-12-17 20:42     ` Chris Murphy
2017-12-18  8:49       ` Anand Jain
2017-12-18  8:49     ` Anand Jain
2017-12-18 10:36       ` Peter Grandi
2017-12-18 12:10       ` Nikolay Borisov [this message]
2017-12-18 13:43         ` Anand Jain
2017-12-18 22:28       ` Chris Murphy
2017-12-18 22:29         ` Chris Murphy
2017-12-19 12:30         ` Adam Borowski
2017-12-19 12:54         ` Andrei Borzenkov
2017-12-19 12:59         ` Peter Grandi
2017-12-18 13:06     ` Austin S. Hemmelgarn
2017-12-18 19:43       ` Tomasz Pala
2017-12-18 22:01         ` Peter Grandi
2017-12-19 12:46           ` Austin S. Hemmelgarn
2017-12-19 12:25         ` Austin S. Hemmelgarn
2017-12-19 14:46           ` Tomasz Pala
2017-12-19 16:35             ` Austin S. Hemmelgarn
2017-12-19 17:56               ` Tomasz Pala
2017-12-19 19:47                 ` Chris Murphy
2017-12-19 21:17                   ` Tomasz Pala
2017-12-20  0:08                     ` Chris Murphy
2017-12-23  4:08                       ` Tomasz Pala
2017-12-23  5:23                         ` Duncan
2017-12-20 16:53                   ` Andrei Borzenkov
2017-12-20 16:57                     ` Austin S. Hemmelgarn
2017-12-20 20:02                     ` Chris Murphy
2017-12-20 20:07                       ` Chris Murphy
2017-12-20 20:14                         ` Austin S. Hemmelgarn
2017-12-21  1:34                           ` Chris Murphy
2017-12-21 11:49                         ` Andrei Borzenkov
2017-12-19 20:11                 ` Austin S. Hemmelgarn
2017-12-19 21:58                   ` Tomasz Pala
2017-12-20 13:10                     ` Austin S. Hemmelgarn
2017-12-19 23:53                   ` Chris Murphy
2017-12-20 13:12                     ` Austin S. Hemmelgarn
2017-12-19 18:31             ` George Mitchell
2017-12-19 20:28               ` Tomasz Pala
2017-12-19 19:35             ` Chris Murphy
2017-12-19 20:41               ` Tomasz Pala
2017-12-19 20:47                 ` Austin S. Hemmelgarn
2017-12-19 22:23                   ` Tomasz Pala
2017-12-20 13:33                     ` Austin S. Hemmelgarn
2017-12-20 17:28                       ` Duncan
2017-12-21 11:44                   ` Andrei Borzenkov
2017-12-21 12:27                     ` Austin S. Hemmelgarn
2017-12-22 16:05                       ` Tomasz Pala
2017-12-22 21:04                         ` Chris Murphy
2017-12-23  2:52                           ` Tomasz Pala
2017-12-23  5:40                             ` Duncan
2017-12-19 23:59                 ` Chris Murphy
2017-12-20  8:34                   ` Tomasz Pala
2017-12-20  8:51                     ` Tomasz Pala
2017-12-20 19:49                     ` Chris Murphy
2017-12-18  5:11   ` Anand Jain
2017-12-18  1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn
2018-01-12 12:26   ` Dark Penguin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9a2f4ed4-26a0-833d-1225-5a5773ab7a61@suse.com \
    --to=nborisov@suse.com \
    --cc=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pg@btrfs.list.sabi.co.UK \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox