linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Peter Grandi <pg@btrfs.list.sabi.co.UK>,
	Linux Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
Date: Thu, 9 Mar 2017 08:54:38 -0500	[thread overview]
Message-ID: <50fa13c0-1b38-098c-edab-8ad4b0195773@gmail.com> (raw)
In-Reply-To: <22721.9515.564892.221096@tree.ty.sabi.co.uk>

On 2017-03-09 04:49, Peter Grandi wrote:
>>> Consider the common case of a 3-member volume with a 'raid1'
>>> target profile: if the sysadm thinks that a drive should be
>>> replaced, the goal is to take it out *without* converting every
>>> chunk to 'single', because with 2-out-of-3 devices half of the
>>> chunks will still be fully mirrored.
>
>>> Also, removing the device to be replaced should really not be
>>> the same thing as balancing the chunks, if there is space, to be
>>> 'raid1' across remaining drives, because that's a completely
>>> different operation.
>
>> There is a command specifically for replacing devices.  It
>> operates very differently from the add+delete or delete+add
>> sequences. [ ... ]
>
> Perhaps it was not clear that I was talking about removing a
> device, as distinct from replacing it, and that I used "removed"
> instead of "deleted" deliberately, to avoid the confusion with
> the 'delete' command.
Ah, sorry I misunderstood what you were saying.
>
> In the everyday practice of system administration it often
> happens that a device should be removed first, and replaced
> later, for example when it is suspected to be faulty, or is
> intermittently faulty. The replacement can be done with
> 'replace' or 'add+delete' or 'delete+add', but that's a
> different matter.
>
> Perhaps I should have not have used the generic verb "remove",
> but written "make unavailable".
>
> This brings about again the topic of some "confusion" in the
> design of the Btrfs multidevice handling logic, where at least
> initially one could only expand the storage space of a
> multidevice by 'add' of a new device or shrink the storage space
> by 'delete' of an existing one, but I think it was not conceived
> at Btrfs design time of storage space being nominally constant
> but for a device (and the chunks on it) having a state of
> "available" ("present", "online", "enabled") or "unavailable"
> ("absent", "offline", "disabled"), either because of events or
> because of system administrator action.
>
> The 'missing' pseudo-device designator was added later, and
> 'replace' also later to avoid having to first expand then shrink
> (or viceversa) the storage space and the related copying.
>
> My impression is that it would be less "confused" if the Btrfs
> device handling logic were changed to allow for the the state of
> "member of the multidevice set but not actually available" and
> the related consequent state for chunks that ought to be on it;
> that probably would be essential to fixing the confusing current
> aspects of recovery in a multidevice set. That would be very
> useful even if it may require a change in the on-disk format to
> distinguish the distinct states of membership and availability
> for devices and mark chunks as available or not (chunks of course
> being only possible on member devices).
>
> That is, it would also be nice to have the opposite state of "not
> member of the multidevice set but actually available to it", that
> is a spare device, and related logic.
OK, so expanding on this a bit, there are currently three functional 
device states in BTRFS right now (note that the terms I use here aren't 
official, they're just what I use to describe them):
1. Active/Online.  This is the normal state for a device, you can both 
read from it and write to it.
2. Inactive/Replacing/Deleting.  This is the state a device is in when 
it's either being deleted or replaced.  Inactive devices don't count 
towards total volume size, and can't be written to, but can be read from 
if they weren't missing prior to becoming inactive.
3. Missing/Offilne.  This is pretty self explanatory.  A device in this 
state can't be read from or written to, but it does count towards volume 
size.

Currently, the only transitions available to a sysadmin through BTRFS 
itself are temporary transitions from Active to Inactive (replace and 
delete).

In an ideal situation, there would be two other states:
4. Local hot-spare/Nearline.  Won't be read from and doesn't count 
towards total volume size, but may be written to (depending on how the 
FS is configured), and will be automatically used to replace a failed 
device in the filesystem it's associated with.
5. Global hot-spare.  Similar to local hot-spare, but can be used for 
any filesystem on the system, and won't be touched until it's needed.

The following manually initiated transitions would be possible for 
regular operation:
1. Active -> Inactive (persistently)
2. Inactive -> Active
3. Active -> Local hot-spare
4. Inactive -> Local hot-spare
5. Local hot-spare -> Active
6. Local hot-spare -> Inactive
7. Global hot-spare -> Active
8. Global hot-spare -> Inactive
9. Local hot-spare -> Global hot-spare
10. Global hot-spare -> Local hot-spare

And the following automatic transitions would be possible:
1. Local hot-spare -> Active
2. Global hot-spare -> Active
3. <any other state> -> Missing
4. Missing -> <any other state>

And there would be the option of manually triggering the automatic 
transitions for debugging purposes.
>
> Note: simply setting '/sys/block/$DEV/device/delete' is not a
> good option, because that makes the device unavailable not just
> to Btrfs, but also to the whole systems. In the ordinary practice
> of system administration it may well be useful to make a device
> unavailable to Btrfs but still available to the system, for
> example for testing, and anyhow they are logically distinct
> states. That also means a member device might well be available
> to the system, but marked as "not available" to Btrfs.


  reply	other threads:[~2017-03-09 13:55 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-03-02  0:30 raid1 degraded mount still produce single chunks, writeable mount not allowed Chris Murphy
2017-03-02 10:37 ` Adam Borowski
2017-03-03  5:56   ` Kai Krakow
2017-03-03 10:13     ` Adam Borowski
2017-03-03 12:19     ` Austin S. Hemmelgarn
2017-03-03 20:10       ` Kai Krakow
2017-03-06 13:07         ` Austin S. Hemmelgarn
2017-03-02 13:41 ` Duncan
2017-03-02 17:26   ` Andrei Borzenkov
2017-03-02 17:58     ` Austin S. Hemmelgarn
2017-03-03  0:47   ` Peter Grandi
2017-03-03  1:15     ` Chris Murphy
2017-03-03  1:18       ` Qu Wenruo
2017-03-03  1:48         ` Chris Murphy
2017-03-04  4:38           ` Chris Murphy
2017-03-04  9:55             ` waxhead
2017-03-03  3:38     ` Duncan
2017-03-03 12:38     ` Austin S. Hemmelgarn
2017-03-05 19:13       ` Peter Grandi
2017-03-05 19:55         ` Peter Grandi
2017-03-06 13:18         ` Austin S. Hemmelgarn
2017-03-09  9:49           ` Peter Grandi
2017-03-09 13:54             ` Austin S. Hemmelgarn [this message]
2017-03-03 10:16   ` Anand Jain

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50fa13c0-1b38-098c-edab-8ad4b0195773@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=pg@btrfs.list.sabi.co.UK \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).