Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: "Austin S. Hemmelgarn" <ahferroin7@gmail.com>
To: Tomasz Pala <gotar@polanet.pl>,
	Linux fs Btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: Unexpected raid1 behaviour
Date: Wed, 20 Dec 2017 08:33:03 -0500	[thread overview]
Message-ID: <862f03b0-9311-09f6-ee2c-d53d35d176bd@gmail.com> (raw)
In-Reply-To: <20171219222308.GE14726@polanet.pl>

On 2017-12-19 17:23, Tomasz Pala wrote:
> On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote:
> 
>>> Sth like this? I got such problem a few months ago, my solution was
>>> accepted upstream:
>>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>>>
>>> Rationale is in referred ticket, udev would not support any more btrfs
>>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>>> that is all that can be done.
>> Or maybe systemd can quit trying to treat BTRFS like a volume manager
>> (which it isn't) and just try to mount the requested filesystem with the
>> requested options?
> 
> Tried that before ("just mount my filesystem, stupid"), it is a no-go.
> The problem source is not within systemd treating BTRFS differently, but
> in btrfs kernel logic that it uses. Just to show it:
> 
> 1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
> 2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan tool),
> 3. try
> mount /dev/sda /test - fails
> mount /dev/sdb /test - works
> 4. reboot again and try in reversed order
> mount /dev/sdb /test - fails
> mount /dev/sda /test - works
> 
> THIS readiness is exposed via udev to systemd. And it must be used for
> multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc).
Except BTRFS _IS NOT MULTIPLE LAYERS_.  It's one layer at the filesystem 
layer, and handles the other 'layers' internally.
> 
> In short: until *something* scans all the btrfs components, so the
> kernel makes it ready, systemd won't even try to mount it.
Which is the problem here.  Systemd needs to treat BTRFS differently, 
even if the ioctl it's using gets 'fixed', currently it's treating it 
like LVM or MD, when it needs to be treated as just a filesystem with an 
extra wait condition prior to mount (and needs to trust that the user 
knows what they are doing when they mount something by hand).  The IOCTL 
systemd is using was poorly named, what it really does is say that the 
FS is ready to mount normally (that is, without needing 'device=' or 
'degraded' mount options).  Aside from this being problematic with 
degraded volumes, it's got an inherent TOCTOU race condition (so do the 
checks with all the other block layers you mentioned FWIW).  If systemd 
would just treat BTRFS like a filesystem instead of a volume manager, 
and try to mount the volume with the specified options (after waiting 
for udev to report that it's done scanning everything) instead of asking 
the kernel if it's ready, none of this would be an issue.

Put slightly differently:  I use OpenRC and sysv init.  I have a script 
that runs right after udev starts and directly scans all fixed disks for 
BTRFS signatures, and that's _all_ that I need to do to get multi-device 
BTRFS working properly with the standard local filesystem mount script 
in Gentoo.  I don't have to deal with any of this crap that systemd 
users do because Gentoo's OpenRC script for mounting local filesystems 
treats BTRFS like any other filesystem, and (sensibly) assumes that if 
the call to mount succeeds, things are ready and working.
> 
>> Then you would just be able to specify 'degraded' in
>> your mount options, and you don't have to care that the kernel refuses
>> to mount degraded filesystems without being explicitly asked to.
> 
> Exactly. But since LP refused to try mounting despite kernel "not-ready"
> state - it is the kernel that must emit 'ready'. So the
> question is: how can I make kernel to mark degraded array as "ready"?
You can't, because the DEVICE_READY IOCTL is coded to mark the volume 
ready when all component devices are ready.  IOW, it's there to say 
'this mount will work without needing -o degraded or specifying any 
devices in the mount options'.

The issue is the interaction here, not the kernel behavior by itself, 
since the kernel behavior produces no issues whatsoever for other init 
systems (though I will acknowledge that the ioctl itself is really only 
used by systemd, but I contend that that's because everything else is 
sensible enough to understand that the ioctl is functionally useless and 
just avoid it).
> 
> The obvious answer is: do it via kernel command line, just like mdadm
> does:
> rootflags=device=/dev/sda,device=/dev/sdb
> rootflags=device=/dev/sda,device=missing
> rootflags=device=/dev/sda,device=/dev/sdb,degraded
> 
> If only btrfs.ko recognized this, kernel would be able to assemble
> multivolume btrfs itself. Not only this would allow automated degraded
> mounts, it would also allow using initrd-less kernels on such volumes.
Last I checked, the 'device=' options work on upstream kernels just 
fine, though I've never tried the degraded option.  Of course, I'm also 
not using systemd, so it may be some interaction with systemd that's 
causing them to not work (and yes, I understand that I'm inclined to 
blame systemd most of the time based on significant past experience with 
systemd creating issues that never existed before).
> 
>>> It doesn't have to be default, might be kernel compile-time knob, module
>>> parameter or anything else to make the *R*aid work.
>> There's a mount option for it per-filesystem.  Just add that to all your
>> mount calls, and you get exactly the same effect.
> 
> If only they were passed...
> 


  reply	other threads:[~2017-12-20 13:33 UTC|newest]

Thread overview: 61+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-16 19:50 Unexpected raid1 behaviour Dark Penguin
2017-12-17 11:58 ` Duncan
2017-12-17 15:48   ` Peter Grandi
2017-12-17 20:42     ` Chris Murphy
2017-12-18  8:49       ` Anand Jain
2017-12-18  8:49     ` Anand Jain
2017-12-18 10:36       ` Peter Grandi
2017-12-18 12:10       ` Nikolay Borisov
2017-12-18 13:43         ` Anand Jain
2017-12-18 22:28       ` Chris Murphy
2017-12-18 22:29         ` Chris Murphy
2017-12-19 12:30         ` Adam Borowski
2017-12-19 12:54         ` Andrei Borzenkov
2017-12-19 12:59         ` Peter Grandi
2017-12-18 13:06     ` Austin S. Hemmelgarn
2017-12-18 19:43       ` Tomasz Pala
2017-12-18 22:01         ` Peter Grandi
2017-12-19 12:46           ` Austin S. Hemmelgarn
2017-12-19 12:25         ` Austin S. Hemmelgarn
2017-12-19 14:46           ` Tomasz Pala
2017-12-19 16:35             ` Austin S. Hemmelgarn
2017-12-19 17:56               ` Tomasz Pala
2017-12-19 19:47                 ` Chris Murphy
2017-12-19 21:17                   ` Tomasz Pala
2017-12-20  0:08                     ` Chris Murphy
2017-12-23  4:08                       ` Tomasz Pala
2017-12-23  5:23                         ` Duncan
2017-12-20 16:53                   ` Andrei Borzenkov
2017-12-20 16:57                     ` Austin S. Hemmelgarn
2017-12-20 20:02                     ` Chris Murphy
2017-12-20 20:07                       ` Chris Murphy
2017-12-20 20:14                         ` Austin S. Hemmelgarn
2017-12-21  1:34                           ` Chris Murphy
2017-12-21 11:49                         ` Andrei Borzenkov
2017-12-19 20:11                 ` Austin S. Hemmelgarn
2017-12-19 21:58                   ` Tomasz Pala
2017-12-20 13:10                     ` Austin S. Hemmelgarn
2017-12-19 23:53                   ` Chris Murphy
2017-12-20 13:12                     ` Austin S. Hemmelgarn
2017-12-19 18:31             ` George Mitchell
2017-12-19 20:28               ` Tomasz Pala
2017-12-19 19:35             ` Chris Murphy
2017-12-19 20:41               ` Tomasz Pala
2017-12-19 20:47                 ` Austin S. Hemmelgarn
2017-12-19 22:23                   ` Tomasz Pala
2017-12-20 13:33                     ` Austin S. Hemmelgarn [this message]
2017-12-20 17:28                       ` Duncan
2017-12-21 11:44                   ` Andrei Borzenkov
2017-12-21 12:27                     ` Austin S. Hemmelgarn
2017-12-22 16:05                       ` Tomasz Pala
2017-12-22 21:04                         ` Chris Murphy
2017-12-23  2:52                           ` Tomasz Pala
2017-12-23  5:40                             ` Duncan
2017-12-19 23:59                 ` Chris Murphy
2017-12-20  8:34                   ` Tomasz Pala
2017-12-20  8:51                     ` Tomasz Pala
2017-12-20 19:49                     ` Chris Murphy
2017-12-18  5:11   ` Anand Jain
2017-12-18  1:20 ` Qu Wenruo
2017-12-18 13:31 ` Austin S. Hemmelgarn
2018-01-12 12:26   ` Dark Penguin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=862f03b0-9311-09f6-ee2c-d53d35d176bd@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=gotar@polanet.pl \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox