From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-btrfs-owner@vger.kernel.org>
Received: from pepin.polanet.pl ([193.34.52.2]:48024 "EHLO pepin.polanet.pl"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1752651AbeA1XNc (ORCPT <rfc822;linux-btrfs@vger.kernel.org>);
        Sun, 28 Jan 2018 18:13:32 -0500
Date: Mon, 29 Jan 2018 00:13:30 +0100
From: Tomasz Pala <gotar@polanet.pl>
To: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
Subject: Re: degraded permanent mount option
Message-ID: <20180128231330.GB26726@polanet.pl>
References: <1517035210.1252874.1249880112.19FABD13@webmail.messagingengine.com>
 <8607255b-98e7-5623-6f62-75d6f7cf23db@gmail.com>
 <569AC15F-174E-4C78-8FE5-6CE9E0BED479@yayon.me>
 <E23AAC7C-6CAA-4290-9CF1-19285DB31D05@yayon.me>
 <111ca301-f631-694d-93eb-b73a790f57d4@gmail.com>
 <20180127110619.GA10472@polanet.pl>
 <20180127132641.mhmdhpokqrahgd4n@angband.pl>
 <pan$49d38$869e8d62$c0063169$584b8866@cox.net>
 <7c95b4ae-f65e-b31d-f907-5eae5c60c49a@gmail.com>
 <CAJCQCtSCC5RFEybWTpWVDeVS9MPwBsbY0-F4C_mB0Mq5EDhH9g@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-2
In-Reply-To: <CAJCQCtSCC5RFEybWTpWVDeVS9MPwBsbY0-F4C_mB0Mq5EDhH9g@mail.gmail.com>
Sender: linux-btrfs-owner@vger.kernel.org
List-ID: <linux-btrfs.vger.kernel.org>

On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote:

>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> "not ready" is rather vague terminology but yes that's how systemd
> ends up using the ioctl this rule depends on, even though the rule has
> nothing to do with readiness per se. If all devices for a volume

If you avoid using THIS ioctl, then you'd have nothing to fire the rule
at all. One way or another, this is btrfs that must emit _some_ event or
be polled _somehow_.

> aren't found, we can correctly conclude a normal mount attempt *will*
> fail. But that's all we can conclude. What I can't parse in all of
> this is if the udev rule is a one shot, if the ioctl is a one shot, if
> something is constantly waiting for "not all devices are found" to
> transition to "all devices are found" or what. I can't actually parse

It's not one shot. This works like this:

sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready
sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready

The end.

If there were some other device appearing after assembly, like /dev/md1,
or if there were some event generated by btrfs code itself, udev could
catch this and follow. Now, if you unplug sdb1, there's no such event at
all.

Since this IOCTL is the *only* thing that udev can rely on, it cannot be
removed from the logic. So even if you create a timer to force assembly,
you must do it by influencing the IOCTL response.

Or creating some other IOCTL for this purpose, or creating some
userspace daemon or whatever.

> the two critical lines in this rule. I
> 
> # let the kernel know about this btrfs filesystem, and check if it is complete
> IMPORT{builtin}="btrfs ready $devnode"

This sends IOCTL.

> # mark the device as not ready to be used by the system
> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
      ^^^^^^^^^^^^^^this is IOCTL response being checked

and SYSTEMD_READY set to 0 prevents systemd from mounting.

> I think the Btrfs ioctl is a one shot. Either they are all present or not.

The rules are called once per (block) device.
So when btrfs scans all the devices to return READY, this would finally
be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger),
but there is no way to force btrfs to return READY after any timeout.

> The waiting is a policy by systemd udev rule near as I can tell.

There is no problem in waiting or re-triggering. This can be done in ~10
lines of rules. The problem is that the IOCTL won't EVER return READY until
there are ALL the components present.

It's simple as that: there MUST be some mechanism at device-manager
level that tells if a compound device is mountable, degraded or not;
upper layers (systemd-mount) do not care about degradation, handling
redundancy/mirrors/chunks/stripes/spares is not it's job.
It (systemd) can (easily!) handle expiration timer to push pending
compound to be force-assembled, but currently there is no way to push.


If the IOCTL would be extended to return TRYING_DEGRADED (when
instructed to do so after expired timeout), systemd could handle
additional per-filesystem fstab options, like x-systemd.allow-degraded.

Then in would be possible to have best-effort policy for rootfs (to make
machine boot), and more strict one for crucial data (do not mount it
when there is no redundancy, wait for operator intervention).

-- 
Tomasz Pala <gotar@pld-linux.org>