From: Goffredo Baroncelli <kreijack@libero.it>
To: Anand Jain <anand.jain@oracle.com>,
Andrei Borzenkov <arvidjaar@gmail.com>
Cc: systemd-devel@lists.freedesktop.org,
"linux-btrfs@vger.kernel.org >> linux-btrfs"
<linux-btrfs@vger.kernel.org>,
dsterba@suse.cz
Subject: Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status
Date: Sat, 13 Jun 2015 17:09:19 +0200 [thread overview]
Message-ID: <557C479F.2070403@libero.it> (raw)
In-Reply-To: <557BF979.9040106@oracle.com>
On 2015-06-13 11:35, Anand Jain wrote:
>
> Thanks for your reply Andrei and Goffredo. more below...
>
> On 06/13/2015 04:08 AM, Goffredo Baroncelli wrote:
>> On 2015-06-12 20:04, Andrei Borzenkov wrote:
>>> В Fri, 12 Jun 2015 21:16:30 +0800 Anand Jain
>>> <anand.jain@oracle.com> пишет:
>>>
>>>>
>>>>
>>>> BTRFS_IOC_DEVICES_READY is to check if all the required
>>>> devices are known by the btrfs kernel, so that
>>>> admin/system-application could mount the FS. It is checked
>>>> against a device in the argument.
>>>>
>>>> However the actual implementation is bit more than just that,
>>>> in the way that it would also scan and register the device
>>>> provided in the argument (same as btrfs device scan subcommand
>>>> or BTRFS_IOC_SCAN_DEV ioctl).
>>>>
>>>> So BTRFS_IOC_DEVICES_READY ioctl isn't a read/view only ioctl,
>>>> but its a write command as well.
>>>>
>>>> Next, since in the kernel we only check if total_devices (read
>>>> from SB) is equal to num_devices (counted in the list) to
>>>> state the status as 0 (ready) or 1 (not ready). But this does
>>>> not work in rest of the device pool state like missing,
>>>> seeding, replacing since total_devices is actually not equal to
>>>> num_devices in these state but device pool is ready for the
>>>> mount and its a bug which is not part of this discussions.
>>>>
>>>>
>>>> Questions:
>>>>
>>>> - Do we want BTRFS_IOC_DEVICES_READY ioctl to also scan and
>>>> register the device provided (same as btrfs device scan command
>>>> or the BTRFS_IOC_SCAN_DEV ioctl) OR can BTRFS_IOC_DEVICES_READY
>>>> be read-only ioctl interface to check the state of the device
>>>> pool. ?
>>>>
>>>
>>> udev is using it to incrementally assemble multi-device btrfs, so
>>> in this case I think it should.
>
> Nice. Thanks for letting me know this.
>
>> I agree, the ioctl name is confusing, but unfortunately this is an
>> API and it has to be stay here forever. Udev uses it, so we know
>> for sure that it is widely used.
>
> ok. what goes in stays there forever. its time to update the man page
> rather.
>
>>> Are there any other users?
>>>
>>>> - If the the device in the argument is already mounted, can it
>>>> straightaway return 0 (ready) ? (as of now it would again
>>>> independently read the SB determine total_devices and check
>>>> against num_devices.
>>>>
>>>
>>> I think yes; obvious use case is btrfs mounted in initrd and
>>> later coldplug. There is no point to wait for anything as
>>> filesystem is obviously there.
>>>
>
> There is little difference. If the device is already mounted. And
> there are two device paths for the same device PA and PB. The path as
> last given to either 'btrfs dev scan (BTRFS_IOC_SCAN_DEV)' or 'btrfs
> device ready (BTRFS_IOC_DEVICES_READY)' will be shown in the 'btrfs
> filesystem show' or '/proc/self/mounts' output. It does not mean that
> btrfs kernel will close the first device path and reopen the 2nd
> given device path, it just updates the device path in the kernel.
>
> Further, the problem will be more intense in this eg. if you use dd
> and copy device A to device B. After you mount device A, by just
> providing device B in the above two commands you could let kernel
> update the device path, again all the IO (since device is mounted)
> are still going to the device A (not B), but /proc/self/mounts and
> 'btrfs fi show' shows it as device B (not A).
>
> Its a bug. very tricky to fix.
In the past [*] I proposed a mount.btrfs helper . I tried to move the logic outside the kernel.
I think that the problem is that we try to manage all these cases from a device point of view: when a device appears, we register the device and we try to mount the filesystem... This works very well when there is 1-volume filesystem. For the other cases there is a mess between the different layers:
- kernel
- udev/systemd
- initrd logic
My attempt followed a different idea: the mount helper waits the devices if needed, or if it is the case it mounts the filesystem in degraded mode. All devices are passed as mount arguments (--device=/dev/sdX), there is no a device registration: this avoids all these problems.
[*] http://permalink.gmane.org/gmane.comp.file-systems.btrfs/40767
back to your questions
> - we can't return -EBUSY for subsequent (after mount) calls for the
> above two ioctls (if a mounted device is used as an argument). Since
> admin/system-application might actually call again to mount subvols.
I am not sure that the two things are related: the mount doesn't use BTRFS_IOC_DEVICES_READY. After BTRFS_IOC_DEVICES_READY returns OK, all the filesystem belongs this FSID should be mounted; but it is a job of systemd/initramfs/sysv... a further failed BTRFS_IOC_DEVICES_READY shouldn't case any problem ...
>
> - we can return success (without updating the device path) but, we
> would be wrong when device A is copied into device B using dd. Since
> we would check against the on device SB's fsid/uuid/devid. Checking
> using strcmp the device paths is not practical since there can be
> different paths to the same device (lets says mapper).
>
> (any suggestion on how to check if its the same device in the
> kernel?).
check minor/major ?
>
> - Also if we don't let to update the device path after device is
> mounted, then are there chances that we would be stuck with the
> device path during initrd which does not make any sense to the user
> ?
>
>
>>>> - What should be the expected return when the FS is mounted and
>>>> there is a missing device.
>>
>> I suggest to not invest further energy on a ioctl API. If you want
>> these kind of information, you (we) should export these in sysfs:
>> In an ideal world:
>>
>> - a new btrfs device appears - udev register it with
>> BTRFS_IOC_SCAN_DEV: - udev (or mount ?) checks the status of the
>> filesystem reading the sysfs entries (total devices, present
>> devices, seed devices, raid level....); on the basis of the local
>> policy (allow degraded mount, device timeout, how many device are
>> missing, filesystem redundancy level.....) udev (mount) may mount
>> the filesystem with the appropriate parameter (ro, degraded, or
>> even insert a spare device to correct a missing device....)
>
> Yes. sysfs interface is coming. few framework patch were sent
> sometime back, any comments will help. On the ioctl part I am trying
> to fix the bug(s).
>
>>>>
>>>
>>> This is similar to problem mdadm had to solve. mdadm starts timer
>>> as soon as enough raid devices are present; if timer expires
>>> before raid is complete, raid is started in degraded mode. This
>>> avoids spurious rebuilds. So it would be good if btrfs could
>>> distinguish between enough devices to mount and all devices.
>
>> These are two different things: how export the filesystem
>> information (I am still convinced that these have to be exported
>> via sysfs), and what the system has to do in case of ... (a missing
>> device ?). The latter is a policy, and I think that it should be
>> not rely in the kernel.
>>
>>
>>> -- To unsubscribe from this list: send the line "unsubscribe
>>> linux-btrfs" in the body of a message to
>>> majordomo@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5
next prev parent reply other threads:[~2015-06-13 15:17 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-06-12 13:16 [survey] BTRFS_IOC_DEVICES_READY return status Anand Jain
2015-06-12 18:04 ` [systemd-devel] " Andrei Borzenkov
2015-06-12 20:08 ` Goffredo Baroncelli
2015-06-13 9:35 ` Anand Jain
2015-06-13 15:09 ` Goffredo Baroncelli [this message]
[not found] ` <pan$63061$a3cdf5f6$a390adbd$e6097ad9@cox.net>
2015-06-14 19:44 ` Goffredo Baroncelli
2015-06-15 10:46 ` Lennart Poettering
2015-06-15 17:23 ` Goffredo Baroncelli
2015-06-15 17:38 ` Lennart Poettering
2015-06-17 19:10 ` Goffredo Baroncelli
2015-06-17 21:02 ` Lennart Poettering
2015-06-18 2:40 ` Andrei Borzenkov
2015-06-14 5:48 ` Andrei Borzenkov
2015-06-15 10:41 ` Lennart Poettering
2015-06-13 7:20 ` btrfs filesystem show confused when label is same as mountpoint Sjoerd
2015-06-13 9:51 ` Duncan
2015-06-25 16:37 ` David Sterba
2015-06-15 10:27 ` [survey] BTRFS_IOC_DEVICES_READY return status Lennart Poettering
2015-06-15 15:01 ` David Sterba
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=557C479F.2070403@libero.it \
--to=kreijack@libero.it \
--cc=anand.jain@oracle.com \
--cc=arvidjaar@gmail.com \
--cc=dsterba@suse.cz \
--cc=kreijack@inwind.it \
--cc=linux-btrfs@vger.kernel.org \
--cc=systemd-devel@lists.freedesktop.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox