* [PATCH] libblkid: fix spurious ext superblock checksum mismatches
@ 2024-11-18 20:35 Krister Johansen
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
2024-11-19 8:15 ` [EXT] " Windl, Ulrich
0 siblings, 2 replies; 11+ messages in thread
From: Krister Johansen @ 2024-11-18 20:35 UTC (permalink / raw)
To: util-linux; +Cc: Karel Zak, systemd-devel, David Reaver, Theodore Ts'o
Reads of ext superblocks can race with updates. If libblkid observes a
checksum mismatch, re-read the superblock with O_DIRECT in order to get
a consistent view of its contents. Only if the O_DIRECT read fails the
checksum should it be reported to have failed.
This fixes a problem where devices that were named by filesystem label
failed to be found when systemd attempted to mount them on boot. The
problem was caused by systemd-udevd using libblkid. If a read of a
superblock resulted in a checksum mismatch, udev will remove the
by-label links which result in the mount call failing to find the
device. The checksum mismatch that was triggering the problem was
spurious, and when we use O_DIRECT, or even perform a subsequent retry,
the superblock is correctly read. This resulted in a failure to mount
/boot in one out of every 2,000 or so attempts in our environment.
e2fsprogs fixed[1] an identical version of this bug that afflicted
resize2fs during online grow operations when run from cloud-init. The
fix there was also to use O_DIRECT in order to read the superblock.
This patch uses a similar approach: read the superblock with O_DIRECT in
the case where a bad checksum is detected.
[1] https://lore.kernel.org/linux-ext4/20230609042239.GA1436857@mit.edu/
Signed-off-by: Krister Johansen <kjlx@templeofstupid.com>
---
libblkid/src/blkidP.h | 5 +++++
libblkid/src/probe.c | 27 +++++++++++++++++++++++++++
libblkid/src/superblocks/ext.c | 22 ++++++++++++++++++++--
3 files changed, 52 insertions(+), 2 deletions(-)
diff --git a/libblkid/src/blkidP.h b/libblkid/src/blkidP.h
index f71e13cb3..fa2379c4d 100644
--- a/libblkid/src/blkidP.h
+++ b/libblkid/src/blkidP.h
@@ -421,6 +421,11 @@ extern const unsigned char *blkid_probe_get_buffer(blkid_probe pr,
__attribute__((nonnull))
__attribute__((warn_unused_result));
+extern const unsigned char *blkid_probe_get_buffer_direct(blkid_probe pr,
+ uint64_t off, uint64_t len)
+ __attribute__((nonnull))
+ __attribute__((warn_unused_result));
+
extern const unsigned char *blkid_probe_get_sector(blkid_probe pr, unsigned int sector)
__attribute__((nonnull))
__attribute__((warn_unused_result));
diff --git a/libblkid/src/probe.c b/libblkid/src/probe.c
index 5a17e873d..e029f5649 100644
--- a/libblkid/src/probe.c
+++ b/libblkid/src/probe.c
@@ -791,6 +791,33 @@ const unsigned char *blkid_probe_get_buffer(blkid_probe pr, uint64_t off, uint64
return real_off ? bf->data + (real_off - bf->off + bias) : bf->data + bias;
}
+/*
+ * This is blkid_probe_get_buffer with the read done as an O_DIRECT operation.
+ * Note that @off is offset within probing area, the probing area is defined by
+ * pr->off and pr->size.
+ */
+const unsigned char *blkid_probe_get_buffer_direct(blkid_probe pr, uint64_t off, uint64_t len)
+{
+ const unsigned char *ret = NULL;
+ int flags, rc, olderrno;
+
+ flags = fcntl(pr->fd, F_GETFL);
+ rc = fcntl(pr->fd, F_SETFL, flags | O_DIRECT);
+ if (rc) {
+ DBG(LOWPROBE, ul_debug("fcntl F_SETFL failed to set O_DIRECT"));
+ errno = 0;
+ return NULL;
+ }
+ ret = blkid_probe_get_buffer(pr, off, len);
+ olderrno = errno;
+ rc = fcntl(pr->fd, F_SETFL, flags);
+ if (rc) {
+ DBG(LOWPROBE, ul_debug("fcntl F_SETFL failed to clear O_DIRECT"));
+ errno = olderrno;
+ }
+ return ret;
+}
+
/**
* blkid_probe_reset_buffers:
* @pr: prober
diff --git a/libblkid/src/superblocks/ext.c b/libblkid/src/superblocks/ext.c
index a5f34810f..7a9f8c9b9 100644
--- a/libblkid/src/superblocks/ext.c
+++ b/libblkid/src/superblocks/ext.c
@@ -156,8 +156,26 @@ static struct ext2_super_block *ext_get_super(
return NULL;
if (le32_to_cpu(es->s_feature_ro_compat) & EXT4_FEATURE_RO_COMPAT_METADATA_CSUM) {
uint32_t csum = crc32c(~0, es, offsetof(struct ext2_super_block, s_checksum));
- if (!blkid_probe_verify_csum(pr, csum, le32_to_cpu(es->s_checksum)))
- return NULL;
+ /*
+ * A read of the superblock can race with other updates to the
+ * same superblock. In the unlikely event that this occurs and
+ * we see a checksum failure, re-read the superblock with
+ * O_DIRECT to ensure that it's consistent. If it _still_ fails
+ * then declare a checksum mismatch.
+ */
+ if (!blkid_probe_verify_csum(pr, csum, le32_to_cpu(es->s_checksum))) {
+ if (blkid_probe_reset_buffers(pr))
+ return NULL;
+
+ es = (struct ext2_super_block *)
+ blkid_probe_get_buffer_direct(pr, EXT_SB_OFF, sizeof(struct ext2_super_block));
+ if (!es)
+ return NULL;
+
+ csum = crc32c(~0, es, offsetof(struct ext2_super_block, s_checksum));
+ if (!blkid_probe_verify_csum(pr, csum, le32_to_cpu(es->s_checksum)))
+ return NULL;
+ }
}
if (fc)
*fc = le32_to_cpu(es->s_feature_compat);
--
2.25.1
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-18 20:35 [PATCH] libblkid: fix spurious ext superblock checksum mismatches Krister Johansen
@ 2024-11-18 22:36 ` Lennart Poettering
2024-11-18 23:13 ` Krister Johansen
2024-11-19 8:15 ` [EXT] " Windl, Ulrich
1 sibling, 1 reply; 11+ messages in thread
From: Lennart Poettering @ 2024-11-18 22:36 UTC (permalink / raw)
To: Krister Johansen
Cc: util-linux, Karel Zak, systemd-devel, David Reaver,
Theodore Ts'o
On Mo, 18.11.24 12:35, Krister Johansen (kjlx@templeofstupid.com) wrote:
> Reads of ext superblocks can race with updates. If libblkid observes a
> checksum mismatch, re-read the superblock with O_DIRECT in order to get
> a consistent view of its contents. Only if the O_DIRECT read fails the
> checksum should it be reported to have failed.
>
> This fixes a problem where devices that were named by filesystem label
> failed to be found when systemd attempted to mount them on boot. The
> problem was caused by systemd-udevd using libblkid. If a read of a
> superblock resulted in a checksum mismatch, udev will remove the
> by-label links which result in the mount call failing to find the
> device. The checksum mismatch that was triggering the problem was
> spurious, and when we use O_DIRECT, or even perform a subsequent retry,
> the superblock is correctly read. This resulted in a failure to mount
> /boot in one out of every 2,000 or so attempts in our environment.
>
> e2fsprogs fixed[1] an identical version of this bug that afflicted
> resize2fs during online grow operations when run from cloud-init. The
> fix there was also to use O_DIRECT in order to read the superblock.
> This patch uses a similar approach: read the superblock with O_DIRECT in
> the case where a bad checksum is detected.
Umpf. udev has a clearly defined protocol to comprehensively avoid
such issues:
https://systemd.io/BLOCK_DEVICE_LOCKING
Partitioning tools should simply follow this logic, and udev and
programs downstream from it will not even be tempted to operate with
half-written superblocks, partition tables or such.
Hence, I personally am not convinced of that O_DIRECT approach. First
of all, it only works on superblocks that have a useful checksum
covering enough relevant data, and it can never really catch scenarios
where a disk is comprehensively repartitioned, i.e. one or more fs and
partition metadata changed at the same time...
Lennart
--
Lennart Poettering, Berlin
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
@ 2024-11-18 23:13 ` Krister Johansen
2024-11-19 8:19 ` [EXT] " Windl, Ulrich
0 siblings, 1 reply; 11+ messages in thread
From: Krister Johansen @ 2024-11-18 23:13 UTC (permalink / raw)
To: Lennart Poettering
Cc: util-linux, Karel Zak, systemd-devel, David Reaver,
Theodore Ts'o
On Mon, Nov 18, 2024 at 11:36:48PM +0100, Lennart Poettering wrote:
> On Mo, 18.11.24 12:35, Krister Johansen (kjlx@templeofstupid.com) wrote:
>
> > Reads of ext superblocks can race with updates. If libblkid observes a
> > checksum mismatch, re-read the superblock with O_DIRECT in order to get
> > a consistent view of its contents. Only if the O_DIRECT read fails the
> > checksum should it be reported to have failed.
> >
> > This fixes a problem where devices that were named by filesystem label
> > failed to be found when systemd attempted to mount them on boot. The
> > problem was caused by systemd-udevd using libblkid. If a read of a
> > superblock resulted in a checksum mismatch, udev will remove the
> > by-label links which result in the mount call failing to find the
> > device. The checksum mismatch that was triggering the problem was
> > spurious, and when we use O_DIRECT, or even perform a subsequent retry,
> > the superblock is correctly read. This resulted in a failure to mount
> > /boot in one out of every 2,000 or so attempts in our environment.
> >
> > e2fsprogs fixed[1] an identical version of this bug that afflicted
> > resize2fs during online grow operations when run from cloud-init. The
> > fix there was also to use O_DIRECT in order to read the superblock.
> > This patch uses a similar approach: read the superblock with O_DIRECT in
> > the case where a bad checksum is detected.
>
> Umpf. udev has a clearly defined protocol to comprehensively avoid
> such issues:
>
> https://systemd.io/BLOCK_DEVICE_LOCKING
>
> Partitioning tools should simply follow this logic, and udev and
> programs downstream from it will not even be tempted to operate with
> half-written superblocks, partition tables or such.
>
> Hence, I personally am not convinced of that O_DIRECT approach. First
> of all, it only works on superblocks that have a useful checksum
> covering enough relevant data, and it can never really catch scenarios
> where a disk is comprehensively repartitioned, i.e. one or more fs and
> partition metadata changed at the same time...
I may have done a poor job of explaining this. This is ext writing its
own superblock from the kernel, but reads seeing an potentially
inconsistent view of that write. O_DIRECT causes us to seralize with
the locks ext4 holds when it writes the superblock, which prevents the
read from observing a partial update.
It's not necessarily the partitioning tools causing this, but any
filesystem level udpdate that modifies the contents of the superblock.
-K
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [EXT] Re: [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-18 23:13 ` Krister Johansen
@ 2024-11-19 8:19 ` Windl, Ulrich
0 siblings, 0 replies; 11+ messages in thread
From: Windl, Ulrich @ 2024-11-19 8:19 UTC (permalink / raw)
To: Krister Johansen, Lennart Poettering
Cc: util-linux@vger.kernel.org, Karel Zak,
systemd-devel@lists.freedesktop.org, David Reaver,
Theodore Ts'o
> -----Original Message-----
> From: systemd-devel <systemd-devel-bounces@lists.freedesktop.org> On
> Behalf Of Krister Johansen
> Sent: Tuesday, November 19, 2024 12:14 AM
> To: Lennart Poettering <lennart@poettering.net>
> Cc: util-linux@vger.kernel.org; Karel Zak <kzak@redhat.com>; systemd-
> devel@lists.freedesktop.org; David Reaver <me@davidreaver.com>;
> Theodore Ts'o <tytso@mit.edu>
> Subject: [EXT] Re: [systemd-devel] [PATCH] libblkid: fix spurious ext
> superblock checksum mismatches
>
...
> I may have done a poor job of explaining this. This is ext writing its
> own superblock from the kernel, but reads seeing an potentially
> inconsistent view of that write. O_DIRECT causes us to seralize with
> the locks ext4 holds when it writes the superblock, which prevents the
> read from observing a partial update.
>
> It's not necessarily the partitioning tools causing this, but any
> filesystem level udpdate that modifies the contents of the superblock.
As I wrote before: I don't think the needless O_DIRECT fixes things some other code broke.
^ permalink raw reply [flat|nested] 11+ messages in thread
* RE: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-18 20:35 [PATCH] libblkid: fix spurious ext superblock checksum mismatches Krister Johansen
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
@ 2024-11-19 8:15 ` Windl, Ulrich
2024-11-19 17:49 ` Theodore Ts'o
1 sibling, 1 reply; 11+ messages in thread
From: Windl, Ulrich @ 2024-11-19 8:15 UTC (permalink / raw)
To: Krister Johansen, util-linux@vger.kernel.org
Cc: Karel Zak, systemd-devel@lists.freedesktop.org, David Reaver,
Theodore Ts'o
> -----Original Message-----
> From: systemd-devel <systemd-devel-bounces@lists.freedesktop.org> On
> Behalf Of Krister Johansen
> Sent: Monday, November 18, 2024 9:35 PM
> To: util-linux@vger.kernel.org
> Cc: Karel Zak <kzak@redhat.com>; systemd-devel@lists.freedesktop.org;
> David Reaver <me@davidreaver.com>; Theodore Ts'o <tytso@mit.edu>
> Subject: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock
> checksum mismatches
>
> Reads of ext superblocks can race with updates. If libblkid observes a
[Windl, Ulrich]
I really wonder:
Can one single block be inconsistent when read, considering that the block on disk is not inconsistent?
That would mean that the block buffer you are reading is being modified by another process.
AFAIK the basic UNIX semantic guarantee that a block is read atomically; if it's not, something is severely broken, and I don't think that O_DIRECT fixes that.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-19 8:15 ` [EXT] " Windl, Ulrich
@ 2024-11-19 17:49 ` Theodore Ts'o
2024-11-19 23:59 ` Krister Johansen
2024-11-21 10:44 ` Karel Zak
0 siblings, 2 replies; 11+ messages in thread
From: Theodore Ts'o @ 2024-11-19 17:49 UTC (permalink / raw)
To: Windl, Ulrich
Cc: Krister Johansen, util-linux@vger.kernel.org, Karel Zak,
systemd-devel@lists.freedesktop.org, David Reaver
On Tue, Nov 19, 2024 at 08:15:29AM +0000, Windl, Ulrich wrote:
> > Reads of ext superblocks can race with updates. If libblkid observes a
> [Windl, Ulrich]
>
> I really wonder:
>
> Can one single block be inconsistent when read, considering that the
> block on disk is not inconsistent? That would mean that the block
> buffer you are reading is being modified by another process. AFAIK
> the basic UNIX semantic guarantee that a block is read atomically;
> if it's not, something is severely broken, and I don't think that
> O_DIRECT fixes that.
Yes, this can happen if the file system is mounted. The reason for
this is that the kernel updates metadata blocks via the block buffer
cache, with the jbd2 (journaled block layer v2) subsystem managing the
atomic updates. The jbd2 layer will block buffer cache writebacks
until the changes are committed in a jbd2 transaction. So the version
on disk is guaranteed to be consistent.
However, a buffer cache read does not have any consistency guarantees,
and if the file system is being actively modified, it is possible that
you could a superblock where the checksum hasn't yet been updated.
The O_DIRECT read isn't a magic bullet. For example, if you have a
scratch file system which is guaranteed not to survive a Kubernetes or
Borg container getting aborted, you might decide to format the file
system without a jbd2 journal, since that would be more efficient, and
by definition you don't care about the contents of the file system
after a crash. So there are millions of ext4 file systems in
hyperscale computing environments that are created without a journal;
and in that case, O_DIRECT will not be sufficient for guaranteeing a
consistent read of the superblock.
In the long term, I'll probably be adding an ioctl which will allow
userspace to read the superblock consistently for a mounted file
system. We actually already have ioctls, EXT4_IOC_GETFSUUID and
FS_IOC_GETFSLABEL which will allow userspace to fetch the UUID and
Label for a mounted file system. So eventually, I'll probably end
up adding EXT4_IOC_GET_SUPERBLOCK. Let me know if this is something
that util-linux would very much want.
Note: this does require figuring out (a) whether the file system is
mounted, and (b) if so, where is it mounted. So if blkid wants to use
this, it would need to have something like the function
ext2fs_check_mount_point[1].
Cheers,
- Ted
[1] https://github.com/tytso/e2fsprogs/blob/950a0d69c82b585aba30118f01bf80151deffe8c/lib/ext2fs/ismounted.c#L363
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-19 17:49 ` Theodore Ts'o
@ 2024-11-19 23:59 ` Krister Johansen
2024-11-20 6:07 ` Theodore Ts'o
2024-11-21 10:44 ` Karel Zak
1 sibling, 1 reply; 11+ messages in thread
From: Krister Johansen @ 2024-11-19 23:59 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Windl, Ulrich, util-linux@vger.kernel.org, Karel Zak,
systemd-devel@lists.freedesktop.org, David Reaver
On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote:
> Yes, this can happen if the file system is mounted. The reason for
> this is that the kernel updates metadata blocks via the block buffer
> cache, with the jbd2 (journaled block layer v2) subsystem managing the
> atomic updates. The jbd2 layer will block buffer cache writebacks
> until the changes are committed in a jbd2 transaction. So the version
> on disk is guaranteed to be consistent.
>
> However, a buffer cache read does not have any consistency guarantees,
> and if the file system is being actively modified, it is possible that
> you could a superblock where the checksum hasn't yet been updated.
>
> The O_DIRECT read isn't a magic bullet. For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash. So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.
Thanks for the additional detail on jbd2's involvement. When I
originally encountered this, it was on a 5.15 kernel where
ext4_commit_super() was still using mark_buffer_dirty() prior to
submitting the IO for the superblock write. I had managed to convince
myself that ext4_commit_super() holding the BH_lock combined with
O_DIRECT waiting for the dirty buffers associated with the superblock to
get written was sufficient to get a consistent read of the superblock.
I missed that this was changed as part of another bugfix[1].
The version of this fix that you applied for resize2fs has resulted in
no re-occurence of the problem in the environments where we had been
previously encountering the problem.
With libblkid, it's resulted in systemd-udevd removing
/dev/disk/by-label and /dev/disk/by-uuid links for devices when the
superblock checksum can't be read. This in turn has resulted in /boot
failing to mount (when it's on a separate filesystem), update-grub calls
failing because /boot isn't mounted, and we recently had a mkinitramfs
fail because the /dev/disk/by-uuid links were missing for the root
device.
The patch I sent has resolved the problems in our production
environments, and was also run through a battery of synthetic boot
tests. We've seen no re-occurence with it applied. I've also run the
change against the util-linux unit tests and observed no regressions.
I included systemd-devel on this in case other users were observing
disappearing /dev/disk/ links. I hoped I might save somebody else from
having to debug this a second time.
-K
[1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-19 23:59 ` Krister Johansen
@ 2024-11-20 6:07 ` Theodore Ts'o
0 siblings, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2024-11-20 6:07 UTC (permalink / raw)
To: Krister Johansen
Cc: Windl, Ulrich, util-linux@vger.kernel.org, Karel Zak,
systemd-devel@lists.freedesktop.org, David Reaver
On Tue, Nov 19, 2024 at 03:59:53PM -0800, Krister Johansen wrote:
>
> Thanks for the additional detail on jbd2's involvement. When I
> originally encountered this, it was on a 5.15 kernel where
> ext4_commit_super() was still using mark_buffer_dirty() prior to
> submitting the IO for the superblock write. I had managed to convince
> myself that ext4_commit_super() holding the BH_lock combined with
> O_DIRECT waiting for the dirty buffers associated with the superblock to
> get written was sufficient to get a consistent read of the superblock.
> I missed that this was changed as part of another bugfix[1].
Actually, even in 5.15, ext4_commit_super() only gets used (a) during
the mount paths before the journal has been initialized, (b) in the
umount code paths after the journal has been shutdown, (c) for file
systems without a journal, or (d) when the journal update has failed ---
for example, if the journal was aborted due to catastrophic failure.
Most of the time during normal operation, say, when the file system is
being resized, or the orphan list is being actively manipulated during
a huge number of unlinks or truncates, or changing the UUID using
EXT4_IOC_SETFSUUID, etc., the kernel updates the superblock using a
jbd2 transaction, and not ext4_commit_super(). So the change which
you cited in [1] doesn't make a change in practice unless the journal
can't be used for some reason.
If I remember correctly, the patch submitter for [1] hit the problem
they were trying to fix after a journal abort while they were doing
fault injection to test I/O error handling. (In other words, case (d)
above.)
- Ted
> [1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-19 17:49 ` Theodore Ts'o
2024-11-19 23:59 ` Krister Johansen
@ 2024-11-21 10:44 ` Karel Zak
2024-11-21 15:55 ` Theodore Ts'o
2024-11-22 8:54 ` Krister Johansen
1 sibling, 2 replies; 11+ messages in thread
From: Karel Zak @ 2024-11-21 10:44 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Windl, Ulrich, Krister Johansen, util-linux@vger.kernel.org,
systemd-devel@lists.freedesktop.org, David Reaver
On Tue, Nov 19, 2024 at 09:49:57AM GMT, Theodore Ts'o wrote:
> The O_DIRECT read isn't a magic bullet.
Hmm... the nice thing about this patch is that it only uses O_DIRECT
when necessary (if the ordinary method fails). This means that it does
not add complexity or overhead to regular FS probing.
> For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash. So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.
>
> In the long term, I'll probably be adding an ioctl which will allow
> userspace to read the superblock consistently for a mounted file
> system. We actually already have ioctls, EXT4_IOC_GETFSUUID and
> FS_IOC_GETFSLABEL which will allow userspace to fetch the UUID and
> Label for a mounted file system. So eventually, I'll probably end
> up adding EXT4_IOC_GET_SUPERBLOCK. Let me know if this is something
> that util-linux would very much want.
I doubt it will be helpful for us.
I believe that EXT4_IOC_GET_SUPERBLOCK will be used with a mountpoint
file descriptor, but libblkid works directly with the block device
(e.g. open(/dev/sda1)) where it searches for valid filesystems.
Another issue is that libblkid does not check if the device is
mounted, so the FS prober can be triggered in all cases. It simply
calls seek()+read() and tries to be smart.
Ideally, we would have a generic ioctl (for block devices) to ask the
kernel if a superblock at a specific location is valid.
ioctl(fd, BLKVERIFYFS, { .fsname="ext4", .offset=123456 })
This would greatly simplify the complex processes currently in
userspace, and it would remove duplicity. Currently, we have many file
system checks in libblkid to verify that the on-disk data are not
random bytes. These same checks are also present in file system
drivers when reading the superblock.
Another userspace dream ... :-)
> Note: this does require figuring out (a) whether the file system is
> mounted, and (b) if so, where is it mounted. So if blkid wants to use
> this, it would need to have something like the function
> ext2fs_check_mount_point[1].
Yes, that's not good.
Karel
--
Karel Zak <kzak@redhat.com>
http://karelzak.blogspot.com
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-21 10:44 ` Karel Zak
@ 2024-11-21 15:55 ` Theodore Ts'o
2024-11-22 8:54 ` Krister Johansen
1 sibling, 0 replies; 11+ messages in thread
From: Theodore Ts'o @ 2024-11-21 15:55 UTC (permalink / raw)
To: Karel Zak
Cc: Windl, Ulrich, Krister Johansen, util-linux@vger.kernel.org,
systemd-devel@lists.freedesktop.org, David Reaver
On Thu, Nov 21, 2024 at 11:44:14AM +0100, Karel Zak wrote:
> I doubt it will be helpful for us.
>
> I believe that EXT4_IOC_GET_SUPERBLOCK will be used with a mountpoint
> file descriptor, but libblkid works directly with the block device
> (e.g. open(/dev/sda1)) where it searches for valid filesystems.
Yeah, that's why I haven't prioritized implementing it. Higher
priority is to implement ioctls so that tune2fs will no longer need to
modify the superblock while the file system is mounted, so we can
allow prohibiting read/write access to the block device while it is
mounted. (Well, after waiting a decent interval so that distros
everywhere can update to a sufficiently new version of e2fsprogs.)
I had guessed that libblkid wouldn't be excited about trying to
determine the mountpoint and using an ioctl that required an open fd
on the mountpoint, but if I had been wrong, I would have been happy to
priotize EXT4_IOC_GET_SUPERBLOCK higher on my todo list.
> Another issue is that libblkid does not check if the device is
> mounted, so the FS prober can be triggered in all cases. It simply
> calls seek()+read() and tries to be smart.
Well, what I had been proposing was something that could be used in
by the ext[234] specific probe code.
> Ideally, we would have a generic ioctl (for block devices) to ask the
> kernel if a superblock at a specific location is valid.
>
> ioctl(fd, BLKVERIFYFS, { .fsname="ext4", .offset=123456 })
I wouldn't have thought this to be that useful since there are plenty
of file systems known by libblkid that the kernel doesn't support ---
and even if the source code exists in the kernel, there is no
guarantee that it is actually compiled into a particular kernel image.
(Exhibit 1: Reiserfs)
So would it really simplify libblkid all that much?
Cheers,
- Ted
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
2024-11-21 10:44 ` Karel Zak
2024-11-21 15:55 ` Theodore Ts'o
@ 2024-11-22 8:54 ` Krister Johansen
1 sibling, 0 replies; 11+ messages in thread
From: Krister Johansen @ 2024-11-22 8:54 UTC (permalink / raw)
To: Karel Zak
Cc: Theodore Ts'o, util-linux@vger.kernel.org,
systemd-devel@lists.freedesktop.org, David Reaver
On Thu, Nov 21, 2024 at 11:44:14AM +0100, Karel Zak wrote:
> On Tue, Nov 19, 2024 at 09:49:57AM GMT, Theodore Ts'o wrote:
> > The O_DIRECT read isn't a magic bullet.
>
> Hmm... the nice thing about this patch is that it only uses O_DIRECT
> when necessary (if the ordinary method fails). This means that it does
> not add complexity or overhead to regular FS probing.
Let me know if there are additional changes you'd like me to make to
this patch. I hope that it's not too unappealing. It's resolved the
problem in our environment where we went from seeing this about 20ish
times a day to none.
-K
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-22 8:54 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-18 20:35 [PATCH] libblkid: fix spurious ext superblock checksum mismatches Krister Johansen
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
2024-11-18 23:13 ` Krister Johansen
2024-11-19 8:19 ` [EXT] " Windl, Ulrich
2024-11-19 8:15 ` [EXT] " Windl, Ulrich
2024-11-19 17:49 ` Theodore Ts'o
2024-11-19 23:59 ` Krister Johansen
2024-11-20 6:07 ` Theodore Ts'o
2024-11-21 10:44 ` Karel Zak
2024-11-21 15:55 ` Theodore Ts'o
2024-11-22 8:54 ` Krister Johansen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox