From: Krister Johansen <kjlx@templeofstupid.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "Windl, Ulrich" <u.windl@ukr.de>,
"util-linux@vger.kernel.org" <util-linux@vger.kernel.org>,
Karel Zak <kzak@redhat.com>,
"systemd-devel@lists.freedesktop.org"
<systemd-devel@lists.freedesktop.org>,
David Reaver <me@davidreaver.com>
Subject: Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
Date: Tue, 19 Nov 2024 15:59:53 -0800 [thread overview]
Message-ID: <20241119235953.GA1865@templeofstupid.com> (raw)
In-Reply-To: <20241119174957.GA3484088@mit.edu>
On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote:
> Yes, this can happen if the file system is mounted. The reason for
> this is that the kernel updates metadata blocks via the block buffer
> cache, with the jbd2 (journaled block layer v2) subsystem managing the
> atomic updates. The jbd2 layer will block buffer cache writebacks
> until the changes are committed in a jbd2 transaction. So the version
> on disk is guaranteed to be consistent.
>
> However, a buffer cache read does not have any consistency guarantees,
> and if the file system is being actively modified, it is possible that
> you could a superblock where the checksum hasn't yet been updated.
>
> The O_DIRECT read isn't a magic bullet. For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash. So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.
Thanks for the additional detail on jbd2's involvement. When I
originally encountered this, it was on a 5.15 kernel where
ext4_commit_super() was still using mark_buffer_dirty() prior to
submitting the IO for the superblock write. I had managed to convince
myself that ext4_commit_super() holding the BH_lock combined with
O_DIRECT waiting for the dirty buffers associated with the superblock to
get written was sufficient to get a consistent read of the superblock.
I missed that this was changed as part of another bugfix[1].
The version of this fix that you applied for resize2fs has resulted in
no re-occurence of the problem in the environments where we had been
previously encountering the problem.
With libblkid, it's resulted in systemd-udevd removing
/dev/disk/by-label and /dev/disk/by-uuid links for devices when the
superblock checksum can't be read. This in turn has resulted in /boot
failing to mount (when it's on a separate filesystem), update-grub calls
failing because /boot isn't mounted, and we recently had a mkinitramfs
fail because the /dev/disk/by-uuid links were missing for the root
device.
The patch I sent has resolved the problems in our production
environments, and was also run through a battery of synthetic boot
tests. We've seen no re-occurence with it applied. I've also run the
change against the util-linux unit tests and observed no regressions.
I included systemd-devel on this in case other users were observing
disappearing /dev/disk/ links. I hoped I might save somebody else from
having to debug this a second time.
-K
[1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/
next prev parent reply other threads:[~2024-11-20 0:08 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-18 20:35 [PATCH] libblkid: fix spurious ext superblock checksum mismatches Krister Johansen
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
2024-11-18 23:13 ` Krister Johansen
2024-11-19 8:19 ` [EXT] " Windl, Ulrich
2024-11-19 8:15 ` [EXT] " Windl, Ulrich
2024-11-19 17:49 ` Theodore Ts'o
2024-11-19 23:59 ` Krister Johansen [this message]
2024-11-20 6:07 ` Theodore Ts'o
2024-11-21 10:44 ` Karel Zak
2024-11-21 15:55 ` Theodore Ts'o
2024-11-22 8:54 ` Krister Johansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241119235953.GA1865@templeofstupid.com \
--to=kjlx@templeofstupid.com \
--cc=kzak@redhat.com \
--cc=me@davidreaver.com \
--cc=systemd-devel@lists.freedesktop.org \
--cc=tytso@mit.edu \
--cc=u.windl@ukr.de \
--cc=util-linux@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox