All of lore.kernel.org
 help / color / mirror / Atom feed
From: Krister Johansen <kjlx@templeofstupid.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: "Windl, Ulrich" <u.windl@ukr.de>,
	"util-linux@vger.kernel.org" <util-linux@vger.kernel.org>,
	Karel Zak <kzak@redhat.com>,
	"systemd-devel@lists.freedesktop.org"
	<systemd-devel@lists.freedesktop.org>,
	David Reaver <me@davidreaver.com>
Subject: Re: [EXT] [systemd-devel] [PATCH] libblkid: fix spurious ext superblock checksum mismatches
Date: Tue, 19 Nov 2024 15:59:53 -0800	[thread overview]
Message-ID: <20241119235953.GA1865@templeofstupid.com> (raw)
In-Reply-To: <20241119174957.GA3484088@mit.edu>

On Tue, Nov 19, 2024 at 09:49:57AM -0800, Theodore Ts'o wrote:
 
> Yes, this can happen if the file system is mounted.  The reason for
> this is that the kernel updates metadata blocks via the block buffer
> cache, with the jbd2 (journaled block layer v2) subsystem managing the
> atomic updates.  The jbd2 layer will block buffer cache writebacks
> until the changes are committed in a jbd2 transaction.  So the version
> on disk is guaranteed to be consistent.
> 
> However, a buffer cache read does not have any consistency guarantees,
> and if the file system is being actively modified, it is possible that
> you could a superblock where the checksum hasn't yet been updated.
> 
> The O_DIRECT read isn't a magic bullet.  For example, if you have a
> scratch file system which is guaranteed not to survive a Kubernetes or
> Borg container getting aborted, you might decide to format the file
> system without a jbd2 journal, since that would be more efficient, and
> by definition you don't care about the contents of the file system
> after a crash.  So there are millions of ext4 file systems in
> hyperscale computing environments that are created without a journal;
> and in that case, O_DIRECT will not be sufficient for guaranteeing a
> consistent read of the superblock.

Thanks for the additional detail on jbd2's involvement.  When I
originally encountered this, it was on a 5.15 kernel where
ext4_commit_super() was still using mark_buffer_dirty() prior to
submitting the IO for the superblock write. I had managed to convince
myself that ext4_commit_super() holding the BH_lock combined with
O_DIRECT waiting for the dirty buffers associated with the superblock to
get written was sufficient to get a consistent read of the superblock.
I missed that this was changed as part of another bugfix[1].

The version of this fix that you applied for resize2fs has resulted in
no re-occurence of the problem in the environments where we had been
previously encountering the problem.

With libblkid, it's resulted in systemd-udevd removing
/dev/disk/by-label and /dev/disk/by-uuid links for devices when the
superblock checksum can't be read.  This in turn has resulted in /boot
failing to mount (when it's on a separate filesystem), update-grub calls
failing because /boot isn't mounted, and we recently had a mkinitramfs
fail because the /dev/disk/by-uuid links were missing for the root
device.

The patch I sent has resolved the problems in our production
environments, and was also run through a battery of synthetic boot
tests.  We've seen no re-occurence with it applied.  I've also run the
change against the util-linux unit tests and observed no regressions.

I included systemd-devel on this in case other users were observing
disappearing /dev/disk/ links.  I hoped I might save somebody else from
having to debug this a second time.

-K


[1] https://lore.kernel.org/all/20220520023216.3065073-1-yi.zhang@huawei.com/


  reply	other threads:[~2024-11-20  0:08 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-18 20:35 [PATCH] libblkid: fix spurious ext superblock checksum mismatches Krister Johansen
2024-11-18 22:36 ` [systemd-devel] " Lennart Poettering
2024-11-18 23:13   ` Krister Johansen
2024-11-19  8:19     ` [EXT] " Windl, Ulrich
2024-11-19  8:15 ` [EXT] " Windl, Ulrich
2024-11-19 17:49   ` Theodore Ts'o
2024-11-19 23:59     ` Krister Johansen [this message]
2024-11-20  6:07       ` Theodore Ts'o
2024-11-21 10:44     ` Karel Zak
2024-11-21 15:55       ` Theodore Ts'o
2024-11-22  8:54       ` Krister Johansen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241119235953.GA1865@templeofstupid.com \
    --to=kjlx@templeofstupid.com \
    --cc=kzak@redhat.com \
    --cc=me@davidreaver.com \
    --cc=systemd-devel@lists.freedesktop.org \
    --cc=tytso@mit.edu \
    --cc=u.windl@ukr.de \
    --cc=util-linux@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.