linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Christian Brauner <brauner@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: Christoph Hellwig <hch@lst.de>, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH RFC 0/6] fs,block: yield devices
Date: Thu, 26 Oct 2023 14:07:38 +0200	[thread overview]
Message-ID: <20231026-marsch-tierzucht-0221d75b18ea@brauner> (raw)
In-Reply-To: <20231026103503.ldupo3nhynjjkz45@quack3>

On Thu, Oct 26, 2023 at 12:35:03PM +0200, Jan Kara wrote:
> On Wed 25-10-23 22:46:29, Christian Brauner wrote:
> > On Wed, Oct 25, 2023 at 07:20:57PM +0200, Jan Kara wrote:
> > > Hello!
> > > 
> > > On Tue 24-10-23 16:53:38, Christian Brauner wrote:
> > > > This is a mechanism that allows the holder of a block device to yield
> > > > device access before actually closing the block device.
> > > > 
> > > > If a someone yields a device then any concurrent opener claiming the
> > > > device exclusively with the same blk_holder_ops as the current owner can
> > > > wait for the device to be given up. Filesystems by default use
> > > > fs_holder_ps and so can wait on each other.
> > > > 
> > > > This mechanism allows us to simplify superblock handling quite a bit at
> > > > the expense of requiring filesystems to yield devices. A filesytems must
> > > > yield devices under s_umount. This allows costly work to be done outside
> > > > of s_umount.
> > > > 
> > > > There's nothing wrong with the way we currently do things but this does
> > > > allow us to simplify things and kills a whole class of theoretical UAF
> > > > when walking the superblock list.
> > > 
> > > I'm not sure why is it better to create new ->yield callback called under
> > > sb->s_umount rather than just move blkdev_put() calls back into
> > > ->put_super? Or at least yielding could be done in ->put_super instead of
> > 
> > The main reason was to not call potentially expensive
> > blkdev_put()/bdev_release() under s_umount. If we don't care about this
> > though then this shouldn't be a problem.
> 
> So I would not be really concerned about performance here. The superblock
> is dying, nobody can do anything about that until the superblock is fully
> dead and cleaned up. Maybe some places could skip such superblocks more
> quickly but so far I'm not convinced it matters in practice (e.g. writeback
> holds s_umount over the whole sync(1) time and nobody complains). And as
> you mention below, we have been doing this for a long time without anybody
> really complaining.

Ok, that's a good point.

> 
> > And yes, then we need to move
> > blkdev_put()/bdev_release() under s_umount including the main block
> > device. IOW, we need to ensure that all bdev calls are done under
> > s_umount before we remove the superblock from the instance list.
> 
> This is about those seemingly spurious "device busy" errors when the
> superblock hasn't closed its devices yet, isn't it?  But we now remove

Yes, because we tie superblock and block device neatly together.

> superblock from s_instances list in kill_super_notify() and until that
> moment SB_DYING is protecting us from racing mounts. So is there some other
> problem?

No, there isn't a problem at all. It's all working fine but it was
initially a little annoying as we had to update filesystems to ensure
that sb->s_fs_info is kept alive. But it's all fixed.

The possible advantage is that if we drop all block devices under
s_umount then we can remove the superblock from fs_type->s_instances in
the old location again. I'm not convinced it's worth it but it's a
possible simplification. I'm not even arguing it's objectively better I
think it's a matter of taste in the end.

> 
> > I think
> > that should be fine but I wanted to propose an alternative to that as
> > well: cheap mark-for-release under s_umount and heavy-duty without
> > s_umount. But I guess it doesn't matter because most filesystems did use
> > to close devices under s_umount before anyway. Let me know what you
> > think makes the most sense.
> 
> I think we should make it as simple as possible for filesystems. As I said

Yes, I fully agree.

The really good thing about the current mechanism is that it's
completely vfs-only. With s_umount/open_mutex ordering fixed filesystems
can now close block devices wherever and the vfs will ensure that you
don't get spurious ebusy. And the filesystem doesn't have to know a
thing about it or take any care.

> above I don't think s_umount hold time really matters so the only thing
> that limits us are locking constraints. During superblock shutdown the only
> thing that currently cannot be done under s_umount (that I'm aware of) is the
> teardown of the sb->s_bdi because that waits for writeback threads and
> those can be blocked waiting for s_umount.

Which we don't need to change if it's working fine.

> 
> So if we wanted we could just move ext4 & xfs bdev closing back into
> ->put_super to avoid ext4_kill_sb() and somewhat slim down xfs_mount_free()
> but otherwise I don't see much space for simplification or need for adding
> more callbacks :)

I mean, that I would only think is useful if we really wanted to close
all block devices under s_umount to possible be remove the waiting
mechanism we have right now. Otherwise I think it's churn for the sake
of churn and I quite like what we have right now.

  reply	other threads:[~2023-10-26 12:07 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24 14:53 [PATCH RFC 0/6] fs,block: yield devices Christian Brauner
2023-10-24 14:53 ` [PATCH RFC 1/6] fs: simplify setup_bdev_super() calls Christian Brauner
2023-10-25 15:29   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 2/6] xfs: simplify device handling Christian Brauner
2023-10-25 15:30   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 3/6] ext4: " Christian Brauner
2023-10-25 15:30   ` Jan Kara
2023-10-27  6:42   ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 4/6] bdev: simplify waiting for concurrent claimers Christian Brauner
2023-10-25 15:54   ` Jan Kara
2023-10-27  7:21     ` Christoph Hellwig
2023-10-24 14:53 ` [PATCH RFC 5/6] block: mark device as about to be released Christian Brauner
2023-10-24 14:53 ` [PATCH RFC 6/6] fs: add ->yield_devices() Christian Brauner
2023-10-25 17:20 ` [PATCH RFC 0/6] fs,block: yield devices Jan Kara
2023-10-25 20:46   ` Christian Brauner
2023-10-26 10:35     ` Jan Kara
2023-10-26 12:07       ` Christian Brauner [this message]
2023-10-26 13:04         ` Jan Kara
2023-10-26 15:08           ` Christian Brauner
2023-10-26 15:58             ` Jan Kara
2023-10-27  7:24             ` Christoph Hellwig
2023-10-26 11:50 ` (subset) " Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231026-marsch-tierzucht-0221d75b18ea@brauner \
    --to=brauner@kernel.org \
    --cc=hch@lst.de \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).