linux-bcachefs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
@ 2025-07-09 17:23 Jan Kara
  2025-07-09 17:49 ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2025-07-09 17:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Christian Brauner, Darrick J. Wong, Qu Wenruo, Qu Wenruo,
	linux-btrfs, linux-fsdevel, viro, jack, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, Kent Overstreet,
	linux-bcachefs

Bcc:
Subject: Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to
 remove_bdev()
Reply-To:
In-Reply-To: <aG2i3qP01m-vmFVE@dread.disaster.area>

On Wed 09-07-25 08:59:42, Dave Chinner wrote:
> On Tue, Jul 08, 2025 at 09:55:14AM +0200, Christian Brauner wrote:
> > On Mon, Jul 07, 2025 at 05:45:32PM -0700, Darrick J. Wong wrote:
> > > On Tue, Jul 08, 2025 at 08:52:47AM +0930, Qu Wenruo wrote:
> > > > 在 2025/7/8 08:32, Dave Chinner 写道:
> > > > > On Fri, Jul 04, 2025 at 10:12:29AM +0930, Qu Wenruo wrote:
> > > > > > Currently all the filesystems implementing the
> > > > > > super_opearations::shutdown() callback can not afford losing a device.
> > > > > > 
> > > > > > Thus fs_bdev_mark_dead() will just call the shutdown() callback for the
> > > > > > involved filesystem.
> > > > > > 
> > > > > > But it will no longer be the case, with multi-device filesystems like
> > > > > > btrfs and bcachefs the filesystem can handle certain device loss without
> > > > > > shutting down the whole filesystem.
> > > > > > 
> > > > > > To allow those multi-device filesystems to be integrated to use
> > > > > > fs_holder_ops:
> > > > > > 
> > > > > > - Replace super_opearation::shutdown() with
> > > > > >    super_opearations::remove_bdev()
> > > > > >    To better describe when the callback is called.
> > > > > 
> > > > > This conflates cause with action.
> > > > > 
> > > > > The shutdown callout is an action that the filesystem must execute,
> > > > > whilst "remove bdev" is a cause notification that might require an
> > > > > action to be take.
> > > > > 
> > > > > Yes, the cause could be someone doing hot-unplug of the block
> > > > > device, but it could also be something going wrong in software
> > > > > layers below the filesystem. e.g. dm-thinp having an unrecoverable
> > > > > corruption or ENOSPC errors.
> > > > > 
> > > > > We already have a "cause" notification: blk_holder_ops->mark_dead().
> > > > > 
> > > > > The generic fs action that is taken by this notification is
> > > > > fs_bdev_mark_dead().  That action is to invalidate caches and shut
> > > > > down the filesystem.
> > > > > 
> > > > > btrfs needs to do something different to a blk_holder_ops->mark_dead
> > > > > notification. i.e. it needs an action that is different to
> > > > > fs_bdev_mark_dead().
> > > > > 
> > > > > Indeed, this is how bcachefs already handles "single device
> > > > > died" events for multi-device filesystems - see
> > > > > bch2_fs_bdev_mark_dead().
> > > > 
> > > > I do not think it's the correct way to go, especially when there is already
> > > > fs_holder_ops.
> > > > 
> > > > We're always going towards a more generic solution, other than letting the
> > > > individual fs to do the same thing slightly differently.
> > > 
> > > On second thought -- it's weird that you'd flush the filesystem and
> > > shrink the inode/dentry caches in a "your device went away" handler.
> > > Fancy filesystems like bcachefs and btrfs would likely just shift IO to
> > > a different bdev, right?  And there's no good reason to run shrinkers on
> > > either of those fses, right?
> > > 
> > > > Yes, the naming is not perfect and mixing cause and action, but the end
> > > > result is still a more generic and less duplicated code base.
> > > 
> > > I think dchinner makes a good point that if your filesystem can do
> > > something clever on device removal, it should provide its own block
> > > device holder ops instead of using fs_holder_ops.  I don't understand
> > > why you need a "generic" solution for btrfs when it's not going to do
> > > what the others do anyway.
> > 
> > I think letting filesystems implement their own holder ops should be
> > avoided if we can. Christoph may chime in here. I have no appettite for
> > exporting stuff like get_bdev_super() unless absolutely necessary. We
> > tried to move all that handling into the VFS to eliminate a slew of
> > deadlocks we detected and fixed. I have no appetite to repeat that
> > cycle.
> 
> Except it isn't actually necessary.
> 
> Everyone here seems to be assuming that the filesystem *must* take
> an active superblock reference to process a device removal event,
> and that is *simply not true*.
> 
> bcachefs does not use get_bdev_super() or an active superblock
> reference to process ->mark_dead events.
>
> It has it's own internal reference counting on the struct bch_fs
> attached to the bdev that ensures the filesystem structures can't go
> away whilst ->mark_dead is being processed.  i.e. bcachefs is only
> dependent on the bdev->bd_holder_lock() being held when
> ->mark_dead() is called and does not rely on the VFS for anything.

Right, they have their own refcount which effectively blocks umount
in .put_super AFAICS and they use it instead of VFS refcounts for this.

> This means that device removal processing can be performed
> without global filesystem/VFS locks needing to be held. Hence issues
> like re-entrancy deadlocks when there are concurrent/cascading
> device failures (e.g. a HBA dies, taking out multiple devices
> simultaneously) are completely avoided...

Funnily enough how about:

bch2_fs_bdev_mark_dead()		umount()
  bdev_get_fs()
    bch2_ro_ref_tryget() -> grabs bch_fs->ro_ref
    mutex_unlock(&bdev->bd_holder_lock);
					deactivate_super()
					  down_write(&sb->s_umount);
					  deactivate_locked_super()
					    bch2_kill_sb()
					      generic_shutdown_super()
					        bch2_put_super()
						  __bch2_fs_stop()
						    bch2_ro_ref_put()
						    wait_event(c->ro_ref_wait, !refcount_read(&c->ro_ref));
  sb = c->vfs_sb;
  down_read(&sb->s_umount); -> deadlock

Which is a case in point why I would like to have a shared infrastructure
for bdev -> sb transition that's used as widely as possible. Because it
isn't easy to get the lock ordering right given all the constraints in the
VFS and block layer code paths for this transition that's going contrary to
the usual ordering sb -> bdev. And yes I do realize bcachefs grabs s_umount
not because it itself needs it but because it calls some VFS helpers
(sync_filesystem()) which expect it to be held so the pain is inflicted
by VFS here but that just demostrates the fact that VFS and FS locking are
deeply intertwined and you can hardly avoid dealing with VFS locking rules
in the filesystem itself.

> It also avoids the problem of ->mark_dead events being generated
> from a context that holds filesystem/vfs locks and then deadlocking
> waiting for those locks to be released.
> 
> IOWs, a multi-device filesystem should really be implementing
> ->mark_dead itself, and should not be depending on being able to
> lock the superblock to take an active reference to it.
> 
> It should be pretty clear that these are not issues that the generic
> filesystem ->mark_dead implementation should be trying to
> handle.....

Well, IMO every fs implementation needs to do the bdev -> sb transition and
make sb somehow stable. It may be that grabbing s_umount and active sb
reference is not what everybody wants but AFAIU btrfs as the second
multi-device filesystem would be fine with that and for bcachefs this
doesn't work only because they have special superblock instantiation
behavior on mount for independent reasons (i.e., not because active ref
+ s_umount would be problematic for them) if I understand Kent right.
So I'm still not fully convinced each multi-device filesystem should be
shipping their special method to get from device to stable sb reference.

> > The shutdown method is implemented only by block-based filesystems and
> > arguably shutdown was always a misnomer because it assumed that the
> > filesystem needs to actually shut down when it is called.
> 
> Shutdown was not -assumed- as the operation that needed to be
> performed. That was the feature that was *required* to fix
> filesystem level problems that occur when the device underneath it
> disappears.
> 
> ->mark_dead() is the abstract filesystem notification from the block
> device, fs_bdfev_mark_dead() is the -generic implementation- of the
> functionality required by single block device filesystems. Part of
> that functionality is shutting down the filesystem because it can
> *no longer function without a backing device*.
> 
> multi-block device filesystems require compeltely different
> implementations, and we already have one that -does not use active
> superblock references-. IOWs, even if we add ->remove_bdev(sb)
> callout, bcachefs will continue to use ->mark_dead() because low
> level filesystem device management isn't (and shouldn't be!)
> dependent on high level VFS structure reference counting....

I have to admit I don't get why device management shouldn't be dependent on
VFS refcounts / locking. IMO it is often dependent although I agree with
multiple devices you likely have to do *additional* locking. And yes, I can
imagine VFS locking could get in your way but the only tangible example we
have is bcachefs and btrfs seems to be a counter example showing even multi
device filesystem can live with VFS locking. So I don't think the case is
as clear as you try to frame it.

So conceptually I agree making filesystems as bdev holders implement their
own holder ops makes a lot of sense but because of lock ordering rules it
is not quite practical or easily maintainable choice I'm afraid.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09 17:23 [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Jan Kara
@ 2025-07-09 17:49 ` Kent Overstreet
  2025-07-10 13:10   ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2025-07-09 17:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	Qu Wenruo, linux-btrfs, linux-fsdevel, viro, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> On Wed 09-07-25 08:59:42, Dave Chinner wrote:
> > This means that device removal processing can be performed
> > without global filesystem/VFS locks needing to be held. Hence issues
> > like re-entrancy deadlocks when there are concurrent/cascading
> > device failures (e.g. a HBA dies, taking out multiple devices
> > simultaneously) are completely avoided...
> 
> Funnily enough how about:
> 
> bch2_fs_bdev_mark_dead()		umount()
>   bdev_get_fs()
>     bch2_ro_ref_tryget() -> grabs bch_fs->ro_ref
>     mutex_unlock(&bdev->bd_holder_lock);
> 					deactivate_super()
> 					  down_write(&sb->s_umount);
> 					  deactivate_locked_super()
> 					    bch2_kill_sb()
> 					      generic_shutdown_super()
> 					        bch2_put_super()
> 						  __bch2_fs_stop()
> 						    bch2_ro_ref_put()
> 						    wait_event(c->ro_ref_wait, !refcount_read(&c->ro_ref));
>   sb = c->vfs_sb;
>   down_read(&sb->s_umount); -> deadlock
> 
> Which is a case in point why I would like to have a shared infrastructure
> for bdev -> sb transition that's used as widely as possible. Because it
> isn't easy to get the lock ordering right given all the constraints in the
> VFS and block layer code paths for this transition that's going contrary to
> the usual ordering sb -> bdev. And yes I do realize bcachefs grabs s_umount
> not because it itself needs it but because it calls some VFS helpers
> (sync_filesystem()) which expect it to be held so the pain is inflicted
> by VFS here but that just demostrates the fact that VFS and FS locking are
> deeply intertwined and you can hardly avoid dealing with VFS locking rules
> in the filesystem itself.

Getting rid of the s_umount use looks like the much saner and easier
fix - like the comment notes, it's only taken to avoid the warning in
sync_filesystem, we don't actually need it.

Locking gets easier when locks are private to individual subsystems,
protecting specific data structures that are private to those
subsystems.

> > It also avoids the problem of ->mark_dead events being generated
> > from a context that holds filesystem/vfs locks and then deadlocking
> > waiting for those locks to be released.
> > 
> > IOWs, a multi-device filesystem should really be implementing
> > ->mark_dead itself, and should not be depending on being able to
> > lock the superblock to take an active reference to it.
> > 
> > It should be pretty clear that these are not issues that the generic
> > filesystem ->mark_dead implementation should be trying to
> > handle.....
> 
> Well, IMO every fs implementation needs to do the bdev -> sb transition and
> make sb somehow stable. It may be that grabbing s_umount and active sb
> reference is not what everybody wants but AFAIU btrfs as the second
> multi-device filesystem would be fine with that and for bcachefs this
> doesn't work only because they have special superblock instantiation
> behavior on mount for independent reasons (i.e., not because active ref
> + s_umount would be problematic for them) if I understand Kent right.
> So I'm still not fully convinced each multi-device filesystem should be
> shipping their special method to get from device to stable sb reference.

Honestly, the sync_filesystem() call seems bogus.

If the block device is truly dead, what's it going to accomplish?

It's not like we get callbacks for "this device is going to be going
away soon", we only get that in reaction to something that's already
happened.

> > > The shutdown method is implemented only by block-based filesystems and
> > > arguably shutdown was always a misnomer because it assumed that the
> > > filesystem needs to actually shut down when it is called.
> > 
> > Shutdown was not -assumed- as the operation that needed to be
> > performed. That was the feature that was *required* to fix
> > filesystem level problems that occur when the device underneath it
> > disappears.
> > 
> > ->mark_dead() is the abstract filesystem notification from the block
> > device, fs_bdfev_mark_dead() is the -generic implementation- of the
> > functionality required by single block device filesystems. Part of
> > that functionality is shutting down the filesystem because it can
> > *no longer function without a backing device*.
> > 
> > multi-block device filesystems require compeltely different
> > implementations, and we already have one that -does not use active
> > superblock references-. IOWs, even if we add ->remove_bdev(sb)
> > callout, bcachefs will continue to use ->mark_dead() because low
> > level filesystem device management isn't (and shouldn't be!)
> > dependent on high level VFS structure reference counting....
> 
> I have to admit I don't get why device management shouldn't be dependent on
> VFS refcounts / locking. IMO it is often dependent although I agree with
> multiple devices you likely have to do *additional* locking. And yes, I can
> imagine VFS locking could get in your way but the only tangible example we
> have is bcachefs and btrfs seems to be a counter example showing even multi
> device filesystem can live with VFS locking. So I don't think the case is
> as clear as you try to frame it.

Individual devices coming and going has nothing to do with the VFS. If a
single device goes away and we're continuing in RW mode, _no_ VFS state
is affected whatsoever.

The only thing that's needed is a ref to prevent the filesystem from
going away, not a lock. But again given that a bch_fs doesn't
necessarily even have a VFS superblock it's not something we'd use
directly in .mark_dead, that synchronization is handled directly via
kill_sb -> generic_shutdown_super -> and all that...

We don't want bch_fs to outlive the VFS superblock if we do have a VFS
sb, because asynchronous shutdown and releasing of resources causes very
real problems (which already exist for other reasons...)

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-09 17:49 ` Kent Overstreet
@ 2025-07-10 13:10   ` Jan Kara
  2025-07-10 18:41     ` Kent Overstreet
  0 siblings, 1 reply; 5+ messages in thread
From: Jan Kara @ 2025-07-10 13:10 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, Dave Chinner, Christian Brauner, Darrick J. Wong,
	Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > It also avoids the problem of ->mark_dead events being generated
> > > from a context that holds filesystem/vfs locks and then deadlocking
> > > waiting for those locks to be released.
> > > 
> > > IOWs, a multi-device filesystem should really be implementing
> > > ->mark_dead itself, and should not be depending on being able to
> > > lock the superblock to take an active reference to it.
> > > 
> > > It should be pretty clear that these are not issues that the generic
> > > filesystem ->mark_dead implementation should be trying to
> > > handle.....
> > 
> > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > make sb somehow stable. It may be that grabbing s_umount and active sb
> > reference is not what everybody wants but AFAIU btrfs as the second
> > multi-device filesystem would be fine with that and for bcachefs this
> > doesn't work only because they have special superblock instantiation
> > behavior on mount for independent reasons (i.e., not because active ref
> > + s_umount would be problematic for them) if I understand Kent right.
> > So I'm still not fully convinced each multi-device filesystem should be
> > shipping their special method to get from device to stable sb reference.
> 
> Honestly, the sync_filesystem() call seems bogus.
> 
> If the block device is truly dead, what's it going to accomplish?

Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
'surprise' argument is false - meaning this is actually a notification
*before* the device is going away. I.e., graceful device hot unplug when
you can access the device to clean up as much as possible.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10 13:10   ` Jan Kara
@ 2025-07-10 18:41     ` Kent Overstreet
  2025-07-11 14:20       ` Jan Kara
  0 siblings, 1 reply; 5+ messages in thread
From: Kent Overstreet @ 2025-07-10 18:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Christian Brauner, Darrick J. Wong, Qu Wenruo,
	Qu Wenruo, linux-btrfs, linux-fsdevel, viro, linux-ext4,
	linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Thu, Jul 10, 2025 at 03:10:04PM +0200, Jan Kara wrote:
> On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> > On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > > It also avoids the problem of ->mark_dead events being generated
> > > > from a context that holds filesystem/vfs locks and then deadlocking
> > > > waiting for those locks to be released.
> > > > 
> > > > IOWs, a multi-device filesystem should really be implementing
> > > > ->mark_dead itself, and should not be depending on being able to
> > > > lock the superblock to take an active reference to it.
> > > > 
> > > > It should be pretty clear that these are not issues that the generic
> > > > filesystem ->mark_dead implementation should be trying to
> > > > handle.....
> > > 
> > > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > > make sb somehow stable. It may be that grabbing s_umount and active sb
> > > reference is not what everybody wants but AFAIU btrfs as the second
> > > multi-device filesystem would be fine with that and for bcachefs this
> > > doesn't work only because they have special superblock instantiation
> > > behavior on mount for independent reasons (i.e., not because active ref
> > > + s_umount would be problematic for them) if I understand Kent right.
> > > So I'm still not fully convinced each multi-device filesystem should be
> > > shipping their special method to get from device to stable sb reference.
> > 
> > Honestly, the sync_filesystem() call seems bogus.
> > 
> > If the block device is truly dead, what's it going to accomplish?
> 
> Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
> 'surprise' argument is false - meaning this is actually a notification
> *before* the device is going away. I.e., graceful device hot unplug when
> you can access the device to clean up as much as possible.

That doesn't seem to be hooked up to anything?

blk_mark_disk_dead() -> blk_report_disk_dead(), surprise is always true

disk_force_media_change(), same

The only call where it's falso is in s390 code. If we know that a disk
is going away, that would be a userspace thing, and they can just
unmount.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev()
  2025-07-10 18:41     ` Kent Overstreet
@ 2025-07-11 14:20       ` Jan Kara
  0 siblings, 0 replies; 5+ messages in thread
From: Jan Kara @ 2025-07-11 14:20 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Jan Kara, Dave Chinner, Christian Brauner, Darrick J. Wong,
	Qu Wenruo, Qu Wenruo, linux-btrfs, linux-fsdevel, viro,
	linux-ext4, linux-f2fs-devel, ntfs3, linux-xfs, linux-bcachefs

On Thu 10-07-25 14:41:18, Kent Overstreet wrote:
> On Thu, Jul 10, 2025 at 03:10:04PM +0200, Jan Kara wrote:
> > On Wed 09-07-25 13:49:12, Kent Overstreet wrote:
> > > On Wed, Jul 09, 2025 at 07:23:07PM +0200, Jan Kara wrote:
> > > > > It also avoids the problem of ->mark_dead events being generated
> > > > > from a context that holds filesystem/vfs locks and then deadlocking
> > > > > waiting for those locks to be released.
> > > > > 
> > > > > IOWs, a multi-device filesystem should really be implementing
> > > > > ->mark_dead itself, and should not be depending on being able to
> > > > > lock the superblock to take an active reference to it.
> > > > > 
> > > > > It should be pretty clear that these are not issues that the generic
> > > > > filesystem ->mark_dead implementation should be trying to
> > > > > handle.....
> > > > 
> > > > Well, IMO every fs implementation needs to do the bdev -> sb transition and
> > > > make sb somehow stable. It may be that grabbing s_umount and active sb
> > > > reference is not what everybody wants but AFAIU btrfs as the second
> > > > multi-device filesystem would be fine with that and for bcachefs this
> > > > doesn't work only because they have special superblock instantiation
> > > > behavior on mount for independent reasons (i.e., not because active ref
> > > > + s_umount would be problematic for them) if I understand Kent right.
> > > > So I'm still not fully convinced each multi-device filesystem should be
> > > > shipping their special method to get from device to stable sb reference.
> > > 
> > > Honestly, the sync_filesystem() call seems bogus.
> > > 
> > > If the block device is truly dead, what's it going to accomplish?
> > 
> > Notice that fs_bdev_mark_dead() calls sync_filesystem() only in case
> > 'surprise' argument is false - meaning this is actually a notification
> > *before* the device is going away. I.e., graceful device hot unplug when
> > you can access the device to clean up as much as possible.
> 
> That doesn't seem to be hooked up to anything?

__del_gendisk()
  if (!test_bit(GD_DEAD, &disk->state))
    blk_report_disk_dead(disk, false);

Is the path which results in "surprise" to be false. I have to admit I
didn't check deeper into drivers whether this is hooked up properly but
del_gendisk() is a standard call to tear down a disk so it would seem so
from the first glance.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-07-11 14:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 17:23 [PATCH v4 1/6] fs: enhance and rename shutdown() callback to remove_bdev() Jan Kara
2025-07-09 17:49 ` Kent Overstreet
2025-07-10 13:10   ` Jan Kara
2025-07-10 18:41     ` Kent Overstreet
2025-07-11 14:20       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).