[PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
@ 2023-10-18 15:29 Jan Kara
  2023-10-18 15:46 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Jan Kara @ 2023-10-18 15:29 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christian Brauner, linux-fsdevel, Jan Kara

The implementation of bdev holder operations such as fs_bdev_mark_dead()
and fs_bdev_sync() grab sb->s_umount semaphore under
bdev->bd_holder_lock. This is problematic because it leads to
disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
(usually we grab higher level (e.g. filesystem) locks first and lower
level (e.g. block layer) locks later) and indeed makes lockdep complain
about possible locking cycles whenever we open a block device while
holding sb->s_umount semaphore. Implement a function
bdev_super_lock_shared() which safely transitions from holding
bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
introducing the problematic lock dependency. We use this function
fs_bdev_sync() and fs_bdev_mark_dead().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 block/bdev.c  |  5 +++--
 block/ioctl.c |  5 +++--
 fs/super.c    | 48 ++++++++++++++++++++++++++++++------------------
 3 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index f3b13aa1b7d4..a9a485aae6b0 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -961,9 +961,10 @@ void bdev_mark_dead(struct block_device *bdev, bool surprise)
 	mutex_lock(&bdev->bd_holder_lock);
 	if (bdev->bd_holder_ops && bdev->bd_holder_ops->mark_dead)
 		bdev->bd_holder_ops->mark_dead(bdev, surprise);
-	else
+	else {
+		mutex_unlock(&bdev->bd_holder_lock);
 		sync_blockdev(bdev);
-	mutex_unlock(&bdev->bd_holder_lock);
+	}
 
 	invalidate_bdev(bdev);
 }
diff --git a/block/ioctl.c b/block/ioctl.c
index d5f5cd61efd7..fc492f9d34ae 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -370,9 +370,10 @@ static int blkdev_flushbuf(struct block_device *bdev, unsigned cmd,
 	mutex_lock(&bdev->bd_holder_lock);
 	if (bdev->bd_holder_ops && bdev->bd_holder_ops->sync)
 		bdev->bd_holder_ops->sync(bdev);
-	else
+	else {
+		mutex_unlock(&bdev->bd_holder_lock);
 		sync_blockdev(bdev);
-	mutex_unlock(&bdev->bd_holder_lock);
+	}
 
 	invalidate_bdev(bdev);
 	return 0;
diff --git a/fs/super.c b/fs/super.c
index 2d762ce67f6e..8b80d03e7cb4 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1419,32 +1419,45 @@ EXPORT_SYMBOL(sget_dev);
 
 #ifdef CONFIG_BLOCK
 /*
- * Lock a super block that the callers holds a reference to.
+ * Lock the superblock that is holder of the bdev. Returns the superblock
+ * pointer if we successfully locked the superblock and it is alive. Otherwise
+ * we return NULL and just unlock bdev->bd_holder_lock.
  *
- * The caller needs to ensure that the super_block isn't being freed while
- * calling this function, e.g. by holding a lock over the call to this function
- * and the place that clears the pointer to the superblock used by this function
- * before freeing the superblock.
+ * The function must be called with bdev->bd_holder_lock and releases it.
  */
-static bool super_lock_shared_active(struct super_block *sb)
+static struct super_block *bdev_super_lock_shared(struct block_device *bdev)
+	__releases(&bdev->bd_holder_lock)
 {
-	bool born = super_lock_shared(sb);
+	struct super_block *sb = bdev->bd_holder;
+	bool born;
 
+	lockdep_assert_held(&bdev->bd_holder_lock);
+	/* Make sure sb doesn't go away from under us */
+	spin_lock(&sb_lock);
+	sb->s_count++;
+	spin_unlock(&sb_lock);
+	mutex_unlock(&bdev->bd_holder_lock);
+
+	born = super_lock_shared(sb);
 	if (!born || !sb->s_root || !(sb->s_flags & SB_ACTIVE)) {
 		super_unlock_shared(sb);
-		return false;
+		put_super(sb);
+		return NULL;
 	}
-	return true;
+	/*
+	 * The superblock is active and we hold s_umount, we can drop our
+	 * temporary reference now.
+	 */
+	put_super(sb);
+	return sb;
 }
 
 static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 {
-	struct super_block *sb = bdev->bd_holder;
-
-	/* bd_holder_lock ensures that the sb isn't freed */
-	lockdep_assert_held(&bdev->bd_holder_lock);
+	struct super_block *sb;
 
-	if (!super_lock_shared_active(sb))
+	sb = bdev_super_lock_shared(bdev);
+	if (!sb)
 		return;
 
 	if (!surprise)
@@ -1459,11 +1472,10 @@ static void fs_bdev_mark_dead(struct block_device *bdev, bool surprise)
 
 static void fs_bdev_sync(struct block_device *bdev)
 {
-	struct super_block *sb = bdev->bd_holder;
-
-	lockdep_assert_held(&bdev->bd_holder_lock);
+	struct super_block *sb;
 
-	if (!super_lock_shared_active(sb))
+	sb = bdev_super_lock_shared(bdev);
+	if (!sb)
 		return;
 	sync_filesystem(sb);
 	super_unlock_shared(sb);
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-18 15:29 [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock Jan Kara
@ 2023-10-18 15:46 ` Christoph Hellwig
  2023-10-19  8:16 ` Christian Brauner
  2023-10-19  8:33 ` Christian Brauner
  2 siblings, 0 replies; 16+ messages in thread
From: Christoph Hellwig @ 2023-10-18 15:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, Christian Brauner, linux-fsdevel

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-18 15:29 [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock Jan Kara
  2023-10-18 15:46 ` Christoph Hellwig
@ 2023-10-19  8:16 ` Christian Brauner
  2023-10-19  8:33 ` Christian Brauner
  2 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2023-10-19  8:16 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christian Brauner, linux-fsdevel, Christoph Hellwig

On Wed, 18 Oct 2023 17:29:24 +0200, Jan Kara wrote:
> The implementation of bdev holder operations such as fs_bdev_mark_dead()
> and fs_bdev_sync() grab sb->s_umount semaphore under
> bdev->bd_holder_lock. This is problematic because it leads to
> disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> (usually we grab higher level (e.g. filesystem) locks first and lower
> level (e.g. block layer) locks later) and indeed makes lockdep complain
> about possible locking cycles whenever we open a block device while
> holding sb->s_umount semaphore. Implement a function
> bdev_super_lock_shared() which safely transitions from holding
> bdev->bd_holder_lock to holding sb->s_umount on alive superblock without
> introducing the problematic lock dependency. We use this function
> fs_bdev_sync() and fs_bdev_mark_dead().
> 
> [...]

Thanks!

---

Applied to the vfs.super branch of the vfs/vfs.git tree.
Patches in the vfs.super branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.super

[1/1] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
      https://git.kernel.org/vfs/vfs/c/4f4f1b3da625

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-18 15:29 [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock Jan Kara
  2023-10-18 15:46 ` Christoph Hellwig
  2023-10-19  8:16 ` Christian Brauner
@ 2023-10-19  8:33 ` Christian Brauner
  2023-10-19 10:57   ` Jan Kara
  2023-10-19 13:40   ` Christoph Hellwig
  2 siblings, 2 replies; 16+ messages in thread
From: Christian Brauner @ 2023-10-19  8:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> The implementation of bdev holder operations such as fs_bdev_mark_dead()
> and fs_bdev_sync() grab sb->s_umount semaphore under
> bdev->bd_holder_lock. This is problematic because it leads to
> disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> (usually we grab higher level (e.g. filesystem) locks first and lower
> level (e.g. block layer) locks later) and indeed makes lockdep complain
> about possible locking cycles whenever we open a block device while
> holding sb->s_umount semaphore. Implement a function

This patches together with my series that Christoph sent out for me
Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
two days ago (tyvm!) the lockdep false positives are all gone and we
also eliminated the counterintuitive ordering requirement that forces us
to give up s_umount before opening block devices.

I've verified that yesterday and did a bunch of testing via sudden
device removal.

Christoph had thankfully added generic/730 and generic/731 to emulate
some device removal. I also messed around with the loop code and
specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
a filesystem.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-19  8:33 ` Christian Brauner
@ 2023-10-19 10:57   ` Jan Kara
  2023-10-20 11:18     ` Christian Brauner
  2023-10-19 13:40   ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Jan Kara @ 2023-10-19 10:57 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Thu 19-10-23 10:33:36, Christian Brauner wrote:
> On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> > The implementation of bdev holder operations such as fs_bdev_mark_dead()
> > and fs_bdev_sync() grab sb->s_umount semaphore under
> > bdev->bd_holder_lock. This is problematic because it leads to
> > disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> > (usually we grab higher level (e.g. filesystem) locks first and lower
> > level (e.g. block layer) locks later) and indeed makes lockdep complain
> > about possible locking cycles whenever we open a block device while
> > holding sb->s_umount semaphore. Implement a function
> 
> This patches together with my series that Christoph sent out for me
> Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
> two days ago (tyvm!) the lockdep false positives are all gone and we
> also eliminated the counterintuitive ordering requirement that forces us
> to give up s_umount before opening block devices.
> 
> I've verified that yesterday and did a bunch of testing via sudden
> device removal.
> 
> Christoph had thankfully added generic/730 and generic/731 to emulate
> some device removal. I also messed around with the loop code and
> specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> a filesystem.

Ah, glad to hear that! So now we can also slowly work on undoing the unlock
s_umount, open bdev, lock s_umount games we have introduced in several
places. But I guess let's wait a bit for the dust to settle :)

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-19  8:33 ` Christian Brauner
  2023-10-19 10:57   ` Jan Kara
@ 2023-10-19 13:40   ` Christoph Hellwig
  2023-10-20 11:31     ` Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2023-10-19 13:40 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel

On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> some device removal. I also messed around with the loop code and
> specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> a filesystem.

Can you wire that up for either blktests or xfstests as well?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-19 10:57   ` Jan Kara
@ 2023-10-20 11:18     ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2023-10-20 11:18 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

On Thu, Oct 19, 2023 at 12:57:17PM +0200, Jan Kara wrote:
> On Thu 19-10-23 10:33:36, Christian Brauner wrote:
> > On Wed, Oct 18, 2023 at 05:29:24PM +0200, Jan Kara wrote:
> > > The implementation of bdev holder operations such as fs_bdev_mark_dead()
> > > and fs_bdev_sync() grab sb->s_umount semaphore under
> > > bdev->bd_holder_lock. This is problematic because it leads to
> > > disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive
> > > (usually we grab higher level (e.g. filesystem) locks first and lower
> > > level (e.g. block layer) locks later) and indeed makes lockdep complain
> > > about possible locking cycles whenever we open a block device while
> > > holding sb->s_umount semaphore. Implement a function
> > 
> > This patches together with my series that Christoph sent out for me
> > Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de
> > two days ago (tyvm!) the lockdep false positives are all gone and we
> > also eliminated the counterintuitive ordering requirement that forces us
> > to give up s_umount before opening block devices.
> > 
> > I've verified that yesterday and did a bunch of testing via sudden
> > device removal.
> > 
> > Christoph had thankfully added generic/730 and generic/731 to emulate
> > some device removal. I also messed around with the loop code and
> > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > a filesystem.
> 
> Ah, glad to hear that! So now we can also slowly work on undoing the unlock
> s_umount, open bdev, lock s_umount games we have introduced in several
> places. But I guess let's wait a bit for the dust to settle :)

Yeah, I had thought about this right away as well. And agreed, that can
happen later. :)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-19 13:40   ` Christoph Hellwig
@ 2023-10-20 11:31     ` Christian Brauner
  2023-10-20 12:04       ` Jan Kara
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2023-10-20 11:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel

On Thu, Oct 19, 2023 at 06:40:27AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> > some device removal. I also messed around with the loop code and
> > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > a filesystem.
> 
> Can you wire that up for either blktests or xfstests as well?

Yeah, I'll try to find some time to do this.

So while I was testing this I realized that the behavior of
LOOP_CHANGE_FD changed in commit 9f65c489b68d ("loop: raise media_change
event") and I'm not clear whether this is actually correct or not.

loop(4) states

"Switch the backing store of the loop device to the new file identified
 file descriptor specified in the (third) ioctl(2) argument, which is an
 integer.  This operation is possible only if the loop device is
 read-only and the new backing store is the same size and type as the
 old backing store."

So the original use-case for this ioctl seems to have been to silently
swap out the backing store. Specifcially it seems to have been used once
upon a time for live images together with device mapper over read-only
loop devices. Where device mapper can direct the writes to some other
location or sm.

Assuming that's correct, I think as long as you have something like
device mapper that doesn't use blk_holder_ops it would still work. But
that commit changed behavior for filesystems because we now do:

loop_change_fd()
-> disk_force_media_change()
   -> bdev_mark_dead()
      -> bdev->bd_holder_ops->mark_dead::fs_mark_dead()

So in essence if you have a filesystem on a loop device via:

truncate -s 10G img1
mkfs.xfs -f img1
LOOP_DEV=$(sudo losetup -f --read-only --show img1)

truncate -s 10G img2
sudo ./loop_change_fd $LOOP_DEV ./img2

We'll shut down that filesystem. I personally think that's correct and
it's buggy not to do that when we have the ability to inform the fs
about media removal but technically it kinda defeats the original
use-case for LOOP_CHANGE_FD.

In practice however, I don't think it matters because I think no one is
using LOOP_CHANGE_FD in that way. Right now all this is a useful for is
a bdev_mark_dead() test.

And one final question:

dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
disk_force_media_change(lo->lo_disk);
/* more stuff */
dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);

What exactly does that achieve? Does it just delay the delivery of the
uevent after the disk sequence number was changed in
disk_force_media_change()? Because it doesn't seem to actually prevent
uevent generation.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-20 11:31     ` Christian Brauner
@ 2023-10-20 12:04       ` Jan Kara
  2023-10-23  7:40         ` Christian Brauner
  2023-10-23 14:08         ` LOOP_CONFIGURE uevents Christian Brauner
  0 siblings, 2 replies; 16+ messages in thread
From: Jan Kara @ 2023-10-20 12:04 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Christoph Hellwig, Jan Kara, linux-fsdevel

On Fri 20-10-23 13:31:20, Christian Brauner wrote:
> On Thu, Oct 19, 2023 at 06:40:27AM -0700, Christoph Hellwig wrote:
> > On Thu, Oct 19, 2023 at 10:33:36AM +0200, Christian Brauner wrote:
> > > some device removal. I also messed around with the loop code and
> > > specifically used LOOP_CHANGE_FD to force a disk_force_media_change() on
> > > a filesystem.
> > 
> > Can you wire that up for either blktests or xfstests as well?
> 
> Yeah, I'll try to find some time to do this.
> 
> So while I was testing this I realized that the behavior of
> LOOP_CHANGE_FD changed in commit 9f65c489b68d ("loop: raise media_change
> event") and I'm not clear whether this is actually correct or not.
> 
> loop(4) states
>               
> "Switch the backing store of the loop device to the new file identified
>  file descriptor specified in the (third) ioctl(2) argument, which is an
>  integer.  This operation is possible only if the loop device is
>  read-only and the new backing store is the same size and type as the
>  old backing store."
> 
> So the original use-case for this ioctl seems to have been to silently
> swap out the backing store. Specifcially it seems to have been used once
> upon a time for live images together with device mapper over read-only
> loop devices. Where device mapper can direct the writes to some other
> location or sm.

LOOP_CHANGE_FD was an old hacky way to switch from read-only installation
image to a full-blown RW filesystem without reboot. I'm not sure about
details how it was supposed to work but reportedly Al Viro implemented that
for Fedora installation back in the old days.

> Assuming that's correct, I think as long as you have something like
> device mapper that doesn't use blk_holder_ops it would still work. But
> that commit changed behavior for filesystems because we now do:
> 
> loop_change_fd()
> -> disk_force_media_change()
>    -> bdev_mark_dead()
>       -> bdev->bd_holder_ops->mark_dead::fs_mark_dead()
> 
> So in essence if you have a filesystem on a loop device via:
> 
> truncate -s 10G img1
> mkfs.xfs -f img1
> LOOP_DEV=$(sudo losetup -f --read-only --show img1)
> 
> truncate -s 10G img2
> sudo ./loop_change_fd $LOOP_DEV ./img2
> 
> We'll shut down that filesystem. I personally think that's correct and
> it's buggy not to do that when we have the ability to inform the fs
> about media removal but technically it kinda defeats the original
> use-case for LOOP_CHANGE_FD.

I agree that it breaks the original usecase for LOOP_CHANGE_FD. I'd say it
also shows nobody is likely using LOOP_CHANGE_FD anymore. Maybe time to try
deprecating it?

> In practice however, I don't think it matters because I think no one is
> using LOOP_CHANGE_FD in that way. Right now all this is a useful for is
> a bdev_mark_dead() test.

:)

> And one final question:
> 
> dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> disk_force_media_change(lo->lo_disk);
> /* more stuff */
> dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> 
> What exactly does that achieve? Does it just delay the delivery of the
> uevent after the disk sequence number was changed in
> disk_force_media_change()? Because it doesn't seem to actually prevent
> uevent generation.

Well, if you grep for dev_get_uevent_suppress() you'll notice there is
exactly *one* place looking at it - the generation of ADD event when adding
a partition bdev. I'm not sure what's the rationale behind this
functionality.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock
  2023-10-20 12:04       ` Jan Kara
@ 2023-10-23  7:40         ` Christian Brauner
  2023-10-23 15:35           ` loop change deprecation bdev->bd_holder_lock Christian Brauner
  2023-10-23 14:08         ` LOOP_CONFIGURE uevents Christian Brauner
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2023-10-23  7:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

> I agree that it breaks the original usecase for LOOP_CHANGE_FD. I'd say it
> also shows nobody is likely using LOOP_CHANGE_FD anymore. Maybe time to try
> deprecating it?

Yeah, why don't we try that. In it's current form the concept isn't very
useful and arguably is broken which no one has really noticed for years.

* codesearch.debian.net has zero users
* codesearch on github has zero users
* cs.android.com has zero users

Only definitions, strace, ltp, and stress-ng that sort of test this
functionality. I say we try to deprecate this.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* LOOP_CONFIGURE uevents
  2023-10-20 12:04       ` Jan Kara
  2023-10-23  7:40         ` Christian Brauner
@ 2023-10-23 14:08         ` Christian Brauner
  2023-10-24  7:06           ` Christoph Hellwig
  1 sibling, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2023-10-23 14:08 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, linux-block

> > And one final question:
> > 
> > dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 1);
> > disk_force_media_change(lo->lo_disk);
> > /* more stuff */
> > dev_set_uevent_suppress(disk_to_dev(lo->lo_disk), 0);
> > 
> > What exactly does that achieve? Does it just delay the delivery of the
> > uevent after the disk sequence number was changed in
> > disk_force_media_change()? Because it doesn't seem to actually prevent
> > uevent generation.
> 
> Well, if you grep for dev_get_uevent_suppress() you'll notice there is
> exactly *one* place looking at it - the generation of ADD event when adding
> a partition bdev. I'm not sure what's the rationale behind this
> functionality.

I looked at dev_set_uevent_suppress() before and what it does is that it
fully prevents the generation of uevents for the kobject. It doesn't
just hold them back like the comments "uncork" in loop_change_fd() and
loop_configure() suggest:

static inline void dev_set_uevent_suppress(struct device *dev, int val)
{
        dev->kobj.uevent_suppress = val;
}

and then

int kobject_uevent_env(struct kobject *kobj, enum kobject_action action,
                       char *envp_ext[])
{

        [...]

        /* skip the event, if uevent_suppress is set*/
        if (kobj->uevent_suppress) {
                pr_debug("kobject: '%s' (%p): %s: uevent_suppress "
                                 "caused the event to drop!\n",
                                 kobject_name(kobj), kobj, __func__);
                return 0;
        }

So commit 498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
tried to fix a problem where uevents were generated for LOOP_SET_FD
before LOOP_SET_STATUS* was called.

That broke LOOP_CONFIGURE because LOOP_CONFIGURE is supposed to be
LOOP_SET_FD + LOOP_SET_STATUS in one shot.

Then commit bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions")
fixed that by moving loop_reread_partitions() out of the uevent
suppression.

No you get uevents if you trigger a partition rescan but only if there
are actually partitions. What you never get however is a media change
event even though we do increment the disk sequence number and attach an
image to the loop device.

This seems problematic because we leave userspace unable to listen for
attaching images to a loop device. Shouldn't we regenerate the media
change event after we're done setting up the device and before the
partition rescan for LOOP_CONFIGURE?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* loop change deprecation bdev->bd_holder_lock
  2023-10-23  7:40         ` Christian Brauner
@ 2023-10-23 15:35           ` Christian Brauner
  2023-10-24  7:03             ` Christoph Hellwig
  0 siblings, 1 reply; 16+ messages in thread
From: Christian Brauner @ 2023-10-23 15:35 UTC (permalink / raw)
  To: Jan Kara, Christoph Hellwig; +Cc: linux-fsdevel, linux-block, Jens Axboe

On Mon, Oct 23, 2023 at 09:40:44AM +0200, Christian Brauner wrote:
> > I agree that it breaks the original usecase for LOOP_CHANGE_FD. I'd say it
> > also shows nobody is likely using LOOP_CHANGE_FD anymore. Maybe time to try
> > deprecating it?
> 
> Yeah, why don't we try that. In it's current form the concept isn't very
> useful and arguably is broken which no one has really noticed for years.
> 
> * codesearch.debian.net has zero users
> * codesearch on github has zero users
> * cs.android.com has zero users
> 
> Only definitions, strace, ltp, and stress-ng that sort of test this
> functionality. I say we try to deprecate this.

I just realized that if we're able to deprecate LOOP_CHANGE_FD we remove
one of the most problematic/weird cases for partitions and filesystems.

So right now, we can attach a filesystem image to a loop device and then
have partitions be claimed by different filesystems:

* create an image with two partitions
* format first partition as xfs, second as ext4

sudo losetup -f --show --read-only -P img1
sudo mount /dev/loop0p1 /mnt1
sudo mount /dev/loop0p2 /mnt2

What happens here is all very wrong afaict. When we issue, e.g., a
change fd event on the first partition:

sudo ./loop_change_fd /dev/loop0p1 img2

we call disk_force_media_change() but that only works on disk->part0
which means that we don't even cleanly shutdown the filesystem on the
partition we're trying to mess around with.

Later on we then completely fall on our faces when we fail to remove the
partitions because they're still in active use.

So I guess, ideally, we'd really force removal of the partitions and
shut down the filesystem but we can't easily do that because of locking
requirements where we can't acquire s_umount beneath open_mutex. Plus,
this wouldn't fit with the original semantics.

There's probably ways around this but I don't think that's actually
worth it for LOOP_CHANGE_FD. The places where forced removal of
partitions really matters is del_gendisk(). And there we've got it
working correctly and are able to actually remove partitions that still
have active users.

For now, we should give up any pretense that disk_force_media_change()
does anything useful for loop change fd and simply remove it completely.
It's either useless, or it breaks the original semantics of loop change
fd although I don't think anyone's ever used it the way I described
above.

So?

From 0d9b5c4963791f0c3ba8529ca56233be87122c59 Mon Sep 17 00:00:00 2001
From: Christian Brauner <brauner@kernel.org>
Date: Mon, 23 Oct 2023 17:24:56 +0200
Subject: [PATCH] loop: drop disk_force_media_change()

The disk_force_media_change() call is currently only useful for
filesystems and in that case it's not very useful. Consider attaching a
filesystem image to a loop device and then have partitions be claimed by
different filesystems:

* create an image with two partitions
* format first partition as xfs, second as ext4

sudo losetup -f --show --read-only -P img1
sudo mount /dev/loop0p1 /mnt1
sudo mount /dev/loop0p2 /mnt2

When we issue e.g., a loop change fd event on the first partition:

sudo ./loop_change_fd /dev/loop0p1 img2

we call disk_force_media_change() but that only works on disk->part0
which means that we don't even cleanly shutdown the filesystem on the
partition we're trying to mess around with.

Later on we then completely fall on our faces when we fail to remove the
partitions because they're still in active use.

Give up any pretense that disk_force_media_change() does anything useful
for loop change fd and simply remove it completely. It's either useless,
or it's meaningless for the original semantics of loop change fd. Anyone
who uses this is on their own anyway.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---
 drivers/block/loop.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9f2d412fc560..87c98e35abac 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -603,7 +603,8 @@ static int loop_change_fd(struct loop_device *lo, struct block_device *bdev,
 		goto out_err;

 	/* and ... switch */
-	disk_force_media_change(lo->lo_disk);
+	invalidate_bdev(lo->lo_disk->part0);
+	set_bit(GD_NEED_PART_SCAN, &lo->lo_disk->state);
 	blk_mq_freeze_queue(lo->lo_queue);
 	mapping_set_gfp_mask(old_file->f_mapping, lo->old_gfp_mask);
 	lo->lo_backing_file = file;
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: loop change deprecation bdev->bd_holder_lock
  2023-10-23 15:35           ` loop change deprecation bdev->bd_holder_lock Christian Brauner
@ 2023-10-24  7:03             ` Christoph Hellwig
  2023-10-24  8:44               ` loop change deprecation Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2023-10-24  7:03 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-block,
	Jens Axboe

On Mon, Oct 23, 2023 at 05:35:25PM +0200, Christian Brauner wrote:
> I just realized that if we're able to deprecate LOOP_CHANGE_FD we remove
> one of the most problematic/weird cases for partitions and filesystems.

> change fd event on the first partition:
> 
> sudo ./loop_change_fd /dev/loop0p1 img2
> 
> we call disk_force_media_change() but that only works on disk->part0
> which means that we don't even cleanly shutdown the filesystem on the
> partition we're trying to mess around with.

Yes, disk_force_media_change has that general problem back from the
early Linux days (it had a different name back then, though).  I think
it is because traditionally removable media in Linux never had
partitions, e.g. the CDROM drivers typically only allocated a single
minor number so they could not be scanned.  But that has changed because
the interfaces got used for different use cases, and we also had
dynamic majors for a long time that now allow partitions.  And there
are real use cases even for traditional removable media, e.g. MacOS
CDROMs traditionally did have partitions.

> For now, we should give up any pretense that disk_force_media_change()
> does anything useful for loop change fd and simply remove it completely.
> It's either useless, or it breaks the original semantics of loop change
> fd although I don't think anyone's ever used it the way I described
> above.

Maybe we can just drop the CHANGE_FD ioctl and see if anyone screams?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: LOOP_CONFIGURE uevents
  2023-10-23 14:08         ` LOOP_CONFIGURE uevents Christian Brauner
@ 2023-10-24  7:06           ` Christoph Hellwig
  2023-10-24  8:42             ` Christian Brauner
  0 siblings, 1 reply; 16+ messages in thread
From: Christoph Hellwig @ 2023-10-24  7:06 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, linux-block

On Mon, Oct 23, 2023 at 04:08:21PM +0200, Christian Brauner wrote:
> No you get uevents if you trigger a partition rescan but only if there
> are actually partitions. What you never get however is a media change
> event even though we do increment the disk sequence number and attach an
> image to the loop device.
> 
> This seems problematic because we leave userspace unable to listen for
> attaching images to a loop device. Shouldn't we regenerate the media
> change event after we're done setting up the device and before the
> partition rescan for LOOP_CONFIGURE?

Maybe.  I think people mostly didn't care about the even anyway, but
about the changing sequence number to check that the content hasn't
changed.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: LOOP_CONFIGURE uevents
  2023-10-24  7:06           ` Christoph Hellwig
@ 2023-10-24  8:42             ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2023-10-24  8:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, linux-block

On Tue, Oct 24, 2023 at 12:06:26AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 23, 2023 at 04:08:21PM +0200, Christian Brauner wrote:
> > No you get uevents if you trigger a partition rescan but only if there
> > are actually partitions. What you never get however is a media change
> > event even though we do increment the disk sequence number and attach an
> > image to the loop device.
> > 
> > This seems problematic because we leave userspace unable to listen for
> > attaching images to a loop device. Shouldn't we regenerate the media
> > change event after we're done setting up the device and before the
> > partition rescan for LOOP_CONFIGURE?
> 
> Maybe.  I think people mostly didn't care about the even anyway, but
> about the changing sequence number to check that the content hasn't
> changed.

Yes, exactly. The core is the changed sequence number but you don't get
that notification if you don't have any partitions and you request a
partition rescan from LOOP_CONFIGURE.

So I think we should always send the media change event for the sequence
number independent of the partition notification.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: loop change deprecation
  2023-10-24  7:03             ` Christoph Hellwig
@ 2023-10-24  8:44               ` Christian Brauner
  0 siblings, 0 replies; 16+ messages in thread
From: Christian Brauner @ 2023-10-24  8:44 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, linux-fsdevel, linux-block, Jens Axboe

(Sorry for the broken "Subject:" btw in the first mail.)

On Tue, Oct 24, 2023 at 12:03:25AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 23, 2023 at 05:35:25PM +0200, Christian Brauner wrote:
> > I just realized that if we're able to deprecate LOOP_CHANGE_FD we remove
> > one of the most problematic/weird cases for partitions and filesystems.
> 
> > change fd event on the first partition:
> > 
> > sudo ./loop_change_fd /dev/loop0p1 img2
> > 
> > we call disk_force_media_change() but that only works on disk->part0
> > which means that we don't even cleanly shutdown the filesystem on the
> > partition we're trying to mess around with.
> 
> Yes, disk_force_media_change has that general problem back from the
> early Linux days (it had a different name back then, though).  I think
> it is because traditionally removable media in Linux never had
> partitions, e.g. the CDROM drivers typically only allocated a single
> minor number so they could not be scanned.  But that has changed because
> the interfaces got used for different use cases, and we also had
> dynamic majors for a long time that now allow partitions.  And there
> are real use cases even for traditional removable media, e.g. MacOS
> CDROMs traditionally did have partitions.
> 
> > For now, we should give up any pretense that disk_force_media_change()
> > does anything useful for loop change fd and simply remove it completely.
> > It's either useless, or it breaks the original semantics of loop change
> > fd although I don't think anyone's ever used it the way I described
> > above.
> 
> Maybe we can just drop the CHANGE_FD ioctl and see if anyone screams?

Yes, I suggested that in the prior mail. I think we should do that.
We'd need changes to LTP and blktests but there are no active users in
either codesearch.debian, cs.github, or cs.android.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-10-24  8:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-18 15:29 [PATCH] fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock Jan Kara
2023-10-18 15:46 ` Christoph Hellwig
2023-10-19  8:16 ` Christian Brauner
2023-10-19  8:33 ` Christian Brauner
2023-10-19 10:57   ` Jan Kara
2023-10-20 11:18     ` Christian Brauner
2023-10-19 13:40   ` Christoph Hellwig
2023-10-20 11:31     ` Christian Brauner
2023-10-20 12:04       ` Jan Kara
2023-10-23  7:40         ` Christian Brauner
2023-10-23 15:35           ` loop change deprecation bdev->bd_holder_lock Christian Brauner
2023-10-24  7:03             ` Christoph Hellwig
2023-10-24  8:44               ` loop change deprecation Christian Brauner
2023-10-23 14:08         ` LOOP_CONFIGURE uevents Christian Brauner
2023-10-24  7:06           ` Christoph Hellwig
2023-10-24  8:42             ` Christian Brauner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).