* [PATCH] fs: open the block device after allocation the super_block
@ 2023-07-24 17:51 Christoph Hellwig
  2023-07-25 12:35 ` Christian Brauner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Christoph Hellwig @ 2023-07-24 17:51 UTC (permalink / raw)
  To: viro, brauner, jack; +Cc: linux-fsdevel
From: Jan Kara <jack@suse.cz>
Currently get_tree_bdev and mount_bdev open the block device before
commiting to allocating a super block.  This means the block device
is opened even for bind mounts and other reuses of the super_block.
That creates problems for restricting the number of writers to a device,
and also leads to a unusual and not very helpful holder (the fs_type).
Reorganize the mount code to first look whether the superblock for a
particular device is already mounted and open the block device only if
it is not.
Signed-off-by: Jan Kara <jack@suse.cz>
[hch: port to before the bdev_handle changes,
      duplicate the bdev read-only check from blkdev_get_by_path,
      extend the fsfree_mutex coverage to protect against freezes,
      fix an open bdev leak when the bdev is frozen,
      use the bdev local variable more,
      rename the s variable to sb to be more descriptive]
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
So I promised to get a series that builds on top of this ready, but
I'm way to busy and this will take a while.  Getting this reworked
version of Jan's patch out for everyone to use it as a based given
that Christian is back from vacation, and I think Jan should be about
back now as well.
 fs/super.c | 194 +++++++++++++++++++++++++++--------------------------
 1 file changed, 98 insertions(+), 96 deletions(-)
diff --git a/fs/super.c b/fs/super.c
index e781226e28800c..3ef39df5bec506 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1228,12 +1228,7 @@ static const struct blk_holder_ops fs_holder_ops = {
 
 static int set_bdev_super(struct super_block *s, void *data)
 {
-	s->s_bdev = data;
-	s->s_dev = s->s_bdev->bd_dev;
-	s->s_bdi = bdi_get(s->s_bdev->bd_disk->bdi);
-
-	if (bdev_stable_writes(s->s_bdev))
-		s->s_iflags |= SB_I_STABLE_WRITES;
+	s->s_dev = *(dev_t *)data;
 	return 0;
 }
 
@@ -1244,7 +1239,61 @@ static int set_bdev_super_fc(struct super_block *s, struct fs_context *fc)
 
 static int test_bdev_super_fc(struct super_block *s, struct fs_context *fc)
 {
-	return !(s->s_iflags & SB_I_RETIRED) && s->s_bdev == fc->sget_key;
+	return !(s->s_iflags & SB_I_RETIRED) &&
+		s->s_dev == *(dev_t *)fc->sget_key;
+}
+
+static int setup_bdev_super(struct super_block *sb, int sb_flags,
+		struct fs_context *fc)
+{
+	blk_mode_t mode = sb_open_mode(sb_flags);
+	struct block_device *bdev;
+
+	bdev = blkdev_get_by_dev(sb->s_dev, mode, sb->s_type, &fs_holder_ops);
+	if (IS_ERR(bdev)) {
+		if (fc)
+			errorf(fc, "%s: Can't open blockdev", fc->source);
+		return PTR_ERR(bdev);
+	}
+
+	/*
+	 * This really should be in blkdev_get_by_dev, but right now can't due
+	 * to legacy issues that require us to allow opening a block device node
+	 * writable from userspace even for a read-only block device.
+	 */
+	if ((mode & BLK_OPEN_WRITE) && bdev_read_only(bdev)) {
+		blkdev_put(bdev, sb->s_type);
+		return -EACCES;
+	}
+
+	/*
+	 * Until SB_BORN flag is set, there can be no active superblock
+	 * references and thus no filesystem freezing. get_active_super() will
+	 * just loop waiting for SB_BORN so even freeze_bdev() cannot proceed.
+	 *
+	 * It is enough to check bdev was not frozen before we set s_bdev.
+	 */
+	mutex_lock(&bdev->bd_fsfreeze_mutex);
+	if (bdev->bd_fsfreeze_count > 0) {
+		mutex_unlock(&bdev->bd_fsfreeze_mutex);
+		if (fc)
+			warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
+		blkdev_put(bdev, sb->s_type);
+		return -EBUSY;
+	}
+	spin_lock(&sb_lock);
+	sb->s_bdev = bdev;
+	sb->s_bdi = bdi_get(bdev->bd_disk->bdi);
+	if (bdev_stable_writes(bdev))
+		sb->s_iflags |= SB_I_STABLE_WRITES;
+	spin_unlock(&sb_lock);
+	mutex_unlock(&bdev->bd_fsfreeze_mutex);
+
+	snprintf(sb->s_id, sizeof(sb->s_id), "%pg", bdev);
+	shrinker_debugfs_rename(&sb->s_shrink, "sb-%s:%s", sb->s_type->name,
+				sb->s_id);
+	sb_set_blocksize(sb, block_size(bdev));
+	return 0;
 }
 
 /**
@@ -1256,73 +1305,50 @@ int get_tree_bdev(struct fs_context *fc,
 		int (*fill_super)(struct super_block *,
 				  struct fs_context *))
 {
-	struct block_device *bdev;
 	struct super_block *s;
 	int error = 0;
+	dev_t dev;
 
 	if (!fc->source)
 		return invalf(fc, "No source specified");
 
-	bdev = blkdev_get_by_path(fc->source, sb_open_mode(fc->sb_flags),
-				  fc->fs_type, &fs_holder_ops);
-	if (IS_ERR(bdev)) {
-		errorf(fc, "%s: Can't open blockdev", fc->source);
-		return PTR_ERR(bdev);
-	}
-
-	/* Once the superblock is inserted into the list by sget_fc(), s_umount
-	 * will protect the lockfs code from trying to start a snapshot while
-	 * we are mounting
-	 */
-	mutex_lock(&bdev->bd_fsfreeze_mutex);
-	if (bdev->bd_fsfreeze_count > 0) {
-		mutex_unlock(&bdev->bd_fsfreeze_mutex);
-		warnf(fc, "%pg: Can't mount, blockdev is frozen", bdev);
-		blkdev_put(bdev, fc->fs_type);
-		return -EBUSY;
+	error = lookup_bdev(fc->source, &dev);
+	if (error) {
+		errorf(fc, "%s: Can't lookup blockdev", fc->source);
+		return error;
 	}
 
 	fc->sb_flags |= SB_NOSEC;
-	fc->sget_key = bdev;
+	fc->sget_key = &dev;
 	s = sget_fc(fc, test_bdev_super_fc, set_bdev_super_fc);
-	mutex_unlock(&bdev->bd_fsfreeze_mutex);
-	if (IS_ERR(s)) {
-		blkdev_put(bdev, fc->fs_type);
+	if (IS_ERR(s))
 		return PTR_ERR(s);
-	}
 
 	if (s->s_root) {
 		/* Don't summarily change the RO/RW state. */
 		if ((fc->sb_flags ^ s->s_flags) & SB_RDONLY) {
-			warnf(fc, "%pg: Can't mount, would change RO state", bdev);
+			warnf(fc, "%pg: Can't mount, would change RO state", s->s_bdev);
 			deactivate_locked_super(s);
-			blkdev_put(bdev, fc->fs_type);
 			return -EBUSY;
 		}
-
+	} else {
 		/*
-		 * s_umount nests inside open_mutex during
-		 * __invalidate_device().  blkdev_put() acquires
-		 * open_mutex and can't be called under s_umount.  Drop
-		 * s_umount temporarily.  This is safe as we're
-		 * holding an active reference.
+		 * We drop s_umount here because we need to open the bdev and
+		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
+		 * __invalidate_device()). It is safe because we have active sb
+		 * reference and SB_BORN is not set yet.
 		 */
 		up_write(&s->s_umount);
-		blkdev_put(bdev, fc->fs_type);
+		error = setup_bdev_super(s, fc->sb_flags, fc);
 		down_write(&s->s_umount);
-	} else {
-		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-		shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
-					fc->fs_type->name, s->s_id);
-		sb_set_blocksize(s, block_size(bdev));
-		error = fill_super(s, fc);
+		if (!error)
+			error = fill_super(s, fc);
 		if (error) {
 			deactivate_locked_super(s);
 			return error;
 		}
-
 		s->s_flags |= SB_ACTIVE;
-		bdev->bd_super = s;
+		s->s_bdev->bd_super = s;
 	}
 
 	BUG_ON(fc->root);
@@ -1333,79 +1359,53 @@ EXPORT_SYMBOL(get_tree_bdev);
 
 static int test_bdev_super(struct super_block *s, void *data)
 {
-	return !(s->s_iflags & SB_I_RETIRED) && (void *)s->s_bdev == data;
+	return !(s->s_iflags & SB_I_RETIRED) && s->s_dev == *(dev_t *)data;
 }
 
 struct dentry *mount_bdev(struct file_system_type *fs_type,
 	int flags, const char *dev_name, void *data,
 	int (*fill_super)(struct super_block *, void *, int))
 {
-	struct block_device *bdev;
 	struct super_block *s;
-	int error = 0;
+	int error;
+	dev_t dev;
 
-	bdev = blkdev_get_by_path(dev_name, sb_open_mode(flags), fs_type,
-				  &fs_holder_ops);
-	if (IS_ERR(bdev))
-		return ERR_CAST(bdev);
+	error = lookup_bdev(dev_name, &dev);
+	if (error)
+		return ERR_PTR(error);
 
-	/*
-	 * once the super is inserted into the list by sget, s_umount
-	 * will protect the lockfs code from trying to start a snapshot
-	 * while we are mounting
-	 */
-	mutex_lock(&bdev->bd_fsfreeze_mutex);
-	if (bdev->bd_fsfreeze_count > 0) {
-		mutex_unlock(&bdev->bd_fsfreeze_mutex);
-		error = -EBUSY;
-		goto error_bdev;
-	}
-	s = sget(fs_type, test_bdev_super, set_bdev_super, flags | SB_NOSEC,
-		 bdev);
-	mutex_unlock(&bdev->bd_fsfreeze_mutex);
+	flags |= SB_NOSEC;
+	s = sget(fs_type, test_bdev_super, set_bdev_super, flags, &dev);
 	if (IS_ERR(s))
-		goto error_s;
+		return ERR_CAST(s);
 
 	if (s->s_root) {
 		if ((flags ^ s->s_flags) & SB_RDONLY) {
 			deactivate_locked_super(s);
-			error = -EBUSY;
-			goto error_bdev;
+			return ERR_PTR(-EBUSY);
 		}
-
+	} else {
 		/*
-		 * s_umount nests inside open_mutex during
-		 * __invalidate_device().  blkdev_put() acquires
-		 * open_mutex and can't be called under s_umount.  Drop
-		 * s_umount temporarily.  This is safe as we're
-		 * holding an active reference.
+		 * We drop s_umount here because we need to open the bdev and
+		 * bdev->open_mutex ranks above s_umount (blkdev_put() ->
+		 * __invalidate_device()). It is safe because we have active sb
+		 * reference and SB_BORN is not set yet.
 		 */
 		up_write(&s->s_umount);
-		blkdev_put(bdev, fs_type);
+		error = setup_bdev_super(s, flags, NULL);
 		down_write(&s->s_umount);
-	} else {
-		snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
-		shrinker_debugfs_rename(&s->s_shrink, "sb-%s:%s",
-					fs_type->name, s->s_id);
-		sb_set_blocksize(s, block_size(bdev));
-		error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
+		if (!error)
+			error = fill_super(s, data, flags & SB_SILENT ? 1 : 0);
 		if (error) {
 			deactivate_locked_super(s);
-			goto error;
+			return ERR_PTR(error);
 		}
 
 		s->s_flags |= SB_ACTIVE;
-		bdev->bd_super = s;
+		s->s_bdev->bd_super = s;
 	}
 
 	return dget(s->s_root);
-
-error_s:
-	error = PTR_ERR(s);
-error_bdev:
-	blkdev_put(bdev, fs_type);
-error:
-	return ERR_PTR(error);
 }
 EXPORT_SYMBOL(mount_bdev);
 
@@ -1413,10 +1413,12 @@ void kill_block_super(struct super_block *sb)
 {
 	struct block_device *bdev = sb->s_bdev;
 
-	bdev->bd_super = NULL;
 	generic_shutdown_super(sb);
-	sync_blockdev(bdev);
-	blkdev_put(bdev, sb->s_type);
+	if (bdev) {
+		bdev->bd_super = NULL;
+		sync_blockdev(bdev);
+		blkdev_put(bdev, sb->s_type);
+	}
 }
 
 EXPORT_SYMBOL(kill_block_super);
-- 
2.39.2
^ permalink raw reply related	[flat|nested] 9+ messages in thread- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-24 17:51 [PATCH] fs: open the block device after allocation the super_block Christoph Hellwig
@ 2023-07-25 12:35 ` Christian Brauner
  2023-07-25 16:32   ` Christian Brauner
  2023-07-25 15:53 ` Christian Brauner
  2023-08-15 14:43 ` Christian Brauner
  2 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-07-25 12:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: viro, jack, linux-fsdevel
On Mon, Jul 24, 2023 at 10:51:45AM -0700, Christoph Hellwig wrote:
> From: Jan Kara <jack@suse.cz>
> 
> Currently get_tree_bdev and mount_bdev open the block device before
> commiting to allocating a super block.  This means the block device
> is opened even for bind mounts and other reuses of the super_block.
> 
> That creates problems for restricting the number of writers to a device,
> and also leads to a unusual and not very helpful holder (the fs_type).
> 
> Reorganize the mount code to first look whether the superblock for a
> particular device is already mounted and open the block device only if
> it is not.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>
> [hch: port to before the bdev_handle changes,
>       duplicate the bdev read-only check from blkdev_get_by_path,
>       extend the fsfree_mutex coverage to protect against freezes,
>       fix an open bdev leak when the bdev is frozen,
>       use the bdev local variable more,
>       rename the s variable to sb to be more descriptive]
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
> 
> So I promised to get a series that builds on top of this ready, but
> I'm way to busy and this will take a while.  Getting this reworked
> version of Jan's patch out for everyone to use it as a based given
> that Christian is back from vacation, and I think Jan should be about
> back now as well.
I'm in the middle of reviewing this. You're probably aware, but both
btrfs and nilfs at least still open the devices first since they
open-code their bdev and sb handling.
^ permalink raw reply	[flat|nested] 9+ messages in thread 
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-25 12:35 ` Christian Brauner
@ 2023-07-25 16:32   ` Christian Brauner
  2023-07-26 12:51     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-07-25 16:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: viro, jack, linux-fsdevel
On Tue, Jul 25, 2023 at 02:35:17PM +0200, Christian Brauner wrote:
> On Mon, Jul 24, 2023 at 10:51:45AM -0700, Christoph Hellwig wrote:
> > From: Jan Kara <jack@suse.cz>
> > 
> > Currently get_tree_bdev and mount_bdev open the block device before
> > commiting to allocating a super block.  This means the block device
> > is opened even for bind mounts and other reuses of the super_block.
> > 
> > That creates problems for restricting the number of writers to a device,
> > and also leads to a unusual and not very helpful holder (the fs_type).
> > 
> > Reorganize the mount code to first look whether the superblock for a
> > particular device is already mounted and open the block device only if
> > it is not.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > [hch: port to before the bdev_handle changes,
> >       duplicate the bdev read-only check from blkdev_get_by_path,
> >       extend the fsfree_mutex coverage to protect against freezes,
> >       fix an open bdev leak when the bdev is frozen,
> >       use the bdev local variable more,
> >       rename the s variable to sb to be more descriptive]
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> > 
> > So I promised to get a series that builds on top of this ready, but
> > I'm way to busy and this will take a while.  Getting this reworked
> > version of Jan's patch out for everyone to use it as a based given
> > that Christian is back from vacation, and I think Jan should be about
> > back now as well.
> 
> I'm in the middle of reviewing this. You're probably aware, but both
> btrfs and nilfs at least still open the devices first since they
> open-code their bdev and sb handling.
I've removed the references to bind mounts from the commit message.
I mentioned in [1] and [2] that this problem is really related to
superblocks at it's core. It's just that technically a bind-mount would
be created in the following scenario where two processes race to create
a superblock:
P1								P2
fd_fs = fsopen("ext4");						fd_fs = fsopen("ext4");
fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");	fsconfig(fd_fs, FSCONFIG_SET_STRING, "source", "/dev/sda");
// wins and creates superblock
fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
								// finds compatible superblock of P1
								// spins until P1 sets SB_BORN and grabs a reference
								fsconfig(fd_fs, FSCONFIG_CMD_CREATE, ...)
P1								P2
fd_mnt1 = fsmount(fd_fs);					fd_mnt2 = fsmount(fd_fs);
move_mount(fd_mnt1, "/A")                                       move_mount(fd_mnt2, "/B")
If we forbid writes to a mounted block device like Jan's other patch did
then before your's and Jan's patch P2 would fail at FSCONFIG_CMD_CREATE
iirc.
But bind-mounting itself isn't really broken. For example, even after P2
failed to create the superblock it could very well still do:
mount --bind /A /C
mount --bind /A /D
or whatever as that never goes into get_tree again. The problem really
is that stuff like:
mount -t ext4 /dev/sda /E
would be broken but the problem is that this request is arguably
ambiguous when seen from userspace. It either means:
(1) create a superblock and mount it at /E
(2) create a bind-mount for an existing superblock at /E
For P1 (1) is the case but for P2 (2) is the case.
Most of the time users really want (1). Right now, we have no way
for a user to be sure that (1) did take place aka that they did indeed
create the superblock. That can be a problem in environments where you
need to be sure that you did create the superblock with the correct
options. Because for P2 all mount options that they set may well be
completely ignored unless e.g., P1 did request rw and P2 did request ro.
This is why I'd like to add something like FSCONFIG_CMD_CREATE_EXCL
which would fail if the caller didn't actually create the superblock but
found an existing one instead.
[1]: https://lore.kernel.org/linux-block/20230704-fasching-wertarbeit-7c6ffb01c83d@brauner
[2]: https://lore.kernel.org/linux-block/20230705-pumpwerk-vielversprechend-a4b1fd947b65@brauner
^ permalink raw reply	[flat|nested] 9+ messages in thread
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-25 16:32   ` Christian Brauner
@ 2023-07-26 12:51     ` Christoph Hellwig
  2023-07-26 12:57       ` Christian Brauner
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2023-07-26 12:51 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Christoph Hellwig, viro, jack, linux-fsdevel
On Tue, Jul 25, 2023 at 06:32:05PM +0200, Christian Brauner wrote:
> I've removed the references to bind mounts from the commit message.
> I mentioned in [1] and [2] that this problem is really related to
> superblocks at it's core. It's just that technically a bind-mount would
> be created in the following scenario where two processes race to create
> a superblock:
I wanted to keep some of Jan's original logic.  In the end a bind mount
is just one of many reuses of a super block so I think your updated
log is fine.
Btw, it might make sense to place this on a separate branch, and Jan's
block work will have to pull it in, and it might be good to not
require the entire vfs misc tree to be pult in.
^ permalink raw reply	[flat|nested] 9+ messages in thread 
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-26 12:51     ` Christoph Hellwig
@ 2023-07-26 12:57       ` Christian Brauner
  0 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2023-07-26 12:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: viro, jack, linux-fsdevel
On Wed, Jul 26, 2023 at 02:51:06PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 25, 2023 at 06:32:05PM +0200, Christian Brauner wrote:
> > I've removed the references to bind mounts from the commit message.
> > I mentioned in [1] and [2] that this problem is really related to
> > superblocks at it's core. It's just that technically a bind-mount would
> > be created in the following scenario where two processes race to create
> > a superblock:
> 
> I wanted to keep some of Jan's original logic.  In the end a bind mount
> is just one of many reuses of a super block so I think your updated
> log is fine.
> 
> Btw, it might make sense to place this on a separate branch, and Jan's
> block work will have to pull it in, and it might be good to not
> require the entire vfs misc tree to be pult in.
Ok, now on the vfs.super branch.
^ permalink raw reply	[flat|nested] 9+ messages in thread 
 
 
 
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-24 17:51 [PATCH] fs: open the block device after allocation the super_block Christoph Hellwig
  2023-07-25 12:35 ` Christian Brauner
@ 2023-07-25 15:53 ` Christian Brauner
  2023-08-15 14:43 ` Christian Brauner
  2 siblings, 0 replies; 9+ messages in thread
From: Christian Brauner @ 2023-07-25 15:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Christian Brauner, linux-fsdevel, viro, jack
On Mon, 24 Jul 2023 10:51:45 -0700, Christoph Hellwig wrote:
> Currently get_tree_bdev and mount_bdev open the block device before
> commiting to allocating a super block.  This means the block device
> is opened even for bind mounts and other reuses of the super_block.
> 
> That creates problems for restricting the number of writers to a device,
> and also leads to a unusual and not very helpful holder (the fs_type).
> 
> [...]
Applied to the vfs.misc branch of the vfs/vfs.git tree.
Patches in the vfs.misc branch should appear in linux-next soon.
Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.
It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.
Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs.misc
[1/1] fs: open block device after superblock creation
      https://git.kernel.org/vfs/vfs/c/787388e88395
^ permalink raw reply	[flat|nested] 9+ messages in thread
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-07-24 17:51 [PATCH] fs: open the block device after allocation the super_block Christoph Hellwig
  2023-07-25 12:35 ` Christian Brauner
  2023-07-25 15:53 ` Christian Brauner
@ 2023-08-15 14:43 ` Christian Brauner
  2023-08-16  7:29   ` Christian Brauner
  2 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-08-15 14:43 UTC (permalink / raw)
  To: Christoph Hellwig, jack; +Cc: viro, linux-fsdevel
>  		up_write(&s->s_umount);
> -		blkdev_put(bdev, fs_type);
> +		error = setup_bdev_super(s, flags, NULL);
>  		down_write(&s->s_umount);
So I've been looking through the branches to see what's ready for v6.6
and what needs some more time. While doing so I went over this again and
realized that we have an issue here.
While it looks like dropping s_umount here and calling
setup_bdev_super() is fine I think it isn't. Consider two processes
racing to create the same mount:
P1                                                                    P2
vfs_get_tree()                                                        vfs_get_tree()
-> get_tree() == get_tree_bdev()                                      -> get_tree() == get_tree_bdev()
   -> sget_fc()                                                          -> sget_fc()
        // allocate new sb; no matching sb found
      -> sb_p1 = alloc_super()
      -> hlist_add_head(&s->s_instances, &s->s_type->fs_supers)
      -> spin_unlock(&sb_lock)                                              
      // yield s_umount to avoid deadlocks
   -> up_write(&sb->s_umount)
                                                                            -> spin_lock(&sb_lock)
                                                                               // find sb_p1
                                                                               if (test(old, fc))
                                                                                       goto share_extant_sb;
      // Assume P1 sleeps on bdev_lock or open_mutex
      // in blkdev_get_by_dev().
   -> setup_bdev_super()
   -> down_write(&sb->s_umount)
Now P2 jumps to the share_extant_sb label and calls:
grab_super(sb_p1)
-> spin_unlock(&sb_lock)
-> down_write(&s->s_umount)
Since s_umount is unlocked P2 doesn't go to sleep and instead immediately
goes to retry by jumping to the "retry" label. If P1 is still sleeping
on a a bdev mutex the same thing happens again.
So if you have a range of processes P{1,n} that all try to mount the
same device you're hammering endlessly on sb_lock without ever going to
sleep like we used to. The same problem exists for all iterate_supers()
and user_get_dev() variants afaict. So that needs fixing unless I'm
wrong. grab_super() and other variants need to sleep until
setup_bdev_super() is done and not busy loop. Am I wrong here?
^ permalink raw reply	[flat|nested] 9+ messages in thread
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-08-15 14:43 ` Christian Brauner
@ 2023-08-16  7:29   ` Christian Brauner
  2023-08-16 21:39     ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Christian Brauner @ 2023-08-16  7:29 UTC (permalink / raw)
  To: Christoph Hellwig, jack; +Cc: viro, linux-fsdevel
On Tue, Aug 15, 2023 at 04:43:12PM +0200, Christian Brauner wrote:
> >  		up_write(&s->s_umount);
> > -		blkdev_put(bdev, fs_type);
> > +		error = setup_bdev_super(s, flags, NULL);
> >  		down_write(&s->s_umount);
> 
> So I've been looking through the branches to see what's ready for v6.6
> and what needs some more time. While doing so I went over this again and
> realized that we have an issue here.
> 
> While it looks like dropping s_umount here and calling
> setup_bdev_super() is fine I think it isn't. Consider two processes
> racing to create the same mount:
> 
> P1                                                                    P2
> vfs_get_tree()                                                        vfs_get_tree()
> -> get_tree() == get_tree_bdev()                                      -> get_tree() == get_tree_bdev()
>    -> sget_fc()                                                          -> sget_fc()
>         // allocate new sb; no matching sb found
>       -> sb_p1 = alloc_super()
>       -> hlist_add_head(&s->s_instances, &s->s_type->fs_supers)
>       -> spin_unlock(&sb_lock)                                              
>       // yield s_umount to avoid deadlocks
>    -> up_write(&sb->s_umount)
>                                                                             -> spin_lock(&sb_lock)
>                                                                                // find sb_p1
>                                                                                if (test(old, fc))
>                                                                                        goto share_extant_sb;
>       // Assume P1 sleeps on bdev_lock or open_mutex
>       // in blkdev_get_by_dev().
>    -> setup_bdev_super()
>    -> down_write(&sb->s_umount)
> 
> Now P2 jumps to the share_extant_sb label and calls:
> 
> grab_super(sb_p1)
> -> spin_unlock(&sb_lock)
> -> down_write(&s->s_umount)
> 
> Since s_umount is unlocked P2 doesn't go to sleep and instead immediately
> goes to retry by jumping to the "retry" label. If P1 is still sleeping
> on a a bdev mutex the same thing happens again.
> 
> So if you have a range of processes P{1,n} that all try to mount the
> same device you're hammering endlessly on sb_lock without ever going to
> sleep like we used to. The same problem exists for all iterate_supers()
That part is wrong. If you have P{1,n} and P1 takes s_umount exclusively
then P{2,n} will sleep on s_umount until P1 is done. But there's still
at least on process spinning through sget_fc() for no good reason.
^ permalink raw reply	[flat|nested] 9+ messages in thread
- * Re: [PATCH] fs: open the block device after allocation the super_block
  2023-08-16  7:29   ` Christian Brauner
@ 2023-08-16 21:39     ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2023-08-16 21:39 UTC (permalink / raw)
  To: Christian Brauner; +Cc: Christoph Hellwig, jack, viro, linux-fsdevel
On Wed 16-08-23 09:29:08, Christian Brauner wrote:
> On Tue, Aug 15, 2023 at 04:43:12PM +0200, Christian Brauner wrote:
> > >  		up_write(&s->s_umount);
> > > -		blkdev_put(bdev, fs_type);
> > > +		error = setup_bdev_super(s, flags, NULL);
> > >  		down_write(&s->s_umount);
> > 
> > So I've been looking through the branches to see what's ready for v6.6
> > and what needs some more time. While doing so I went over this again and
> > realized that we have an issue here.
> > 
> > While it looks like dropping s_umount here and calling
> > setup_bdev_super() is fine I think it isn't. Consider two processes
> > racing to create the same mount:
> > 
> > P1                                                                    P2
> > vfs_get_tree()                                                        vfs_get_tree()
> > -> get_tree() == get_tree_bdev()                                      -> get_tree() == get_tree_bdev()
> >    -> sget_fc()                                                          -> sget_fc()
> >         // allocate new sb; no matching sb found
> >       -> sb_p1 = alloc_super()
> >       -> hlist_add_head(&s->s_instances, &s->s_type->fs_supers)
> >       -> spin_unlock(&sb_lock)                                              
> >       // yield s_umount to avoid deadlocks
> >    -> up_write(&sb->s_umount)
> >                                                                             -> spin_lock(&sb_lock)
> >                                                                                // find sb_p1
> >                                                                                if (test(old, fc))
> >                                                                                        goto share_extant_sb;
> >       // Assume P1 sleeps on bdev_lock or open_mutex
> >       // in blkdev_get_by_dev().
> >    -> setup_bdev_super()
> >    -> down_write(&sb->s_umount)
> > 
> > Now P2 jumps to the share_extant_sb label and calls:
> > 
> > grab_super(sb_p1)
> > -> spin_unlock(&sb_lock)
> > -> down_write(&s->s_umount)
> > 
> > Since s_umount is unlocked P2 doesn't go to sleep and instead immediately
> > goes to retry by jumping to the "retry" label. If P1 is still sleeping
> > on a a bdev mutex the same thing happens again.
> > 
> > So if you have a range of processes P{1,n} that all try to mount the
> > same device you're hammering endlessly on sb_lock without ever going to
> > sleep like we used to. The same problem exists for all iterate_supers()
> 
> That part is wrong. If you have P{1,n} and P1 takes s_umount exclusively
> then P{2,n} will sleep on s_umount until P1 is done. But there's still
> at least on process spinning through sget_fc() for no good reason.
No, you're right that the second process is going to effectively busyloop
waiting for SB_BORN to be set. I agree we should add some sleeping wait to
the loop to avoid pointlessly burning CPU cycles. I'll look into some
elegant solution tomorrow.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply	[flat|nested] 9+ messages in thread
 
 
end of thread, other threads:[~2023-08-16 21:40 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-24 17:51 [PATCH] fs: open the block device after allocation the super_block Christoph Hellwig
2023-07-25 12:35 ` Christian Brauner
2023-07-25 16:32   ` Christian Brauner
2023-07-26 12:51     ` Christoph Hellwig
2023-07-26 12:57       ` Christian Brauner
2023-07-25 15:53 ` Christian Brauner
2023-08-15 14:43 ` Christian Brauner
2023-08-16  7:29   ` Christian Brauner
2023-08-16 21:39     ` Jan Kara
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).