[RFC[RAP] 1/2] block: fix race between set

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
@ 2025-04-15  0:14 Darrick J. Wong
  2025-04-15  0:33 ` [RF[CRAP] 2/2] xfs: stop using set_blocksize Darrick J. Wong
  2025-04-16  4:41 ` [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2025-04-15  0:14 UTC (permalink / raw)
  To: axboe
  Cc: Christoph Hellwig, Luis Chamberlain, Matthew Wilcox, linux-block,
	linux-fsdevel, xfs, Jack Vogel

From: Darrick J. Wong <djwong@kernel.org>

With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.

Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device.  The read call can create an order-0 folio to
read the first 4096 bytes from the disk.  But then udev is preempted.

Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device.  The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.

Now udev resumes, still holding the order-0 folio it allocated.  It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev.  We then submit the bio with a
NULL block device and crash.

Therefore, truncate the page cache after flushing but before updating
i_blksize.  However, that's not enough -- we also need to lock out file
IO and page faults during the update.  Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.

I don't know if this is the correct fix, but xfs/259 found it.

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 block/bdev.c      |   12 ++++++++++++
 block/blk-zoned.c |    5 ++++-
 block/fops.c      |    7 +++++++
 block/ioctl.c     |    6 ++++++
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/block/bdev.c b/block/bdev.c
index 7b4e35a661b0c9..0cbdac46d98d86 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -169,11 +169,23 @@ int set_blocksize(struct file *file, int size)
 
 	/* Don't change the size if it is same as current */
 	if (inode->i_blkbits != blksize_bits(size)) {
+		/* Prevent concurrent IO operations */
+		inode_lock(inode);
+		filemap_invalidate_lock(inode->i_mapping);
+
+		/*
+		 * Flush and truncate the pagecache before we reconfigure the
+		 * mapping geometry because folio sizes are variable now.
+		 */
 		sync_blockdev(bdev);
+		kill_bdev(bdev);
+
 		inode->i_blkbits = blksize_bits(size);
 		mapping_set_folio_order_range(inode->i_mapping,
 				get_order(size), get_order(size));
 		kill_bdev(bdev);
+		filemap_invalidate_unlock(inode->i_mapping);
+		inode_unlock(inode);
 	}
 	return 0;
 }
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 0c77244a35c92e..8f15d1aa6eb89a 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 		op = REQ_OP_ZONE_RESET;
 
 		/* Invalidate the page cache, including dirty pages. */
+		inode_lock(bdev->bd_mapping->host);
 		filemap_invalidate_lock(bdev->bd_mapping);
 		ret = blkdev_truncate_zone_range(bdev, mode, &zrange);
 		if (ret)
@@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 	ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors);
 
 fail:
-	if (cmd == BLKRESETZONE)
+	if (cmd == BLKRESETZONE) {
 		filemap_invalidate_unlock(bdev->bd_mapping);
+		inode_unlock(bdev->bd_mapping->host);
+	}
 
 	return ret;
 }
diff --git a/block/fops.c b/block/fops.c
index be9f1dbea9ce0a..f46ae08fac33dd 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -746,7 +746,9 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			ret = direct_write_fallback(iocb, from, ret,
 					blkdev_buffered_write(iocb, from));
 	} else {
+		inode_lock_shared(bd_inode);
 		ret = blkdev_buffered_write(iocb, from);
+		inode_unlock_shared(bd_inode);
 	}
 
 	if (ret > 0)
@@ -757,6 +759,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 {
+	struct inode *bd_inode = bdev_file_inode(iocb->ki_filp);
 	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
 	loff_t size = bdev_nr_bytes(bdev);
 	loff_t pos = iocb->ki_pos;
@@ -793,7 +796,9 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			goto reexpand;
 	}
 
+	inode_lock_shared(bd_inode);
 	ret = filemap_read(iocb, to, ret);
+	inode_unlock_shared(bd_inode);
 
 reexpand:
 	if (unlikely(shorted))
@@ -836,6 +841,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 	if ((start | len) & (bdev_logical_block_size(bdev) - 1))
 		return -EINVAL;
 
+	inode_lock(inode);
 	filemap_invalidate_lock(inode->i_mapping);
 
 	/*
@@ -868,6 +874,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
 
  fail:
 	filemap_invalidate_unlock(inode->i_mapping);
+	inode_unlock(inode);
 	return error;
 }
 
diff --git a/block/ioctl.c b/block/ioctl.c
index faa40f383e2736..e472cc1030c60c 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
 	if (err)
 		return err;
 
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
 	if (err)
@@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
 	blk_finish_plug(&plug);
 fail:
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 
@@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode,
 	    end > bdev_nr_bytes(bdev))
 		return -EINVAL;
 
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, end - 1);
 	if (!err)
 		err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9,
 						GFP_KERNEL);
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 
@@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
 		return -EINVAL;
 
 	/* Invalidate the page cache, including dirty pages */
+	inode_lock(bdev->bd_mapping->host);
 	filemap_invalidate_lock(bdev->bd_mapping);
 	err = truncate_bdev_range(bdev, mode, start, end);
 	if (err)
@@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
 
 fail:
 	filemap_invalidate_unlock(bdev->bd_mapping);
+	inode_unlock(bdev->bd_mapping->host);
 	return err;
 }
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [RF[CRAP] 2/2] xfs: stop using set_blocksize
  2025-04-15  0:14 [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Darrick J. Wong
@ 2025-04-15  0:33 ` Darrick J. Wong
  2025-04-16  4:46   ` Christoph Hellwig
  2025-04-16  4:41 ` [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2025-04-15  0:33 UTC (permalink / raw)
  To: Carlos Maiolino
  Cc: Christoph Hellwig, axboe, Luis Chamberlain, Matthew Wilcox,
	linux-block, linux-fsdevel, xfs, Jack Vogel

From: Darrick J. Wong <djwong@kernel.org>

XFS has its own buffer cache for metadata that uses submit_bio, which
means that it no longer uses the block device pagecache for anything.
Create a more lightweight helper that runs the blocksize checks and
flushes dirty data and use that instead.  No more truncating the
pagecache because why would XFS care? ;)

Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>
---
 include/linux/blkdev.h |    1 +
 block/bdev.c           |   23 +++++++++++++++++++++++
 fs/xfs/xfs_buf.c       |    9 ++++++---
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f442639dfae224..ae83dd12351c2e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1618,6 +1618,7 @@ static inline void bio_end_io_acct(struct bio *bio, unsigned long start_time)
 	return bio_end_io_acct_remapped(bio, start_time, bio->bi_bdev);
 }
 
+int bdev_use_blocksize(struct file *file, int size);
 int set_blocksize(struct file *file, int size);
 
 int lookup_bdev(const char *pathname, dev_t *dev);
diff --git a/block/bdev.c b/block/bdev.c
index 0cbdac46d98d86..201d61d743592e 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -152,6 +152,29 @@ static void set_init_blocksize(struct block_device *bdev)
 				    get_order(bsize), get_order(bsize));
 }
 
+/*
+ * For bdev filesystems that do not use buffer heads, check that this block
+ * size is acceptable and flush dirty pagecache to disk.
+ */
+int bdev_use_blocksize(struct file *file, int size)
+{
+	struct inode *inode = file->f_mapping->host;
+	struct block_device *bdev = I_BDEV(inode);
+
+	if (blk_validate_block_size(size))
+		return -EINVAL;
+
+	/* Size cannot be smaller than the size supported by the device */
+	if (size < bdev_logical_block_size(bdev))
+		return -EINVAL;
+
+	if (!file->private_data)
+		return -EINVAL;
+
+	return sync_blockdev(bdev);
+}
+EXPORT_SYMBOL_GPL(bdev_use_blocksize);
+
 int set_blocksize(struct file *file, int size)
 {
 	struct inode *inode = file->f_mapping->host;
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 8e7f1b324b3bea..2c8531103c01bb 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1718,14 +1718,17 @@ xfs_setsize_buftarg(
 	struct xfs_buftarg	*btp,
 	unsigned int		sectorsize)
 {
+	int			error;
+
 	/* Set up metadata sector size info */
 	btp->bt_meta_sectorsize = sectorsize;
 	btp->bt_meta_sectormask = sectorsize - 1;
 
-	if (set_blocksize(btp->bt_bdev_file, sectorsize)) {
+	error = bdev_use_blocksize(btp->bt_bdev_file, sectorsize);
+	if (error) {
 		xfs_warn(btp->bt_mount,
-			"Cannot set_blocksize to %u on device %pg",
-			sectorsize, btp->bt_bdev);
+			"Cannot use blocksize %u on device %pg, err %d",
+			sectorsize, btp->bt_bdev, error);
 		return -EINVAL;
 	}
 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
  2025-04-15  0:14 [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Darrick J. Wong
  2025-04-15  0:33 ` [RF[CRAP] 2/2] xfs: stop using set_blocksize Darrick J. Wong
@ 2025-04-16  4:41 ` Christoph Hellwig
  2025-04-16  5:01   ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-04-16  4:41 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: axboe, Christoph Hellwig, Luis Chamberlain, Matthew Wilcox,
	linux-block, linux-fsdevel, xfs, Jack Vogel

On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> With the new large sector size support, it's now the case that
> set_blocksize can change i_blksize and the folio order in a manner that
> conflicts with a concurrent reader and causes a kernel crash.
> 
> Specifically, let's say that udev-worker calls libblkid to detect the
> labels on a block device.  The read call can create an order-0 folio to
> read the first 4096 bytes from the disk.  But then udev is preempted.
> 
> Next, someone tries to mount an 8k-sectorsize filesystem from the same
> block device.  The filesystem calls set_blksize, which sets i_blksize to
> 8192 and the minimum folio order to 1.
> 
> Now udev resumes, still holding the order-0 folio it allocated.  It then
> tries to schedule a read bio and do_mpage_readahead tries to create
> bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
> the page size is 4096 but the blocksize is 8192 so no bufferheads are
> attached and the bh walk never sets bdev.  We then submit the bio with a
> NULL block device and crash.
> 

Do we have a testcase for blktests or xfstests for this?  The issue is
subtle and some of the code in the patch looks easy to accidentally
break again (not the fault of this patch primarily).

>  	} else {
> +		inode_lock_shared(bd_inode);
>  		ret = blkdev_buffered_write(iocb, from);
> +		inode_unlock_shared(bd_inode);

Does this need a comment why we take i_rwsem?

> +	inode_lock_shared(bd_inode);
>  	ret = filemap_read(iocb, to, ret);
> +	inode_unlock_shared(bd_inode);

Same here.  Especially as the protection is now heavier than for most
file systems.

I also wonder if we need locking asserts in some of the write side
functions that expect the shared inode lock and invalidate lock now?


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RF[CRAP] 2/2] xfs: stop using set_blocksize
  2025-04-15  0:33 ` [RF[CRAP] 2/2] xfs: stop using set_blocksize Darrick J. Wong
@ 2025-04-16  4:46   ` Christoph Hellwig
  2025-04-16  5:06     ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-04-16  4:46 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Carlos Maiolino, Christoph Hellwig, axboe, Luis Chamberlain,
	Matthew Wilcox, linux-block, linux-fsdevel, xfs, Jack Vogel

On Mon, Apr 14, 2025 at 05:33:08PM -0700, Darrick J. Wong wrote:
> +/*
> + * For bdev filesystems that do not use buffer heads, check that this block
> + * size is acceptable and flush dirty pagecache to disk.
> + */

Can you turn this into a full fledged kerneldoc comment?

> +int bdev_use_blocksize(struct file *file, int size)
> +{
> +	struct inode *inode = file->f_mapping->host;
> +	struct block_device *bdev = I_BDEV(inode);
> +
> +	if (blk_validate_block_size(size))
> +		return -EINVAL;
> +
> +	/* Size cannot be smaller than the size supported by the device */
> +	if (size < bdev_logical_block_size(bdev))
> +		return -EINVAL;
> +
> +	if (!file->private_data)
> +		return -EINVAL;

This private_data check looks really confusing.  Looking it up I see
that it is directly copied from set_blocksize, but it could really
use a comment.  Or in fact be removed here and kept in set_blocksize
only as we don't care about an exclusive opener at all.   Even there
a comment would be rather helpful, though.

> +
> +	return sync_blockdev(bdev);
> +}

I don't think we need sync_blockdev here as we don't touch the
bdev page cache.  Maybe XFS wants to still call it, but it feels
wrong in a helper just validating the block size.

So maybe drop it, rename the helper to bdev_validate_block_size
and use it in set_blocksize instead of duplicating the logic?

> +	error = bdev_use_blocksize(btp->bt_bdev_file, sectorsize);

.. and then split using it in XFS into a separate patch from adding
the block layer helper.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
  2025-04-16  4:41 ` [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Christoph Hellwig
@ 2025-04-16  5:01   ` Darrick J. Wong
  2025-04-16  5:14     ` Christoph Hellwig
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2025-04-16  5:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: axboe, Luis Chamberlain, Matthew Wilcox, linux-block,
	linux-fsdevel, xfs, Jack Vogel

On Tue, Apr 15, 2025 at 09:41:32PM -0700, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 05:14:05PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > With the new large sector size support, it's now the case that
> > set_blocksize can change i_blksize and the folio order in a manner that
> > conflicts with a concurrent reader and causes a kernel crash.
> > 
> > Specifically, let's say that udev-worker calls libblkid to detect the
> > labels on a block device.  The read call can create an order-0 folio to
> > read the first 4096 bytes from the disk.  But then udev is preempted.
> > 
> > Next, someone tries to mount an 8k-sectorsize filesystem from the same
> > block device.  The filesystem calls set_blksize, which sets i_blksize to
> > 8192 and the minimum folio order to 1.
> > 
> > Now udev resumes, still holding the order-0 folio it allocated.  It then
> > tries to schedule a read bio and do_mpage_readahead tries to create
> > bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
> > the page size is 4096 but the blocksize is 8192 so no bufferheads are
> > attached and the bh walk never sets bdev.  We then submit the bio with a
> > NULL block device and crash.
> > 
> 
> Do we have a testcase for blktests or xfstests for this?  The issue is
> subtle and some of the code in the patch looks easy to accidentally
> break again (not the fault of this patch primarily).

It's the same patch as:
https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/

which is to say, xfs/032 with while true; do blkid; done running in the
background to increase the chances of a collision.

> >  	} else {
> > +		inode_lock_shared(bd_inode);
> >  		ret = blkdev_buffered_write(iocb, from);
> > +		inode_unlock_shared(bd_inode);
> 
> Does this need a comment why we take i_rwsem?
> 
> > +	inode_lock_shared(bd_inode);
> >  	ret = filemap_read(iocb, to, ret);
> > +	inode_unlock_shared(bd_inode);
> 
> Same here.  Especially as the protection is now heavier than for most
> file systems.

Yeah, somewhere we need a better comment.  How about this for
set_blocksize:

	/*
	 * Flush and truncate the pagecache before we reconfigure the
	 * mapping geometry because folio sizes are variable now.  If
	 * a reader has already allocated a folio whose size is smaller
	 * than the new min_order but invokes readahead after the new
	 * min_order becomes visible, readahead will think there are
	 * "zero" blocks per folio and crash.
	 */

And then the read/write paths can say something simpler:

	/*
	 * Take i_rwsem and invalidate_lock to avoid racing with a
	 * blocksize change punching out the pagecache.
	 */

> I also wonder if we need locking asserts in some of the write side
> functions that expect the shared inode lock and invalidate lock now?

Probably.  Do you have specific places in mind?

--D

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RF[CRAP] 2/2] xfs: stop using set_blocksize
  2025-04-16  4:46   ` Christoph Hellwig
@ 2025-04-16  5:06     ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2025-04-16  5:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Carlos Maiolino, axboe, Luis Chamberlain, Matthew Wilcox,
	linux-block, linux-fsdevel, xfs, Jack Vogel

On Tue, Apr 15, 2025 at 09:46:09PM -0700, Christoph Hellwig wrote:
> On Mon, Apr 14, 2025 at 05:33:08PM -0700, Darrick J. Wong wrote:
> > +/*
> > + * For bdev filesystems that do not use buffer heads, check that this block
> > + * size is acceptable and flush dirty pagecache to disk.
> > + */
> 
> Can you turn this into a full fledged kerneldoc comment?

Ok.

> > +int bdev_use_blocksize(struct file *file, int size)
> > +{
> > +	struct inode *inode = file->f_mapping->host;
> > +	struct block_device *bdev = I_BDEV(inode);
> > +
> > +	if (blk_validate_block_size(size))
> > +		return -EINVAL;
> > +
> > +	/* Size cannot be smaller than the size supported by the device */
> > +	if (size < bdev_logical_block_size(bdev))
> > +		return -EINVAL;
> > +
> > +	if (!file->private_data)
> > +		return -EINVAL;
> 
> This private_data check looks really confusing.  Looking it up I see
> that it is directly copied from set_blocksize, but it could really
> use a comment.  Or in fact be removed here and kept in set_blocksize
> only as we don't care about an exclusive opener at all.   Even there
> a comment would be rather helpful, though.

When even is it null?  I thought it would either be the holder or
bdev_inode if not.

> > +
> > +	return sync_blockdev(bdev);
> > +}
> 
> I don't think we need sync_blockdev here as we don't touch the
> bdev page cache.  Maybe XFS wants to still call it, but it feels
> wrong in a helper just validating the block size.
> 
> So maybe drop it, rename the helper to bdev_validate_block_size
> and use it in set_blocksize instead of duplicating the logic?

Ok.  bdev_validate_block_size is a much better name for a tighter
function...

> > +	error = bdev_use_blocksize(btp->bt_bdev_file, sectorsize);
> 
> .. and then split using it in XFS into a separate patch from adding
> the block layer helper.

...and xfs can call sync_blockdev directly from xfs_setsize_buftarg.
I imagine we still want any dirty pagecache to get flushed before we
start submitting our own read bios.

--D

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
  2025-04-16  5:01   ` Darrick J. Wong
@ 2025-04-16  5:14     ` Christoph Hellwig
  2025-04-18  7:51       ` Shinichiro Kawasaki
  0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2025-04-16  5:14 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, axboe, Luis Chamberlain, Matthew Wilcox,
	linux-block, linux-fsdevel, xfs, Jack Vogel

On Tue, Apr 15, 2025 at 10:01:44PM -0700, Darrick J. Wong wrote:
> It's the same patch as:
> https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/
> 
> which is to say, xfs/032 with while true; do blkid; done running in the
> background to increase the chances of a collision.

I think the xfs-zoned CI actually hit this with 032 without any extra
action the.

> 	/*
> 	 * Flush and truncate the pagecache before we reconfigure the
> 	 * mapping geometry because folio sizes are variable now.  If
> 	 * a reader has already allocated a folio whose size is smaller
> 	 * than the new min_order but invokes readahead after the new
> 	 * min_order becomes visible, readahead will think there are
> 	 * "zero" blocks per folio and crash.
> 	 */
> 
> And then the read/write paths can say something simpler:
> 
> 	/*
> 	 * Take i_rwsem and invalidate_lock to avoid racing with a
> 	 * blocksize change punching out the pagecache.
> 	 */

Sounds reasonable.

> > I also wonder if we need locking asserts in some of the write side
> > functions that expect the shared inode lock and invalidate lock now?
> 
> Probably.  Do you have specific places in mind?

Looking at it more closely: no.  We're only calling very low-level
helpers, so this might not actually be feasible.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
  2025-04-16  5:14     ` Christoph Hellwig
@ 2025-04-18  7:51       ` Shinichiro Kawasaki
  2025-04-18 15:29         ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Shinichiro Kawasaki @ 2025-04-18  7:51 UTC (permalink / raw)
  To: hch@infradead.org
  Cc: Darrick J. Wong, axboe@kernel.dk, Luis Chamberlain,
	Matthew Wilcox, linux-block, linux-fsdevel, xfs, Jack Vogel

On Apr 15, 2025 / 22:14, Christoph Hellwig wrote:
> On Tue, Apr 15, 2025 at 10:01:44PM -0700, Darrick J. Wong wrote:
> > It's the same patch as:
> > https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/
> > 
> > which is to say, xfs/032 with while true; do blkid; done running in the
> > background to increase the chances of a collision.
> 
> I think the xfs-zoned CI actually hit this with 032 without any extra
> action the.

I observed xfs/032 hanged using the kernel on linux-xfs/for-next branch with git
hash 71700ac47ad8. Before the hang, kernel reported the messages below:

  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
  CPU: 21 UID: 0 PID: 3187783 Comm: (udev-worker) Not tainted 6.15.0-rc1-kts-xfs-g71700ac47ad+ #1 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
  RIP: 0010:guard_bio_eod+0x52/0x5b0

The failure was recreated in stable manner. I applied this patch series, and
confirmed the failure disappears. Good. (I needed to resolve conflicts, though)

This patch fixes block layer. So, IMO, it's the better to have a test case in
blktests to confirm the fix. I created a blktests test case which recreates the
failure using blockdev and fio commands. Will post it soon.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths
  2025-04-18  7:51       ` Shinichiro Kawasaki
@ 2025-04-18 15:29         ` Darrick J. Wong
  0 siblings, 0 replies; 9+ messages in thread
From: Darrick J. Wong @ 2025-04-18 15:29 UTC (permalink / raw)
  To: Shinichiro Kawasaki
  Cc: hch@infradead.org, axboe@kernel.dk, Luis Chamberlain,
	Matthew Wilcox, linux-block, linux-fsdevel, xfs, Jack Vogel

On Fri, Apr 18, 2025 at 07:51:58AM +0000, Shinichiro Kawasaki wrote:
> On Apr 15, 2025 / 22:14, Christoph Hellwig wrote:
> > On Tue, Apr 15, 2025 at 10:01:44PM -0700, Darrick J. Wong wrote:
> > > It's the same patch as:
> > > https://lore.kernel.org/linux-fsdevel/20250408175125.GL6266@frogsfrogsfrogs/
> > > 
> > > which is to say, xfs/032 with while true; do blkid; done running in the
> > > background to increase the chances of a collision.
> > 
> > I think the xfs-zoned CI actually hit this with 032 without any extra
> > action the.
> 
> I observed xfs/032 hanged using the kernel on linux-xfs/for-next branch with git
> hash 71700ac47ad8. Before the hang, kernel reported the messages below:
> 
>   Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI
>   KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
>   CPU: 21 UID: 0 PID: 3187783 Comm: (udev-worker) Not tainted 6.15.0-rc1-kts-xfs-g71700ac47ad+ #1 PREEMPT(lazy)
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
>   RIP: 0010:guard_bio_eod+0x52/0x5b0
> 
> The failure was recreated in stable manner. I applied this patch series, and
> confirmed the failure disappears. Good. (I needed to resolve conflicts, though)
> 
> This patch fixes block layer. So, IMO, it's the better to have a test case in
> blktests to confirm the fix. I created a blktests test case which recreates the
> failure using blockdev and fio commands. Will post it soon.

Ok.  I'll post a non-rfcrap version of the series shortly.  Thank you
for writing a regression test! :)

--D

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-04-18 15:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15  0:14 [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Darrick J. Wong
2025-04-15  0:33 ` [RF[CRAP] 2/2] xfs: stop using set_blocksize Darrick J. Wong
2025-04-16  4:46   ` Christoph Hellwig
2025-04-16  5:06     ` Darrick J. Wong
2025-04-16  4:41 ` [RFC[RAP] 1/2] block: fix race between set_blocksize and read paths Christoph Hellwig
2025-04-16  5:01   ` Darrick J. Wong
2025-04-16  5:14     ` Christoph Hellwig
2025-04-18  7:51       ` Shinichiro Kawasaki
2025-04-18 15:29         ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).