Re: [PATCH 1/2] block: fix race between set_blocksize and read paths

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>,
	Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	Luis Chamberlain <mcgrof@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	linux-block <linux-block@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	xfs <linux-xfs@vger.kernel.org>
Subject: Re: [PATCH 1/2] block: fix race between set_blocksize and read paths
Date: Fri, 18 Apr 2025 09:02:34 -0700	[thread overview]
Message-ID: <20250418160234.GT25675@frogsfrogsfrogs> (raw)
In-Reply-To: <20250418155458.GR25675@frogsfrogsfrogs>

On Fri, Apr 18, 2025 at 08:54:58AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> With the new large sector size support, it's now the case that
> set_blocksize can change i_blksize and the folio order in a manner that
> conflicts with a concurrent reader and causes a kernel crash.
> 
> Specifically, let's say that udev-worker calls libblkid to detect the
> labels on a block device.  The read call can create an order-0 folio to
> read the first 4096 bytes from the disk.  But then udev is preempted.
> 
> Next, someone tries to mount an 8k-sectorsize filesystem from the same
> block device.  The filesystem calls set_blksize, which sets i_blksize to
> 8192 and the minimum folio order to 1.
> 
> Now udev resumes, still holding the order-0 folio it allocated.  It then
> tries to schedule a read bio and do_mpage_readahead tries to create
> bufferheads for the folio.  Unfortunately, blocks_per_folio == 0 because
> the page size is 4096 but the blocksize is 8192 so no bufferheads are
> attached and the bh walk never sets bdev.  We then submit the bio with a
> NULL block device and crash.
> 
> Therefore, truncate the page cache after flushing but before updating
> i_blksize.  However, that's not enough -- we also need to lock out file
> IO and page faults during the update.  Take both the i_rwsem and the
> invalidate_lock in exclusive mode for invalidations, and in shared mode
> for read/write operations.
> 
> I don't know if this is the correct fix, but xfs/259 found it.
> 
> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>

I think this could also have the tag:
Fixes: 3c20917120ce61 ("block/bdev: enable large folio support for large logical block sizes")

Not sure anyone cares about that for a fix for 6.15-rc1 though.

--D

> ---
>  block/bdev.c      |   17 +++++++++++++++++
>  block/blk-zoned.c |    5 ++++-
>  block/fops.c      |   16 ++++++++++++++++
>  block/ioctl.c     |    6 ++++++
>  4 files changed, 43 insertions(+), 1 deletion(-)
> 
> diff --git a/block/bdev.c b/block/bdev.c
> index 7b4e35a661b0c9..1313ad256593c5 100644
> --- a/block/bdev.c
> +++ b/block/bdev.c
> @@ -169,11 +169,28 @@ int set_blocksize(struct file *file, int size)
>  
>  	/* Don't change the size if it is same as current */
>  	if (inode->i_blkbits != blksize_bits(size)) {
> +		/*
> +		 * Flush and truncate the pagecache before we reconfigure the
> +		 * mapping geometry because folio sizes are variable now.  If a
> +		 * reader has already allocated a folio whose size is smaller
> +		 * than the new min_order but invokes readahead after the new
> +		 * min_order becomes visible, readahead will think there are
> +		 * "zero" blocks per folio and crash.  Take the inode and
> +		 * invalidation locks to avoid racing with
> +		 * read/write/fallocate.
> +		 */
> +		inode_lock(inode);
> +		filemap_invalidate_lock(inode->i_mapping);
> +
>  		sync_blockdev(bdev);
> +		kill_bdev(bdev);
> +
>  		inode->i_blkbits = blksize_bits(size);
>  		mapping_set_folio_order_range(inode->i_mapping,
>  				get_order(size), get_order(size));
>  		kill_bdev(bdev);
> +		filemap_invalidate_unlock(inode->i_mapping);
> +		inode_unlock(inode);
>  	}
>  	return 0;
>  }
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 0c77244a35c92e..8f15d1aa6eb89a 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -343,6 +343,7 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
>  		op = REQ_OP_ZONE_RESET;
>  
>  		/* Invalidate the page cache, including dirty pages. */
> +		inode_lock(bdev->bd_mapping->host);
>  		filemap_invalidate_lock(bdev->bd_mapping);
>  		ret = blkdev_truncate_zone_range(bdev, mode, &zrange);
>  		if (ret)
> @@ -364,8 +365,10 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
>  	ret = blkdev_zone_mgmt(bdev, op, zrange.sector, zrange.nr_sectors);
>  
>  fail:
> -	if (cmd == BLKRESETZONE)
> +	if (cmd == BLKRESETZONE) {
>  		filemap_invalidate_unlock(bdev->bd_mapping);
> +		inode_unlock(bdev->bd_mapping->host);
> +	}
>  
>  	return ret;
>  }
> diff --git a/block/fops.c b/block/fops.c
> index be9f1dbea9ce0a..e221fdcaa8aaf8 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -746,7 +746,14 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			ret = direct_write_fallback(iocb, from, ret,
>  					blkdev_buffered_write(iocb, from));
>  	} else {
> +		/*
> +		 * Take i_rwsem and invalidate_lock to avoid racing with
> +		 * set_blocksize changing i_blkbits/folio order and punching
> +		 * out the pagecache.
> +		 */
> +		inode_lock_shared(bd_inode);
>  		ret = blkdev_buffered_write(iocb, from);
> +		inode_unlock_shared(bd_inode);
>  	}
>  
>  	if (ret > 0)
> @@ -757,6 +764,7 @@ static ssize_t blkdev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  
>  static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  {
> +	struct inode *bd_inode = bdev_file_inode(iocb->ki_filp);
>  	struct block_device *bdev = I_BDEV(iocb->ki_filp->f_mapping->host);
>  	loff_t size = bdev_nr_bytes(bdev);
>  	loff_t pos = iocb->ki_pos;
> @@ -793,7 +801,13 @@ static ssize_t blkdev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			goto reexpand;
>  	}
>  
> +	/*
> +	 * Take i_rwsem and invalidate_lock to avoid racing with set_blocksize
> +	 * changing i_blkbits/folio order and punching out the pagecache.
> +	 */
> +	inode_lock_shared(bd_inode);
>  	ret = filemap_read(iocb, to, ret);
> +	inode_unlock_shared(bd_inode);
>  
>  reexpand:
>  	if (unlikely(shorted))
> @@ -836,6 +850,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  	if ((start | len) & (bdev_logical_block_size(bdev) - 1))
>  		return -EINVAL;
>  
> +	inode_lock(inode);
>  	filemap_invalidate_lock(inode->i_mapping);
>  
>  	/*
> @@ -868,6 +883,7 @@ static long blkdev_fallocate(struct file *file, int mode, loff_t start,
>  
>   fail:
>  	filemap_invalidate_unlock(inode->i_mapping);
> +	inode_unlock(inode);
>  	return error;
>  }
>  
> diff --git a/block/ioctl.c b/block/ioctl.c
> index faa40f383e2736..e472cc1030c60c 100644
> --- a/block/ioctl.c
> +++ b/block/ioctl.c
> @@ -142,6 +142,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
>  	if (err)
>  		return err;
>  
> +	inode_lock(bdev->bd_mapping->host);
>  	filemap_invalidate_lock(bdev->bd_mapping);
>  	err = truncate_bdev_range(bdev, mode, start, start + len - 1);
>  	if (err)
> @@ -174,6 +175,7 @@ static int blk_ioctl_discard(struct block_device *bdev, blk_mode_t mode,
>  	blk_finish_plug(&plug);
>  fail:
>  	filemap_invalidate_unlock(bdev->bd_mapping);
> +	inode_unlock(bdev->bd_mapping->host);
>  	return err;
>  }
>  
> @@ -199,12 +201,14 @@ static int blk_ioctl_secure_erase(struct block_device *bdev, blk_mode_t mode,
>  	    end > bdev_nr_bytes(bdev))
>  		return -EINVAL;
>  
> +	inode_lock(bdev->bd_mapping->host);
>  	filemap_invalidate_lock(bdev->bd_mapping);
>  	err = truncate_bdev_range(bdev, mode, start, end - 1);
>  	if (!err)
>  		err = blkdev_issue_secure_erase(bdev, start >> 9, len >> 9,
>  						GFP_KERNEL);
>  	filemap_invalidate_unlock(bdev->bd_mapping);
> +	inode_unlock(bdev->bd_mapping->host);
>  	return err;
>  }
>  
> @@ -236,6 +240,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
>  		return -EINVAL;
>  
>  	/* Invalidate the page cache, including dirty pages */
> +	inode_lock(bdev->bd_mapping->host);
>  	filemap_invalidate_lock(bdev->bd_mapping);
>  	err = truncate_bdev_range(bdev, mode, start, end);
>  	if (err)
> @@ -246,6 +251,7 @@ static int blk_ioctl_zeroout(struct block_device *bdev, blk_mode_t mode,
>  
>  fail:
>  	filemap_invalidate_unlock(bdev->bd_mapping);
> +	inode_unlock(bdev->bd_mapping->host);
>  	return err;
>  }
>  
>

next prev parent reply	other threads:[~2025-04-18 16:02 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-18 15:54 [PATCH 1/2] block: fix race between set_blocksize and read paths Darrick J. Wong
2025-04-18 15:58 ` [PATCH 2/2] xfs: stop using set_blocksize Darrick J. Wong
2025-04-18 17:56   ` Luis Chamberlain
2025-04-21  7:59   ` Christoph Hellwig
2025-04-18 16:02 ` Darrick J. Wong [this message]
2025-04-18 17:56   ` [PATCH 1/2] block: fix race between set_blocksize and read paths Luis Chamberlain
2025-04-18 17:55 ` Luis Chamberlain
2025-04-21  7:58 ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250418160234.GT25675@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=hch@infradead.org \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=shinichiro.kawasaki@wdc.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.