All of lore.kernel.org
 help / color / mirror / Atom feed
From: Richard Weinberger <richard@sigma-star.at>
To: Richard Weinberger <richard@nod.at>,
	linux-nvme@lists.infradead.org, upstream@sigma-star.at
Cc: linux-kernel@vger.kernel.org, kch@nvidia.com, sagi@grimberg.me,
	hch@lst.de, upstream+nvme@sigma-star.at,
	Damien Le Moal <dlemoal@kernel.org>
Subject: Re: [PATCH v2] nvmet: Make blksize_shift configurable
Date: Tue, 01 Jul 2025 09:09:42 +0200	[thread overview]
Message-ID: <2920993.eCsiYhmirv@nailgun> (raw)
In-Reply-To: <132c1bdf-e100-4e3a-883f-27f9e9b78020@kernel.org>

On Dienstag, 1. Juli 2025 02:34 'Damien Le Moal' via upstream wrote:
> On 7/1/25 4:13 AM, Richard Weinberger wrote:
> > Currently, the block size is automatically configured, and for
> > file-backed namespaces it is likely to be 4K.
> > While this is a reasonable default for modern storage, it can
> > cause confusion if someone wants to export a pre-created disk image
> > that uses a 512-byte block size.
> > As a result, partition parsing will fail.
> > 
> > So, just like we already do for the loop block device, let the user
> > configure the block size if they know better.
> 
> Hmm... That fine with me but this explanation does not match what the patch
> does: you allow configuring the block size bit shift, not the size. That is not
> super user friendly :)
> 
> Even if internally you use the block size bit shift, I think it would be better
> if the user facing interface is the block size as that is much easier to
> manipulate without having to remember the exponent for powers of 2 values :)

The initial intention of this patch was exposing the blksize_shift property.
If we want to expose this as more user friendly, I'm fine with it.
Maybe "minimum_io_size"?

> 
> > 
> > Signed-off-by: Richard Weinberger <richard@nod.at>
> > ---
> > Changes since v1 (RFC)[0]:
> > 
> > - Make sure blksize_shift is in general within reason
> > - In the bdev case and when using direct IO, blksize_shift has to be
> >   smaller than the logical block it the device
> > - In the file case and when using direct IO try to use STATX_DIOALIGN,
> >   just like the loop device does
> > 
> > [0] https://lore.kernel.org/linux-nvme/20250418090834.2755289-1-richard@nod.at/
> > 
> > Thanks,
> > //richard
> > ---
> >  drivers/nvme/target/configfs.c    | 37 +++++++++++++++++++++++++++++++
> >  drivers/nvme/target/io-cmd-bdev.c | 13 ++++++++++-
> >  drivers/nvme/target/io-cmd-file.c | 28 ++++++++++++++++++-----
> >  3 files changed, 71 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> > index e44ef69dffc24..26175c37374ab 100644
> > --- a/drivers/nvme/target/configfs.c
> > +++ b/drivers/nvme/target/configfs.c
> > @@ -797,6 +797,42 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
> >  }
> >  CONFIGFS_ATTR(nvmet_ns_, resv_enable);
> >  
> > +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> 
> As mentioned above, I think this should be nvmet_ns_blksize_show().
> 
> > +{
> > +	return sysfs_emit(page, "%u\n", to_nvmet_ns(item)->blksize_shift);
> 
> And you can do:
> 
> 	return sysfs_emit(page, "%u\n",
> 			  1U << to_nvmet_ns(item)->blksize_shift);
> 
> > +}
> > +
> > +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> > +		const char *page, size_t count)
> 
> Similar here: nvmet_ns_blksize_store()
> 
> > +{
> > +	struct nvmet_ns *ns = to_nvmet_ns(item);
> > +	u32 shift;
> > +	int ret;
> > +
> > +	ret = kstrtou32(page, 0, &shift);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Make sure block size is within reason, something between 512 and
> > +	 * BLK_MAX_BLOCK_SIZE.
> > +	 */
> > +	if (shift < 9 || shift > 16)
> > +		return -EINVAL;
> 
> And this would be simpler:
> 
> 	if (blksz < SECTOR_SIZE || blksz > BLK_MAX_BLOCK_SIZE ||
> 	    !is_power_of_2(blksz))
> 		return -EINVAL;
> 
> > +
> > +	mutex_lock(&ns->subsys->lock);
> > +	if (ns->enabled) {
> > +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> > +		mutex_unlock(&ns->subsys->lock);
> > +		return -EINVAL;
> > +	}
> > +	ns->blksize_shift = shift;
> 
> and here:
> 
> 	ns->blksize_shift = ilog2(blksz);
> 
> > +	mutex_unlock(&ns->subsys->lock);
> > +
> > +	return count;
> > +}
> > +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> > +
> >  static struct configfs_attribute *nvmet_ns_attrs[] = {
> >  	&nvmet_ns_attr_device_path,
> >  	&nvmet_ns_attr_device_nguid,
> > @@ -806,6 +842,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
> >  	&nvmet_ns_attr_buffered_io,
> >  	&nvmet_ns_attr_revalidate_size,
> >  	&nvmet_ns_attr_resv_enable,
> > +	&nvmet_ns_attr_blksize_shift,
> >  #ifdef CONFIG_PCI_P2PDMA
> >  	&nvmet_ns_attr_p2pmem,
> >  #endif
> > diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> > index eba42df2f8215..be39837d4d792 100644
> > --- a/drivers/nvme/target/io-cmd-bdev.c
> > +++ b/drivers/nvme/target/io-cmd-bdev.c
> > @@ -77,6 +77,7 @@ static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns)
> >  
> >  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> >  {
> > +	int bdev_blksize_shift;
> >  	int ret;
> >  
> >  	/*
> > @@ -100,7 +101,17 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> >  	}
> >  	ns->bdev = file_bdev(ns->bdev_file);
> >  	ns->size = bdev_nr_bytes(ns->bdev);
> > -	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> > +	bdev_blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> > +
> > +	if (ns->blksize_shift) {
> > +		if (ns->blksize_shift < bdev_blksize_shift) {
> > +			pr_err("Configured blksize_shift needs to be at least %d for device %s\n",
> > +				bdev_blksize_shift, ns->device_path);
> > +			return -EINVAL;
> > +		}
> > +	} else {
> > +		ns->blksize_shift = bdev_blksize_shift;
> > +	}
> 
> Nit: to avoid the indented if, may be write this like this: ?
> 
> 	if (!ns->blksize_shift)
> 		ns->blksize_shift = bdev_blksize_shift;
> 
> 	if (ns->blksize_shift < bdev_blksize_shift) {
> 		pr_err("Configured blksize needs to be at least %u for device %s\n",
> 			bdev_logical_block_size(ns->bdev),
> 			ns->device_path);
> 		return -EINVAL;
> 	}

It's a matter of taste, but yes...

> Also, if the backend is an HDD, do we want to allow the user to configure a
> block size that is less than the *physical* block size ? Performance will
> suffer on regular HDDs and writes may fail with SMR HDDs.

I'm not sure whether it's worth putting more smartness into this logic.
 
> >  
> >  	ns->pi_type = 0;
> >  	ns->metadata_size = 0;
> > diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> > index 2d068439b129c..a4066b5a1dc97 100644
> > --- a/drivers/nvme/target/io-cmd-file.c
> > +++ b/drivers/nvme/target/io-cmd-file.c
> > @@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
> >  
> >  	nvmet_file_ns_revalidate(ns);
> >  
> > -	/*
> > -	 * i_blkbits can be greater than the universally accepted upper bound,
> > -	 * so make sure we export a sane namespace lba_shift.
> > -	 */
> > -	ns->blksize_shift = min_t(u8,
> > -			file_inode(ns->file)->i_blkbits, 12);
> > +	if (ns->blksize_shift) {
> > +		if (!ns->buffered_io) {
> > +			struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
> > +			struct kstat st;
> > +
> > +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> > +			    (st.result_mask & STATX_DIOALIGN) &&
> > +			    (1 << ns->blksize_shift) < st.dio_offset_align)
> > +				return -EINVAL;
> > +
> > +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> > +				return -EINVAL;
> 
> I am confused... This is going to check both... But if you got STATX_DIOALIGN
> and it is OK, you do not need (and probably should not) do the second if, no ?

I was not sure about that.
Is it guaranteed that STATX_DIOALIGN returns something sane?

> 
> Also, the second condition of the second if is essentially the same check as
> for the block dev case. So maybe reuse that by creating a small helper function ?

Ok.
 
> > +		}
> > +	} else {
> > +		/*
> > +		 * i_blkbits can be greater than the universally accepted
> > +		 * upper bound, so make sure we export a sane namespace
> > +		 * lba_shift.
> > +		 */
> > +		ns->blksize_shift = min_t(u8,
> > +				file_inode(ns->file)->i_blkbits, 12);
> > +	}
> 
> It feels like this entire hunk should be a helper function as that would allow
> making it a lot more readable with early returns. This code here whould be
> something like:

Ok.

Thanks,
//richard

-- 
​​​​​sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y




  reply	other threads:[~2025-07-01  7:45 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-30 19:13 [PATCH v2] nvmet: Make blksize_shift configurable Richard Weinberger
2025-07-01  0:34 ` Damien Le Moal
2025-07-01  7:09   ` Richard Weinberger [this message]
2025-07-01  7:32     ` Damien Le Moal
2025-07-01 15:00       ` Richard Weinberger
2025-07-03  8:54   ` Christoph Hellwig
2025-07-03  9:29     ` Richard Weinberger

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2920993.eCsiYhmirv@nailgun \
    --to=richard@sigma-star.at \
    --cc=dlemoal@kernel.org \
    --cc=hch@lst.de \
    --cc=kch@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=richard@nod.at \
    --cc=sagi@grimberg.me \
    --cc=upstream+nvme@sigma-star.at \
    --cc=upstream@sigma-star.at \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.