[PATCH v2] nvmet: Make blksize

linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2] nvmet: Make blksize_shift configurable
@ 2025-06-30 19:13 Richard Weinberger
  2025-07-01  0:34 ` Damien Le Moal
  0 siblings, 1 reply; 7+ messages in thread
From: Richard Weinberger @ 2025-06-30 19:13 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-kernel, kch, sagi, hch, dlemoal, upstream+nvme,
	Richard Weinberger

Currently, the block size is automatically configured, and for
file-backed namespaces it is likely to be 4K.
While this is a reasonable default for modern storage, it can
cause confusion if someone wants to export a pre-created disk image
that uses a 512-byte block size.
As a result, partition parsing will fail.

So, just like we already do for the loop block device, let the user
configure the block size if they know better.

Signed-off-by: Richard Weinberger <richard@nod.at>
---
Changes since v1 (RFC)[0]:

- Make sure blksize_shift is in general within reason
- In the bdev case and when using direct IO, blksize_shift has to be
  smaller than the logical block it the device
- In the file case and when using direct IO try to use STATX_DIOALIGN,
  just like the loop device does

[0] https://lore.kernel.org/linux-nvme/20250418090834.2755289-1-richard@nod.at/

Thanks,
//richard
---
 drivers/nvme/target/configfs.c    | 37 +++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-bdev.c | 13 ++++++++++-
 drivers/nvme/target/io-cmd-file.c | 28 ++++++++++++++++++-----
 3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e44ef69dffc24..26175c37374ab 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -797,6 +797,42 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
 }
 CONFIGFS_ATTR(nvmet_ns_, resv_enable);
 
+static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
+{
+	return sysfs_emit(page, "%u\n", to_nvmet_ns(item)->blksize_shift);
+}
+
+static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	u32 shift;
+	int ret;
+
+	ret = kstrtou32(page, 0, &shift);
+	if (ret)
+		return ret;
+
+	/*
+	 * Make sure block size is within reason, something between 512 and
+	 * BLK_MAX_BLOCK_SIZE.
+	 */
+	if (shift < 9 || shift > 16)
+		return -EINVAL;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("the ns:%d is already enabled.\n", ns->nsid);
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+	ns->blksize_shift = shift;
+	mutex_unlock(&ns->subsys->lock);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
+
 static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_device_path,
 	&nvmet_ns_attr_device_nguid,
@@ -806,6 +842,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_buffered_io,
 	&nvmet_ns_attr_revalidate_size,
 	&nvmet_ns_attr_resv_enable,
+	&nvmet_ns_attr_blksize_shift,
 #ifdef CONFIG_PCI_P2PDMA
 	&nvmet_ns_attr_p2pmem,
 #endif
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index eba42df2f8215..be39837d4d792 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -77,6 +77,7 @@ static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns)
 
 int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 {
+	int bdev_blksize_shift;
 	int ret;
 
 	/*
@@ -100,7 +101,17 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 	}
 	ns->bdev = file_bdev(ns->bdev_file);
 	ns->size = bdev_nr_bytes(ns->bdev);
-	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
+	bdev_blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
+
+	if (ns->blksize_shift) {
+		if (ns->blksize_shift < bdev_blksize_shift) {
+			pr_err("Configured blksize_shift needs to be at least %d for device %s\n",
+				bdev_blksize_shift, ns->device_path);
+			return -EINVAL;
+		}
+	} else {
+		ns->blksize_shift = bdev_blksize_shift;
+	}
 
 	ns->pi_type = 0;
 	ns->metadata_size = 0;
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129c..a4066b5a1dc97 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
 
 	nvmet_file_ns_revalidate(ns);
 
-	/*
-	 * i_blkbits can be greater than the universally accepted upper bound,
-	 * so make sure we export a sane namespace lba_shift.
-	 */
-	ns->blksize_shift = min_t(u8,
-			file_inode(ns->file)->i_blkbits, 12);
+	if (ns->blksize_shift) {
+		if (!ns->buffered_io) {
+			struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
+			struct kstat st;
+
+			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
+			    (st.result_mask & STATX_DIOALIGN) &&
+			    (1 << ns->blksize_shift) < st.dio_offset_align)
+				return -EINVAL;
+
+			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
+				return -EINVAL;
+		}
+	} else {
+		/*
+		 * i_blkbits can be greater than the universally accepted
+		 * upper bound, so make sure we export a sane namespace
+		 * lba_shift.
+		 */
+		ns->blksize_shift = min_t(u8,
+				file_inode(ns->file)->i_blkbits, 12);
+	}
 
 	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
 			mempool_free_slab, nvmet_bvec_cache);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-06-30 19:13 [PATCH v2] nvmet: Make blksize_shift configurable Richard Weinberger
@ 2025-07-01  0:34 ` Damien Le Moal
  2025-07-01  7:09   ` Richard Weinberger
  2025-07-03  8:54   ` Christoph Hellwig
  0 siblings, 2 replies; 7+ messages in thread
From: Damien Le Moal @ 2025-07-01  0:34 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme

On 7/1/25 4:13 AM, Richard Weinberger wrote:
> Currently, the block size is automatically configured, and for
> file-backed namespaces it is likely to be 4K.
> While this is a reasonable default for modern storage, it can
> cause confusion if someone wants to export a pre-created disk image
> that uses a 512-byte block size.
> As a result, partition parsing will fail.
> 
> So, just like we already do for the loop block device, let the user
> configure the block size if they know better.

Hmm... That fine with me but this explanation does not match what the patch
does: you allow configuring the block size bit shift, not the size. That is not
super user friendly :)

Even if internally you use the block size bit shift, I think it would be better
if the user facing interface is the block size as that is much easier to
manipulate without having to remember the exponent for powers of 2 values :)

> 
> Signed-off-by: Richard Weinberger <richard@nod.at>
> ---
> Changes since v1 (RFC)[0]:
> 
> - Make sure blksize_shift is in general within reason
> - In the bdev case and when using direct IO, blksize_shift has to be
>   smaller than the logical block it the device
> - In the file case and when using direct IO try to use STATX_DIOALIGN,
>   just like the loop device does
> 
> [0] https://lore.kernel.org/linux-nvme/20250418090834.2755289-1-richard@nod.at/
> 
> Thanks,
> //richard
> ---
>  drivers/nvme/target/configfs.c    | 37 +++++++++++++++++++++++++++++++
>  drivers/nvme/target/io-cmd-bdev.c | 13 ++++++++++-
>  drivers/nvme/target/io-cmd-file.c | 28 ++++++++++++++++++-----
>  3 files changed, 71 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index e44ef69dffc24..26175c37374ab 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -797,6 +797,42 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
>  }
>  CONFIGFS_ATTR(nvmet_ns_, resv_enable);
>  
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)

As mentioned above, I think this should be nvmet_ns_blksize_show().

> +{
> +	return sysfs_emit(page, "%u\n", to_nvmet_ns(item)->blksize_shift);

And you can do:

	return sysfs_emit(page, "%u\n",
			  1U << to_nvmet_ns(item)->blksize_shift);

> +}
> +
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> +		const char *page, size_t count)

Similar here: nvmet_ns_blksize_store()

> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	u32 shift;
> +	int ret;
> +
> +	ret = kstrtou32(page, 0, &shift);
> +	if (ret)
> +		return ret;
> +
> +	/*
> +	 * Make sure block size is within reason, something between 512 and
> +	 * BLK_MAX_BLOCK_SIZE.
> +	 */
> +	if (shift < 9 || shift > 16)
> +		return -EINVAL;

And this would be simpler:

	if (blksz < SECTOR_SIZE || blksz > BLK_MAX_BLOCK_SIZE ||
	    !is_power_of_2(blksz))
		return -EINVAL;

> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +	ns->blksize_shift = shift;

and here:

	ns->blksize_shift = ilog2(blksz);

> +	mutex_unlock(&ns->subsys->lock);
> +
> +	return count;
> +}
> +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> +
>  static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_device_path,
>  	&nvmet_ns_attr_device_nguid,
> @@ -806,6 +842,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_buffered_io,
>  	&nvmet_ns_attr_revalidate_size,
>  	&nvmet_ns_attr_resv_enable,
> +	&nvmet_ns_attr_blksize_shift,
>  #ifdef CONFIG_PCI_P2PDMA
>  	&nvmet_ns_attr_p2pmem,
>  #endif
> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> index eba42df2f8215..be39837d4d792 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -77,6 +77,7 @@ static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns)
>  
>  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>  {
> +	int bdev_blksize_shift;
>  	int ret;
>  
>  	/*
> @@ -100,7 +101,17 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>  	}
>  	ns->bdev = file_bdev(ns->bdev_file);
>  	ns->size = bdev_nr_bytes(ns->bdev);
> -	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> +	bdev_blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> +
> +	if (ns->blksize_shift) {
> +		if (ns->blksize_shift < bdev_blksize_shift) {
> +			pr_err("Configured blksize_shift needs to be at least %d for device %s\n",
> +				bdev_blksize_shift, ns->device_path);
> +			return -EINVAL;
> +		}
> +	} else {
> +		ns->blksize_shift = bdev_blksize_shift;
> +	}

Nit: to avoid the indented if, may be write this like this: ?

	if (!ns->blksize_shift)
		ns->blksize_shift = bdev_blksize_shift;

	if (ns->blksize_shift < bdev_blksize_shift) {
		pr_err("Configured blksize needs to be at least %u for device %s\n",
			bdev_logical_block_size(ns->bdev),
			ns->device_path);
		return -EINVAL;
	}

Also, if the backend is an HDD, do we want to allow the user to configure a
block size that is less than the *physical* block size ? Performance will
suffer on regular HDDs and writes may fail with SMR HDDs.

>  
>  	ns->pi_type = 0;
>  	ns->metadata_size = 0;
> diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> index 2d068439b129c..a4066b5a1dc97 100644
> --- a/drivers/nvme/target/io-cmd-file.c
> +++ b/drivers/nvme/target/io-cmd-file.c
> @@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
>  
>  	nvmet_file_ns_revalidate(ns);
>  
> -	/*
> -	 * i_blkbits can be greater than the universally accepted upper bound,
> -	 * so make sure we export a sane namespace lba_shift.
> -	 */
> -	ns->blksize_shift = min_t(u8,
> -			file_inode(ns->file)->i_blkbits, 12);
> +	if (ns->blksize_shift) {
> +		if (!ns->buffered_io) {
> +			struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
> +			struct kstat st;
> +
> +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> +			    (st.result_mask & STATX_DIOALIGN) &&
> +			    (1 << ns->blksize_shift) < st.dio_offset_align)
> +				return -EINVAL;
> +
> +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> +				return -EINVAL;

I am confused... This is going to check both... But if you got STATX_DIOALIGN
and it is OK, you do not need (and probably should not) do the second if, no ?

Also, the second condition of the second if is essentially the same check as
for the block dev case. So maybe reuse that by creating a small helper function ?

> +		}
> +	} else {
> +		/*
> +		 * i_blkbits can be greater than the universally accepted
> +		 * upper bound, so make sure we export a sane namespace
> +		 * lba_shift.
> +		 */
> +		ns->blksize_shift = min_t(u8,
> +				file_inode(ns->file)->i_blkbits, 12);
> +	}

It feels like this entire hunk should be a helper function as that would allow
making it a lot more readable with early returns. This code here whould be
something like:

	ret = nvmet_file_set_ns_blksize_shift(ns);
	if (ret)
		return ret;

>  
>  	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
>  			mempool_free_slab, nvmet_bvec_cache);


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-07-01  0:34 ` Damien Le Moal
@ 2025-07-01  7:09   ` Richard Weinberger
  2025-07-01  7:32     ` Damien Le Moal
  2025-07-03  8:54   ` Christoph Hellwig
  1 sibling, 1 reply; 7+ messages in thread
From: Richard Weinberger @ 2025-07-01  7:09 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme, upstream
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme, Damien Le Moal

On Dienstag, 1. Juli 2025 02:34 'Damien Le Moal' via upstream wrote:
> On 7/1/25 4:13 AM, Richard Weinberger wrote:
> > Currently, the block size is automatically configured, and for
> > file-backed namespaces it is likely to be 4K.
> > While this is a reasonable default for modern storage, it can
> > cause confusion if someone wants to export a pre-created disk image
> > that uses a 512-byte block size.
> > As a result, partition parsing will fail.
> > 
> > So, just like we already do for the loop block device, let the user
> > configure the block size if they know better.
> 
> Hmm... That fine with me but this explanation does not match what the patch
> does: you allow configuring the block size bit shift, not the size. That is not
> super user friendly :)
> 
> Even if internally you use the block size bit shift, I think it would be better
> if the user facing interface is the block size as that is much easier to
> manipulate without having to remember the exponent for powers of 2 values :)

The initial intention of this patch was exposing the blksize_shift property.
If we want to expose this as more user friendly, I'm fine with it.
Maybe "minimum_io_size"?

> 
> > 
> > Signed-off-by: Richard Weinberger <richard@nod.at>
> > ---
> > Changes since v1 (RFC)[0]:
> > 
> > - Make sure blksize_shift is in general within reason
> > - In the bdev case and when using direct IO, blksize_shift has to be
> >   smaller than the logical block it the device
> > - In the file case and when using direct IO try to use STATX_DIOALIGN,
> >   just like the loop device does
> > 
> > [0] https://lore.kernel.org/linux-nvme/20250418090834.2755289-1-richard@nod.at/
> > 
> > Thanks,
> > //richard
> > ---
> >  drivers/nvme/target/configfs.c    | 37 +++++++++++++++++++++++++++++++
> >  drivers/nvme/target/io-cmd-bdev.c | 13 ++++++++++-
> >  drivers/nvme/target/io-cmd-file.c | 28 ++++++++++++++++++-----
> >  3 files changed, 71 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> > index e44ef69dffc24..26175c37374ab 100644
> > --- a/drivers/nvme/target/configfs.c
> > +++ b/drivers/nvme/target/configfs.c
> > @@ -797,6 +797,42 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
> >  }
> >  CONFIGFS_ATTR(nvmet_ns_, resv_enable);
> >  
> > +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> 
> As mentioned above, I think this should be nvmet_ns_blksize_show().
> 
> > +{
> > +	return sysfs_emit(page, "%u\n", to_nvmet_ns(item)->blksize_shift);
> 
> And you can do:
> 
> 	return sysfs_emit(page, "%u\n",
> 			  1U << to_nvmet_ns(item)->blksize_shift);
> 
> > +}
> > +
> > +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> > +		const char *page, size_t count)
> 
> Similar here: nvmet_ns_blksize_store()
> 
> > +{
> > +	struct nvmet_ns *ns = to_nvmet_ns(item);
> > +	u32 shift;
> > +	int ret;
> > +
> > +	ret = kstrtou32(page, 0, &shift);
> > +	if (ret)
> > +		return ret;
> > +
> > +	/*
> > +	 * Make sure block size is within reason, something between 512 and
> > +	 * BLK_MAX_BLOCK_SIZE.
> > +	 */
> > +	if (shift < 9 || shift > 16)
> > +		return -EINVAL;
> 
> And this would be simpler:
> 
> 	if (blksz < SECTOR_SIZE || blksz > BLK_MAX_BLOCK_SIZE ||
> 	    !is_power_of_2(blksz))
> 		return -EINVAL;
> 
> > +
> > +	mutex_lock(&ns->subsys->lock);
> > +	if (ns->enabled) {
> > +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> > +		mutex_unlock(&ns->subsys->lock);
> > +		return -EINVAL;
> > +	}
> > +	ns->blksize_shift = shift;
> 
> and here:
> 
> 	ns->blksize_shift = ilog2(blksz);
> 
> > +	mutex_unlock(&ns->subsys->lock);
> > +
> > +	return count;
> > +}
> > +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> > +
> >  static struct configfs_attribute *nvmet_ns_attrs[] = {
> >  	&nvmet_ns_attr_device_path,
> >  	&nvmet_ns_attr_device_nguid,
> > @@ -806,6 +842,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
> >  	&nvmet_ns_attr_buffered_io,
> >  	&nvmet_ns_attr_revalidate_size,
> >  	&nvmet_ns_attr_resv_enable,
> > +	&nvmet_ns_attr_blksize_shift,
> >  #ifdef CONFIG_PCI_P2PDMA
> >  	&nvmet_ns_attr_p2pmem,
> >  #endif
> > diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> > index eba42df2f8215..be39837d4d792 100644
> > --- a/drivers/nvme/target/io-cmd-bdev.c
> > +++ b/drivers/nvme/target/io-cmd-bdev.c
> > @@ -77,6 +77,7 @@ static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns)
> >  
> >  int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> >  {
> > +	int bdev_blksize_shift;
> >  	int ret;
> >  
> >  	/*
> > @@ -100,7 +101,17 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> >  	}
> >  	ns->bdev = file_bdev(ns->bdev_file);
> >  	ns->size = bdev_nr_bytes(ns->bdev);
> > -	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> > +	bdev_blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> > +
> > +	if (ns->blksize_shift) {
> > +		if (ns->blksize_shift < bdev_blksize_shift) {
> > +			pr_err("Configured blksize_shift needs to be at least %d for device %s\n",
> > +				bdev_blksize_shift, ns->device_path);
> > +			return -EINVAL;
> > +		}
> > +	} else {
> > +		ns->blksize_shift = bdev_blksize_shift;
> > +	}
> 
> Nit: to avoid the indented if, may be write this like this: ?
> 
> 	if (!ns->blksize_shift)
> 		ns->blksize_shift = bdev_blksize_shift;
> 
> 	if (ns->blksize_shift < bdev_blksize_shift) {
> 		pr_err("Configured blksize needs to be at least %u for device %s\n",
> 			bdev_logical_block_size(ns->bdev),
> 			ns->device_path);
> 		return -EINVAL;
> 	}

It's a matter of taste, but yes...

> Also, if the backend is an HDD, do we want to allow the user to configure a
> block size that is less than the *physical* block size ? Performance will
> suffer on regular HDDs and writes may fail with SMR HDDs.

I'm not sure whether it's worth putting more smartness into this logic.
 
> >  
> >  	ns->pi_type = 0;
> >  	ns->metadata_size = 0;
> > diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> > index 2d068439b129c..a4066b5a1dc97 100644
> > --- a/drivers/nvme/target/io-cmd-file.c
> > +++ b/drivers/nvme/target/io-cmd-file.c
> > @@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
> >  
> >  	nvmet_file_ns_revalidate(ns);
> >  
> > -	/*
> > -	 * i_blkbits can be greater than the universally accepted upper bound,
> > -	 * so make sure we export a sane namespace lba_shift.
> > -	 */
> > -	ns->blksize_shift = min_t(u8,
> > -			file_inode(ns->file)->i_blkbits, 12);
> > +	if (ns->blksize_shift) {
> > +		if (!ns->buffered_io) {
> > +			struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
> > +			struct kstat st;
> > +
> > +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> > +			    (st.result_mask & STATX_DIOALIGN) &&
> > +			    (1 << ns->blksize_shift) < st.dio_offset_align)
> > +				return -EINVAL;
> > +
> > +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> > +				return -EINVAL;
> 
> I am confused... This is going to check both... But if you got STATX_DIOALIGN
> and it is OK, you do not need (and probably should not) do the second if, no ?

I was not sure about that.
Is it guaranteed that STATX_DIOALIGN returns something sane?

> 
> Also, the second condition of the second if is essentially the same check as
> for the block dev case. So maybe reuse that by creating a small helper function ?

Ok.
 
> > +		}
> > +	} else {
> > +		/*
> > +		 * i_blkbits can be greater than the universally accepted
> > +		 * upper bound, so make sure we export a sane namespace
> > +		 * lba_shift.
> > +		 */
> > +		ns->blksize_shift = min_t(u8,
> > +				file_inode(ns->file)->i_blkbits, 12);
> > +	}
> 
> It feels like this entire hunk should be a helper function as that would allow
> making it a lot more readable with early returns. This code here whould be
> something like:

Ok.

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-07-01  7:09   ` Richard Weinberger
@ 2025-07-01  7:32     ` Damien Le Moal
  2025-07-01 15:00       ` Richard Weinberger
  0 siblings, 1 reply; 7+ messages in thread
From: Damien Le Moal @ 2025-07-01  7:32 UTC (permalink / raw)
  To: Richard Weinberger, Richard Weinberger, linux-nvme, upstream
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme

On 7/1/25 4:09 PM, Richard Weinberger wrote:
> On Dienstag, 1. Juli 2025 02:34 'Damien Le Moal' via upstream wrote:
>> On 7/1/25 4:13 AM, Richard Weinberger wrote:
>>> Currently, the block size is automatically configured, and for
>>> file-backed namespaces it is likely to be 4K.
>>> While this is a reasonable default for modern storage, it can
>>> cause confusion if someone wants to export a pre-created disk image
>>> that uses a 512-byte block size.
>>> As a result, partition parsing will fail.
>>>
>>> So, just like we already do for the loop block device, let the user
>>> configure the block size if they know better.
>>
>> Hmm... That fine with me but this explanation does not match what the patch
>> does: you allow configuring the block size bit shift, not the size. That is not
>> super user friendly :)
>>
>> Even if internally you use the block size bit shift, I think it would be better
>> if the user facing interface is the block size as that is much easier to
>> manipulate without having to remember the exponent for powers of 2 values :)
> 
> The initial intention of this patch was exposing the blksize_shift property.
> If we want to expose this as more user friendly, I'm fine with it.
> Maybe "minimum_io_size"?

That likely will be confusing with the existing device limit io_min. I think
block_size is clear.

>> Nit: to avoid the indented if, may be write this like this: ?
>>
>> 	if (!ns->blksize_shift)
>> 		ns->blksize_shift = bdev_blksize_shift;
>>
>> 	if (ns->blksize_shift < bdev_blksize_shift) {
>> 		pr_err("Configured blksize needs to be at least %u for device %s\n",
>> 			bdev_logical_block_size(ns->bdev),
>> 			ns->device_path);
>> 		return -EINVAL;
>> 	}
> 
> It's a matter of taste, but yes...

Absolutely :)

> 
>> Also, if the backend is an HDD, do we want to allow the user to configure a
>> block size that is less than the *physical* block size ? Performance will
>> suffer on regular HDDs and writes may fail with SMR HDDs.
> 
> I'm not sure whether it's worth putting more smartness into this logic.

This may be nice to avoid users shooting themselves in the foot with a bad
setup and us having to deal with bad performance complaints...
If we do not do anything special, we will be stuck with it as a more
restrictive setup later may break some (bad) user setups. That is why I raised
the point :)

>>>  	ns->pi_type = 0;
>>>  	ns->metadata_size = 0;
>>> diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
>>> index 2d068439b129c..a4066b5a1dc97 100644
>>> --- a/drivers/nvme/target/io-cmd-file.c
>>> +++ b/drivers/nvme/target/io-cmd-file.c
>>> @@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
>>>  
>>>  	nvmet_file_ns_revalidate(ns);
>>>  
>>> -	/*
>>> -	 * i_blkbits can be greater than the universally accepted upper bound,
>>> -	 * so make sure we export a sane namespace lba_shift.
>>> -	 */
>>> -	ns->blksize_shift = min_t(u8,
>>> -			file_inode(ns->file)->i_blkbits, 12);
>>> +	if (ns->blksize_shift) {
>>> +		if (!ns->buffered_io) {
>>> +			struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
>>> +			struct kstat st;
>>> +
>>> +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
>>> +			    (st.result_mask & STATX_DIOALIGN) &&
>>> +			    (1 << ns->blksize_shift) < st.dio_offset_align)
>>> +				return -EINVAL;
>>> +
>>> +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
>>> +				return -EINVAL;
>>
>> I am confused... This is going to check both... But if you got STATX_DIOALIGN
>> and it is OK, you do not need (and probably should not) do the second if, no ?
> 
> I was not sure about that.
> Is it guaranteed that STATX_DIOALIGN returns something sane?

If it is defined by the FS, yes. But it may not be defined, so in that case,
you have to use the fallback of the bdev block size.


-- 
Damien Le Moal
Western Digital Research


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-07-01  7:32     ` Damien Le Moal
@ 2025-07-01 15:00       ` Richard Weinberger
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Weinberger @ 2025-07-01 15:00 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme, upstream, Damien Le Moal
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme

On Dienstag, 1. Juli 2025 09:32 Damien Le Moal wrote:
> > The initial intention of this patch was exposing the blksize_shift property.
> > If we want to expose this as more user friendly, I'm fine with it.
> > Maybe "minimum_io_size"?
> 
> That likely will be confusing with the existing device limit io_min. I think
> block_size is clear.

Ok!

> > 
> >> Also, if the backend is an HDD, do we want to allow the user to configure a
> >> block size that is less than the *physical* block size ? Performance will
> >> suffer on regular HDDs and writes may fail with SMR HDDs.
> > 
> > I'm not sure whether it's worth putting more smartness into this logic.
> 
> This may be nice to avoid users shooting themselves in the foot with a bad
> setup and us having to deal with bad performance complaints...
> If we do not do anything special, we will be stuck with it as a more
> restrictive setup later may break some (bad) user setups. That is why I raised
> the point :)

Detecting whether the backend is an HDD should work via bdev_nonrot().
So, we could issue a warning if bdev_nonrot() is true and the configured
block size is less than bdev_physical_block_size()?

> >> I am confused... This is going to check both... But if you got STATX_DIOALIGN
> >> and it is OK, you do not need (and probably should not) do the second if, no ?
> > 
> > I was not sure about that.
> > Is it guaranteed that STATX_DIOALIGN returns something sane?
> 
> If it is defined by the FS, yes. But it may not be defined, so in that case,
> you have to use the fallback of the bdev block size.

Ok!

LG,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-07-01  0:34 ` Damien Le Moal
  2025-07-01  7:09   ` Richard Weinberger
@ 2025-07-03  8:54   ` Christoph Hellwig
  2025-07-03  9:29     ` Richard Weinberger
  1 sibling, 1 reply; 7+ messages in thread
From: Christoph Hellwig @ 2025-07-03  8:54 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Richard Weinberger, linux-nvme, linux-kernel, kch, sagi, hch,
	upstream+nvme

On Tue, Jul 01, 2025 at 09:34:00AM +0900, Damien Le Moal wrote:
> Even if internally you use the block size bit shift, I think it would be better
> if the user facing interface is the block size as that is much easier to
> manipulate without having to remember the exponent for powers of 2 values :)

Yeah, block sizes are probably a nice user interface indeed.

> 		pr_err("Configured blksize needs to be at least %u for device %s\n",
> 			bdev_logical_block_size(ns->bdev),
> 			ns->device_path);
> 		return -EINVAL;
> 	}
> 
> Also, if the backend is an HDD, do we want to allow the user to configure a
> block size that is less than the *physical* block size ? Performance will
> suffer on regular HDDs and writes may fail with SMR HDDs.

I don't think we should babysit the user like that, just like we allow
creating file systems with block size smaller than the physical block
size.

> > +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> > +			    (st.result_mask & STATX_DIOALIGN) &&
> > +			    (1 << ns->blksize_shift) < st.dio_offset_align)
> > +				return -EINVAL;
> > +
> > +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> > +				return -EINVAL;
> 
> I am confused... This is going to check both... But if you got STATX_DIOALIGN
> and it is OK, you do not need (and probably should not) do the second if, no ?
> 
> Also, the second condition of the second if is essentially the same check as
> for the block dev case. So maybe reuse that by creating a small helper function ?

This code is copy and pasted from loop, so it's originally my fault.
It just missed the comment that explains why it is there:

	/*
         * In a perfect world this wouldn't be needed, but as of Linux 6.13 only
         * a handful of file systems support the STATX_DIOALIGN flag.
         */

The situation has unfortunately not improved since 6.13.  Maybe we
just need to do a sweep and fix this up?



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v2] nvmet: Make blksize_shift configurable
  2025-07-03  8:54   ` Christoph Hellwig
@ 2025-07-03  9:29     ` Richard Weinberger
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Weinberger @ 2025-07-03  9:29 UTC (permalink / raw)
  To: Damien Le Moal, upstream
  Cc: Richard Weinberger, linux-nvme, linux-kernel, kch, sagi, hch,
	upstream+nvme, Christoph Hellwig

On Donnerstag, 3. Juli 2025 10:54 Christoph Hellwig wrote:
> On Tue, Jul 01, 2025 at 09:34:00AM +0900, Damien Le Moal wrote:
> > Even if internally you use the block size bit shift, I think it would be better
> > if the user facing interface is the block size as that is much easier to
> > manipulate without having to remember the exponent for powers of 2 values :)
> 
> Yeah, block sizes are probably a nice user interface indeed.

Ok!

> 
> > 		pr_err("Configured blksize needs to be at least %u for device %s\n",
> > 			bdev_logical_block_size(ns->bdev),
> > 			ns->device_path);
> > 		return -EINVAL;
> > 	}
> > 
> > Also, if the backend is an HDD, do we want to allow the user to configure a
> > block size that is less than the *physical* block size ? Performance will
> > suffer on regular HDDs and writes may fail with SMR HDDs.
> 
> I don't think we should babysit the user like that, just like we allow
> creating file systems with block size smaller than the physical block
> size.

I'm fine with either way.

> 
> > > +			if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> > > +			    (st.result_mask & STATX_DIOALIGN) &&
> > > +			    (1 << ns->blksize_shift) < st.dio_offset_align)
> > > +				return -EINVAL;
> > > +
> > > +			if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> > > +				return -EINVAL;
> > 
> > I am confused... This is going to check both... But if you got STATX_DIOALIGN
> > and it is OK, you do not need (and probably should not) do the second if, no ?
> > 
> > Also, the second condition of the second if is essentially the same check as
> > for the block dev case. So maybe reuse that by creating a small helper function ?
> 
> This code is copy and pasted from loop, so it's originally my fault.
> It just missed the comment that explains why it is there:
> 
> 	/*
>          * In a perfect world this wouldn't be needed, but as of Linux 6.13 only
>          * a handful of file systems support the STATX_DIOALIGN flag.
>          */

Well, my code is the other way around. I checks the logical block size of a device
even if STATX_DIOALIGN succeeded, which is a bit too paranoid I guess.

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y




^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-03  9:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-30 19:13 [PATCH v2] nvmet: Make blksize_shift configurable Richard Weinberger
2025-07-01  0:34 ` Damien Le Moal
2025-07-01  7:09   ` Richard Weinberger
2025-07-01  7:32     ` Damien Le Moal
2025-07-01 15:00       ` Richard Weinberger
2025-07-03  8:54   ` Christoph Hellwig
2025-07-03  9:29     ` Richard Weinberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).