[RFC PATCH] nvmet: Make blksize

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] nvmet: Make blksize_shift configurable
@ 2025-04-18  9:08 Richard Weinberger
  2025-04-18  9:37 ` Damien Le Moal
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Richard Weinberger @ 2025-04-18  9:08 UTC (permalink / raw)
  To: linux-nvme
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme, Richard Weinberger

Currently, the block size is automatically configured, and for
file-backed namespaces it is likely to be 4K.
While this is a reasonable default for modern storage, it can
cause confusion if someone wants to export a pre-created disk image
that uses a 512-byte block size.
As a result, partition parsing will fail.

So, just like we already do for the loop block device, let the user
configure the block size if they know better.

Signed-off-by: Richard Weinberger <richard@nod.at>
---
 drivers/nvme/target/configfs.c    | 30 ++++++++++++++++++++++++++++++
 drivers/nvme/target/io-cmd-bdev.c |  4 +++-
 drivers/nvme/target/io-cmd-file.c | 15 +++++++++------
 3 files changed, 42 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
index e44ef69dffc2..2fd9cc3b1d00 100644
--- a/drivers/nvme/target/configfs.c
+++ b/drivers/nvme/target/configfs.c
@@ -797,6 +797,35 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
 }
 CONFIGFS_ATTR(nvmet_ns_, resv_enable);
 
+static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
+{
+	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
+}
+
+static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
+		const char *page, size_t count)
+{
+	struct nvmet_ns *ns = to_nvmet_ns(item);
+	u32 shift;
+	int ret;
+
+	ret = kstrtou32(page, 0, &shift);
+	if (ret)
+		return ret;
+
+	mutex_lock(&ns->subsys->lock);
+	if (ns->enabled) {
+		pr_err("the ns:%d is already enabled.\n", ns->nsid);
+		mutex_unlock(&ns->subsys->lock);
+		return -EINVAL;
+	}
+	ns->blksize_shift = shift;
+	mutex_unlock(&ns->subsys->lock);
+
+	return count;
+}
+CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
+
 static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_device_path,
 	&nvmet_ns_attr_device_nguid,
@@ -806,6 +835,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
 	&nvmet_ns_attr_buffered_io,
 	&nvmet_ns_attr_revalidate_size,
 	&nvmet_ns_attr_resv_enable,
+	&nvmet_ns_attr_blksize_shift,
 #ifdef CONFIG_PCI_P2PDMA
 	&nvmet_ns_attr_p2pmem,
 #endif
diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
index 83be0657e6df..a86010af4670 100644
--- a/drivers/nvme/target/io-cmd-bdev.c
+++ b/drivers/nvme/target/io-cmd-bdev.c
@@ -100,7 +100,9 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
 	}
 	ns->bdev = file_bdev(ns->bdev_file);
 	ns->size = bdev_nr_bytes(ns->bdev);
-	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
+
+	if (!ns->blksize_shift)
+		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
 
 	ns->pi_type = 0;
 	ns->metadata_size = 0;
diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
index 2d068439b129..5893b64179fb 100644
--- a/drivers/nvme/target/io-cmd-file.c
+++ b/drivers/nvme/target/io-cmd-file.c
@@ -49,12 +49,15 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
 
 	nvmet_file_ns_revalidate(ns);
 
-	/*
-	 * i_blkbits can be greater than the universally accepted upper bound,
-	 * so make sure we export a sane namespace lba_shift.
-	 */
-	ns->blksize_shift = min_t(u8,
-			file_inode(ns->file)->i_blkbits, 12);
+	if (!ns->blksize_shift) {
+		/*
+		 * i_blkbits can be greater than the universally accepted
+		 * upper bound, so make sure we export a sane namespace
+		 * lba_shift.
+		 */
+		ns->blksize_shift = min_t(u8,
+				file_inode(ns->file)->i_blkbits, 12);
+	}
 
 	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
 			mempool_free_slab, nvmet_bvec_cache);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:08 [RFC PATCH] nvmet: Make blksize_shift configurable Richard Weinberger
@ 2025-04-18  9:37 ` Damien Le Moal
  2025-04-18  9:56   ` Richard Weinberger
  2025-04-22  6:47 ` Chaitanya Kulkarni
  2025-04-22  6:50 ` Chaitanya Kulkarni
  2 siblings, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2025-04-18  9:37 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme
  Cc: linux-kernel, kch, sagi, hch, upstream+nvme

On 4/18/25 18:08, Richard Weinberger wrote:
> Currently, the block size is automatically configured, and for
> file-backed namespaces it is likely to be 4K.
> While this is a reasonable default for modern storage, it can
> cause confusion if someone wants to export a pre-created disk image
> that uses a 512-byte block size.
> As a result, partition parsing will fail.
> 
> So, just like we already do for the loop block device, let the user
> configure the block size if they know better.
> 
> Signed-off-by: Richard Weinberger <richard@nod.at>
> ---
>  drivers/nvme/target/configfs.c    | 30 ++++++++++++++++++++++++++++++
>  drivers/nvme/target/io-cmd-bdev.c |  4 +++-
>  drivers/nvme/target/io-cmd-file.c | 15 +++++++++------
>  3 files changed, 42 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index e44ef69dffc2..2fd9cc3b1d00 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -797,6 +797,35 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
>  }
>  CONFIGFS_ATTR(nvmet_ns_, resv_enable);
>  
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> +{
> +	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
> +}
> +
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	u32 shift;
> +	int ret;
> +
> +	ret = kstrtou32(page, 0, &shift);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +	ns->blksize_shift = shift;
> +	mutex_unlock(&ns->subsys->lock);
> +
> +	return count;
> +}
> +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> +
>  static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_device_path,
>  	&nvmet_ns_attr_device_nguid,
> @@ -806,6 +835,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
>  	&nvmet_ns_attr_buffered_io,
>  	&nvmet_ns_attr_revalidate_size,
>  	&nvmet_ns_attr_resv_enable,
> +	&nvmet_ns_attr_blksize_shift,
>  #ifdef CONFIG_PCI_P2PDMA
>  	&nvmet_ns_attr_p2pmem,
>  #endif
> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> index 83be0657e6df..a86010af4670 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -100,7 +100,9 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
>  	}
>  	ns->bdev = file_bdev(ns->bdev_file);
>  	ns->size = bdev_nr_bytes(ns->bdev);
> -	ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> +
> +	if (!ns->blksize_shift)
> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));

If the user set logical block size is smaller than the block dev logical block
size, this is not going to work... No ? Am I missing something ?

>  
>  	ns->pi_type = 0;
>  	ns->metadata_size = 0;
> diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> index 2d068439b129..5893b64179fb 100644
> --- a/drivers/nvme/target/io-cmd-file.c
> +++ b/drivers/nvme/target/io-cmd-file.c
> @@ -49,12 +49,15 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
>  
>  	nvmet_file_ns_revalidate(ns);
>  
> -	/*
> -	 * i_blkbits can be greater than the universally accepted upper bound,
> -	 * so make sure we export a sane namespace lba_shift.
> -	 */
> -	ns->blksize_shift = min_t(u8,
> -			file_inode(ns->file)->i_blkbits, 12);
> +	if (!ns->blksize_shift) {
> +		/*
> +		 * i_blkbits can be greater than the universally accepted
> +		 * upper bound, so make sure we export a sane namespace
> +		 * lba_shift.
> +		 */
> +		ns->blksize_shift = min_t(u8,
> +				file_inode(ns->file)->i_blkbits, 12);

This will work for any block size, regardless of the FS block size, but only if
ns->buffered_io is true. Doesn't this require some more checks with regards to
O_DIRECT (!ns->buffered_io case) ?

> +	}
>  
>  	ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
>  			mempool_free_slab, nvmet_bvec_cache);


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:37 ` Damien Le Moal
@ 2025-04-18  9:56   ` Richard Weinberger
  2025-04-18 10:23     ` Damien Le Moal
  2025-04-22  8:16     ` Christoph Hellwig
  0 siblings, 2 replies; 8+ messages in thread
From: Richard Weinberger @ 2025-04-18  9:56 UTC (permalink / raw)
  To: linux-nvme, linux-kernel
  Cc: Richard Weinberger, kch, sagi, hch, upstream+nvme, Damien Le Moal

On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
> > +	if (!ns->blksize_shift)
> > +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> 
> If the user set logical block size is smaller than the block dev logical block
> size, this is not going to work... No ? Am I missing something ?

Likely, yes.
TBH, I'm not sure whether it makes actually sense for the bdev case to make
blksize_shift configurable.
The case I see most benefit is the backing file case.

> > +	if (!ns->blksize_shift) {
> > +		/*
> > +		 * i_blkbits can be greater than the universally accepted
> > +		 * upper bound, so make sure we export a sane namespace
> > +		 * lba_shift.
> > +		 */
> > +		ns->blksize_shift = min_t(u8,
> > +				file_inode(ns->file)->i_blkbits, 12);
> 
> This will work for any block size, regardless of the FS block size, but only if
> ns->buffered_io is true. Doesn't this require some more checks with regards to
> O_DIRECT (!ns->buffered_io case) ?

Good catch. I'll add a check.

It's also worth discussing whether we should limit blksize_shift to a specific
range. Right now, any shift is accepted, and it is up to the user to
use a sane value.

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:56   ` Richard Weinberger
@ 2025-04-18 10:23     ` Damien Le Moal
  2025-04-18 10:53       ` Richard Weinberger
  2025-04-22  8:16     ` Christoph Hellwig
  1 sibling, 1 reply; 8+ messages in thread
From: Damien Le Moal @ 2025-04-18 10:23 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme, linux-kernel
  Cc: Richard Weinberger, kch, sagi, hch, upstream+nvme

On 4/18/25 18:56, Richard Weinberger wrote:
> On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
>>> +	if (!ns->blksize_shift)
>>> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
>>
>> If the user set logical block size is smaller than the block dev logical block
>> size, this is not going to work... No ? Am I missing something ?
> 
> Likely, yes.
> TBH, I'm not sure whether it makes actually sense for the bdev case to make
> blksize_shift configurable.

Probably not... I do understand the value for the file case though.

> The case I see most benefit is the backing file case.
> 
>>> +	if (!ns->blksize_shift) {
>>> +		/*
>>> +		 * i_blkbits can be greater than the universally accepted
>>> +		 * upper bound, so make sure we export a sane namespace
>>> +		 * lba_shift.
>>> +		 */
>>> +		ns->blksize_shift = min_t(u8,
>>> +				file_inode(ns->file)->i_blkbits, 12);
>>
>> This will work for any block size, regardless of the FS block size, but only if
>> ns->buffered_io is true. Doesn't this require some more checks with regards to
>> O_DIRECT (!ns->buffered_io case) ?
> 
> Good catch. I'll add a check.

And by the way, you need to check for STATX_DIOALIGN since some FS (e.g. xfs)
can handle direct IOs that are not aligned to the FS block size. See the recent
changes in drivers/block/loop.c to improve direct IO handling, specifically, the
function loop_query_min_dio_size().

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18 10:23     ` Damien Le Moal
@ 2025-04-18 10:53       ` Richard Weinberger
  0 siblings, 0 replies; 8+ messages in thread
From: Richard Weinberger @ 2025-04-18 10:53 UTC (permalink / raw)
  To: linux-nvme, linux-kernel, Damien Le Moal
  Cc: Richard Weinberger, kch, sagi, hch, upstream+nvme

On Freitag, 18. April 2025 12:23 Damien Le Moal wrote:
> On 4/18/25 18:56, Richard Weinberger wrote:
> > On Freitag, 18. April 2025 11:37 'Damien Le Moal' via upstream wrote:
> >>> +	if (!ns->blksize_shift)
> >>> +		ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> >>
> >> If the user set logical block size is smaller than the block dev logical block
> >> size, this is not going to work... No ? Am I missing something ?
> > 
> > Likely, yes.
> > TBH, I'm not sure whether it makes actually sense for the bdev case to make
> > blksize_shift configurable.
> 
> Probably not... I do understand the value for the file case though.

The use case is exposing ready-to-use cloud images like:
https://cloud.debian.org/images/cloud/bookworm/20250416-2084/debian-12-generic-amd64-20250416-2084.raw

via NVme-of TCP.
Yesterday I did so and figured that no GPT partitions got detected because of different block sizes.
Setting the block size in nvmet to 512 made it work.

If there are better ways to achieve the same, I'm open for suggestions.

> 
> > The case I see most benefit is the backing file case.
> > 
> >>> +	if (!ns->blksize_shift) {
> >>> +		/*
> >>> +		 * i_blkbits can be greater than the universally accepted
> >>> +		 * upper bound, so make sure we export a sane namespace
> >>> +		 * lba_shift.
> >>> +		 */
> >>> +		ns->blksize_shift = min_t(u8,
> >>> +				file_inode(ns->file)->i_blkbits, 12);
> >>
> >> This will work for any block size, regardless of the FS block size, but only if
> >> ns->buffered_io is true. Doesn't this require some more checks with regards to
> >> O_DIRECT (!ns->buffered_io case) ?
> > 
> > Good catch. I'll add a check.
> 
> And by the way, you need to check for STATX_DIOALIGN since some FS (e.g. xfs)
> can handle direct IOs that are not aligned to the FS block size. See the recent
> changes in drivers/block/loop.c to improve direct IO handling, specifically, the
> function loop_query_min_dio_size().

Ok!

Thanks,
//richard

-- 
sigma star gmbh | Eduard-Bodem-Gasse 6, 6020 Innsbruck, AUT UID/VAT Nr:
ATU 66964118 | FN: 374287y



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:08 [RFC PATCH] nvmet: Make blksize_shift configurable Richard Weinberger
  2025-04-18  9:37 ` Damien Le Moal
@ 2025-04-22  6:47 ` Chaitanya Kulkarni
  2025-04-22  6:50 ` Chaitanya Kulkarni
  2 siblings, 0 replies; 8+ messages in thread
From: Chaitanya Kulkarni @ 2025-04-22  6:47 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme@lists.infradead.org
  Cc: linux-kernel@vger.kernel.org, Chaitanya Kulkarni,
	sagi@grimberg.me, hch@lst.de, upstream+nvme@sigma-star.at

On 4/18/25 02:08, Richard Weinberger wrote:
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> +		const char *page, size_t count)
> +{
> +	struct nvmet_ns *ns = to_nvmet_ns(item);
> +	u32 shift;
> +	int ret;
> +
> +	ret = kstrtou32(page, 0, &shift);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&ns->subsys->lock);
> +	if (ns->enabled) {
> +		pr_err("the ns:%d is already enabled.\n", ns->nsid);
> +		mutex_unlock(&ns->subsys->lock);
> +		return -EINVAL;
> +	}
> +	ns->blksize_shift = shift;
> +	mutex_unlock(&ns->subsys->lock);
> +
> +	return count;
> +}

before we overwrite the old value in ns->blksize_shift don't we need
to make sure new value shift is in the acceptable range ? in case value
is not within the acceptable range then we need to return -EINVAL ?

Also, do we need to #define min and max values for the acceptable
range?

-ck



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:08 [RFC PATCH] nvmet: Make blksize_shift configurable Richard Weinberger
  2025-04-18  9:37 ` Damien Le Moal
  2025-04-22  6:47 ` Chaitanya Kulkarni
@ 2025-04-22  6:50 ` Chaitanya Kulkarni
  2 siblings, 0 replies; 8+ messages in thread
From: Chaitanya Kulkarni @ 2025-04-22  6:50 UTC (permalink / raw)
  To: Richard Weinberger, linux-nvme@lists.infradead.org
  Cc: linux-kernel@vger.kernel.org, Chaitanya Kulkarni,
	sagi@grimberg.me, hch@lst.de, upstream+nvme@sigma-star.at

On 4/18/25 02:08, Richard Weinberger wrote:
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
> +{
> +	return sysfs_emit(page, "%d\n", to_nvmet_ns(item)->blksize_shift);
> +}
> +

I believe ns->blksize_shift is u32 to do we need %u ? unless we need to
re-interpret that value in int which I failed to understand.

if not please ignore this comment ...

-ck



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH] nvmet: Make blksize_shift configurable
  2025-04-18  9:56   ` Richard Weinberger
  2025-04-18 10:23     ` Damien Le Moal
@ 2025-04-22  8:16     ` Christoph Hellwig
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Hellwig @ 2025-04-22  8:16 UTC (permalink / raw)
  To: Richard Weinberger
  Cc: linux-nvme, linux-kernel, Richard Weinberger, kch, sagi, hch,
	upstream+nvme, Damien Le Moal

On Fri, Apr 18, 2025 at 11:56:30AM +0200, Richard Weinberger wrote:
> TBH, I'm not sure whether it makes actually sense for the bdev case to make
> blksize_shift configurable.

> It's also worth discussing whether we should limit blksize_shift to a specific
> range. Right now, any shift is accepted, and it is up to the user to
> use a sane value.

It should have a hard lower bound of 512 bytes (9) and for direct I/O and
the bdev backend the reported dio alignment.

The upper bound is capped by the nvme_lbaf field being a 8-bit value on
the wire, no real need to require anything lower than that probably.


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-04-22  8:16 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18  9:08 [RFC PATCH] nvmet: Make blksize_shift configurable Richard Weinberger
2025-04-18  9:37 ` Damien Le Moal
2025-04-18  9:56   ` Richard Weinberger
2025-04-18 10:23     ` Damien Le Moal
2025-04-18 10:53       ` Richard Weinberger
2025-04-22  8:16     ` Christoph Hellwig
2025-04-22  6:47 ` Chaitanya Kulkarni
2025-04-22  6:50 ` Chaitanya Kulkarni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox