From: Damien Le Moal <dlemoal@kernel.org>
To: Richard Weinberger <richard@nod.at>, linux-nvme@lists.infradead.org
Cc: linux-kernel@vger.kernel.org, kch@nvidia.com, sagi@grimberg.me,
hch@lst.de, upstream+nvme@sigma-star.at
Subject: Re: [PATCH v2] nvmet: Make blksize_shift configurable
Date: Tue, 1 Jul 2025 09:34:00 +0900 [thread overview]
Message-ID: <132c1bdf-e100-4e3a-883f-27f9e9b78020@kernel.org> (raw)
In-Reply-To: <20250630191341.1263000-1-richard@nod.at>
On 7/1/25 4:13 AM, Richard Weinberger wrote:
> Currently, the block size is automatically configured, and for
> file-backed namespaces it is likely to be 4K.
> While this is a reasonable default for modern storage, it can
> cause confusion if someone wants to export a pre-created disk image
> that uses a 512-byte block size.
> As a result, partition parsing will fail.
>
> So, just like we already do for the loop block device, let the user
> configure the block size if they know better.
Hmm... That fine with me but this explanation does not match what the patch
does: you allow configuring the block size bit shift, not the size. That is not
super user friendly :)
Even if internally you use the block size bit shift, I think it would be better
if the user facing interface is the block size as that is much easier to
manipulate without having to remember the exponent for powers of 2 values :)
>
> Signed-off-by: Richard Weinberger <richard@nod.at>
> ---
> Changes since v1 (RFC)[0]:
>
> - Make sure blksize_shift is in general within reason
> - In the bdev case and when using direct IO, blksize_shift has to be
> smaller than the logical block it the device
> - In the file case and when using direct IO try to use STATX_DIOALIGN,
> just like the loop device does
>
> [0] https://lore.kernel.org/linux-nvme/20250418090834.2755289-1-richard@nod.at/
>
> Thanks,
> //richard
> ---
> drivers/nvme/target/configfs.c | 37 +++++++++++++++++++++++++++++++
> drivers/nvme/target/io-cmd-bdev.c | 13 ++++++++++-
> drivers/nvme/target/io-cmd-file.c | 28 ++++++++++++++++++-----
> 3 files changed, 71 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/nvme/target/configfs.c b/drivers/nvme/target/configfs.c
> index e44ef69dffc24..26175c37374ab 100644
> --- a/drivers/nvme/target/configfs.c
> +++ b/drivers/nvme/target/configfs.c
> @@ -797,6 +797,42 @@ static ssize_t nvmet_ns_resv_enable_store(struct config_item *item,
> }
> CONFIGFS_ATTR(nvmet_ns_, resv_enable);
>
> +static ssize_t nvmet_ns_blksize_shift_show(struct config_item *item, char *page)
As mentioned above, I think this should be nvmet_ns_blksize_show().
> +{
> + return sysfs_emit(page, "%u\n", to_nvmet_ns(item)->blksize_shift);
And you can do:
return sysfs_emit(page, "%u\n",
1U << to_nvmet_ns(item)->blksize_shift);
> +}
> +
> +static ssize_t nvmet_ns_blksize_shift_store(struct config_item *item,
> + const char *page, size_t count)
Similar here: nvmet_ns_blksize_store()
> +{
> + struct nvmet_ns *ns = to_nvmet_ns(item);
> + u32 shift;
> + int ret;
> +
> + ret = kstrtou32(page, 0, &shift);
> + if (ret)
> + return ret;
> +
> + /*
> + * Make sure block size is within reason, something between 512 and
> + * BLK_MAX_BLOCK_SIZE.
> + */
> + if (shift < 9 || shift > 16)
> + return -EINVAL;
And this would be simpler:
if (blksz < SECTOR_SIZE || blksz > BLK_MAX_BLOCK_SIZE ||
!is_power_of_2(blksz))
return -EINVAL;
> +
> + mutex_lock(&ns->subsys->lock);
> + if (ns->enabled) {
> + pr_err("the ns:%d is already enabled.\n", ns->nsid);
> + mutex_unlock(&ns->subsys->lock);
> + return -EINVAL;
> + }
> + ns->blksize_shift = shift;
and here:
ns->blksize_shift = ilog2(blksz);
> + mutex_unlock(&ns->subsys->lock);
> +
> + return count;
> +}
> +CONFIGFS_ATTR(nvmet_ns_, blksize_shift);
> +
> static struct configfs_attribute *nvmet_ns_attrs[] = {
> &nvmet_ns_attr_device_path,
> &nvmet_ns_attr_device_nguid,
> @@ -806,6 +842,7 @@ static struct configfs_attribute *nvmet_ns_attrs[] = {
> &nvmet_ns_attr_buffered_io,
> &nvmet_ns_attr_revalidate_size,
> &nvmet_ns_attr_resv_enable,
> + &nvmet_ns_attr_blksize_shift,
> #ifdef CONFIG_PCI_P2PDMA
> &nvmet_ns_attr_p2pmem,
> #endif
> diff --git a/drivers/nvme/target/io-cmd-bdev.c b/drivers/nvme/target/io-cmd-bdev.c
> index eba42df2f8215..be39837d4d792 100644
> --- a/drivers/nvme/target/io-cmd-bdev.c
> +++ b/drivers/nvme/target/io-cmd-bdev.c
> @@ -77,6 +77,7 @@ static void nvmet_bdev_ns_enable_integrity(struct nvmet_ns *ns)
>
> int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> {
> + int bdev_blksize_shift;
> int ret;
>
> /*
> @@ -100,7 +101,17 @@ int nvmet_bdev_ns_enable(struct nvmet_ns *ns)
> }
> ns->bdev = file_bdev(ns->bdev_file);
> ns->size = bdev_nr_bytes(ns->bdev);
> - ns->blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> + bdev_blksize_shift = blksize_bits(bdev_logical_block_size(ns->bdev));
> +
> + if (ns->blksize_shift) {
> + if (ns->blksize_shift < bdev_blksize_shift) {
> + pr_err("Configured blksize_shift needs to be at least %d for device %s\n",
> + bdev_blksize_shift, ns->device_path);
> + return -EINVAL;
> + }
> + } else {
> + ns->blksize_shift = bdev_blksize_shift;
> + }
Nit: to avoid the indented if, may be write this like this: ?
if (!ns->blksize_shift)
ns->blksize_shift = bdev_blksize_shift;
if (ns->blksize_shift < bdev_blksize_shift) {
pr_err("Configured blksize needs to be at least %u for device %s\n",
bdev_logical_block_size(ns->bdev),
ns->device_path);
return -EINVAL;
}
Also, if the backend is an HDD, do we want to allow the user to configure a
block size that is less than the *physical* block size ? Performance will
suffer on regular HDDs and writes may fail with SMR HDDs.
>
> ns->pi_type = 0;
> ns->metadata_size = 0;
> diff --git a/drivers/nvme/target/io-cmd-file.c b/drivers/nvme/target/io-cmd-file.c
> index 2d068439b129c..a4066b5a1dc97 100644
> --- a/drivers/nvme/target/io-cmd-file.c
> +++ b/drivers/nvme/target/io-cmd-file.c
> @@ -49,12 +49,28 @@ int nvmet_file_ns_enable(struct nvmet_ns *ns)
>
> nvmet_file_ns_revalidate(ns);
>
> - /*
> - * i_blkbits can be greater than the universally accepted upper bound,
> - * so make sure we export a sane namespace lba_shift.
> - */
> - ns->blksize_shift = min_t(u8,
> - file_inode(ns->file)->i_blkbits, 12);
> + if (ns->blksize_shift) {
> + if (!ns->buffered_io) {
> + struct block_device *sb_bdev = ns->file->f_mapping->host->i_sb->s_bdev;
> + struct kstat st;
> +
> + if (!vfs_getattr(&ns->file->f_path, &st, STATX_DIOALIGN, 0) &&
> + (st.result_mask & STATX_DIOALIGN) &&
> + (1 << ns->blksize_shift) < st.dio_offset_align)
> + return -EINVAL;
> +
> + if (sb_bdev && (1 << ns->blksize_shift < bdev_logical_block_size(sb_bdev)))
> + return -EINVAL;
I am confused... This is going to check both... But if you got STATX_DIOALIGN
and it is OK, you do not need (and probably should not) do the second if, no ?
Also, the second condition of the second if is essentially the same check as
for the block dev case. So maybe reuse that by creating a small helper function ?
> + }
> + } else {
> + /*
> + * i_blkbits can be greater than the universally accepted
> + * upper bound, so make sure we export a sane namespace
> + * lba_shift.
> + */
> + ns->blksize_shift = min_t(u8,
> + file_inode(ns->file)->i_blkbits, 12);
> + }
It feels like this entire hunk should be a helper function as that would allow
making it a lot more readable with early returns. This code here whould be
something like:
ret = nvmet_file_set_ns_blksize_shift(ns);
if (ret)
return ret;
>
> ns->bvec_pool = mempool_create(NVMET_MIN_MPOOL_OBJ, mempool_alloc_slab,
> mempool_free_slab, nvmet_bvec_cache);
--
Damien Le Moal
Western Digital Research
next prev parent reply other threads:[~2025-07-01 0:36 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-30 19:13 [PATCH v2] nvmet: Make blksize_shift configurable Richard Weinberger
2025-07-01 0:34 ` Damien Le Moal [this message]
2025-07-01 7:09 ` Richard Weinberger
2025-07-01 7:32 ` Damien Le Moal
2025-07-01 15:00 ` Richard Weinberger
2025-07-03 8:54 ` Christoph Hellwig
2025-07-03 9:29 ` Richard Weinberger
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=132c1bdf-e100-4e3a-883f-27f9e9b78020@kernel.org \
--to=dlemoal@kernel.org \
--cc=hch@lst.de \
--cc=kch@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=richard@nod.at \
--cc=sagi@grimberg.me \
--cc=upstream+nvme@sigma-star.at \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.