[PATCH v1 0/8] Enforce uniform metadata settings for ns head

public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed

* [PATCH v1 0/8] Enforce uniform metadata settings for ns head
@ 2024-01-22 14:56 Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets Max Gurtovoy
                   ` (7 more replies)
  0 siblings, 8 replies; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

Hi Christoph/Sagi/Keith,
This patch series uniforms the namespace and it's head metadata
capabilities and definitions. It fixes a situation where a namespace
with metadata caps is exosed to a host controller without metadata
offload support via one port and to a host controller with metadata
offload support via second port. In this case, we will create separate
ns_head and won't use nvme-multipath unbrella for these namespaces.

We've also developed patches that allows the user to have a more
flexible control on the metadata offload enablement for a specific
controller (this control is using new nvmecli flag that is introduced in
a separate series).
For example:
"nvme connect --disallow_pi --transport=rdma --traddr=10.0.1.1 --trsvcid=4420 --nqn=test-nvme"

We also handle the case where the metadata capabilities of the target
were changed for a namespaces during the lifecycle of the controller.
In this case we will remove the namespace from the ns_head and will try
to create a new one (this will create a new ns_head as well).

While we're here we fixed also the transfer length calculation in case
the block layer doesn't generate/verify metadata (in this case only the HW
will do it so the new calculated transfer length value on the wire should be
written in the CC.SGL length).

This series is based on nvme-6.8 branch with 2 fixes added manually:
1. nvme-core: fix a memory leak in nvme_ns_info_from_identify() [from Maurizio Lombardi]
2. nvme: check for valid nvme_identify_ns() before using it [from  Ewan D. Milne]

Israel Rukshin (3):
  nvme-rdma: Fix transfer length when write_generate/read_verify are 0
  nvme-fabrics: add option to disallow T10-PI offload
  nvme-rdma: enable user to disallow T10-PI offload

Max Gurtovoy (4):
  nvme: use Independent ID-NS only for unknown cmd sets
  nvme: set uniform metadata settings for ns head
  nvme: allocate a new namespace if validation fail
  nvme: add nvme_queue_scan_sync helper

Ori Evron (1):
  nvme: sync the namespace scanning during ctrl start

 drivers/nvme/host/core.c    | 430 +++++++++++++++++++++++-------------
 drivers/nvme/host/fabrics.c |   7 +
 drivers/nvme/host/fabrics.h |   3 +
 drivers/nvme/host/rdma.c    |  18 +-
 4 files changed, 295 insertions(+), 163 deletions(-)

-- 
2.18.1

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-23  8:58   ` Christoph Hellwig
  2024-01-22 14:56 ` [PATCH 2/8] nvme: set uniform metadata settings for ns head Max Gurtovoy
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

According to the specification, the Admin Identify command is not
part of the Admin commands permitted to return a status code of "Admin
Command Media Not Ready" for "Controller Ready Independent of Media"
feature (represented by NVME_CAP_CRMS_CRIMS capability). Therefore, use
the legacy ID-NS also if this capability is supported by the controller.

Keep using the Independent ID-NS limited mechanism only for unknown
command sets.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/core.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a17616fbae0..ff18c263c62c 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3808,12 +3808,12 @@ static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 	}
 
 	/*
-	 * If available try to use the Command Set Idependent Identify Namespace
-	 * data structure to find all the generic information that is needed to
-	 * set up a namespace.  If not fall back to the legacy version.
+	 * Try to use the Command Set Idependent Identify Namespace data
+	 * structure to find all the generic information that is needed to set
+	 * up a namespace only for unknown command sets. Otherwise, fall back
+	 * to the legacy version.
 	 */
-	if ((ctrl->cap & NVME_CAP_CRMS_CRIMS) ||
-	    (info.ids.csi != NVME_CSI_NVM && info.ids.csi != NVME_CSI_ZNS))
+	if (info.ids.csi != NVME_CSI_NVM && info.ids.csi != NVME_CSI_ZNS)
 		ret = nvme_ns_info_from_id_cs_indep(ctrl, &info);
 	else
 		ret = nvme_ns_info_from_identify(ctrl, &info);
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/8] nvme: set uniform metadata settings for ns head
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-23  9:01   ` Christoph Hellwig
  2024-01-22 14:56 ` [PATCH 3/8] nvme: allocate a new namespace if validation fail Max Gurtovoy
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

Do not allow having different metadata settings for a namespace and its
associated head.

In case of a different metadata settings discovered for a shared
namespace, that can caused by controller or path capabilities, create a
dedicated namespace head for each controller for that namespace.

For example, in the fabrics case, where there might be a multipath
configuration which one controller/path supports metadata offloading and
the second path that doesn't. In that case, we don't want to put these
paths and their namespaces under the same multipath umbrella.
Therefore, if the metadata settings are not the same we will create a
separate namespace head for each such namespace.

While we're here, also make sure that all the namespaces under the
namespace head have the same lba shift size.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Ori Evron <oevron@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/core.c | 372 +++++++++++++++++++++++++--------------
 1 file changed, 236 insertions(+), 136 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index ff18c263c62c..9d46acf5b6cb 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -35,6 +35,14 @@
 struct nvme_ns_info {
 	struct nvme_ns_ids ids;
 	u32 nsid;
+	unsigned long features;
+	u64 nuse;
+	u16 ms;
+	u16 pi_size;
+	u8 pi_type;
+	u8 guard_type;
+	u8 flbas;
+	u8 ds;
 	__le32 anagrpid;
 	bool is_shared;
 	bool is_readonly;
@@ -1473,6 +1481,126 @@ int nvme_identify_ns(struct nvme_ctrl *ctrl, unsigned nsid,
 	return error;
 }
 
+static int nvme_elbaf_from_id_ns_nvm(struct nvme_ctrl *ctrl, u32 nsid,
+		unsigned lbaf, u32 *elbaf)
+{
+	struct nvme_command c = { };
+	struct nvme_id_ns_nvm *nvm;
+	int ret;
+
+	nvm = kzalloc(sizeof(*nvm), GFP_KERNEL);
+	if (!nvm)
+		return -ENOMEM;
+
+	c.identify.opcode = nvme_admin_identify;
+	c.identify.nsid = cpu_to_le32(nsid);
+	c.identify.cns = NVME_ID_CNS_CS_NS;
+	c.identify.csi = NVME_CSI_NVM;
+
+	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, nvm, sizeof(*nvm));
+	if (ret)
+		goto free_data;
+
+	*elbaf = le32_to_cpu(nvm->elbaf[lbaf]);
+
+free_data:
+	kfree(nvm);
+
+	return ret;
+}
+
+static int nvme_info_init(struct nvme_ctrl *ctrl, struct nvme_ns_info *info,
+		struct nvme_id_ns *id)
+{
+	bool first = id->dps & NVME_NS_DPS_PI_FIRST;
+	unsigned lbaf = nvme_lbaf_index(id->flbas);
+	int ret = 0;
+	u32 elbaf;
+
+	info->nuse = le64_to_cpu(id->nuse);
+	info->flbas = id->flbas;
+	info->ds = id->lbaf[lbaf].ds;
+	info->pi_size = 0;
+	info->ms = le16_to_cpu(id->lbaf[lbaf].ms);
+	if (!(ctrl->ctratt & NVME_CTRL_ATTR_ELBAS)) {
+		info->pi_size = sizeof(struct t10_pi_tuple);
+		info->guard_type = NVME_NVM_NS_16B_GUARD;
+		goto set_pi;
+	}
+
+	ret = nvme_elbaf_from_id_ns_nvm(ctrl, info->nsid, lbaf, &elbaf);
+	if (ret)
+		goto set_pi;
+
+	/* no support for storage tag formats right now */
+	if (nvme_elbaf_sts(elbaf))
+		goto set_pi;
+
+	info->guard_type = nvme_elbaf_guard_type(elbaf);
+	switch (info->guard_type) {
+	case NVME_NVM_NS_64B_GUARD:
+		info->pi_size = sizeof(struct crc64_pi_tuple);
+		break;
+	case NVME_NVM_NS_16B_GUARD:
+		info->pi_size = sizeof(struct t10_pi_tuple);
+		break;
+	default:
+		break;
+	}
+
+set_pi:
+	if (info->pi_size && (first || info->ms == info->pi_size))
+		info->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
+	else
+		info->pi_type = 0;
+
+	return ret;
+}
+
+static void nvme_configure_metadata(struct nvme_ctrl *ctrl,
+		struct nvme_ns_info *info)
+{
+	info->features &= ~(NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS);
+	if (!info->ms || !(ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
+		return;
+
+	if (ctrl->ops->flags & NVME_F_FABRICS) {
+		/*
+		 * The NVMe over Fabrics specification only supports metadata as
+		 * part of the extended data LBA.  We rely on HCA/HBA support to
+		 * remap the separate metadata buffer from the block layer.
+		 */
+		if (WARN_ON_ONCE(!(info->flbas & NVME_NS_FLBAS_META_EXT)))
+			return;
+
+		info->features |= NVME_NS_EXT_LBAS;
+
+		/*
+		 * The current fabrics transport drivers support namespace
+		 * metadata formats only if pi_type != 0 and ms == pi_size.
+		 * Suppress support for all other formats so the namespace will
+		 * have a 0 capacity and not be usable through the block stack.
+		 *
+		 * Note, this check will need to be modified if any drivers
+		 * gain the ability to use other metadata formats.
+		 */
+		if (ctrl->max_integrity_segments && info->pi_type &&
+		    info->ms == info->pi_size)
+			info->features |= NVME_NS_METADATA_SUPPORTED;
+	} else {
+		/*
+		 * For PCIe controllers, we can't easily remap the separate
+		 * metadata buffer from the block layer and thus require a
+		 * separate metadata buffer for block layer metadata/PI support.
+		 * We allow extended LBAs for the passthrough interface, though.
+		 */
+		if (info->flbas & NVME_NS_FLBAS_META_EXT)
+			info->features |= NVME_NS_EXT_LBAS;
+		else
+			info->features |= NVME_NS_METADATA_SUPPORTED;
+	}
+}
+
 static int nvme_ns_info_from_identify(struct nvme_ctrl *ctrl,
 		struct nvme_ns_info *info)
 {
@@ -1491,6 +1619,21 @@ static int nvme_ns_info_from_identify(struct nvme_ctrl *ctrl,
 		goto error;
 	}
 
+	ret = nvme_info_init(ctrl, info, id);
+	if (ret)
+		goto error;
+
+	nvme_configure_metadata(ctrl, info);
+
+	/*
+	 * Only set the DEAC bit if the device guarantees that reads from
+	 * deallocated data return zeroes.  While the DEAC bit does not
+	 * require that, it must be a no-op if reads from deallocated data
+	 * do not return zeroes.
+	 */
+	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
+		info->features |= NVME_NS_DEAC;
+
 	info->anagrpid = id->anagrpid;
 	info->is_shared = id->nmic & NVME_NS_NMIC_SHARED;
 	info->is_readonly = id->nsattr & NVME_NS_ATTR_RO;
@@ -1773,116 +1916,6 @@ static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 		a->csi == b->csi;
 }
 
-static int nvme_init_ms(struct nvme_ctrl *ctrl, struct nvme_ns_head *head,
-		struct nvme_id_ns *id)
-{
-	bool first = id->dps & NVME_NS_DPS_PI_FIRST;
-	unsigned lbaf = nvme_lbaf_index(id->flbas);
-	struct nvme_command c = { };
-	struct nvme_id_ns_nvm *nvm;
-	int ret = 0;
-	u32 elbaf;
-
-	head->pi_size = 0;
-	head->ms = le16_to_cpu(id->lbaf[lbaf].ms);
-	if (!(ctrl->ctratt & NVME_CTRL_ATTR_ELBAS)) {
-		head->pi_size = sizeof(struct t10_pi_tuple);
-		head->guard_type = NVME_NVM_NS_16B_GUARD;
-		goto set_pi;
-	}
-
-	nvm = kzalloc(sizeof(*nvm), GFP_KERNEL);
-	if (!nvm)
-		return -ENOMEM;
-
-	c.identify.opcode = nvme_admin_identify;
-	c.identify.nsid = cpu_to_le32(head->ns_id);
-	c.identify.cns = NVME_ID_CNS_CS_NS;
-	c.identify.csi = NVME_CSI_NVM;
-
-	ret = nvme_submit_sync_cmd(ctrl->admin_q, &c, nvm, sizeof(*nvm));
-	if (ret)
-		goto free_data;
-
-	elbaf = le32_to_cpu(nvm->elbaf[lbaf]);
-
-	/* no support for storage tag formats right now */
-	if (nvme_elbaf_sts(elbaf))
-		goto free_data;
-
-	head->guard_type = nvme_elbaf_guard_type(elbaf);
-	switch (head->guard_type) {
-	case NVME_NVM_NS_64B_GUARD:
-		head->pi_size = sizeof(struct crc64_pi_tuple);
-		break;
-	case NVME_NVM_NS_16B_GUARD:
-		head->pi_size = sizeof(struct t10_pi_tuple);
-		break;
-	default:
-		break;
-	}
-
-free_data:
-	kfree(nvm);
-set_pi:
-	if (head->pi_size && (first || head->ms == head->pi_size))
-		head->pi_type = id->dps & NVME_NS_DPS_PI_MASK;
-	else
-		head->pi_type = 0;
-
-	return ret;
-}
-
-static int nvme_configure_metadata(struct nvme_ctrl *ctrl,
-		struct nvme_ns_head *head, struct nvme_id_ns *id)
-{
-	int ret;
-
-	ret = nvme_init_ms(ctrl, head, id);
-	if (ret)
-		return ret;
-
-	head->features &= ~(NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS);
-	if (!head->ms || !(ctrl->ops->flags & NVME_F_METADATA_SUPPORTED))
-		return 0;
-
-	if (ctrl->ops->flags & NVME_F_FABRICS) {
-		/*
-		 * The NVMe over Fabrics specification only supports metadata as
-		 * part of the extended data LBA.  We rely on HCA/HBA support to
-		 * remap the separate metadata buffer from the block layer.
-		 */
-		if (WARN_ON_ONCE(!(id->flbas & NVME_NS_FLBAS_META_EXT)))
-			return 0;
-
-		head->features |= NVME_NS_EXT_LBAS;
-
-		/*
-		 * The current fabrics transport drivers support namespace
-		 * metadata formats only if nvme_ns_has_pi() returns true.
-		 * Suppress support for all other formats so the namespace will
-		 * have a 0 capacity and not be usable through the block stack.
-		 *
-		 * Note, this check will need to be modified if any drivers
-		 * gain the ability to use other metadata formats.
-		 */
-		if (ctrl->max_integrity_segments && nvme_ns_has_pi(head))
-			head->features |= NVME_NS_METADATA_SUPPORTED;
-	} else {
-		/*
-		 * For PCIe controllers, we can't easily remap the separate
-		 * metadata buffer from the block layer and thus require a
-		 * separate metadata buffer for block layer metadata/PI support.
-		 * We allow extended LBAs for the passthrough interface, though.
-		 */
-		if (id->flbas & NVME_NS_FLBAS_META_EXT)
-			head->features |= NVME_NS_EXT_LBAS;
-		else
-			head->features |= NVME_NS_METADATA_SUPPORTED;
-	}
-	return 0;
-}
-
 static void nvme_set_queue_limits(struct nvme_ctrl *ctrl,
 		struct request_queue *q)
 {
@@ -2058,15 +2091,14 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 
 	blk_mq_freeze_queue(ns->disk->queue);
 	lbaf = nvme_lbaf_index(id->flbas);
-	ns->head->lba_shift = id->lbaf[lbaf].ds;
-	ns->head->nuse = le64_to_cpu(id->nuse);
-	nvme_set_queue_limits(ns->ctrl, ns->queue);
 
-	ret = nvme_configure_metadata(ns->ctrl, ns->head, id);
-	if (ret < 0) {
+	if (info->flbas != id->flbas || info->ds != id->lbaf[lbaf].ds) {
 		blk_mq_unfreeze_queue(ns->disk->queue);
-		goto out;
+		ret = -EINVAL;
+		goto error;
 	}
+
+	nvme_set_queue_limits(ns->ctrl, ns->queue);
 	nvme_set_chunk_sectors(ns, id);
 	nvme_update_disk_info(ns->ctrl, ns->disk, ns->head, id);
 
@@ -2078,14 +2110,6 @@ static int nvme_update_ns_info_block(struct nvme_ns *ns,
 		}
 	}
 
-	/*
-	 * Only set the DEAC bit if the device guarantees that reads from
-	 * deallocated data return zeroes.  While the DEAC bit does not
-	 * require that, it must be a no-op if reads from deallocated data
-	 * do not return zeroes.
-	 */
-	if ((id->dlfeat & 0x7) == 0x1 && (id->dlfeat & (1 << 3)))
-		ns->head->features |= NVME_NS_DEAC;
 	set_disk_ro(ns->disk, nvme_ns_is_readonly(ns, info));
 	set_bit(NVME_NS_READY, &ns->flags);
 	blk_mq_unfreeze_queue(ns->disk->queue);
@@ -3294,8 +3318,18 @@ static const struct file_operations nvme_dev_fops = {
 	.uring_cmd	= nvme_dev_uring_cmd,
 };
 
+static bool nvme_head_is_same_metadata(struct nvme_ns_head *h,
+		struct nvme_ns_info *info)
+{
+	unsigned int mask = NVME_NS_METADATA_SUPPORTED | NVME_NS_EXT_LBAS;
+
+	return (h->features & mask) == (info->features & mask) &&
+	       h->pi_size == info->pi_size && h->pi_type == info->pi_type &&
+	       h->guard_type == info->guard_type && h->ms == info->ms;
+}
+
 static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
-		unsigned nsid)
+		struct nvme_ns_info *info)
 {
 	struct nvme_ns_head *h;
 
@@ -3307,7 +3341,11 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
 		 * In that case we can't use the same ns_head for namespaces
 		 * with the same NSID.
 		 */
-		if (h->ns_id != nsid || !nvme_is_unique_nsid(ctrl, h))
+		if (h->ns_id != info->nsid || !nvme_is_unique_nsid(ctrl, h))
+			continue;
+		if (!nvme_head_is_same_metadata(h, info))
+			continue;
+		if (h->lba_shift != info->ds)
 			continue;
 		if (!list_empty(&h->list) && nvme_tryget_ns_head(h))
 			return h;
@@ -3316,24 +3354,51 @@ static struct nvme_ns_head *nvme_find_ns_head(struct nvme_ctrl *ctrl,
 	return NULL;
 }
 
-static int nvme_subsys_check_duplicate_ids(struct nvme_subsystem *subsys,
+static bool nvme_head_is_duplicate_ids(struct nvme_ns_head *h,
 		struct nvme_ns_ids *ids)
 {
 	bool has_uuid = !uuid_is_null(&ids->uuid);
 	bool has_nguid = memchr_inv(ids->nguid, 0, sizeof(ids->nguid));
 	bool has_eui64 = memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+
+	if (has_uuid && uuid_equal(&ids->uuid, &h->ids.uuid))
+		return true;
+	if (has_nguid &&
+	    memcmp(&ids->nguid, &h->ids.nguid, sizeof(ids->nguid)) == 0)
+		return true;
+	if (has_eui64 &&
+	    memcmp(&ids->eui64, &h->ids.eui64, sizeof(ids->eui64)) == 0)
+		return true;
+
+	return false;
+}
+
+static int nvme_subsys_check_duplicate_ids(struct nvme_subsystem *subsys,
+		struct nvme_ns_ids *ids)
+{
 	struct nvme_ns_head *h;
 
 	lockdep_assert_held(&subsys->lock);
 
 	list_for_each_entry(h, &subsys->nsheads, entry) {
-		if (has_uuid && uuid_equal(&ids->uuid, &h->ids.uuid))
-			return -EINVAL;
-		if (has_nguid &&
-		    memcmp(&ids->nguid, &h->ids.nguid, sizeof(ids->nguid)) == 0)
+		if (nvme_head_is_duplicate_ids(h, ids))
 			return -EINVAL;
-		if (has_eui64 &&
-		    memcmp(&ids->eui64, &h->ids.eui64, sizeof(ids->eui64)) == 0)
+	}
+
+	return 0;
+}
+
+static int nvme_subsys_check_duplicate_info(struct nvme_subsystem *subsys,
+		struct nvme_ns_info *info)
+{
+	struct nvme_ns_head *h;
+
+	lockdep_assert_held(&subsys->lock);
+
+	list_for_each_entry(h, &subsys->nsheads, entry) {
+		if (nvme_head_is_same_metadata(h, info) &&
+		    nvme_head_is_duplicate_ids(h, &info->ids) &&
+		    h->lba_shift == info->ds)
 			return -EINVAL;
 	}
 
@@ -3432,6 +3497,13 @@ static struct nvme_ns_head *nvme_alloc_ns_head(struct nvme_ctrl *ctrl,
 	head->subsys = ctrl->subsys;
 	head->ns_id = info->nsid;
 	head->ids = info->ids;
+	head->nuse = info->nuse;
+	head->pi_size = info->pi_size;
+	head->pi_type = info->pi_type;
+	head->guard_type = info->guard_type;
+	head->ms = info->ms;
+	head->lba_shift = info->ds;
+	head->features = info->features;
 	head->shared = info->is_shared;
 	ratelimit_state_init(&head->rs_nuse, 5 * HZ, 1);
 	ratelimit_set_flags(&head->rs_nuse, RATELIMIT_MSG_ON_RELEASE);
@@ -3536,12 +3608,12 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
 	}
 
 	mutex_lock(&ctrl->subsys->lock);
-	head = nvme_find_ns_head(ctrl, info->nsid);
+	head = nvme_find_ns_head(ctrl, info);
 	if (!head) {
-		ret = nvme_subsys_check_duplicate_ids(ctrl->subsys, &info->ids);
+		ret = nvme_subsys_check_duplicate_info(ctrl->subsys, info);
 		if (ret) {
 			dev_err(ctrl->device,
-				"duplicate IDs in subsystem for nsid %d\n",
+				"duplicate NS Info in subsystem for nsid %d\n",
 				info->nsid);
 			goto out_unlock;
 		}
@@ -3565,6 +3637,21 @@ static int nvme_init_ns_head(struct nvme_ns *ns, struct nvme_ns_info *info)
 			goto out_put_ns_head;
 		}
 
+		if (!nvme_head_is_same_metadata(head, info)) {
+			dev_err(ctrl->device,
+				"Metadata doesn't match for shared namespace %d\n",
+				info->nsid);
+			goto out_put_ns_head;
+		}
+
+		if (head->lba_shift != info->ds) {
+			dev_err(ctrl->device,
+				"LBA data size doesn't match for shared namespace %d\n",
+				info->nsid);
+			goto out_put_ns_head;
+
+		}
+
 		if (!multipath) {
 			dev_warn(ctrl->device,
 				"Found shared namespace %d, but multipathing not supported.\n",
@@ -3780,6 +3867,19 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
 		goto out;
 	}
 
+	if (!nvme_head_is_same_metadata(ns->head, info)) {
+		dev_err(ns->ctrl->device,
+			"Metadata changed for nsid %d\n", ns->head->ns_id);
+			goto out;
+	}
+
+	if (ns->head->lba_shift != info->ds) {
+		dev_err(ns->ctrl->device,
+			"LBA data size changed for nsid %d\n",
+			ns->head->ns_id);
+		goto out;
+	}
+
 	ret = nvme_update_ns_info(ns, info);
 out:
 	/*
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/8] nvme: allocate a new namespace if validation fail
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 2/8] nvme: set uniform metadata settings for ns head Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-23  9:02   ` Christoph Hellwig
  2024-01-22 14:56 ` [PATCH 4/8] nvme: add nvme_queue_scan_sync helper Max Gurtovoy
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

Namespace validation can fail in case some of the namespace identifiers
were changed. In this case, we will give up and remove the namespace.
Add a mechanism to retry the allocation of a new namespace instance that
will be setup according to the new namespace identifiers.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/core.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 9d46acf5b6cb..4fd3612b7570 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3857,7 +3857,14 @@ static void nvme_ns_remove_by_nsid(struct nvme_ctrl *ctrl, u32 nsid)
 	}
 }
 
-static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
+/*
+ * Validate the namespace according to the updated namespace info.
+ * Remove the namespace from the controller in case of a mismatch with previous
+ * identifiers.
+ *
+ * Returns true if namespace was removed, false otherwise.
+ */
+static bool nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
 {
 	int ret = NVME_SC_INVALID_NS | NVME_SC_DNR;
 
@@ -3886,10 +3893,13 @@ static void nvme_validate_ns(struct nvme_ns *ns, struct nvme_ns_info *info)
 	 * Only remove the namespace if we got a fatal error back from the
 	 * device, otherwise ignore the error and just move on.
 	 *
-	 * TODO: we should probably schedule a delayed retry here.
+	 * Caller should probably retry namespace allocation.
 	 */
-	if (ret > 0 && (ret & NVME_SC_DNR))
+	if (ret > 0 && (ret & NVME_SC_DNR)) {
 		nvme_ns_remove(ns);
+		return true;
+	}
+	return false;
 }
 
 static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
@@ -3930,8 +3940,10 @@ static void nvme_scan_ns(struct nvme_ctrl *ctrl, unsigned nsid)
 
 	ns = nvme_find_get_ns(ctrl, nsid);
 	if (ns) {
-		nvme_validate_ns(ns, &info);
+		bool removed = nvme_validate_ns(ns, &info);
 		nvme_put_ns(ns);
+		if (removed)
+			nvme_alloc_ns(ctrl, &info);
 	} else {
 		nvme_alloc_ns(ctrl, &info);
 	}
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/8] nvme: add nvme_queue_scan_sync helper
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
                   ` (2 preceding siblings ...)
  2024-01-22 14:56 ` [PATCH 3/8] nvme: allocate a new namespace if validation fail Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 5/8] nvme: sync the namespace scanning during ctrl start Max Gurtovoy
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

This is a preparation patch for synchronizing the scanning during the
start of the controller.

This patch doesn't change any logic.

Reviewed-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/core.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 4fd3612b7570..d0e4888041aa 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -144,6 +144,12 @@ void nvme_queue_scan(struct nvme_ctrl *ctrl)
 		queue_work(nvme_wq, &ctrl->scan_work);
 }
 
+static void nvme_queue_scan_sync(struct nvme_ctrl *ctrl)
+{
+	nvme_queue_scan(ctrl);
+	flush_work(&ctrl->scan_work);
+}
+
 /*
  * Use this function to proceed with scheduling reset_work for a controller
  * that had previously been set to the resetting state. This is intended for
@@ -1156,10 +1162,8 @@ void nvme_passthru_end(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u32 effects,
 "controller capabilities changed, reset may be required to take effect.\n");
 		}
 	}
-	if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC)) {
-		nvme_queue_scan(ctrl);
-		flush_work(&ctrl->scan_work);
-	}
+	if (effects & (NVME_CMD_EFFECTS_NIC | NVME_CMD_EFFECTS_NCC))
+		nvme_queue_scan_sync(ctrl);
 	if (ns)
 		return;
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
                   ` (3 preceding siblings ...)
  2024-01-22 14:56 ` [PATCH 4/8] nvme: add nvme_queue_scan_sync helper Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-23  9:02   ` Christoph Hellwig
  2024-01-24 12:58   ` Sagi Grimberg
  2024-01-22 14:56 ` [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0 Max Gurtovoy
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

From: Ori Evron <oevron@nvidia.com>

The namespace identifiers may change during the re-connection flow.
Therefore, wait for the scanning to finish before finalizing the start
of the controller.

Also move the re-queueing of the old requests to the stage that we've
finished the namespace scanning otherwise we may issue a request to a
namespace that might have changed its identifiers.

Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Signed-off-by: Ori Evron <oevron@nvidia.com>
---
 drivers/nvme/host/core.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index d0e4888041aa..90ffb3db1ed9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -593,14 +593,12 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 	if (!changed)
 		return false;
 
-	if (ctrl->state == NVME_CTRL_LIVE) {
-		if (old_state == NVME_CTRL_CONNECTING)
-			nvme_stop_failfast_work(ctrl);
-		nvme_kick_requeue_lists(ctrl);
-	} else if (ctrl->state == NVME_CTRL_CONNECTING &&
-		old_state == NVME_CTRL_RESETTING) {
+	if (ctrl->state == NVME_CTRL_LIVE && old_state == NVME_CTRL_CONNECTING)
+		nvme_stop_failfast_work(ctrl);
+	else if (ctrl->state == NVME_CTRL_CONNECTING &&
+		 old_state == NVME_CTRL_RESETTING)
 		nvme_start_failfast_work(ctrl);
-	}
+
 	return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -4287,6 +4285,7 @@ static void nvme_fw_act_work(struct work_struct *work)
 		return;
 
 	nvme_unquiesce_io_queues(ctrl);
+	nvme_kick_requeue_lists(ctrl);
 	/* read FW slot information to clear the AER */
 	nvme_get_fw_slot_info(ctrl);
 
@@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 		nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
 
 	if (ctrl->queue_count > 1) {
-		nvme_queue_scan(ctrl);
+		nvme_queue_scan_sync(ctrl);
 		nvme_unquiesce_io_queues(ctrl);
 		nvme_mpath_update(ctrl);
+		nvme_kick_requeue_lists(ctrl);
 	}
 
 	nvme_change_uevent(ctrl, "NVME_EVENT=connected");
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
                   ` (4 preceding siblings ...)
  2024-01-22 14:56 ` [PATCH 5/8] nvme: sync the namespace scanning during ctrl start Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-23  9:02   ` Christoph Hellwig
  2024-01-22 14:56 ` [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload Max Gurtovoy
  2024-01-22 14:56 ` [PATCH 8/8] nvme-rdma: enable user " Max Gurtovoy
  7 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme
  Cc: oren, israelr, dwagner, oevron, Max Gurtovoy

From: Israel Rukshin <israelr@nvidia.com>

When the block layer doesn't generate/verify metadata, the SG length is
smaller than the transfer length. This is because the SG length doesn't
include the metadata length that is added by the HW on the wire. The
target failes those commands with "Data SGL Length Invalid" by comparing
the transfer length and the SG length. Fix it by adding the metadata
length to the transfer length when there is no metadata SGL. The bug
reproduces when setting read_verify/write_generate configs to 0 at the
child multipath device or at the primary device when NVMe multipath is
disabled.

Note that setting those configs to 0 on the multipath device (ns_head)
doesn't have any impact on the I/Os.

Fixes: 5ec5d3bddc6b ("nvme-rdma: add metadata/T10-PI support")
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Signed-off-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/rdma.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 2e77c0f25f71..a380bafbed08 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1405,6 +1405,8 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
 	struct nvme_ns *ns = rq->q->queuedata;
 	struct bio *bio = rq->bio;
 	struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
+	struct blk_integrity *bi = blk_get_integrity(bio->bi_bdev->bd_disk);
+	u32 xfer_len;
 	int nr;
 
 	req->mr = ib_mr_pool_get(queue->qp, &queue->qp->sig_mrs);
@@ -1417,8 +1419,7 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
 	if (unlikely(nr))
 		goto mr_put;
 
-	nvme_rdma_set_sig_attrs(blk_get_integrity(bio->bi_bdev->bd_disk), c,
-				req->mr->sig_attrs, ns->head->pi_type);
+	nvme_rdma_set_sig_attrs(bi, c, req->mr->sig_attrs, ns->head->pi_type);
 	nvme_rdma_set_prot_checks(c, &req->mr->sig_attrs->check_mask);
 
 	ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
@@ -1436,7 +1437,11 @@ static int nvme_rdma_map_sg_pi(struct nvme_rdma_queue *queue,
 		     IB_ACCESS_REMOTE_WRITE;
 
 	sg->addr = cpu_to_le64(req->mr->iova);
-	put_unaligned_le24(req->mr->length, sg->length);
+	xfer_len = req->mr->length;
+	/* Check if PI is added by the HW */
+	if (!pi_count)
+		xfer_len += (xfer_len >> bi->interval_exp) * ns->head->pi_size;
+	put_unaligned_le24(xfer_len, sg->length);
 	put_unaligned_le32(req->mr->rkey, sg->key);
 	sg->type = NVME_KEY_SGL_FMT_DATA_DESC << 4;
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
                   ` (5 preceding siblings ...)
  2024-01-22 14:56 ` [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0 Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  2024-01-22 15:13   ` Daniel Wagner
  2024-01-22 14:56 ` [PATCH 8/8] nvme-rdma: enable user " Max Gurtovoy
  7 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme; +Cc: oren, israelr, dwagner, oevron

From: Israel Rukshin <israelr@nvidia.com>

Today, T10-PI offload is supported only by nvme-rdma, but nothing
prevents other transports reusing the concept, so do not associate
with rdma transport solely. The flag disallows T10-PI offload, so
users can save system resources and allow a more flexible
configuration.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/fabrics.c | 7 +++++++
 drivers/nvme/host/fabrics.h | 3 +++
 2 files changed, 10 insertions(+)

diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index aa88606a44c4..046f5a466576 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -673,6 +673,9 @@ static const match_table_t opt_tokens = {
 #endif
 #ifdef CONFIG_NVME_TCP_TLS
 	{ NVMF_OPT_TLS,			"tls"			},
+#endif
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+	{ NVMF_OPT_DISALLOW_PI,		"disallow_pi"		},
 #endif
 	{ NVMF_OPT_ERR,			NULL			}
 };
@@ -702,6 +705,7 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 	opts->tls = false;
 	opts->tls_key = NULL;
 	opts->keyring = NULL;
+	opts->disallow_pi = false;
 
 	options = o = kstrdup(buf, GFP_KERNEL);
 	if (!options)
@@ -915,6 +919,9 @@ static int nvmf_parse_options(struct nvmf_ctrl_options *opts,
 		case NVMF_OPT_DATA_DIGEST:
 			opts->data_digest = true;
 			break;
+		case NVMF_OPT_DISALLOW_PI:
+			opts->disallow_pi = true;
+			break;
 		case NVMF_OPT_NR_WRITE_QUEUES:
 			if (match_int(args, &token)) {
 				ret = -EINVAL;
diff --git a/drivers/nvme/host/fabrics.h b/drivers/nvme/host/fabrics.h
index fbaee5a7be19..cd0140a2abe7 100644
--- a/drivers/nvme/host/fabrics.h
+++ b/drivers/nvme/host/fabrics.h
@@ -73,6 +73,7 @@ enum {
 	NVMF_OPT_TLS		= 1 << 25,
 	NVMF_OPT_KEYRING	= 1 << 26,
 	NVMF_OPT_TLS_KEY	= 1 << 27,
+	NVMF_OPT_DISALLOW_PI	= 1 << 28,
 };
 
 /**
@@ -115,6 +116,7 @@ enum {
  * @nr_poll_queues: number of queues for polling I/O
  * @tos: type of service
  * @fast_io_fail_tmo: Fast I/O fail timeout in seconds
+ * @disallow_pi: Disallow metadata (T10-PI) offload support
  */
 struct nvmf_ctrl_options {
 	unsigned		mask;
@@ -140,6 +142,7 @@ struct nvmf_ctrl_options {
 	bool			disable_sqflow;
 	bool			hdr_digest;
 	bool			data_digest;
+	bool			disallow_pi;
 	unsigned int		nr_write_queues;
 	unsigned int		nr_poll_queues;
 	int			tos;
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 8/8] nvme-rdma: enable user to disallow T10-PI offload
  2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
                   ` (6 preceding siblings ...)
  2024-01-22 14:56 ` [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload Max Gurtovoy
@ 2024-01-22 14:56 ` Max Gurtovoy
  7 siblings, 0 replies; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 14:56 UTC (permalink / raw)
  To: kbusch, hch, sagi, linux-nvme; +Cc: oren, israelr, dwagner, oevron

From: Israel Rukshin <israelr@nvidia.com>

Disallowing metadata support helps the user to save system
resources and allow a more flexible configuration. For example,
in multipath case, when PI offload is supported only on one of
the paths (running on ConnectX-4 that support PI offload and on
ConnectX-3 that doesn't support PI offload) then two NS heads
will be created by default and multipath can't be enabled on this
setup. But when the disallow_pi flag is used on the path that
supports PI (ConnectX-4 path) then only a single NS head will be
created and multipath will be set properly.

Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
---
 drivers/nvme/host/rdma.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index a380bafbed08..0c5671832b50 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -791,8 +791,9 @@ static int nvme_rdma_configure_admin_queue(struct nvme_rdma_ctrl *ctrl,
 	ctrl->ctrl.numa_node = ibdev_to_node(ctrl->device->dev);
 
 	/* T10-PI support */
-	if (ctrl->device->dev->attrs.kernel_cap_flags &
-	    IBK_INTEGRITY_HANDOVER)
+	if (!ctrl->ctrl.opts->disallow_pi &&
+	    (ctrl->device->dev->attrs.kernel_cap_flags &
+	     IBK_INTEGRITY_HANDOVER))
 		pi_capable = true;
 
 	ctrl->max_fr_pages = nvme_rdma_get_max_fr_pages(ctrl->device->dev,
@@ -2325,7 +2326,7 @@ static struct nvmf_transport_ops nvme_rdma_transport = {
 	.allowed_opts	= NVMF_OPT_TRSVCID | NVMF_OPT_RECONNECT_DELAY |
 			  NVMF_OPT_HOST_TRADDR | NVMF_OPT_CTRL_LOSS_TMO |
 			  NVMF_OPT_NR_WRITE_QUEUES | NVMF_OPT_NR_POLL_QUEUES |
-			  NVMF_OPT_TOS,
+			  NVMF_OPT_TOS | NVMF_OPT_DISALLOW_PI,
 	.create_ctrl	= nvme_rdma_create_ctrl,
 };
 
-- 
2.18.1



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 14:56 ` [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload Max Gurtovoy
@ 2024-01-22 15:13   ` Daniel Wagner
  2024-01-22 15:17     ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2024-01-22 15:13 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, oevron

On Mon, Jan 22, 2024 at 04:56:58PM +0200, Max Gurtovoy wrote:
> --- a/drivers/nvme/host/fabrics.c
> +++ b/drivers/nvme/host/fabrics.c
> @@ -673,6 +673,9 @@ static const match_table_t opt_tokens = {
>  #endif
>  #ifdef CONFIG_NVME_TCP_TLS
>  	{ NVMF_OPT_TLS,			"tls"			},
> +#endif
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> +	{ NVMF_OPT_DISALLOW_PI,		"disallow_pi"		},

What about 'enable_pi' instead of the inverted boolean logic?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 15:13   ` Daniel Wagner
@ 2024-01-22 15:17     ` Max Gurtovoy
  2024-01-22 15:27       ` Daniel Wagner
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-22 15:17 UTC (permalink / raw)
  To: Daniel Wagner; +Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, oevron



On 22/01/2024 17:13, Daniel Wagner wrote:
> On Mon, Jan 22, 2024 at 04:56:58PM +0200, Max Gurtovoy wrote:
>> --- a/drivers/nvme/host/fabrics.c
>> +++ b/drivers/nvme/host/fabrics.c
>> @@ -673,6 +673,9 @@ static const match_table_t opt_tokens = {
>>   #endif
>>   #ifdef CONFIG_NVME_TCP_TLS
>>   	{ NVMF_OPT_TLS,			"tls"			},
>> +#endif
>> +#ifdef CONFIG_BLK_DEV_INTEGRITY
>> +	{ NVMF_OPT_DISALLOW_PI,		"disallow_pi"		},
> 
> What about 'enable_pi' instead of the inverted boolean logic?

we can't change the default behavior (that is "enable" by default).


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 15:17     ` Max Gurtovoy
@ 2024-01-22 15:27       ` Daniel Wagner
  2024-01-22 15:28         ` Daniel Wagner
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2024-01-22 15:27 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, oevron

On Mon, Jan 22, 2024 at 05:17:11PM +0200, Max Gurtovoy wrote:
> > What about 'enable_pi' instead of the inverted boolean logic?
> 
> we can't change the default behavior (that is "enable" by default).

But couldn't we just have

    enable_pi = true

as default then? This should be the same as

   disallow_pi = false

?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: Re: Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 15:27       ` Daniel Wagner
@ 2024-01-22 15:28         ` Daniel Wagner
  2024-01-23  9:04           ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Wagner @ 2024-01-22 15:28 UTC (permalink / raw)
  To: Max Gurtovoy; +Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, oevron

On Mon, Jan 22, 2024 at 04:27:08PM +0100, Daniel Wagner wrote:
> On Mon, Jan 22, 2024 at 05:17:11PM +0200, Max Gurtovoy wrote:
> > > What about 'enable_pi' instead of the inverted boolean logic?
> > 
> > we can't change the default behavior (that is "enable" by default).
> 
> But couldn't we just have
> 
>     enable_pi = true
> 
> as default then? This should be the same as
> 
>    disallow_pi = false
> 
> ?

Ah, I get it now. This interface doesn't work this way.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets
  2024-01-22 14:56 ` [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets Max Gurtovoy
@ 2024-01-23  8:58   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  8:58 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On Mon, Jan 22, 2024 at 04:56:52PM +0200, Max Gurtovoy wrote:
> According to the specification, the Admin Identify command is not
> part of the Admin commands permitted to return a status code of "Admin
> Command Media Not Ready" for "Controller Ready Independent of Media"
> feature (represented by NVME_CAP_CRMS_CRIMS capability). Therefore, use
> the legacy ID-NS also if this capability is supported by the controller.

Please just file an errate to fix this, not using the CS independent
command in the CS-independent path is just silly.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/8] nvme: set uniform metadata settings for ns head
  2024-01-22 14:56 ` [PATCH 2/8] nvme: set uniform metadata settings for ns head Max Gurtovoy
@ 2024-01-23  9:01   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  9:01 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On Mon, Jan 22, 2024 at 04:56:53PM +0200, Max Gurtovoy wrote:
> While we're here, also make sure that all the namespaces under the
> namespace head have the same lba shift size.

Please split this into a separate prep patch as it is not related
to PI.

>  struct nvme_ns_info {
>  	struct nvme_ns_ids ids;
>  	u32 nsid;
> +	unsigned long features;
> +	u64 nuse;
> +	u16 ms;
> +	u16 pi_size;
> +	u8 pi_type;
> +	u8 guard_type;
> +	u8 flbas;
> +	u8 ds;

nvme_ns_info is used for the command set independent scanning, so
it should not grow NVM command set data like this.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/8] nvme: allocate a new namespace if validation fail
  2024-01-22 14:56 ` [PATCH 3/8] nvme: allocate a new namespace if validation fail Max Gurtovoy
@ 2024-01-23  9:02   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  9:02 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On Mon, Jan 22, 2024 at 04:56:54PM +0200, Max Gurtovoy wrote:
> Namespace validation can fail in case some of the namespace identifiers
> were changed. In this case, we will give up and remove the namespace.
> Add a mechanism to retry the allocation of a new namespace instance that
> will be setup according to the new namespace identifiers.

Can you come up with a testcase for blktests to exercise this path?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-22 14:56 ` [PATCH 5/8] nvme: sync the namespace scanning during ctrl start Max Gurtovoy
@ 2024-01-23  9:02   ` Christoph Hellwig
  2024-01-24  0:47     ` Max Gurtovoy
  2024-01-24 12:58   ` Sagi Grimberg
  1 sibling, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  9:02 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On Mon, Jan 22, 2024 at 04:56:56PM +0200, Max Gurtovoy wrote:
> From: Ori Evron <oevron@nvidia.com>
> 
> The namespace identifiers may change during the re-connection flow.

How?



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0
  2024-01-22 14:56 ` [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0 Max Gurtovoy
@ 2024-01-23  9:02   ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  9:02 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: kbusch, hch, sagi, linux-nvme, oren, israelr, dwagner, oevron

Please send this separately for Linux 6.8-rc.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-22 15:28         ` Daniel Wagner
@ 2024-01-23  9:04           ` Christoph Hellwig
  2024-02-01 10:40             ` Israel Rukshin
       [not found]             ` <fdd1c81f-caf3-4f34-96e8-f4d8ffc26203@nvidia.com>
  0 siblings, 2 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-23  9:04 UTC (permalink / raw)
  To: Daniel Wagner
  Cc: Max Gurtovoy, kbusch, hch, sagi, linux-nvme, oren, israelr,
	oevron

On Mon, Jan 22, 2024 at 04:28:53PM +0100, Daniel Wagner wrote:
> On Mon, Jan 22, 2024 at 04:27:08PM +0100, Daniel Wagner wrote:
> > On Mon, Jan 22, 2024 at 05:17:11PM +0200, Max Gurtovoy wrote:
> > > > What about 'enable_pi' instead of the inverted boolean logic?
> > > 
> > > we can't change the default behavior (that is "enable" by default).
> > 
> > But couldn't we just have
> > 
> >     enable_pi = true
> > 
> > as default then? This should be the same as
> > 
> >    disallow_pi = false
> > 
> > ?
> 
> Ah, I get it now. This interface doesn't work this way.

The match_table_t infrastructure does allow checking for paramter
values, including boolean ones.  And this would indeed be a good
use case for that.  Extending the existing boolean only own
flags to that would be nice as well.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-23  9:02   ` Christoph Hellwig
@ 2024-01-24  0:47     ` Max Gurtovoy
  2024-01-24  9:48       ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-24  0:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kbusch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On 23/01/2024 11:02, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 04:56:56PM +0200, Max Gurtovoy wrote:
>> From: Ori Evron <oevron@nvidia.com>
>>
>> The namespace identifiers may change during the re-connection flow.
> 
> How?
> 

for example if a the PI was supported on the port before re-connection 
and wasn't supported after the re-connection.
We must first identify the namespace format/identifiers before kicking 
the old requests (otherwise we will get IO errors).
We must make sure that the identifiers are equal to the ns_head 
identifiers (otherwise we will remove the path from mpath and allocate a 
new ns_head).

maybe it will be simpler to reproduce if the namespace uuid/nguid change 
after re-connection (I didn't test this scenario)

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24  0:47     ` Max Gurtovoy
@ 2024-01-24  9:48       ` Christoph Hellwig
  2024-01-24 10:23         ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-24  9:48 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, kbusch, sagi, linux-nvme, oren, israelr,
	dwagner, oevron

On Wed, Jan 24, 2024 at 02:47:30AM +0200, Max Gurtovoy wrote:
> for example if a the PI was supported on the port before re-connection and 
> wasn't supported after the re-connection.
> We must first identify the namespace format/identifiers before kicking the 
> old requests (otherwise we will get IO errors).

Why would the namespace identifier change when you reformat?  They
should only change when the namespace is deleted and re-created.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24  9:48       ` Christoph Hellwig
@ 2024-01-24 10:23         ` Max Gurtovoy
  2024-01-25 14:41           ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-24 10:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kbusch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On 24/01/2024 11:48, Christoph Hellwig wrote:
> On Wed, Jan 24, 2024 at 02:47:30AM +0200, Max Gurtovoy wrote:
>> for example if a the PI was supported on the port before re-connection and
>> wasn't supported after the re-connection.
>> We must first identify the namespace format/identifiers before kicking the
>> old requests (otherwise we will get IO errors).
> 
> Why would the namespace identifier change when you reformat?  They
> should only change when the namespace is deleted and re-created.

Even in this case of re-creation we must sync the namespace identification.
For example, if the port went down, target re-configured but disabled PI 
offload (it was enabled previously) or someone replaced HCA to one that 
does not support PI. The queued IOs in the host would have integrity ctx 
even though the new ns/disk doesn't have ms/integrity settings. it will 
immediately fail the IO during nvme_kick_requeue_lists.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-22 14:56 ` [PATCH 5/8] nvme: sync the namespace scanning during ctrl start Max Gurtovoy
  2024-01-23  9:02   ` Christoph Hellwig
@ 2024-01-24 12:58   ` Sagi Grimberg
  2024-01-24 13:04     ` Max Gurtovoy
  1 sibling, 1 reply; 36+ messages in thread
From: Sagi Grimberg @ 2024-01-24 12:58 UTC (permalink / raw)
  To: Max Gurtovoy, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron


> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>   		nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>   
>   	if (ctrl->queue_count > 1) {
> -		nvme_queue_scan(ctrl);
> +		nvme_queue_scan_sync(ctrl);
>   		nvme_unquiesce_io_queues(ctrl);
>   		nvme_mpath_update(ctrl);
> +		nvme_kick_requeue_lists(ctrl);
>   	}

I really don't think its a good idea to block ctrl start
like that.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 12:58   ` Sagi Grimberg
@ 2024-01-24 13:04     ` Max Gurtovoy
  2024-01-24 13:10       ` Sagi Grimberg
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-24 13:04 UTC (permalink / raw)
  To: Sagi Grimberg, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron



On 24/01/2024 14:58, Sagi Grimberg wrote:
> 
>> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>           nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>       if (ctrl->queue_count > 1) {
>> -        nvme_queue_scan(ctrl);
>> +        nvme_queue_scan_sync(ctrl);
>>           nvme_unquiesce_io_queues(ctrl);
>>           nvme_mpath_update(ctrl);
>> +        nvme_kick_requeue_lists(ctrl);
>>       }
> 
> I really don't think its a good idea to block ctrl start
> like that.

like how ? scan_sync ?


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 13:04     ` Max Gurtovoy
@ 2024-01-24 13:10       ` Sagi Grimberg
  2024-01-24 13:17         ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Sagi Grimberg @ 2024-01-24 13:10 UTC (permalink / raw)
  To: Max Gurtovoy, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron


>>> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>>           nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>>       if (ctrl->queue_count > 1) {
>>> -        nvme_queue_scan(ctrl);
>>> +        nvme_queue_scan_sync(ctrl);
>>>           nvme_unquiesce_io_queues(ctrl);
>>>           nvme_mpath_update(ctrl);
>>> +        nvme_kick_requeue_lists(ctrl);
>>>       }
>>
>> I really don't think its a good idea to block ctrl start
>> like that.
> 
> like how ? scan_sync ?

Yes


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 13:10       ` Sagi Grimberg
@ 2024-01-24 13:17         ` Max Gurtovoy
  2024-01-24 13:54           ` Sagi Grimberg
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-24 13:17 UTC (permalink / raw)
  To: Sagi Grimberg, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron



On 24/01/2024 15:10, Sagi Grimberg wrote:
> 
>>>> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>>>           nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>>>       if (ctrl->queue_count > 1) {
>>>> -        nvme_queue_scan(ctrl);
>>>> +        nvme_queue_scan_sync(ctrl);
>>>>           nvme_unquiesce_io_queues(ctrl);
>>>>           nvme_mpath_update(ctrl);
>>>> +        nvme_kick_requeue_lists(ctrl);
>>>>       }
>>>
>>> I really don't think its a good idea to block ctrl start
>>> like that.
>>
>> like how ? scan_sync ?
> 
> Yes

why not ? this is control path..
we have to make sure we issue commands to a validated namespace.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 13:17         ` Max Gurtovoy
@ 2024-01-24 13:54           ` Sagi Grimberg
  2024-01-24 14:15             ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Sagi Grimberg @ 2024-01-24 13:54 UTC (permalink / raw)
  To: Max Gurtovoy, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron

>>>>> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>>>>           nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>>>>       if (ctrl->queue_count > 1) {
>>>>> -        nvme_queue_scan(ctrl);
>>>>> +        nvme_queue_scan_sync(ctrl);
>>>>>           nvme_unquiesce_io_queues(ctrl);
>>>>>           nvme_mpath_update(ctrl);
>>>>> +        nvme_kick_requeue_lists(ctrl);
>>>>>       }
>>>>
>>>> I really don't think its a good idea to block ctrl start
>>>> like that.
>>>
>>> like how ? scan_sync ?
>>
>> Yes
> 
> why not ? this is control path..

Because in the real world, namespaces (or HCAs) will not change their
attributes in 99.999999% of the cases, and when they do, in 99.999999%
of the cases the inflight IO has already failed over to a different
path.

So no, I don't think that preventing the ctrl start from making forward
progress until a full namespaces scan completes makes any sense.

> we have to make sure we issue commands to a validated namespace.

I think we should simply refuse to create the ns when it differs between
paths, or remove it in the highly unlikely case where it suddenly
changes its attributes when reconnecting.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 13:54           ` Sagi Grimberg
@ 2024-01-24 14:15             ` Max Gurtovoy
  0 siblings, 0 replies; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-24 14:15 UTC (permalink / raw)
  To: Sagi Grimberg, kbusch, hch, linux-nvme; +Cc: oren, israelr, dwagner, oevron



On 24/01/2024 15:54, Sagi Grimberg wrote:
> 
>>>>>> @@ -4537,9 +4536,10 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
>>>>>>           nvme_change_uevent(ctrl, "NVME_EVENT=rediscover");
>>>>>>       if (ctrl->queue_count > 1) {
>>>>>> -        nvme_queue_scan(ctrl);
>>>>>> +        nvme_queue_scan_sync(ctrl);
>>>>>>           nvme_unquiesce_io_queues(ctrl);
>>>>>>           nvme_mpath_update(ctrl);
>>>>>> +        nvme_kick_requeue_lists(ctrl);
>>>>>>       }
>>>>>
>>>>> I really don't think its a good idea to block ctrl start
>>>>> like that.
>>>>
>>>> like how ? scan_sync ?
>>>
>>> Yes
>>
>> why not ? this is control path..
> 
> Because in the real world, namespaces (or HCAs) will not change their
> attributes in 99.999999% of the cases, and when they do, in 99.999999%
> of the cases the inflight IO has already failed over to a different
> path.
> 
> So no, I don't think that preventing the ctrl start from making forward
> progress until a full namespaces scan completes makes any sense.
> 

The correctness is important. Namespaces in NVMe can change dynamically.
Issue a request to non identified namespace doesn't sounds right to me.
For real world use case, that has 1-5 namespaces for a controller this 
sync will not cause any harm.
Also there is some patch sent recently to perform parallel scanning of 
namespaces so it will even not be stalled a lot for the 1k namespaces case.

>> we have to make sure we issue commands to a validated namespace.
> 
> I think we should simply refuse to create the ns when it differs between
> paths, or remove it in the highly unlikely case where it suddenly
> changes its attributes when reconnecting.

how will you refuse creating a namespace if you didn't finish scanning it ?
This is exactly the logic we did, but we need to get the new identifiers 
and understand whether we need to remove old ns and create a new one.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-24 10:23         ` Max Gurtovoy
@ 2024-01-25 14:41           ` Christoph Hellwig
  2024-01-25 15:55             ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-25 14:41 UTC (permalink / raw)
  To: Max Gurtovoy
  Cc: Christoph Hellwig, kbusch, sagi, linux-nvme, oren, israelr,
	dwagner, oevron

On Wed, Jan 24, 2024 at 12:23:33PM +0200, Max Gurtovoy wrote:
> Even in this case of re-creation we must sync the namespace identification.

But a format is not a re-creation.



^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-25 14:41           ` Christoph Hellwig
@ 2024-01-25 15:55             ` Max Gurtovoy
  2024-01-29 10:48               ` Sagi Grimberg
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-25 15:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: kbusch, sagi, linux-nvme, oren, israelr, dwagner, oevron

On 25/01/2024 16:41, Christoph Hellwig wrote:
> On Wed, Jan 24, 2024 at 12:23:33PM +0200, Max Gurtovoy wrote:
>> Even in this case of re-creation we must sync the namespace identification.
> 
> But a format is not a re-creation.
> 

we currently don't support format in fabrics so maybe I didn't use the 
correct word during my answer.

because of the dynamic nature of controllers and namespaces we have the 
identify ctrl/ns commands in the specification and we need to use them.
Re-queue requests to a non identified namespace is bad behavior. It 
might be even a spec violation (I need to dig into the spec for it 
probably and try to find some description on that scenario).

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-25 15:55             ` Max Gurtovoy
@ 2024-01-29 10:48               ` Sagi Grimberg
  2024-01-29 12:37                 ` Max Gurtovoy
  0 siblings, 1 reply; 36+ messages in thread
From: Sagi Grimberg @ 2024-01-29 10:48 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig
  Cc: kbusch, linux-nvme, oren, israelr, dwagner, oevron


>>> Even in this case of re-creation we must sync the namespace 
>>> identification.
>>
>> But a format is not a re-creation.
>>
> 
> we currently don't support format in fabrics so maybe I didn't use the 
> correct word during my answer.
> 
> because of the dynamic nature of controllers and namespaces we have the 
> identify ctrl/ns commands in the specification and we need to use them.

ns scanning has been notorious to cause issues when executing with
controller resets. So I don't think it is a good idea to make ctrl start
block until it completes, especially not for an esoteric case that
a ns that suddenly was reformatted between reconnects.

> Re-queue requests to a non identified namespace is bad behavior. It 
> might be even a spec violation (I need to dig into the spec for it 
> probably and try to find some description on that scenario).

It can't be a spec violation, the host cannot know if there is any
aen pending (for example NS_CHANGE) when sending *ANY* command to the
controller.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-29 10:48               ` Sagi Grimberg
@ 2024-01-29 12:37                 ` Max Gurtovoy
  2024-01-31 12:40                   ` Sagi Grimberg
  0 siblings, 1 reply; 36+ messages in thread
From: Max Gurtovoy @ 2024-01-29 12:37 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig
  Cc: kbusch, linux-nvme, oren, israelr, dwagner, oevron



On 29/01/2024 12:48, Sagi Grimberg wrote:
> 
>>>> Even in this case of re-creation we must sync the namespace 
>>>> identification.
>>>
>>> But a format is not a re-creation.
>>>
>>
>> we currently don't support format in fabrics so maybe I didn't use the 
>> correct word during my answer.
>>
>> because of the dynamic nature of controllers and namespaces we have 
>> the identify ctrl/ns commands in the specification and we need to use 
>> them.
> 
> ns scanning has been notorious to cause issues when executing with
> controller resets. So I don't think it is a good idea to make ctrl start
> block until it completes, especially not for an esoteric case that
> a ns that suddenly was reformatted between reconnects.
> 

I think that making sure that the namespace is active and validated is 
not esoteric.
What is the concern ? is it the time it takes to validate a namespace ?

>> Re-queue requests to a non identified namespace is bad behavior. It 
>> might be even a spec violation (I need to dig into the spec for it 
>> probably and try to find some description on that scenario).
> 
> It can't be a spec violation, the host cannot know if there is any
> aen pending (for example NS_CHANGE) when sending *ANY* command to the
> controller.

I think that AEN and reset flows are a bit different. The controller is 
free to be modified at any time (even when no Admin queue is opened from 
the host). It may have out-of-band management interface (for example the 
nvmet cli) and may not have any association (AQ) to the host to notify 
on changes.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-29 12:37                 ` Max Gurtovoy
@ 2024-01-31 12:40                   ` Sagi Grimberg
  2024-01-31 13:10                     ` Christoph Hellwig
  0 siblings, 1 reply; 36+ messages in thread
From: Sagi Grimberg @ 2024-01-31 12:40 UTC (permalink / raw)
  To: Max Gurtovoy, Christoph Hellwig
  Cc: kbusch, linux-nvme, oren, israelr, dwagner, oevron



On 1/29/24 14:37, Max Gurtovoy wrote:
> 
> 
> On 29/01/2024 12:48, Sagi Grimberg wrote:
>>
>>>>> Even in this case of re-creation we must sync the namespace 
>>>>> identification.
>>>>
>>>> But a format is not a re-creation.
>>>>
>>>
>>> we currently don't support format in fabrics so maybe I didn't use 
>>> the correct word during my answer.
>>>
>>> because of the dynamic nature of controllers and namespaces we have 
>>> the identify ctrl/ns commands in the specification and we need to use 
>>> them.
>>
>> ns scanning has been notorious to cause issues when executing with
>> controller resets. So I don't think it is a good idea to make ctrl start
>> block until it completes, especially not for an esoteric case that
>> a ns that suddenly was reformatted between reconnects.
>>
> 
> I think that making sure that the namespace is active and validated is 
> not esoteric.

The use-case you describe is very niche and esoteric IMO.

> What is the concern ? is it the time it takes to validate a namespace ?

Max, I mentioned that I don't think it is a good idea to change the
existing behavior, because of time, because its preventing forward
progress, because it doesn't fix an existing issue, etc.
I also don't think its needed.

I think the original intent of this as made in discussions in the past 
was to simply fail upon PI mismatch when nvme_init_ns_head and
nvme_validate_ns.

> I think that AEN and reset flows are a bit different. The controller is 
> free to be modified at any time (even when no Admin queue is opened from 
> the host). It may have out-of-band management interface (for example the 
> nvmet cli) and may not have any association (AQ) to the host to notify 
> on changes.

Fundamentally things can change before the host knows about it, hence
I don't get the claim that somehow the host violates the spec.

Do others think that now namespace scanning should block
nvme_ctrl_start ? I personally don't but I'm open to justification
for why it is the right thing to do.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/8] nvme: sync the namespace scanning during ctrl start
  2024-01-31 12:40                   ` Sagi Grimberg
@ 2024-01-31 13:10                     ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-01-31 13:10 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Max Gurtovoy, Christoph Hellwig, kbusch, linux-nvme, oren,
	israelr, dwagner, oevron

On Wed, Jan 31, 2024 at 02:40:07PM +0200, Sagi Grimberg wrote:
> Max, I mentioned that I don't think it is a good idea to change the
> existing behavior, because of time, because its preventing forward
> progress, because it doesn't fix an existing issue, etc.
> I also don't think its needed.
>
> I think the original intent of this as made in discussions in the past was 
> to simply fail upon PI mismatch when nvme_init_ns_head and
> nvme_validate_ns.

Yes.

>> I think that AEN and reset flows are a bit different. The controller is 
>> free to be modified at any time (even when no Admin queue is opened from 
>> the host). It may have out-of-band management interface (for example the 
>> nvmet cli) and may not have any association (AQ) to the host to notify on 
>> changes.
>
> Fundamentally things can change before the host knows about it, hence
> I don't get the claim that somehow the host violates the spec.
>
> Do others think that now namespace scanning should block
> nvme_ctrl_start ? I personally don't but I'm open to justification
> for why it is the right thing to do.

Hell no.


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
  2024-01-23  9:04           ` Christoph Hellwig
@ 2024-02-01 10:40             ` Israel Rukshin
       [not found]             ` <fdd1c81f-caf3-4f34-96e8-f4d8ffc26203@nvidia.com>
  1 sibling, 0 replies; 36+ messages in thread
From: Israel Rukshin @ 2024-02-01 10:40 UTC (permalink / raw)
  To: Christoph Hellwig, Daniel Wagner
  Cc: Max Gurtovoy, kbusch, sagi, linux-nvme, oren, oevron


On 1/23/2024 11:04 AM, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 04:28:53PM +0100, Daniel Wagner wrote:
>> On Mon, Jan 22, 2024 at 04:27:08PM +0100, Daniel Wagner wrote:
>>> On Mon, Jan 22, 2024 at 05:17:11PM +0200, Max Gurtovoy wrote:
>>>>> What about 'enable_pi' instead of the inverted boolean logic?
>>>> we can't change the default behavior (that is "enable" by default).
>>> But couldn't we just have
>>>
>>>      enable_pi = true
>>>
>>> as default then? This should be the same as
>>>
>>>     disallow_pi = false
>>>
>>> ?
>> Ah, I get it now. This interface doesn't work this way.
> The match_table_t infrastructure does allow checking for paramter
> values, including boolean ones.  And this would indeed be a good
> use case for that.  Extending the existing boolean only own
> flags to that would be nice as well.

Do you mean to do something like the bellow diff?

Regarding extending the existing boolean flags, are you talking about 
hdr_digest, disable_sqflow, data_digest, discovery, duplicate_connect 
and tls flags?

I will try to do it without breaking the existing nvmecli, but I am not 
sure if it is possible.


diff --git a/drivers/nvme/host/fabrics.c b/drivers/nvme/host/fabrics.c
index aa88606a44c4..6467b32e5d50 100644
--- a/drivers/nvme/host/fabrics.c
+++ b/drivers/nvme/host/fabrics.c
@@ -673,6 +673,9 @@ static const match_table_t opt_tokens = {
  #endif
  #ifdef CONFIG_NVME_TCP_TLS
         { NVMF_OPT_TLS,                 "tls"                   },
+#endif
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+       { NVMF_OPT_ENABLE_PI,           "enable_pi=%d"          },
  #endif
         { NVMF_OPT_ERR,                 NULL                    }
  };

@@ -1020,6 +1024,13 @@ static int nvmf_parse_options(struct 
nvmf_ctrl_options *opts,
                         }
                         opts->tls = true;
                         break;
+               case NVMF_OPT_ENABLE_PI:
+                       if (match_int(args, &token)) {
+                               ret = -EINVAL;
+                               goto out;
+                       }
+                       opts->enable_pi = !!token;
+                       break;
                 default:
                         pr_warn("unknown parameter or missing value 
'%s' in ctrl creation request\n",
                                 p);
--



^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload
       [not found]             ` <fdd1c81f-caf3-4f34-96e8-f4d8ffc26203@nvidia.com>
@ 2024-02-13  7:16               ` Christoph Hellwig
  0 siblings, 0 replies; 36+ messages in thread
From: Christoph Hellwig @ 2024-02-13  7:16 UTC (permalink / raw)
  To: Israel Rukshin
  Cc: Christoph Hellwig, Daniel Wagner, Max Gurtovoy, kbusch, sagi,
	linux-nvme, oren, oevron

On Thu, Feb 01, 2024 at 10:28:50AM +0200, Israel Rukshin wrote:
> Do you mean to do something like the bellow diff?

Yes.

> Regarding extending the existing boolean flags, are you talking about 
> hdr_digest, disable_sqflow, data_digest, discovery, duplicate_connect and 
> tls flags?
>
> I will try to do it without breaking the existing nvmecli, but I am not 
> sure if it is possible.

I'm pretty sure we have a bunch of fs mount options that allow a version
with paramters and without.  dax is an example, but you might have to
look at older kernels as the dax-supporting file systems got converted
to the new mount API in the meantime.



^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2024-02-13  7:16 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-22 14:56 [PATCH v1 0/8] Enforce uniform metadata settings for ns head Max Gurtovoy
2024-01-22 14:56 ` [PATCH 1/8] nvme: use Independent ID-NS only for unknown cmd sets Max Gurtovoy
2024-01-23  8:58   ` Christoph Hellwig
2024-01-22 14:56 ` [PATCH 2/8] nvme: set uniform metadata settings for ns head Max Gurtovoy
2024-01-23  9:01   ` Christoph Hellwig
2024-01-22 14:56 ` [PATCH 3/8] nvme: allocate a new namespace if validation fail Max Gurtovoy
2024-01-23  9:02   ` Christoph Hellwig
2024-01-22 14:56 ` [PATCH 4/8] nvme: add nvme_queue_scan_sync helper Max Gurtovoy
2024-01-22 14:56 ` [PATCH 5/8] nvme: sync the namespace scanning during ctrl start Max Gurtovoy
2024-01-23  9:02   ` Christoph Hellwig
2024-01-24  0:47     ` Max Gurtovoy
2024-01-24  9:48       ` Christoph Hellwig
2024-01-24 10:23         ` Max Gurtovoy
2024-01-25 14:41           ` Christoph Hellwig
2024-01-25 15:55             ` Max Gurtovoy
2024-01-29 10:48               ` Sagi Grimberg
2024-01-29 12:37                 ` Max Gurtovoy
2024-01-31 12:40                   ` Sagi Grimberg
2024-01-31 13:10                     ` Christoph Hellwig
2024-01-24 12:58   ` Sagi Grimberg
2024-01-24 13:04     ` Max Gurtovoy
2024-01-24 13:10       ` Sagi Grimberg
2024-01-24 13:17         ` Max Gurtovoy
2024-01-24 13:54           ` Sagi Grimberg
2024-01-24 14:15             ` Max Gurtovoy
2024-01-22 14:56 ` [PATCH 6/8] nvme-rdma: Fix transfer length when write_generate/read_verify are 0 Max Gurtovoy
2024-01-23  9:02   ` Christoph Hellwig
2024-01-22 14:56 ` [PATCH 7/8] nvme-fabrics: add option to disallow T10-PI offload Max Gurtovoy
2024-01-22 15:13   ` Daniel Wagner
2024-01-22 15:17     ` Max Gurtovoy
2024-01-22 15:27       ` Daniel Wagner
2024-01-22 15:28         ` Daniel Wagner
2024-01-23  9:04           ` Christoph Hellwig
2024-02-01 10:40             ` Israel Rukshin
     [not found]             ` <fdd1c81f-caf3-4f34-96e8-f4d8ffc26203@nvidia.com>
2024-02-13  7:16               ` Christoph Hellwig
2024-01-22 14:56 ` [PATCH 8/8] nvme-rdma: enable user " Max Gurtovoy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox