linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
@ 2025-06-24 20:20 Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
                   ` (8 more replies)
  0 siblings, 9 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

FPIN LI (link integrity) messages are received when the attached
fabric detects hardware errors. In response to these messages I/O
should be directed away from the affected ports, and only used
if the 'optimized' paths are unavailable.
Upon port reset the paths should be put back in service as the
affected hardware might have been replaced.
This patch adds a new controller flag 'NVME_CTRL_MARGINAL'
which will be checked during multipath path selection, causing the
path to be skipped when checking for 'optimized' paths. If no
optimized paths are available the 'marginal' paths are considered
for path selection alongside the 'non-optimized' paths.
It also introduces a new nvme-fc callback 'nvme_fc_fpin_rcv()' to
evaluate the FPIN LI TLV payload and set the 'marginal' state on
all affected rports.

The testing for this patch set was performed by Bryan Gurney, using the
process outlined by John Meneghini's presentation at LSFMM 2024, where
the fibre channel switch sends an FPIN notification on a specific switch
port, and the following is checked on the initiator:

1. The controllers corresponding to the paths on the port that has
received the notification are showing a set NVME_CTRL_MARGINAL flag.

   \
    +- nvme4 fc traddr=c,host_traddr=e live optimized
    +- nvme5 fc traddr=8,host_traddr=e live non-optimized
    +- nvme8 fc traddr=e,host_traddr=f marginal optimized
    +- nvme9 fc traddr=a,host_traddr=f marginal non-optimized

2. The I/O statistics of the test namespace show no I/O activity on the
controllers with NVME_CTRL_MARGINAL set.

   Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
   nvme4c4n1         0.00         0.00         0.00         0.00
   nvme4c5n1     25001.00         0.00        97.66         0.00
   nvme4c9n1     25000.00         0.00        97.66         0.00
   nvme4n1       50011.00         0.00       195.36         0.00
   
   
   Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
   nvme4c4n1         0.00         0.00         0.00         0.00
   nvme4c5n1     48360.00         0.00       188.91         0.00
   nvme4c9n1      1642.00         0.00         6.41         0.00
   nvme4n1       49981.00         0.00       195.24         0.00
   
   
   Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
   nvme4c4n1         0.00         0.00         0.00         0.00
   nvme4c5n1     50001.00         0.00       195.32         0.00
   nvme4c9n1         0.00         0.00         0.00         0.00
   nvme4n1       50016.00         0.00       195.38         0.00

Link: https://people.redhat.com/jmeneghi/LSFMM_2024/LSFMM_2024_NVMe_Cancel_and_FPIN.pdf

More rigorous testing was also performed to ensure proper path migration
on each of the eight different FPIN link integrity events, particularly
during a scenario where there are only non-optimized paths available, in
a state where all paths are marginal.  On a configuration with a
round-robin iopolicy, when all paths on the host show as marginal, I/O
continues on the optimized path that was most recently non-marginal.
From this point, of both of the optimized paths are down, I/O properly
continues on the remaining paths.

Changes to the original submission:
- Changed flag name to 'marginal'
- Do not block marginal path; influence path selection instead
  to de-prioritize marginal paths

Changes to v2:
- Split off driver-specific modifications
- Introduce 'union fc_tlv_desc' to avoid casts

Changes to v3:
- Include reviews from Justin Tee
- Split marginal path handling patch

Changes to v4:
- Change 'u8' to '__u8' on fc_tlv_desc to fix a failure to build
- Print 'marginal' instead of 'live' in the state of controllers
  when they are marginal

Changes to v5:
- Minor spelling corrections to patch descriptions

Changes to v6:
- No code changes; added note about additional testing 

Hannes Reinecke (5):
  fc_els: use 'union fc_tlv_desc'
  nvme-fc: marginal path handling
  nvme-fc: nvme_fc_fpin_rcv() callback
  lpfc: enable FPIN notification for NVMe
  qla2xxx: enable FPIN notification for NVMe

Bryan Gurney (1):
  nvme: sysfs: emit the marginal path state in show_state()

 drivers/nvme/host/core.c         |   1 +
 drivers/nvme/host/fc.c           |  99 +++++++++++++++++++
 drivers/nvme/host/multipath.c    |  17 ++--
 drivers/nvme/host/nvme.h         |   6 ++
 drivers/nvme/host/sysfs.c        |   4 +-
 drivers/scsi/lpfc/lpfc_els.c     |  84 ++++++++--------
 drivers/scsi/qla2xxx/qla_isr.c   |   3 +
 drivers/scsi/scsi_transport_fc.c |  27 +++--
 include/linux/nvme-fc-driver.h   |   3 +
 include/uapi/scsi/fc/fc_els.h    | 165 +++++++++++++++++--------------
 10 files changed, 269 insertions(+), 140 deletions(-)

-- 
2.49.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc'
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: Hannes Reinecke <hare@kernel.org>

Introduce 'union fc_tlv_desc' to have a common structure for all FC
ELS TLV structures and avoid type casts.

[bgurney: The cast inside the union fc_tlv_next_desc() has "u8",
which causes a failure to build.  Use "__u8" instead.]

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Tested-by: Bryan Gurney <bgurney@redhat.com>
---
 drivers/scsi/lpfc/lpfc_els.c     |  75 +++++++-------
 drivers/scsi/scsi_transport_fc.c |  27 +++--
 include/uapi/scsi/fc/fc_els.h    | 165 +++++++++++++++++--------------
 3 files changed, 135 insertions(+), 132 deletions(-)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 375a879c31f1..959603ab939a 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -3937,7 +3937,7 @@ lpfc_cmpl_els_edc(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 {
 	IOCB_t *irsp_iocb;
 	struct fc_els_edc_resp *edc_rsp;
-	struct fc_tlv_desc *tlv;
+	union fc_tlv_desc *tlv;
 	struct fc_diag_cg_sig_desc *pcgd;
 	struct fc_diag_lnkflt_desc *plnkflt;
 	struct lpfc_dmabuf *pcmd, *prsp;
@@ -4028,7 +4028,7 @@ lpfc_cmpl_els_edc(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 			goto out;
 		}
 
-		dtag = be32_to_cpu(tlv->desc_tag);
+		dtag = be32_to_cpu(tlv->hdr.desc_tag);
 		switch (dtag) {
 		case ELS_DTAG_LNK_FAULT_CAP:
 			if (bytes_remain < FC_TLV_DESC_SZ_FROM_LENGTH(tlv) ||
@@ -4043,7 +4043,7 @@ lpfc_cmpl_els_edc(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 					sizeof(struct fc_diag_lnkflt_desc));
 				goto out;
 			}
-			plnkflt = (struct fc_diag_lnkflt_desc *)tlv;
+			plnkflt = &tlv->lnkflt;
 			lpfc_printf_log(phba, KERN_INFO,
 				LOG_ELS | LOG_LDS_EVENT,
 				"4617 Link Fault Desc Data: 0x%08x 0x%08x "
@@ -4070,7 +4070,7 @@ lpfc_cmpl_els_edc(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 				goto out;
 			}
 
-			pcgd = (struct fc_diag_cg_sig_desc *)tlv;
+			pcgd = &tlv->cg_sig;
 			lpfc_printf_log(
 				phba, KERN_INFO, LOG_ELS | LOG_CGN_MGMT,
 				"4616 CGN Desc Data: 0x%08x 0x%08x "
@@ -4125,10 +4125,8 @@ lpfc_cmpl_els_edc(struct lpfc_hba *phba, struct lpfc_iocbq *cmdiocb,
 }
 
 static void
-lpfc_format_edc_lft_desc(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_format_edc_lft_desc(struct lpfc_hba *phba, struct fc_diag_lnkflt_desc *lft)
 {
-	struct fc_diag_lnkflt_desc *lft = (struct fc_diag_lnkflt_desc *)tlv;
-
 	lft->desc_tag = cpu_to_be32(ELS_DTAG_LNK_FAULT_CAP);
 	lft->desc_len = cpu_to_be32(
 		FC_TLV_DESC_LENGTH_FROM_SZ(struct fc_diag_lnkflt_desc));
@@ -4141,10 +4139,8 @@ lpfc_format_edc_lft_desc(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
 }
 
 static void
-lpfc_format_edc_cgn_desc(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_format_edc_cgn_desc(struct lpfc_hba *phba, struct fc_diag_cg_sig_desc *cgd)
 {
-	struct fc_diag_cg_sig_desc *cgd = (struct fc_diag_cg_sig_desc *)tlv;
-
 	/* We are assuming cgd was zero'ed before calling this routine */
 
 	/* Configure the congestion detection capability */
@@ -4233,7 +4229,7 @@ lpfc_issue_els_edc(struct lpfc_vport *vport, uint8_t retry)
 	struct lpfc_hba  *phba = vport->phba;
 	struct lpfc_iocbq *elsiocb;
 	struct fc_els_edc *edc_req;
-	struct fc_tlv_desc *tlv;
+	union fc_tlv_desc *tlv;
 	u16 cmdsize;
 	struct lpfc_nodelist *ndlp;
 	u8 *pcmd = NULL;
@@ -4272,13 +4268,13 @@ lpfc_issue_els_edc(struct lpfc_vport *vport, uint8_t retry)
 	tlv = edc_req->desc;
 
 	if (cgn_desc_size) {
-		lpfc_format_edc_cgn_desc(phba, tlv);
+		lpfc_format_edc_cgn_desc(phba, &tlv->cg_sig);
 		phba->cgn_sig_freq = lpfc_fabric_cgn_frequency;
 		tlv = fc_tlv_next_desc(tlv);
 	}
 
 	if (lft_desc_size)
-		lpfc_format_edc_lft_desc(phba, tlv);
+		lpfc_format_edc_lft_desc(phba, &tlv->lnkflt);
 
 	lpfc_printf_vlog(vport, KERN_INFO, LOG_ELS | LOG_CGN_MGMT,
 			 "4623 Xmit EDC to remote "
@@ -5823,7 +5819,7 @@ lpfc_issue_els_edc_rsp(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 {
 	struct lpfc_hba  *phba = vport->phba;
 	struct fc_els_edc_resp *edc_rsp;
-	struct fc_tlv_desc *tlv;
+	union fc_tlv_desc *tlv;
 	struct lpfc_iocbq *elsiocb;
 	IOCB_t *icmd, *cmd;
 	union lpfc_wqe128 *wqe;
@@ -5867,10 +5863,10 @@ lpfc_issue_els_edc_rsp(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 		FC_TLV_DESC_LENGTH_FROM_SZ(struct fc_els_lsri_desc));
 	edc_rsp->lsri.rqst_w0.cmd = ELS_EDC;
 	tlv = edc_rsp->desc;
-	lpfc_format_edc_cgn_desc(phba, tlv);
+	lpfc_format_edc_cgn_desc(phba, &tlv->cg_sig);
 	tlv = fc_tlv_next_desc(tlv);
 	if (lft_desc_size)
-		lpfc_format_edc_lft_desc(phba, tlv);
+		lpfc_format_edc_lft_desc(phba, &tlv->lnkflt);
 
 	lpfc_debugfs_disc_trc(vport, LPFC_DISC_TRC_ELS_RSP,
 			      "Issue EDC ACC:      did:x%x flg:x%lx refcnt %d",
@@ -9255,7 +9251,7 @@ lpfc_els_rcv_edc(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 {
 	struct lpfc_hba  *phba = vport->phba;
 	struct fc_els_edc *edc_req;
-	struct fc_tlv_desc *tlv;
+	union fc_tlv_desc *tlv;
 	uint8_t *payload;
 	uint32_t *ptr, dtag;
 	const char *dtag_nm;
@@ -9298,7 +9294,7 @@ lpfc_els_rcv_edc(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 			goto out;
 		}
 
-		dtag = be32_to_cpu(tlv->desc_tag);
+		dtag = be32_to_cpu(tlv->hdr.desc_tag);
 		switch (dtag) {
 		case ELS_DTAG_LNK_FAULT_CAP:
 			if (bytes_remain < FC_TLV_DESC_SZ_FROM_LENGTH(tlv) ||
@@ -9313,7 +9309,7 @@ lpfc_els_rcv_edc(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 					sizeof(struct fc_diag_lnkflt_desc));
 				goto out;
 			}
-			plnkflt = (struct fc_diag_lnkflt_desc *)tlv;
+			plnkflt = &tlv->lnkflt;
 			lpfc_printf_log(phba, KERN_INFO,
 				LOG_ELS | LOG_LDS_EVENT,
 				"4626 Link Fault Desc Data: x%08x len x%x "
@@ -9350,7 +9346,7 @@ lpfc_els_rcv_edc(struct lpfc_vport *vport, struct lpfc_iocbq *cmdiocb,
 			phba->cgn_sig_freq = lpfc_fabric_cgn_frequency;
 
 			lpfc_least_capable_settings(
-				phba, (struct fc_diag_cg_sig_desc *)tlv);
+				phba, &tlv->cg_sig);
 			break;
 		default:
 			dtag_nm = lpfc_get_tlv_dtag_nm(dtag);
@@ -9941,14 +9937,13 @@ lpfc_display_fpin_wwpn(struct lpfc_hba *phba, __be64 *wwnlist, u32 cnt)
 /**
  * lpfc_els_rcv_fpin_li - Process an FPIN Link Integrity Event.
  * @phba: Pointer to phba object.
- * @tlv:  Pointer to the Link Integrity Notification Descriptor.
+ * @li:  Pointer to the Link Integrity Notification Descriptor.
  *
  * This function processes a Link Integrity FPIN event by logging a message.
  **/
 static void
-lpfc_els_rcv_fpin_li(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_els_rcv_fpin_li(struct lpfc_hba *phba, struct fc_fn_li_desc *li)
 {
-	struct fc_fn_li_desc *li = (struct fc_fn_li_desc *)tlv;
 	const char *li_evt_str;
 	u32 li_evt, cnt;
 
@@ -9972,14 +9967,13 @@ lpfc_els_rcv_fpin_li(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
 /**
  * lpfc_els_rcv_fpin_del - Process an FPIN Delivery Event.
  * @phba: Pointer to hba object.
- * @tlv:  Pointer to the Delivery Notification Descriptor TLV
+ * @del:  Pointer to the Delivery Notification Descriptor TLV
  *
  * This function processes a Delivery FPIN event by logging a message.
  **/
 static void
-lpfc_els_rcv_fpin_del(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_els_rcv_fpin_del(struct lpfc_hba *phba, struct fc_fn_deli_desc *del)
 {
-	struct fc_fn_deli_desc *del = (struct fc_fn_deli_desc *)tlv;
 	const char *del_rsn_str;
 	u32 del_rsn;
 	__be32 *frame;
@@ -10010,14 +10004,14 @@ lpfc_els_rcv_fpin_del(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
 /**
  * lpfc_els_rcv_fpin_peer_cgn - Process a FPIN Peer Congestion Event.
  * @phba: Pointer to hba object.
- * @tlv:  Pointer to the Peer Congestion Notification Descriptor TLV
+ * @pc:  Pointer to the Peer Congestion Notification Descriptor TLV
  *
  * This function processes a Peer Congestion FPIN event by logging a message.
  **/
 static void
-lpfc_els_rcv_fpin_peer_cgn(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_els_rcv_fpin_peer_cgn(struct lpfc_hba *phba,
+			   struct fc_fn_peer_congn_desc *pc)
 {
-	struct fc_fn_peer_congn_desc *pc = (struct fc_fn_peer_congn_desc *)tlv;
 	const char *pc_evt_str;
 	u32 pc_evt, cnt;
 
@@ -10045,7 +10039,7 @@ lpfc_els_rcv_fpin_peer_cgn(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
 /**
  * lpfc_els_rcv_fpin_cgn - Process an FPIN Congestion notification
  * @phba: Pointer to hba object.
- * @tlv:  Pointer to the Congestion Notification Descriptor TLV
+ * @cgn:  Pointer to the Congestion Notification Descriptor TLV
  *
  * This function processes an FPIN Congestion Notifiction.  The notification
  * could be an Alarm or Warning.  This routine feeds that data into driver's
@@ -10054,10 +10048,9 @@ lpfc_els_rcv_fpin_peer_cgn(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
  * to the upper layer or 0 to indicate don't deliver it.
  **/
 static int
-lpfc_els_rcv_fpin_cgn(struct lpfc_hba *phba, struct fc_tlv_desc *tlv)
+lpfc_els_rcv_fpin_cgn(struct lpfc_hba *phba, struct fc_fn_congn_desc *cgn)
 {
 	struct lpfc_cgn_info *cp;
-	struct fc_fn_congn_desc *cgn = (struct fc_fn_congn_desc *)tlv;
 	const char *cgn_evt_str;
 	u32 cgn_evt;
 	const char *cgn_sev_str;
@@ -10160,7 +10153,7 @@ lpfc_els_rcv_fpin(struct lpfc_vport *vport, void *p, u32 fpin_length)
 {
 	struct lpfc_hba *phba = vport->phba;
 	struct fc_els_fpin *fpin = (struct fc_els_fpin *)p;
-	struct fc_tlv_desc *tlv, *first_tlv, *current_tlv;
+	union fc_tlv_desc *tlv, *first_tlv, *current_tlv;
 	const char *dtag_nm;
 	int desc_cnt = 0, bytes_remain, cnt;
 	u32 dtag, deliver = 0;
@@ -10185,7 +10178,7 @@ lpfc_els_rcv_fpin(struct lpfc_vport *vport, void *p, u32 fpin_length)
 		return;
 	}
 
-	tlv = (struct fc_tlv_desc *)&fpin->fpin_desc[0];
+	tlv = &fpin->fpin_desc[0];
 	first_tlv = tlv;
 	bytes_remain = fpin_length - offsetof(struct fc_els_fpin, fpin_desc);
 	bytes_remain = min_t(u32, bytes_remain, be32_to_cpu(fpin->desc_len));
@@ -10193,22 +10186,22 @@ lpfc_els_rcv_fpin(struct lpfc_vport *vport, void *p, u32 fpin_length)
 	/* process each descriptor separately */
 	while (bytes_remain >= FC_TLV_DESC_HDR_SZ &&
 	       bytes_remain >= FC_TLV_DESC_SZ_FROM_LENGTH(tlv)) {
-		dtag = be32_to_cpu(tlv->desc_tag);
+		dtag = be32_to_cpu(tlv->hdr.desc_tag);
 		switch (dtag) {
 		case ELS_DTAG_LNK_INTEGRITY:
-			lpfc_els_rcv_fpin_li(phba, tlv);
+			lpfc_els_rcv_fpin_li(phba, &tlv->li);
 			deliver = 1;
 			break;
 		case ELS_DTAG_DELIVERY:
-			lpfc_els_rcv_fpin_del(phba, tlv);
+			lpfc_els_rcv_fpin_del(phba, &tlv->deli);
 			deliver = 1;
 			break;
 		case ELS_DTAG_PEER_CONGEST:
-			lpfc_els_rcv_fpin_peer_cgn(phba, tlv);
+			lpfc_els_rcv_fpin_peer_cgn(phba, &tlv->peer_congn);
 			deliver = 1;
 			break;
 		case ELS_DTAG_CONGESTION:
-			deliver = lpfc_els_rcv_fpin_cgn(phba, tlv);
+			deliver = lpfc_els_rcv_fpin_cgn(phba, &tlv->congn);
 			break;
 		default:
 			dtag_nm = lpfc_get_tlv_dtag_nm(dtag);
@@ -10221,12 +10214,12 @@ lpfc_els_rcv_fpin(struct lpfc_vport *vport, void *p, u32 fpin_length)
 			return;
 		}
 		lpfc_cgn_update_stat(phba, dtag);
-		cnt = be32_to_cpu(tlv->desc_len);
+		cnt = be32_to_cpu(tlv->hdr.desc_len);
 
 		/* Sanity check descriptor length. The desc_len value does not
 		 * include space for the desc_tag and the desc_len fields.
 		 */
-		len -= (cnt + sizeof(struct fc_tlv_desc));
+		len -= (cnt + sizeof(struct fc_tlv_desc_hdr));
 		if (len < 0) {
 			dtag_nm = lpfc_get_tlv_dtag_nm(dtag);
 			lpfc_printf_log(phba, KERN_WARNING, LOG_CGN_MGMT,
diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c
index 082f76e76721..a62636c6f708 100644
--- a/drivers/scsi/scsi_transport_fc.c
+++ b/drivers/scsi/scsi_transport_fc.c
@@ -750,13 +750,12 @@ fc_cn_stats_update(u16 event_type, struct fc_fpin_stats *stats)
  *
  */
 static void
-fc_fpin_li_stats_update(struct Scsi_Host *shost, struct fc_tlv_desc *tlv)
+fc_fpin_li_stats_update(struct Scsi_Host *shost, struct fc_fn_li_desc *li_desc)
 {
 	u8 i;
 	struct fc_rport *rport = NULL;
 	struct fc_rport *attach_rport = NULL;
 	struct fc_host_attrs *fc_host = shost_to_fc_host(shost);
-	struct fc_fn_li_desc *li_desc = (struct fc_fn_li_desc *)tlv;
 	u16 event_type = be16_to_cpu(li_desc->event_type);
 	u64 wwpn;
 
@@ -799,12 +798,11 @@ fc_fpin_li_stats_update(struct Scsi_Host *shost, struct fc_tlv_desc *tlv)
  */
 static void
 fc_fpin_delivery_stats_update(struct Scsi_Host *shost,
-			      struct fc_tlv_desc *tlv)
+			      struct fc_fn_deli_desc *dn_desc)
 {
 	struct fc_rport *rport = NULL;
 	struct fc_rport *attach_rport = NULL;
 	struct fc_host_attrs *fc_host = shost_to_fc_host(shost);
-	struct fc_fn_deli_desc *dn_desc = (struct fc_fn_deli_desc *)tlv;
 	u32 reason_code = be32_to_cpu(dn_desc->deli_reason_code);
 
 	rport = fc_find_rport_by_wwpn(shost,
@@ -830,13 +828,11 @@ fc_fpin_delivery_stats_update(struct Scsi_Host *shost,
  */
 static void
 fc_fpin_peer_congn_stats_update(struct Scsi_Host *shost,
-				struct fc_tlv_desc *tlv)
+				struct fc_fn_peer_congn_desc *pc_desc)
 {
 	u8 i;
 	struct fc_rport *rport = NULL;
 	struct fc_rport *attach_rport = NULL;
-	struct fc_fn_peer_congn_desc *pc_desc =
-	    (struct fc_fn_peer_congn_desc *)tlv;
 	u16 event_type = be16_to_cpu(pc_desc->event_type);
 	u64 wwpn;
 
@@ -876,10 +872,9 @@ fc_fpin_peer_congn_stats_update(struct Scsi_Host *shost,
  */
 static void
 fc_fpin_congn_stats_update(struct Scsi_Host *shost,
-			   struct fc_tlv_desc *tlv)
+			   struct fc_fn_congn_desc *congn)
 {
 	struct fc_host_attrs *fc_host = shost_to_fc_host(shost);
-	struct fc_fn_congn_desc *congn = (struct fc_fn_congn_desc *)tlv;
 
 	fc_cn_stats_update(be16_to_cpu(congn->event_type),
 			   &fc_host->fpin_stats);
@@ -899,32 +894,32 @@ fc_host_fpin_rcv(struct Scsi_Host *shost, u32 fpin_len, char *fpin_buf,
 		u8 event_acknowledge)
 {
 	struct fc_els_fpin *fpin = (struct fc_els_fpin *)fpin_buf;
-	struct fc_tlv_desc *tlv;
+	union fc_tlv_desc *tlv;
 	u32 bytes_remain;
 	u32 dtag;
 	enum fc_host_event_code event_code =
 		event_acknowledge ? FCH_EVT_LINK_FPIN_ACK : FCH_EVT_LINK_FPIN;
 
 	/* Update Statistics */
-	tlv = (struct fc_tlv_desc *)&fpin->fpin_desc[0];
+	tlv = &fpin->fpin_desc[0];
 	bytes_remain = fpin_len - offsetof(struct fc_els_fpin, fpin_desc);
 	bytes_remain = min_t(u32, bytes_remain, be32_to_cpu(fpin->desc_len));
 
 	while (bytes_remain >= FC_TLV_DESC_HDR_SZ &&
 	       bytes_remain >= FC_TLV_DESC_SZ_FROM_LENGTH(tlv)) {
-		dtag = be32_to_cpu(tlv->desc_tag);
+		dtag = be32_to_cpu(tlv->hdr.desc_tag);
 		switch (dtag) {
 		case ELS_DTAG_LNK_INTEGRITY:
-			fc_fpin_li_stats_update(shost, tlv);
+			fc_fpin_li_stats_update(shost, &tlv->li);
 			break;
 		case ELS_DTAG_DELIVERY:
-			fc_fpin_delivery_stats_update(shost, tlv);
+			fc_fpin_delivery_stats_update(shost, &tlv->deli);
 			break;
 		case ELS_DTAG_PEER_CONGEST:
-			fc_fpin_peer_congn_stats_update(shost, tlv);
+			fc_fpin_peer_congn_stats_update(shost, &tlv->peer_congn);
 			break;
 		case ELS_DTAG_CONGESTION:
-			fc_fpin_congn_stats_update(shost, tlv);
+			fc_fpin_congn_stats_update(shost, &tlv->congn);
 		}
 
 		bytes_remain -= FC_TLV_DESC_SZ_FROM_LENGTH(tlv);
diff --git a/include/uapi/scsi/fc/fc_els.h b/include/uapi/scsi/fc/fc_els.h
index 16782c360de3..3598dc553f4d 100644
--- a/include/uapi/scsi/fc/fc_els.h
+++ b/include/uapi/scsi/fc/fc_els.h
@@ -253,12 +253,12 @@ enum fc_ls_tlv_dtag {
 
 
 /*
- * Generic Link Service TLV Descriptor format
+ * Generic Link Service TLV Descriptor header
  *
  * This structure, as it defines no payload, will also be referred to
  * as the "tlv header" - which contains the tag and len fields.
  */
-struct fc_tlv_desc {
+struct fc_tlv_desc_hdr {
 	__be32		desc_tag;	/* Notification Descriptor Tag */
 	__be32		desc_len;	/* Length of Descriptor (in bytes).
 					 * Size of descriptor excluding
@@ -267,36 +267,6 @@ struct fc_tlv_desc {
 	__u8		desc_value[];  /* Descriptor Value */
 };
 
-/* Descriptor tag and len fields are considered the mandatory header
- * for a descriptor
- */
-#define FC_TLV_DESC_HDR_SZ	sizeof(struct fc_tlv_desc)
-
-/*
- * Macro, used when initializing payloads, to return the descriptor length.
- * Length is size of descriptor minus the tag and len fields.
- */
-#define FC_TLV_DESC_LENGTH_FROM_SZ(desc)	\
-		(sizeof(desc) - FC_TLV_DESC_HDR_SZ)
-
-/* Macro, used on received payloads, to return the descriptor length */
-#define FC_TLV_DESC_SZ_FROM_LENGTH(tlv)		\
-		(__be32_to_cpu((tlv)->desc_len) + FC_TLV_DESC_HDR_SZ)
-
-/*
- * This helper is used to walk descriptors in a descriptor list.
- * Given the address of the current descriptor, which minimally contains a
- * tag and len field, calculate the address of the next descriptor based
- * on the len field.
- */
-static inline void *fc_tlv_next_desc(void *desc)
-{
-	struct fc_tlv_desc *tlv = desc;
-
-	return (desc + FC_TLV_DESC_SZ_FROM_LENGTH(tlv));
-}
-
-
 /*
  * Link Service Request Information Descriptor
  */
@@ -1094,19 +1064,6 @@ struct fc_fn_congn_desc {
 	__u8		resv[3];	/* reserved - must be zero */
 };
 
-/*
- * ELS_FPIN - Fabric Performance Impact Notification
- */
-struct fc_els_fpin {
-	__u8		fpin_cmd;	/* command (0x16) */
-	__u8		fpin_zero[3];	/* specified as zero - part of cmd */
-	__be32		desc_len;	/* Length of Descriptor List (in bytes).
-					 * Size of ELS excluding fpin_cmd,
-					 * fpin_zero and desc_len fields.
-					 */
-	struct fc_tlv_desc	fpin_desc[];	/* Descriptor list */
-};
-
 /* Diagnostic Function Descriptor - FPIN Registration */
 struct fc_df_desc_fpin_reg {
 	__be32		desc_tag;	/* FPIN Registration (0x00030001) */
@@ -1125,33 +1082,6 @@ struct fc_df_desc_fpin_reg {
 					 */
 };
 
-/*
- * ELS_RDF - Register Diagnostic Functions
- */
-struct fc_els_rdf {
-	__u8		fpin_cmd;	/* command (0x19) */
-	__u8		fpin_zero[3];	/* specified as zero - part of cmd */
-	__be32		desc_len;	/* Length of Descriptor List (in bytes).
-					 * Size of ELS excluding fpin_cmd,
-					 * fpin_zero and desc_len fields.
-					 */
-	struct fc_tlv_desc	desc[];	/* Descriptor list */
-};
-
-/*
- * ELS RDF LS_ACC Response.
- */
-struct fc_els_rdf_resp {
-	struct fc_els_ls_acc	acc_hdr;
-	__be32			desc_list_len;	/* Length of response (in
-						 * bytes). Excludes acc_hdr
-						 * and desc_list_len fields.
-						 */
-	struct fc_els_lsri_desc	lsri;
-	struct fc_tlv_desc	desc[];	/* Supported Descriptor list */
-};
-
-
 /*
  * Diagnostic Capability Descriptors for EDC ELS
  */
@@ -1221,6 +1151,65 @@ struct fc_diag_cg_sig_desc {
 	struct fc_diag_cg_sig_freq	rcv_signal_frequency;
 };
 
+/*
+ * Generic Link Service TLV Descriptor format
+ *
+ * This structure, as it defines no payload, will also be referred to
+ * as the "tlv header" - which contains the tag and len fields.
+ */
+union fc_tlv_desc {
+	struct fc_tlv_desc_hdr hdr;
+	struct fc_els_lsri_desc lsri;
+	struct fc_fn_li_desc li;
+	struct fc_fn_deli_desc deli;
+	struct fc_fn_peer_congn_desc peer_congn;
+	struct fc_fn_congn_desc congn;
+	struct fc_df_desc_fpin_reg fpin_reg;
+	struct fc_diag_lnkflt_desc lnkflt;
+	struct fc_diag_cg_sig_desc cg_sig;
+};
+
+/* Descriptor tag and len fields are considered the mandatory header
+ * for a descriptor
+ */
+#define FC_TLV_DESC_HDR_SZ	sizeof(struct fc_tlv_desc_hdr)
+
+/*
+ * Macro, used when initializing payloads, to return the descriptor length.
+ * Length is size of descriptor minus the tag and len fields.
+ */
+#define FC_TLV_DESC_LENGTH_FROM_SZ(desc)	\
+		(sizeof(desc) - FC_TLV_DESC_HDR_SZ)
+
+/* Macro, used on received payloads, to return the descriptor length */
+#define FC_TLV_DESC_SZ_FROM_LENGTH(tlv)		\
+		(__be32_to_cpu((tlv)->hdr.desc_len) + FC_TLV_DESC_HDR_SZ)
+
+/*
+ * This helper is used to walk descriptors in a descriptor list.
+ * Given the address of the current descriptor, which minimally contains a
+ * tag and len field, calculate the address of the next descriptor based
+ * on the len field.
+ */
+static inline union fc_tlv_desc *fc_tlv_next_desc(union fc_tlv_desc *desc)
+{
+	return (union fc_tlv_desc *)((__u8 *)desc + FC_TLV_DESC_SZ_FROM_LENGTH(desc));
+}
+
+
+/*
+ * ELS_FPIN - Fabric Performance Impact Notification
+ */
+struct fc_els_fpin {
+	__u8		fpin_cmd;	/* command (0x16) */
+	__u8		fpin_zero[3];	/* specified as zero - part of cmd */
+	__be32		desc_len;	/* Length of Descriptor List (in bytes).
+					 * Size of ELS excluding fpin_cmd,
+					 * fpin_zero and desc_len fields.
+					 */
+	union fc_tlv_desc	fpin_desc[];	/* Descriptor list */
+};
+
 /*
  * ELS_EDC - Exchange Diagnostic Capabilities
  */
@@ -1231,10 +1220,37 @@ struct fc_els_edc {
 					 * Size of ELS excluding edc_cmd,
 					 * edc_zero and desc_len fields.
 					 */
-	struct fc_tlv_desc	desc[];
+	union fc_tlv_desc	desc[];
 					/* Diagnostic Descriptor list */
 };
 
+/*
+ * ELS_RDF - Register Diagnostic Functions
+ */
+struct fc_els_rdf {
+	__u8		fpin_cmd;	/* command (0x19) */
+	__u8		fpin_zero[3];	/* specified as zero - part of cmd */
+	__be32		desc_len;	/* Length of Descriptor List (in bytes).
+					 * Size of ELS excluding fpin_cmd,
+					 * fpin_zero and desc_len fields.
+					 */
+	union fc_tlv_desc	desc[];	/* Descriptor list */
+};
+
+/*
+ * ELS RDF LS_ACC Response.
+ */
+struct fc_els_rdf_resp {
+	struct fc_els_ls_acc	acc_hdr;
+	__be32			desc_list_len;	/* Length of response (in
+						 * bytes). Excludes acc_hdr
+						 * and desc_list_len fields.
+						 */
+	struct fc_els_lsri_desc	lsri;
+	union fc_tlv_desc	desc[];	/* Supported Descriptor list */
+};
+
+
 /*
  * ELS EDC LS_ACC Response.
  */
@@ -1245,9 +1261,8 @@ struct fc_els_edc_resp {
 						 * and desc_list_len fields.
 						 */
 	struct fc_els_lsri_desc	lsri;
-	struct fc_tlv_desc	desc[];
+	union fc_tlv_desc	desc[];
 				    /* Supported Diagnostic Descriptor list */
 };
 
-
 #endif /* _FC_ELS_H_ */
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 2/6] nvme-fc: marginal path handling
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-07-03 11:47   ` Christoph Hellwig
  2025-06-24 20:20 ` [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback Bryan Gurney
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: Hannes Reinecke <hare@kernel.org>

FPIN LI (link integrity) messages are received when the attached
fabric detects hardware errors. In response to these messages I/O
should be directed away from the affected ports, and only used
if the 'optimized' paths are unavailable.
To handle this a new controller flag 'NVME_CTRL_MARGINAL' is added
which will cause the multipath scheduler to skip these paths when
checking for 'optimized' paths. They are, however, still eligible
for non-optimized path selected. The flag is cleared upon reset as then the
faulty hardware might be replaced.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
---
 drivers/nvme/host/core.c      |  1 +
 drivers/nvme/host/fc.c        |  4 ++++
 drivers/nvme/host/multipath.c | 17 +++++++++++------
 drivers/nvme/host/nvme.h      |  6 ++++++
 4 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 3da5ac71a9b0..ac03ef7baab9 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5040,6 +5040,7 @@ int nvme_init_ctrl(struct nvme_ctrl *ctrl, struct device *dev,
 	WRITE_ONCE(ctrl->state, NVME_CTRL_NEW);
 	ctrl->passthru_err_log_enabled = false;
 	clear_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+	clear_bit(NVME_CTRL_MARGINAL, &ctrl->flags);
 	spin_lock_init(&ctrl->lock);
 	mutex_init(&ctrl->namespaces_lock);
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 014b387f1e8b..7e81c815bb83 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -786,6 +786,10 @@ nvme_fc_ctrl_connectivity_loss(struct nvme_fc_ctrl *ctrl)
 		"Reconnect", ctrl->cnum);
 
 	set_bit(ASSOC_FAILED, &ctrl->flags);
+
+	/* clear 'marginal' flag as controller will be reset */
+	clear_bit(NVME_CTRL_MARGINAL, &ctrl->flags);
+
 	nvme_reset_ctrl(&ctrl->ctrl);
 }
 
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 1062467595f3..003954985675 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -324,11 +324,14 @@ static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
 
 		switch (ns->ana_state) {
 		case NVME_ANA_OPTIMIZED:
-			if (distance < found_distance) {
-				found_distance = distance;
-				found = ns;
+			if (!nvme_ctrl_is_marginal(ns->ctrl)) {
+				if (distance < found_distance) {
+					found_distance = distance;
+					found = ns;
+				}
+				break;
 			}
-			break;
+			fallthrough;
 		case NVME_ANA_NONOPTIMIZED:
 			if (distance < fallback_distance) {
 				fallback_distance = distance;
@@ -381,7 +384,8 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head)
 
 		if (ns->ana_state == NVME_ANA_OPTIMIZED) {
 			found = ns;
-			goto out;
+			if (!nvme_ctrl_is_marginal(ns->ctrl))
+				goto out;
 		}
 		if (ns->ana_state == NVME_ANA_NONOPTIMIZED)
 			found = ns;
@@ -445,7 +449,8 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return nvme_ctrl_state(ns->ctrl) == NVME_CTRL_LIVE &&
-		ns->ana_state == NVME_ANA_OPTIMIZED;
+		ns->ana_state == NVME_ANA_OPTIMIZED &&
+		!nvme_ctrl_is_marginal(ns->ctrl);
 }
 
 static struct nvme_ns *nvme_numa_path(struct nvme_ns_head *head)
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 7df2ea21851f..71a5c5f87db6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -275,6 +275,7 @@ enum nvme_ctrl_flags {
 	NVME_CTRL_SKIP_ID_CNS_CS	= 4,
 	NVME_CTRL_DIRTY_CAPABILITY	= 5,
 	NVME_CTRL_FROZEN		= 6,
+	NVME_CTRL_MARGINAL		= 7,
 };
 
 struct nvme_ctrl {
@@ -417,6 +418,11 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
 	return READ_ONCE(ctrl->state);
 }
 
+static inline bool nvme_ctrl_is_marginal(struct nvme_ctrl *ctrl)
+{
+	return test_bit(NVME_CTRL_MARGINAL, &ctrl->flags);
+}
+
 enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe Bryan Gurney
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: Hannes Reinecke <hare@kernel.org>

Add a callback nvme_fc_fpin_rcv() to evaluate the FPIN LI TLV
information and set the 'marginal' path status for all affected
rports.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Bryan Gurney <bgurney@redhat.com>
---
 drivers/nvme/host/fc.c         | 95 ++++++++++++++++++++++++++++++++++
 include/linux/nvme-fc-driver.h |  3 ++
 2 files changed, 98 insertions(+)

diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 7e81c815bb83..71a77917a14f 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -3724,6 +3724,101 @@ static struct nvmf_transport_ops nvme_fc_transport = {
 	.create_ctrl	= nvme_fc_create_ctrl,
 };
 
+static struct nvme_fc_rport *nvme_fc_rport_from_wwpn(struct nvme_fc_lport *lport,
+		u64 rport_wwpn)
+{
+	struct nvme_fc_rport *rport;
+
+	list_for_each_entry(rport, &lport->endp_list, endp_list) {
+		if (!nvme_fc_rport_get(rport))
+			continue;
+		if (rport->remoteport.port_name == rport_wwpn &&
+		    rport->remoteport.port_role & FC_PORT_ROLE_NVME_TARGET)
+			return rport;
+		nvme_fc_rport_put(rport);
+	}
+	return NULL;
+}
+
+static void
+nvme_fc_fpin_li_lport_update(struct nvme_fc_lport *lport, struct fc_fn_li_desc *li)
+{
+	unsigned int i, pname_count = be32_to_cpu(li->pname_count);
+	u64 attached_wwpn = be64_to_cpu(li->attached_wwpn);
+	struct nvme_fc_rport *attached_rport;
+
+	for (i = 0; i < pname_count; i++) {
+		struct nvme_fc_rport *rport;
+		u64 wwpn = be64_to_cpu(li->pname_list[i]);
+
+		rport = nvme_fc_rport_from_wwpn(lport, wwpn);
+		if (!rport)
+			continue;
+		if (wwpn != attached_wwpn) {
+			struct nvme_fc_ctrl *ctrl;
+
+			spin_lock_irq(&rport->lock);
+			list_for_each_entry(ctrl, &rport->ctrl_list, ctrl_list)
+				set_bit(NVME_CTRL_MARGINAL, &ctrl->ctrl.flags);
+			spin_unlock_irq(&rport->lock);
+		}
+		nvme_fc_rport_put(rport);
+	}
+
+	attached_rport = nvme_fc_rport_from_wwpn(lport, attached_wwpn);
+	if (attached_rport) {
+		struct nvme_fc_ctrl *ctrl;
+
+		spin_lock_irq(&attached_rport->lock);
+		list_for_each_entry(ctrl, &attached_rport->ctrl_list, ctrl_list)
+			set_bit(NVME_CTRL_MARGINAL, &ctrl->ctrl.flags);
+		spin_unlock_irq(&attached_rport->lock);
+		nvme_fc_rport_put(attached_rport);
+	}
+}
+
+/**
+ * fc_host_fpin_rcv() - Process a received FPIN.
+ * @localport:		local port the FPIN was received on
+ * @fpin_len:		length of FPIN payload, in bytes
+ * @fpin_buf:		pointer to FPIN payload
+ * Notes:
+ *	This routine assumes no locks are held on entry.
+ */
+void
+nvme_fc_fpin_rcv(struct nvme_fc_local_port *localport,
+		 u32 fpin_len, char *fpin_buf)
+{
+	struct nvme_fc_lport *lport;
+	struct fc_els_fpin *fpin = (struct fc_els_fpin *)fpin_buf;
+	union fc_tlv_desc *tlv;
+	u32 bytes_remain;
+	u32 dtag;
+
+	if (!localport)
+		return;
+	lport = localport_to_lport(localport);
+	tlv = &fpin->fpin_desc[0];
+	bytes_remain = fpin_len - offsetof(struct fc_els_fpin, fpin_desc);
+	bytes_remain = min_t(u32, bytes_remain, be32_to_cpu(fpin->desc_len));
+
+	while (bytes_remain >= FC_TLV_DESC_HDR_SZ &&
+	       bytes_remain >= FC_TLV_DESC_SZ_FROM_LENGTH(tlv)) {
+		dtag = be32_to_cpu(tlv->hdr.desc_tag);
+		switch (dtag) {
+		case ELS_DTAG_LNK_INTEGRITY:
+			nvme_fc_fpin_li_lport_update(lport, &tlv->li);
+			break;
+		default:
+			break;
+		}
+
+		bytes_remain -= FC_TLV_DESC_SZ_FROM_LENGTH(tlv);
+		tlv = fc_tlv_next_desc(tlv);
+	}
+}
+EXPORT_SYMBOL(nvme_fc_fpin_rcv);
+
 /* Arbitrary successive failures max. With lots of subsystems could be high */
 #define DISCOVERY_MAX_FAIL	20
 
diff --git a/include/linux/nvme-fc-driver.h b/include/linux/nvme-fc-driver.h
index 9f6acadfe0c8..bcd3b1e5a256 100644
--- a/include/linux/nvme-fc-driver.h
+++ b/include/linux/nvme-fc-driver.h
@@ -536,6 +536,9 @@ void nvme_fc_rescan_remoteport(struct nvme_fc_remote_port *remoteport);
 int nvme_fc_set_remoteport_devloss(struct nvme_fc_remote_port *remoteport,
 			u32 dev_loss_tmo);
 
+void nvme_fc_fpin_rcv(struct nvme_fc_local_port *localport,
+		      u32 fpin_len, char *fpin_buf);
+
 /*
  * Routine called to pass a NVME-FC LS request, received by the lldd,
  * to the nvme-fc transport.
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (2 preceding siblings ...)
  2025-06-24 20:20 ` [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 5/6] qla2xxx: " Bryan Gurney
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: Hannes Reinecke <hare@kernel.org>

Call 'nvme_fc_fpin_rcv()' to enable FPIN notifications for NVMe.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Justin Tee <justin.tee@broadcom.com>
Tested-by: Bryan Gurney <bgurney@redhat.com>
---
 drivers/scsi/lpfc/lpfc_els.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/lpfc/lpfc_els.c b/drivers/scsi/lpfc/lpfc_els.c
index 959603ab939a..396ed1a05bc9 100644
--- a/drivers/scsi/lpfc/lpfc_els.c
+++ b/drivers/scsi/lpfc/lpfc_els.c
@@ -33,6 +33,7 @@
 #include <scsi/scsi_transport_fc.h>
 #include <uapi/scsi/fc/fc_fs.h>
 #include <uapi/scsi/fc/fc_els.h>
+#include <linux/nvme-fc-driver.h>
 
 #include "lpfc_hw4.h"
 #include "lpfc_hw.h"
@@ -10248,9 +10249,15 @@ lpfc_els_rcv_fpin(struct lpfc_vport *vport, void *p, u32 fpin_length)
 		fpin_length += sizeof(struct fc_els_fpin); /* the entire FPIN */
 
 		/* Send every descriptor individually to the upper layer */
-		if (deliver)
+		if (deliver) {
 			fc_host_fpin_rcv(lpfc_shost_from_vport(vport),
 					 fpin_length, (char *)fpin, 0);
+#if (IS_ENABLED(CONFIG_NVME_FC))
+			if (vport->cfg_enable_fc4_type & LPFC_ENABLE_NVME)
+				nvme_fc_fpin_rcv(vport->localport,
+						 fpin_length, (char *)fpin);
+#endif
+		}
 		desc_cnt++;
 	}
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 5/6] qla2xxx: enable FPIN notification for NVMe
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (3 preceding siblings ...)
  2025-06-24 20:20 ` [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-06-24 20:20 ` [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state() Bryan Gurney
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: Hannes Reinecke <hare@kernel.org>

Call 'nvme_fc_fpin_rcv()' to enable FPIN notifications for NVMe.

Signed-off-by: Hannes Reinecke <hare@kernel.org>
---
 drivers/scsi/qla2xxx/qla_isr.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index fe98c76e9be3..cfe7afc905b4 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -46,6 +46,9 @@ qla27xx_process_purex_fpin(struct scsi_qla_host *vha, struct purex_item *item)
 		       pkt, pkt_size);
 
 	fc_host_fpin_rcv(vha->host, pkt_size, (char *)pkt, 0);
+#if (IS_ENABLED(CONFIG_NVME_FC))
+	nvme_fc_fpin_rcv(vha->nvme_local_port, pkt_size, (char *)pkt);
+#endif
 }
 
 const char *const port_state_str[] = {
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state()
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (4 preceding siblings ...)
  2025-06-24 20:20 ` [PATCH v7 5/6] qla2xxx: " Bryan Gurney
@ 2025-06-24 20:20 ` Bryan Gurney
  2025-07-01  0:27 ` [PATCHv7 0/6] nvme-fc: FPIN link integrity handling Muneendra Kumar
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 21+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

If a controller has received a link integrity or congestion event, and
has the NVME_CTRL_MARGINAL flag set, emit "marginal" in the state
instead of "live", to identify the marginal paths.

Co-developed-by: John Meneghini <jmeneghi@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Bryan Gurney <bgurney@redhat.com>
---
 drivers/nvme/host/sysfs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/sysfs.c b/drivers/nvme/host/sysfs.c
index 29430949ce2f..4a6135c2f9cb 100644
--- a/drivers/nvme/host/sysfs.c
+++ b/drivers/nvme/host/sysfs.c
@@ -430,7 +430,9 @@ static ssize_t nvme_sysfs_show_state(struct device *dev,
 	};
 
 	if (state < ARRAY_SIZE(state_name) && state_name[state])
-		return sysfs_emit(buf, "%s\n", state_name[state]);
+		return sysfs_emit(buf, "%s\n",
+			(nvme_ctrl_is_marginal(ctrl)) ? "marginal" :
+			state_name[state]);
 
 	return sysfs_emit(buf, "unknown state\n");
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCHv7 0/6] nvme-fc: FPIN link integrity handling
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (5 preceding siblings ...)
  2025-06-24 20:20 ` [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state() Bryan Gurney
@ 2025-07-01  0:27 ` Muneendra Kumar
  2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
  2025-07-11  2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
  8 siblings, 0 replies; 21+ messages in thread
From: Muneendra Kumar @ 2025-07-01  0:27 UTC (permalink / raw)
  To: bgurney
  Cc: axboe, dick.kennedy, hare, hch, james.smart, jmeneghi, kbusch,
	linux-nvme, linux-scsi, njavali, sagi, muneendra737,
	Muneendra Kumar

Thanks for the patch .
We have tested this patch and this is working as designed.
This functionality is a crucial enhancement for Fibre Channel solutions.
This feature perpetuates the capabilities available in dm for NVMe in previous releases.
With this patch we have a solution for both SCSI and NVMe.

Tested-by: Muneendra Kumar <muneendra.kumar@broadcom.com>


Regards,
Muneendra.


-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (6 preceding siblings ...)
  2025-07-01  0:27 ` [PATCHv7 0/6] nvme-fc: FPIN link integrity handling Muneendra Kumar
@ 2025-07-01 20:32 ` Bryan Gurney
  2025-07-02  6:10   ` Hannes Reinecke
  2025-07-11  2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
  8 siblings, 1 reply; 21+ messages in thread
From: Bryan Gurney @ 2025-07-01 20:32 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, jmeneghi

On Tue, Jun 24, 2025 at 4:20 PM Bryan Gurney <bgurney@redhat.com> wrote:
>
> FPIN LI (link integrity) messages are received when the attached
> fabric detects hardware errors. In response to these messages I/O
> should be directed away from the affected ports, and only used
> if the 'optimized' paths are unavailable.
> Upon port reset the paths should be put back in service as the
> affected hardware might have been replaced.
> This patch adds a new controller flag 'NVME_CTRL_MARGINAL'
> which will be checked during multipath path selection, causing the
> path to be skipped when checking for 'optimized' paths. If no
> optimized paths are available the 'marginal' paths are considered
> for path selection alongside the 'non-optimized' paths.
> It also introduces a new nvme-fc callback 'nvme_fc_fpin_rcv()' to
> evaluate the FPIN LI TLV payload and set the 'marginal' state on
> all affected rports.
>
> The testing for this patch set was performed by Bryan Gurney, using the
> process outlined by John Meneghini's presentation at LSFMM 2024, where
> the fibre channel switch sends an FPIN notification on a specific switch
> port, and the following is checked on the initiator:
>
> 1. The controllers corresponding to the paths on the port that has
> received the notification are showing a set NVME_CTRL_MARGINAL flag.
>
>    \
>     +- nvme4 fc traddr=c,host_traddr=e live optimized
>     +- nvme5 fc traddr=8,host_traddr=e live non-optimized
>     +- nvme8 fc traddr=e,host_traddr=f marginal optimized
>     +- nvme9 fc traddr=a,host_traddr=f marginal non-optimized
>
> 2. The I/O statistics of the test namespace show no I/O activity on the
> controllers with NVME_CTRL_MARGINAL set.
>
>    Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>    nvme4c4n1         0.00         0.00         0.00         0.00
>    nvme4c5n1     25001.00         0.00        97.66         0.00
>    nvme4c9n1     25000.00         0.00        97.66         0.00
>    nvme4n1       50011.00         0.00       195.36         0.00
>
>
>    Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>    nvme4c4n1         0.00         0.00         0.00         0.00
>    nvme4c5n1     48360.00         0.00       188.91         0.00
>    nvme4c9n1      1642.00         0.00         6.41         0.00
>    nvme4n1       49981.00         0.00       195.24         0.00
>
>
>    Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>    nvme4c4n1         0.00         0.00         0.00         0.00
>    nvme4c5n1     50001.00         0.00       195.32         0.00
>    nvme4c9n1         0.00         0.00         0.00         0.00
>    nvme4n1       50016.00         0.00       195.38         0.00
>
> Link: https://people.redhat.com/jmeneghi/LSFMM_2024/LSFMM_2024_NVMe_Cancel_and_FPIN.pdf
>
> More rigorous testing was also performed to ensure proper path migration
> on each of the eight different FPIN link integrity events, particularly
> during a scenario where there are only non-optimized paths available, in
> a state where all paths are marginal.  On a configuration with a
> round-robin iopolicy, when all paths on the host show as marginal, I/O
> continues on the optimized path that was most recently non-marginal.
> From this point, of both of the optimized paths are down, I/O properly
> continues on the remaining paths.
>
> Changes to the original submission:
> - Changed flag name to 'marginal'
> - Do not block marginal path; influence path selection instead
>   to de-prioritize marginal paths
>
> Changes to v2:
> - Split off driver-specific modifications
> - Introduce 'union fc_tlv_desc' to avoid casts
>
> Changes to v3:
> - Include reviews from Justin Tee
> - Split marginal path handling patch
>
> Changes to v4:
> - Change 'u8' to '__u8' on fc_tlv_desc to fix a failure to build
> - Print 'marginal' instead of 'live' in the state of controllers
>   when they are marginal
>
> Changes to v5:
> - Minor spelling corrections to patch descriptions
>
> Changes to v6:
> - No code changes; added note about additional testing
>
> Hannes Reinecke (5):
>   fc_els: use 'union fc_tlv_desc'
>   nvme-fc: marginal path handling
>   nvme-fc: nvme_fc_fpin_rcv() callback
>   lpfc: enable FPIN notification for NVMe
>   qla2xxx: enable FPIN notification for NVMe
>
> Bryan Gurney (1):
>   nvme: sysfs: emit the marginal path state in show_state()
>
>  drivers/nvme/host/core.c         |   1 +
>  drivers/nvme/host/fc.c           |  99 +++++++++++++++++++
>  drivers/nvme/host/multipath.c    |  17 ++--
>  drivers/nvme/host/nvme.h         |   6 ++
>  drivers/nvme/host/sysfs.c        |   4 +-
>  drivers/scsi/lpfc/lpfc_els.c     |  84 ++++++++--------
>  drivers/scsi/qla2xxx/qla_isr.c   |   3 +
>  drivers/scsi/scsi_transport_fc.c |  27 +++--
>  include/linux/nvme-fc-driver.h   |   3 +
>  include/uapi/scsi/fc/fc_els.h    | 165 +++++++++++++++++--------------
>  10 files changed, 269 insertions(+), 140 deletions(-)
>
> --
> 2.49.0
>


We're going to be working on follow-up patches to address some things
that I found in additional testing:

During path fail testing on the numa iopolicy, I found that I/O moves
off of the marginal path after a first link integrity event is
received, but if the non-marginal path the I/O is on is disconnected,
the I/O is transferred onto a marginal path (in testing, sometimes
I've seen it go to a "marginal optimized" path, and sometimes
"marginal non-optimized").

The queue-depth iopolicy doesn't change its path selection based on
the marginal flag, but looking at nvme_queue_depth_path(), I can see
that there's currently no logic to handle marginal paths.  We're
developing a patch to address that issue in queue-depth, but we need
to do more testing.


Thanks,

Bryan



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
@ 2025-07-02  6:10   ` Hannes Reinecke
  2025-07-08 19:56     ` John Meneghini
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2025-07-02  6:10 UTC (permalink / raw)
  To: Bryan Gurney, linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, jmeneghi

On 7/1/25 22:32, Bryan Gurney wrote:
> On Tue, Jun 24, 2025 at 4:20 PM Bryan Gurney <bgurney@redhat.com> wrote:
>>
>> FPIN LI (link integrity) messages are received when the attached
>> fabric detects hardware errors. In response to these messages I/O
>> should be directed away from the affected ports, and only used
>> if the 'optimized' paths are unavailable.
>> Upon port reset the paths should be put back in service as the
>> affected hardware might have been replaced.
>> This patch adds a new controller flag 'NVME_CTRL_MARGINAL'
>> which will be checked during multipath path selection, causing the
>> path to be skipped when checking for 'optimized' paths. If no
>> optimized paths are available the 'marginal' paths are considered
>> for path selection alongside the 'non-optimized' paths.
>> It also introduces a new nvme-fc callback 'nvme_fc_fpin_rcv()' to
>> evaluate the FPIN LI TLV payload and set the 'marginal' state on
>> all affected rports.
>>
>> The testing for this patch set was performed by Bryan Gurney, using the
>> process outlined by John Meneghini's presentation at LSFMM 2024, where
>> the fibre channel switch sends an FPIN notification on a specific switch
>> port, and the following is checked on the initiator:
>>
>> 1. The controllers corresponding to the paths on the port that has
>> received the notification are showing a set NVME_CTRL_MARGINAL flag.
>>
>>     \
>>      +- nvme4 fc traddr=c,host_traddr=e live optimized
>>      +- nvme5 fc traddr=8,host_traddr=e live non-optimized
>>      +- nvme8 fc traddr=e,host_traddr=f marginal optimized
>>      +- nvme9 fc traddr=a,host_traddr=f marginal non-optimized
>>
>> 2. The I/O statistics of the test namespace show no I/O activity on the
>> controllers with NVME_CTRL_MARGINAL set.
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     25001.00         0.00        97.66         0.00
>>     nvme4c9n1     25000.00         0.00        97.66         0.00
>>     nvme4n1       50011.00         0.00       195.36         0.00
>>
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     48360.00         0.00       188.91         0.00
>>     nvme4c9n1      1642.00         0.00         6.41         0.00
>>     nvme4n1       49981.00         0.00       195.24         0.00
>>
>>
>>     Device             tps    MB_read/s    MB_wrtn/s    MB_dscd/s
>>     nvme4c4n1         0.00         0.00         0.00         0.00
>>     nvme4c5n1     50001.00         0.00       195.32         0.00
>>     nvme4c9n1         0.00         0.00         0.00         0.00
>>     nvme4n1       50016.00         0.00       195.38         0.00
>>
>> Link: https://people.redhat.com/jmeneghi/LSFMM_2024/LSFMM_2024_NVMe_Cancel_and_FPIN.pdf
>>
>> More rigorous testing was also performed to ensure proper path migration
>> on each of the eight different FPIN link integrity events, particularly
>> during a scenario where there are only non-optimized paths available, in
>> a state where all paths are marginal.  On a configuration with a
>> round-robin iopolicy, when all paths on the host show as marginal, I/O
>> continues on the optimized path that was most recently non-marginal.
>>  From this point, of both of the optimized paths are down, I/O properly
>> continues on the remaining paths.
>>
>> Changes to the original submission:
>> - Changed flag name to 'marginal'
>> - Do not block marginal path; influence path selection instead
>>    to de-prioritize marginal paths
>>
>> Changes to v2:
>> - Split off driver-specific modifications
>> - Introduce 'union fc_tlv_desc' to avoid casts
>>
>> Changes to v3:
>> - Include reviews from Justin Tee
>> - Split marginal path handling patch
>>
>> Changes to v4:
>> - Change 'u8' to '__u8' on fc_tlv_desc to fix a failure to build
>> - Print 'marginal' instead of 'live' in the state of controllers
>>    when they are marginal
>>
>> Changes to v5:
>> - Minor spelling corrections to patch descriptions
>>
>> Changes to v6:
>> - No code changes; added note about additional testing
>>
>> Hannes Reinecke (5):
>>    fc_els: use 'union fc_tlv_desc'
>>    nvme-fc: marginal path handling
>>    nvme-fc: nvme_fc_fpin_rcv() callback
>>    lpfc: enable FPIN notification for NVMe
>>    qla2xxx: enable FPIN notification for NVMe
>>
>> Bryan Gurney (1):
>>    nvme: sysfs: emit the marginal path state in show_state()
>>
>>   drivers/nvme/host/core.c         |   1 +
>>   drivers/nvme/host/fc.c           |  99 +++++++++++++++++++
>>   drivers/nvme/host/multipath.c    |  17 ++--
>>   drivers/nvme/host/nvme.h         |   6 ++
>>   drivers/nvme/host/sysfs.c        |   4 +-
>>   drivers/scsi/lpfc/lpfc_els.c     |  84 ++++++++--------
>>   drivers/scsi/qla2xxx/qla_isr.c   |   3 +
>>   drivers/scsi/scsi_transport_fc.c |  27 +++--
>>   include/linux/nvme-fc-driver.h   |   3 +
>>   include/uapi/scsi/fc/fc_els.h    | 165 +++++++++++++++++--------------
>>   10 files changed, 269 insertions(+), 140 deletions(-)
>>
>> --
>> 2.49.0
>>
> 
> 
> We're going to be working on follow-up patches to address some things
> that I found in additional testing:
> 
> During path fail testing on the numa iopolicy, I found that I/O moves
> off of the marginal path after a first link integrity event is
> received, but if the non-marginal path the I/O is on is disconnected,
> the I/O is transferred onto a marginal path (in testing, sometimes
> I've seen it go to a "marginal optimized" path, and sometimes
> "marginal non-optimized").
> 
That is by design.
'marginal' paths are only evaluated for the 'optimized' path selection,
where it's obvious that 'marginal' paths should not be selected as
'optimized'.
For 'non-optimized' the situation is less clear; is the 'non-optimized'
path preferrable to 'marginal'? Or the other way round?
So once the 'optimized' path selection returns no paths, _any_ of the
remaining paths are eligible.

> The queue-depth iopolicy doesn't change its path selection based on
> the marginal flag, but looking at nvme_queue_depth_path(), I can see
> that there's currently no logic to handle marginal paths.  We're
> developing a patch to address that issue in queue-depth, but we need
> to do more testing.
> 
Again, by design.
The whole point of the marginal path patchset is that I/O should
be steered away from the marginal path, but the path itself should
not completely shut off (otherwise we just could have declared the
path as 'faulty' and be done with).
Any I/O on 'marginal' paths should have higher latencies, and higher
latencies should result in higher queue depths on these paths. So
the queue-depth based IO scheduler should to the right thing
automatically.
Always assuming that marginal paths should have higher latencies.
If they haven't then they will be happily selection for I/O.
But then again, if the marginal path does _not_ have higher
latencies why shouldn't we select it for I/O?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 2/6] nvme-fc: marginal path handling
  2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
@ 2025-07-03 11:47   ` Christoph Hellwig
  0 siblings, 0 replies; 21+ messages in thread
From: Christoph Hellwig @ 2025-07-03 11:47 UTC (permalink / raw)
  To: Bryan Gurney
  Cc: linux-nvme, kbusch, hch, sagi, axboe, james.smart, dick.kennedy,
	njavali, linux-scsi, hare, jmeneghi

On Tue, Jun 24, 2025 at 04:20:16PM -0400, Bryan Gurney wrote:
>  drivers/nvme/host/core.c      |  1 +
>  drivers/nvme/host/fc.c        |  4 ++++
>  drivers/nvme/host/multipath.c | 17 +++++++++++------
>  drivers/nvme/host/nvme.h      |  6 ++++++

This is not a nvme-fc patch.  Please split out the core patches into a
separate well described patch.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-07-02  6:10   ` Hannes Reinecke
@ 2025-07-08 19:56     ` John Meneghini
  2025-07-09  6:21       ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: John Meneghini @ 2025-07-08 19:56 UTC (permalink / raw)
  To: Hannes Reinecke, Bryan Gurney, linux-nvme, kbusch, hch, sagi,
	axboe
  Cc: james.smart, dick.kennedy, linux-scsi, njavali, muneendra737

On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>> During path fail testing on the numa iopolicy, I found that I/O moves
>> off of the marginal path after a first link integrity event is
>> received, but if the non-marginal path the I/O is on is disconnected,
>> the I/O is transferred onto a marginal path (in testing, sometimes
>> I've seen it go to a "marginal optimized" path, and sometimes
>> "marginal non-optimized").
>>
> That is by design.
> 'marginal' paths are only evaluated for the 'optimized' path selection,
> where it's obvious that 'marginal' paths should not be selected as
> 'optimized'.

I think we might want to change this.  With the NUMA scheduler you can end up with
using the non-optimized marginal path.  This happens when we test with 4 paths
(2 optimized and 2 non-optimized) and set all 4 paths to marginal.  In this case
the NUMA scheduler should simply choose the optimized marginal path on the closest
numa node.  However, that's not what happens. It consistently chooses the
non-optimized first non-optimized path.

> For 'non-optimized' the situation is less clear; is the 'non-optimized'
> path preferrable to 'marginal'? Or the other way round?
> So once the 'optimized' path selection returns no paths, _any_ of the
> remaining paths are eligible.

This is a good question for Broadcom.  I think, with all IO schedulers, as long
as there is a non-marginal path available, that path should be used.  So
a non-marginal non-optimized path should take precedence over a marginal optimized path.

In the case were all paths are marginal, I think the scheduler should simply use the firt
optimized path on the closest numa node.
   >> The queue-depth iopolicy doesn't change its path selection based on
>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>> that there's currently no logic to handle marginal paths.  We're
>> developing a patch to address that issue in queue-depth, but we need
>> to do more testing.
>>
> Again, by design.
> The whole point of the marginal path patchset is that I/O should
> be steered away from the marginal path, but the path itself should
> not completely shut off (otherwise we just could have declared the
> path as 'faulty' and be done with).
> Any I/O on 'marginal' paths should have higher latencies, and higher
> latencies should result in higher queue depths on these paths. So
> the queue-depth based IO scheduler should to the right thing
> automatically.

I don't understand this.  The Round-robin scheduler removes marginal paths, why shouldn't the
queue-depth and numa scheduler do the same?

> Always assuming that marginal paths should have higher latencies.
> If they haven't then they will be happily selection for I/O.
> But then again, if the marginal path does _not_ have highert
> latencies why shouldn't we select it for I/O?

This may be true with FPIN Congestion Signal, but we are testing Link Integrity.  With FPIN LI I think we want to simply stop using the path.
LI has nothing to do with latency.  So unless ALL paths are marginal, the IO scheduler should not be using any marginal path.

Do we need another state here?  There is an ask to support FPIN CS, so maybe using the term "marginal" to describe LI is wrong.

Maybe we need something like "marginal_li" and "marginal_cs" to describe the difference.

/John



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-07-08 19:56     ` John Meneghini
@ 2025-07-09  6:21       ` Hannes Reinecke
  2025-07-09 13:42         ` Bryan Gurney
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2025-07-09  6:21 UTC (permalink / raw)
  To: John Meneghini, Bryan Gurney, linux-nvme, kbusch, hch, sagi,
	axboe
  Cc: james.smart, dick.kennedy, linux-scsi, njavali, muneendra737

On 7/8/25 21:56, John Meneghini wrote:
> On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>>> During path fail testing on the numa iopolicy, I found that I/O moves
>>> off of the marginal path after a first link integrity event is
>>> received, but if the non-marginal path the I/O is on is disconnected,
>>> the I/O is transferred onto a marginal path (in testing, sometimes
>>> I've seen it go to a "marginal optimized" path, and sometimes
>>> "marginal non-optimized").
>>>
>> That is by design.
>> 'marginal' paths are only evaluated for the 'optimized' path selection,
>> where it's obvious that 'marginal' paths should not be selected as
>> 'optimized'.
> 
> I think we might want to change this.  With the NUMA scheduler you can 
> end up with using the non-optimized marginal path.  This happens when
 > we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
 > paths to marginal.  In this case> the NUMA scheduler should simply 
choose the optimized marginal path on
> the closest numa node.  However, that's not what happens. It consistently
 > chooses the non-optimized first non-optimized path.>
Ah. So it seems that the NUMA scheduler needs to be fixed.
I'll have a look there.

>> For 'non-optimized' the situation is less clear; is the 'non-optimized'
>> path preferrable to 'marginal'? Or the other way round?
>> So once the 'optimized' path selection returns no paths, _any_ of the
>> remaining paths are eligible.
> 
> This is a good question for Broadcom.  I think, with all IO schedulers, 
> as long
> as there is a non-marginal path available, that path should be used.  So
> a non-marginal non-optimized path should take precedence over a marginal 
> optimized path.
> 
> In the case were all paths are marginal, I think the scheduler should 
> simply use the first optimized path on the closest numa node.

For the NUMA case, yes. But as I said above, it seems that the NUMA
scheduler needs to fixes.

>>> The queue-depth iopolicy doesn't change its path selection based on
>>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>>> that there's currently no logic to handle marginal paths.  We're
>>> developing a patch to address that issue in queue-depth, but we need
>>> to do more testing.
>>>
>> Again, by design.
>> The whole point of the marginal path patchset is that I/O should
>> be steered away from the marginal path, but the path itself should
>> not completely shut off (otherwise we just could have declared the
>> path as 'faulty' and be done with).
>> Any I/O on 'marginal' paths should have higher latencies, and higher
>> latencies should result in higher queue depths on these paths. So
>> the queue-depth based IO scheduler should to the right thing
>> automatically.
> 
> I don't understand this.  The Round-robin scheduler removes marginal 
> paths, why shouldn't the queue-depth and numa scheduler do the same?
> 
The NUMA scheduler should, that's correct.

>> Always assuming that marginal paths should have higher latencies.
>> If they haven't then they will be happily selection for I/O.
>> But then again, if the marginal path does _not_ have highert
>> latencies why shouldn't we select it for I/O?
> 
> This may be true with FPIN Congestion Signal, but we are testing Link 
> Integrity.  With FPIN LI I think we want to simply stop using the path.
> LI has nothing to do with latency.  So unless ALL paths are marginal, 
> the IO scheduler should not be using any marginal path.
> 
For FPIN LI the paths should be marked as 'faulty', true.

> Do we need another state here?  There is an ask to support FPIN CS, so 
> maybe using the term "marginal" to describe LI is wrong.
> 
> Maybe we need something like "marginal_li" and "marginal_cs" to describe 
> the difference.
> 
Really not so sure. I really wonder how a FPIN LI event reflect back on
the actual I/O. Will the I/O be aborted with an error? Or does the I/O
continue at a slower pace?
I would think the latter, and that's the design assumption for this
patchset. If it's the former and I/O is aborted with an error we are in
a similar situation like we have with a faulty cable, and we need
to come up with a different solution.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-07-09  6:21       ` Hannes Reinecke
@ 2025-07-09 13:42         ` Bryan Gurney
  2025-07-09 17:44           ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: Bryan Gurney @ 2025-07-09 13:42 UTC (permalink / raw)
  To: Hannes Reinecke
  Cc: John Meneghini, linux-nvme, kbusch, hch, sagi, axboe, james.smart,
	dick.kennedy, linux-scsi, njavali, muneendra737

On Wed, Jul 9, 2025 at 2:21 AM Hannes Reinecke <hare@suse.de> wrote:
>
> On 7/8/25 21:56, John Meneghini wrote:
> > On 7/2/25 2:10 AM, Hannes Reinecke wrote:
> >>> During path fail testing on the numa iopolicy, I found that I/O moves
> >>> off of the marginal path after a first link integrity event is
> >>> received, but if the non-marginal path the I/O is on is disconnected,
> >>> the I/O is transferred onto a marginal path (in testing, sometimes
> >>> I've seen it go to a "marginal optimized" path, and sometimes
> >>> "marginal non-optimized").
> >>>
> >> That is by design.
> >> 'marginal' paths are only evaluated for the 'optimized' path selection,
> >> where it's obvious that 'marginal' paths should not be selected as
> >> 'optimized'.
> >
> > I think we might want to change this.  With the NUMA scheduler you can
> > end up with using the non-optimized marginal path.  This happens when
>  > we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
>  > paths to marginal.  In this case> the NUMA scheduler should simply
> choose the optimized marginal path on
> > the closest numa node.  However, that's not what happens. It consistently
>  > chooses the non-optimized first non-optimized path.>
> Ah. So it seems that the NUMA scheduler needs to be fixed.
> I'll have a look there.
>
> >> For 'non-optimized' the situation is less clear; is the 'non-optimized'
> >> path preferrable to 'marginal'? Or the other way round?
> >> So once the 'optimized' path selection returns no paths, _any_ of the
> >> remaining paths are eligible.
> >
> > This is a good question for Broadcom.  I think, with all IO schedulers,
> > as long
> > as there is a non-marginal path available, that path should be used.  So
> > a non-marginal non-optimized path should take precedence over a marginal
> > optimized path.
> >
> > In the case were all paths are marginal, I think the scheduler should
> > simply use the first optimized path on the closest numa node.
>
> For the NUMA case, yes. But as I said above, it seems that the NUMA
> scheduler needs to fixes.
>
> >>> The queue-depth iopolicy doesn't change its path selection based on
> >>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
> >>> that there's currently no logic to handle marginal paths.  We're
> >>> developing a patch to address that issue in queue-depth, but we need
> >>> to do more testing.
> >>>
> >> Again, by design.
> >> The whole point of the marginal path patchset is that I/O should
> >> be steered away from the marginal path, but the path itself should
> >> not completely shut off (otherwise we just could have declared the
> >> path as 'faulty' and be done with).
> >> Any I/O on 'marginal' paths should have higher latencies, and higher
> >> latencies should result in higher queue depths on these paths. So
> >> the queue-depth based IO scheduler should to the right thing
> >> automatically.
> >
> > I don't understand this.  The Round-robin scheduler removes marginal
> > paths, why shouldn't the queue-depth and numa scheduler do the same?
> >
> The NUMA scheduler should, that's correct.
>
> >> Always assuming that marginal paths should have higher latencies.
> >> If they haven't then they will be happily selection for I/O.
> >> But then again, if the marginal path does _not_ have highert
> >> latencies why shouldn't we select it for I/O?
> >
> > This may be true with FPIN Congestion Signal, but we are testing Link
> > Integrity.  With FPIN LI I think we want to simply stop using the path.
> > LI has nothing to do with latency.  So unless ALL paths are marginal,
> > the IO scheduler should not be using any marginal path.
> >
> For FPIN LI the paths should be marked as 'faulty', true.
>
> > Do we need another state here?  There is an ask to support FPIN CS, so
> > maybe using the term "marginal" to describe LI is wrong.
> >
> > Maybe we need something like "marginal_li" and "marginal_cs" to describe
> > the difference.
> >
> Really not so sure. I really wonder how a FPIN LI event reflect back on
> the actual I/O. Will the I/O be aborted with an error? Or does the I/O
> continue at a slower pace?
> I would think the latter, and that's the design assumption for this
> patchset. If it's the former and I/O is aborted with an error we are in
> a similar situation like we have with a faulty cable, and we need
> to come up with a different solution.
>

During my testing, I was watching the logs on the test host as I was
about to run the command on the switch to generate the FPIN LI event.
I didn't see any I/O errors, and the I/O continued at the normally
expected throughput and latency.  But "if this had been an actual
emergency..." as the saying goes, there would probably be some kind of
disruption that the event itself would be expected to cause (e.g.:
"loss sync", "loss signal", "link failure"), but

There was a Storage Developer Conference 21 presentation slide deck on
the FPIN LI events that's hosted on the SNIA website [1]; slide 4
shows the problem statements addressed by the notifications.

In my previous career as a system administrator, I remember seeing
strange performance slowdowns on high-volume database servers, and on
searching through the logs, I might find an event from the database
engine about an I/O operating taking over 30 seconds to complete.
Meanwhile, the application using the database was backlogged due to
its queries taking longer, for what ended up being a faulty SFP.
After replacing that, we could get the application running again, to
process its replication and workload backlogs.  The link integrity
events could help alert on these link problems before they turn into
application disruptions.


Thanks,

Bryan

[1] https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-Johnson-Introducing-Fabric-Notifications-From-Awareness-to-Action.pdf

> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                  Kernel Storage Architect
> hare@suse.de                                +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
>



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
  2025-07-09 13:42         ` Bryan Gurney
@ 2025-07-09 17:44           ` Hannes Reinecke
       [not found]             ` <CAJtR8wVupqRK3t0+7shrNCTZ9qZC7gHAR2X8Ov0AR-RJamxcWg@mail.gmail.com>
  0 siblings, 1 reply; 21+ messages in thread
From: Hannes Reinecke @ 2025-07-09 17:44 UTC (permalink / raw)
  To: Bryan Gurney
  Cc: John Meneghini, linux-nvme, kbusch, hch, sagi, axboe, james.smart,
	dick.kennedy, linux-scsi, njavali, muneendra737

On 7/9/25 15:42, Bryan Gurney wrote:
> On Wed, Jul 9, 2025 at 2:21 AM Hannes Reinecke <hare@suse.de> wrote:
>>
[ .. ]
>>> This may be true with FPIN Congestion Signal, but we are testing Link
>>> Integrity.  With FPIN LI I think we want to simply stop using the path.
>>> LI has nothing to do with latency.  So unless ALL paths are marginal,
>>> the IO scheduler should not be using any marginal path.
>>>
>> For FPIN LI the paths should be marked as 'faulty', true.
>>
>>> Do we need another state here?  There is an ask to support FPIN CS, so
>>> maybe using the term "marginal" to describe LI is wrong.
>>>
>>> Maybe we need something like "marginal_li" and "marginal_cs" to describe
>>> the difference.
>>>
>> Really not so sure. I really wonder how a FPIN LI event reflect back on
>> the actual I/O. Will the I/O be aborted with an error? Or does the I/O
>> continue at a slower pace?
>> I would think the latter, and that's the design assumption for this
>> patchset. If it's the former and I/O is aborted with an error we are in
>> a similar situation like we have with a faulty cable, and we need
>> to come up with a different solution.
>>
> 
> During my testing, I was watching the logs on the test host as I was
> about to run the command on the switch to generate the FPIN LI event.
> I didn't see any I/O errors, and the I/O continued at the normally
> expected throughput and latency.  But "if this had been an actual
> emergency..." as the saying goes, there would probably be some kind of
> disruption that the event itself would be expected to cause (e.g.:
> "loss sync", "loss signal", "link failure"), but
> 
> There was a Storage Developer Conference 21 presentation slide deck on
> the FPIN LI events that's hosted on the SNIA website [1]; slide 4
> shows the problem statements addressed by the notifications.
> 
> In my previous career as a system administrator, I remember seeing
> strange performance slowdowns on high-volume database servers, and on
> searching through the logs, I might find an event from the database
> engine about an I/O operating taking over 30 seconds to complete.
> Meanwhile, the application using the database was backlogged due to
> its queries taking longer, for what ended up being a faulty SFP.
> After replacing that, we could get the application running again, to
> process its replication and workload backlogs.  The link integrity
> events could help alert on these link problems before they turn into
> application disruptions.
> 
But that's precisely it, isn't it?
If it's a straight error the path/controller is being reset, and
really there's nothing for us to be done.
If it's an FPIN LI _without_ any performance impact, why shouldn't
we continue to use that path? Would there be any impact if we do?
And if it's an FPIN LI with _any_ sort of performance impact
(or a performance impact which might happen eventually) the
current approach of steering away I/O should be fine.
Am I missing something?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
@ 2025-07-09 21:26 Bryan Gurney
  2025-07-09 22:03 ` John Meneghini
  0 siblings, 1 reply; 21+ messages in thread
From: Bryan Gurney @ 2025-07-09 21:26 UTC (permalink / raw)
  To: linux-nvme, kbusch, hch, sagi, axboe
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
	jmeneghi

From: John Meneghini <jmeneghi@redhat.com>

Exclude marginal paths from queue-depth io policy. In the case where all
paths are marginal and no optimized or non-optimized path is found, we
fall back to __nvme_find_path which selects the best marginal path.

Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
---
 drivers/nvme/host/multipath.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8d4e54bb4261..767583e8454b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -420,6 +420,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 		if (nvme_path_is_disabled(ns))
 			continue;
 
+		if (nvme_ctrl_is_marginal(ns->ctrl))
+			continue;
+
 		depth = atomic_read(&ns->ctrl->nr_active);
 
 		switch (ns->ana_state) {
@@ -443,7 +446,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
 			return best_opt;
 	}
 
-	return best_opt ? best_opt : best_nonopt;
+	best_opt = (best_opt) ? best_opt : best_nonopt;
+
+	return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
 }
 
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
-- 
2.50.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
  2025-07-09 21:26 Bryan Gurney
@ 2025-07-09 22:03 ` John Meneghini
  2025-07-10  6:36   ` Hannes Reinecke
  0 siblings, 1 reply; 21+ messages in thread
From: John Meneghini @ 2025-07-09 22:03 UTC (permalink / raw)
  To: hare
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, axboe, sagi, hch,
	kbusch, linux-nvme, Bryan Gurney

Hannes, this patch fixes the queue-depth scheduler.  Please take a look.

On 7/9/25 5:26 PM, Bryan Gurney wrote:
> From: John Meneghini <jmeneghi@redhat.com>
> 
> Exclude marginal paths from queue-depth io policy. In the case where all
> paths are marginal and no optimized or non-optimized path is found, we
> fall back to __nvme_find_path which selects the best marginal path.
> 
> Tested-by: Bryan Gurney <bgurney@redhat.com>
> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
> ---
>   drivers/nvme/host/multipath.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 8d4e54bb4261..767583e8454b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -420,6 +420,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>   		if (nvme_path_is_disabled(ns))
>   			continue;
>   
> +		if (nvme_ctrl_is_marginal(ns->ctrl))
> +			continue;
> +
>   		depth = atomic_read(&ns->ctrl->nr_active);
>   
>   		switch (ns->ana_state) {
> @@ -443,7 +446,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
>   			return best_opt;
>   	}
>   
> -	return best_opt ? best_opt : best_nonopt;
> +	best_opt = (best_opt) ? best_opt : best_nonopt;
> +
> +	return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
>   }
>   
>   static inline bool nvme_path_is_optimized(struct nvme_ns *ns)



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
  2025-07-09 22:03 ` John Meneghini
@ 2025-07-10  6:36   ` Hannes Reinecke
  0 siblings, 0 replies; 21+ messages in thread
From: Hannes Reinecke @ 2025-07-10  6:36 UTC (permalink / raw)
  To: John Meneghini
  Cc: james.smart, dick.kennedy, njavali, linux-scsi, axboe, sagi, hch,
	kbusch, linux-nvme, Bryan Gurney

On 7/10/25 00:03, John Meneghini wrote:
> Hannes, this patch fixes the queue-depth scheduler.  Please take a look.
> 
> On 7/9/25 5:26 PM, Bryan Gurney wrote:
>> From: John Meneghini <jmeneghi@redhat.com>
>>
>> Exclude marginal paths from queue-depth io policy. In the case where all
>> paths are marginal and no optimized or non-optimized path is found, we
>> fall back to __nvme_find_path which selects the best marginal path.
>>
>> Tested-by: Bryan Gurney <bgurney@redhat.com>
>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>> ---
>>   drivers/nvme/host/multipath.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/ 
>> multipath.c
>> index 8d4e54bb4261..767583e8454b 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -420,6 +420,9 @@ static struct nvme_ns 
>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>           if (nvme_path_is_disabled(ns))
>>               continue;
>> +        if (nvme_ctrl_is_marginal(ns->ctrl))
>> +            continue;
>> +
>>           depth = atomic_read(&ns->ctrl->nr_active);
>>           switch (ns->ana_state) {
>> @@ -443,7 +446,9 @@ static struct nvme_ns 
>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>               return best_opt;
>>       }
>> -    return best_opt ? best_opt : best_nonopt;
>> +    best_opt = (best_opt) ? best_opt : best_nonopt;
>> +
>> +    return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
>>   }
>>   static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
> 

Hmm. Not convinced. I would expect a 'marginal' path to behave different
(performance-wise) than unaffected paths. And the queue-depth scheduler
should be able to handle paths with different performance
characteristics just fine.
(Is is possible that your results are test artifacts? I guess
your tool just injects FPIN messages with no performance impact,
resulting in this behaviour...)

But if you want to exclude marginal paths from queue depth:
by all means, go for it.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
  2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
                   ` (7 preceding siblings ...)
  2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
@ 2025-07-11  2:59 ` Muneendra Kumar
  2025-07-11 14:53   ` John Meneghini
  8 siblings, 1 reply; 21+ messages in thread
From: Muneendra Kumar @ 2025-07-11  2:59 UTC (permalink / raw)
  To: bgurney
  Cc: axboe, dick.kennedy, hare, hch, james.smart, jmeneghi, kbusch,
	linux-nvme, linux-scsi, njavali, sagi, muneendra737

Correct me if iam wrong.
>>. In the case where 
>> all paths are marginal and no optimized or non-optimized path is 
>> found, we fall back to __nvme_find_path which selects the best marginal path

With the current patch __nvme_find_path will allways picks the path from non-optimized path ?

Regards,
Muneendra


>On 7/10/25 00:03, John Meneghini wrote:
>> Hannes, this patch fixes the queue-depth scheduler.  Please take a look.
>> 
>> On 7/9/25 5:26 PM, Bryan Gurney wrote:
>>> From: John Meneghini <jmeneghi@redhat.com>
>>>
>>> Exclude marginal paths from queue-depth io policy. In the case where 
>>> all paths are marginal and no optimized or non-optimized path is 
>>> found, we fall back to __nvme_find_path which selects the best marginal path.
>>>
>>> Tested-by: Bryan Gurney <bgurney@redhat.com>
>>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>>> ---
>>>   drivers/nvme/host/multipath.c | 7 ++++++-
>>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/ 
>>> multipath.c index 8d4e54bb4261..767583e8454b 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -420,6 +420,9 @@ static struct nvme_ns 
>>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>>           if (nvme_path_is_disabled(ns))
>>>               continue;
>>> +        if (nvme_ctrl_is_marginal(ns->ctrl))
>>> +            continue;
>>> +
>>>           depth = atomic_read(&ns->ctrl->nr_active);
>>>           switch (ns->ana_state) {
>>> @@ -443,7 +446,9 @@ static struct nvme_ns 
>>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>>               return best_opt;
>>>       }
>>> -    return best_opt ? best_opt : best_nonopt;
>>> +    best_opt = (best_opt) ? best_opt : best_nonopt;
>>> +
>>> +    return best_opt ? best_opt : __nvme_find_path(head, 
>>> +numa_node_id());
>>>   }
>>>   static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>> 
>
>Hmm. Not convinced. I would expect a 'marginal' path to behave different
>(performance-wise) than unaffected paths. And the queue-depth scheduler should be able to handle paths with different performance characteristics just fine.
>(Is is possible that your results are test artifacts? I guess your tool just injects FPIN messages with no performance impact, resulting in this behaviour...)
>
>But if you want to exclude marginal paths from queue depth:
>by all means, go for it.
>



-- 
This electronic communication and the information and any files transmitted 
with it, or attached to it, are confidential and are intended solely for 
the use of the individual or entity to whom it is addressed and may contain 
information that is confidential, legally privileged, protected by privacy 
laws, or otherwise restricted from disclosure to anyone else. If you are 
not the intended recipient or the person responsible for delivering the 
e-mail to the intended recipient, you are hereby notified that any use, 
copying, distributing, dissemination, forwarding, printing, or copying of 
this e-mail is strictly prohibited. If you received this e-mail in error, 
please return the e-mail to the sender, delete it from your computer, and 
destroy any printed copy of it.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
  2025-07-11  2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
@ 2025-07-11 14:53   ` John Meneghini
  0 siblings, 0 replies; 21+ messages in thread
From: John Meneghini @ 2025-07-11 14:53 UTC (permalink / raw)
  To: Muneendra Kumar, bgurney
  Cc: axboe, dick.kennedy, hare, hch, james.smart, kbusch, linux-nvme,
	linux-scsi, njavali, sagi, muneendra737


On 7/10/25 10:59 PM, Muneendra Kumar wrote:
> Correct me if iam wrong.
>>> . In the case where
>>> all paths are marginal and no optimized or non-optimized path is
>>> found, we fall back to __nvme_find_path which selects the best marginal path
> 
> With the current patch __nvme_find_path will allways picks the path from non-optimized path ?

Not necessarily. I think it all comes down the this code:

                switch (ns->ana_state) {
                 case NVME_ANA_OPTIMIZED:
                         if (!nvme_ctrl_is_marginal(ns->ctrl)) {
                                 if (distance < found_distance) {
                                         found_distance = distance;
                                         found = ns;
                                 }
                                 break;
                         }
                         fallthrough;
                 case NVME_ANA_NONOPTIMIZED:
                         if (distance < fallback_distance) {
                                 fallback_distance = distance;
                                 fallback = ns;
                         }
                         break;

Any NVME_ANA_OPTIMIZED path that is marginal becomes a part of the fallback ns algorithm.

In the case where there is at least one NVME_ANA_OPTIMIZED path, it works correctly.  You will always find the NVME_ANA_OPTIMIZED
path.  In the case there there are no NVME_ANA_OPTIMIZED paths it turns in to kind of a crap shoot. You end up with the first fallback
ns that's found.  That could be an NVME_ANA_OPTIMIZED path or an NVME_ANA_NONOPTIMIZED path.  It all depends upon how the head->list is
sorted and if there are any disabled paths.

In our testing I've seen that this sometimes selects the NVME_ANA_OPTIMIZED path and sometimes the NVME_ANA_NONOPTIMIZED path.

In the simple test case, when the first two paths are optimized, and only one is marginal, this algorithm always selects the NVME_ANA_NONOPTIMIZED path.

It's only the more complicated test when all NVME_ANA_NONOPTIMIZED paths are marginal that I see some unpredictability.

/John




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
       [not found]             ` <CAJtR8wVupqRK3t0+7shrNCTZ9qZC7gHAR2X8Ov0AR-RJamxcWg@mail.gmail.com>
@ 2025-07-11 16:49               ` John Meneghini
  0 siblings, 0 replies; 21+ messages in thread
From: John Meneghini @ 2025-07-11 16:49 UTC (permalink / raw)
  To: Muneendra Kumar M, Hannes Reinecke
  Cc: Bryan Gurney, linux-nvme, kbusch, hch, sagi, axboe, james.smart,
	dick.kennedy, linux-scsi, njavali, muneendra737, Howard Johnson

Adding Howard Johnson.

On 7/11/25 4:54 AM, Muneendra Kumar M wrote:
>  >>But that's precisely it, isn't it?
>  >>If it's a straight error the path/controller is being reset, and
>  >>really there's nothing for us to be done.
>  >>If it's an FPIN LI _without_ any performance impact, why shouldn't
>  >>we continue to use that path? Would there be any impact if we do?
>  >>And if it's an FPIN LI with _any_ sort of performance impact
>  >>(or a performance impact which might happen eventually) the
>  >>current approach of steering away I/O should be fine.
> [Muneendra]
> 
> With FPIN/LinkIntegrity (LI) - there is still connectivity, but the FPIN is identifying the link to the target (could be multiple remoteports if the target is doing NPIV) that had some error.  It is *not* indicating that I/O won't complete. True, some I/O may not due to the error that affected it.  And it is true, but not likely that all i/o hits the same problem.   What we have seen with flaky links is most I/O does complete, but a few I/Os don't.
> It's actually a rather funky condition, kind of sick but not dead scenario.
> As FPIN-Li indicates that the path is "flaky" and using this path further  will have a performance impact.
> And the current approach of steering away I/O is fine for FPIN-Li.

OK, then can we all agree that the current patch series, including the patch for the queue-depth handler, does the correct thing for FPIN LI?

Right now this patch series will "disable" a controller and remove it from being actively used by the multipath scheduler once that controller/path receives an FPIN LI event.
This is true for all three multi-path schedulers: round-robin, queue-depth and numa.  Once a controller/path has received an LI event is reports a state of "marginal" in the
controller state field  (e.g.: /sys/devices/virtual/nvme-subsystem/nvme-subsys6/nvme4/state). While in the marginal state the controller can still be used, it's only the path
selection policy that avoids it in the nvme multipath scheduling code.

These patches also prefer non-marginal paths over marginal paths and optimized paths over non-optimized paths. If all optimized paths are marginal, then the
non-optimized paths are used. If all paths are marginal then the first available marginal optimized path is used, else the first available non-optimized path is used.

To clear the marginal state a controller must be disconnected, allowing the /dev to be removed, and then reconnected.  (We might want to change this, but that can be discussed in a future patch)

Bryan and I have tested these patches with all of the above configurations conditions and, at this point, they are working as described above while using an LPFC adapter.

You can see the test plan Bryan and I used is at https://bugzilla.kernel.org/show_bug.cgi?id=220329#c1

We observed a problem while testing this with a QLA adapter.

So I am hoping one more update to this patch series, to fix the QLA problem, will complete this work.

/John









^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-07-11 17:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
2025-07-03 11:47   ` Christoph Hellwig
2025-06-24 20:20 ` [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 5/6] qla2xxx: " Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state() Bryan Gurney
2025-07-01  0:27 ` [PATCHv7 0/6] nvme-fc: FPIN link integrity handling Muneendra Kumar
2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
2025-07-02  6:10   ` Hannes Reinecke
2025-07-08 19:56     ` John Meneghini
2025-07-09  6:21       ` Hannes Reinecke
2025-07-09 13:42         ` Bryan Gurney
2025-07-09 17:44           ` Hannes Reinecke
     [not found]             ` <CAJtR8wVupqRK3t0+7shrNCTZ9qZC7gHAR2X8Ov0AR-RJamxcWg@mail.gmail.com>
2025-07-11 16:49               ` John Meneghini
2025-07-11  2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
2025-07-11 14:53   ` John Meneghini
  -- strict thread matches above, loose matches on Subject: below --
2025-07-09 21:26 Bryan Gurney
2025-07-09 22:03 ` John Meneghini
2025-07-10  6:36   ` Hannes Reinecke

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).