* [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
@ 2025-07-09 21:26 Bryan Gurney
2025-07-09 22:03 ` John Meneghini
0 siblings, 1 reply; 5+ messages in thread
From: Bryan Gurney @ 2025-07-09 21:26 UTC (permalink / raw)
To: linux-nvme, kbusch, hch, sagi, axboe
Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
jmeneghi
From: John Meneghini <jmeneghi@redhat.com>
Exclude marginal paths from queue-depth io policy. In the case where all
paths are marginal and no optimized or non-optimized path is found, we
fall back to __nvme_find_path which selects the best marginal path.
Tested-by: Bryan Gurney <bgurney@redhat.com>
Signed-off-by: John Meneghini <jmeneghi@redhat.com>
---
drivers/nvme/host/multipath.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 8d4e54bb4261..767583e8454b 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -420,6 +420,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
if (nvme_path_is_disabled(ns))
continue;
+ if (nvme_ctrl_is_marginal(ns->ctrl))
+ continue;
+
depth = atomic_read(&ns->ctrl->nr_active);
switch (ns->ana_state) {
@@ -443,7 +446,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
return best_opt;
}
- return best_opt ? best_opt : best_nonopt;
+ best_opt = (best_opt) ? best_opt : best_nonopt;
+
+ return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
}
static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
--
2.50.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
2025-07-09 21:26 [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Bryan Gurney
@ 2025-07-09 22:03 ` John Meneghini
2025-07-10 6:36 ` Hannes Reinecke
0 siblings, 1 reply; 5+ messages in thread
From: John Meneghini @ 2025-07-09 22:03 UTC (permalink / raw)
To: hare
Cc: james.smart, dick.kennedy, njavali, linux-scsi, axboe, sagi, hch,
kbusch, linux-nvme, Bryan Gurney
Hannes, this patch fixes the queue-depth scheduler. Please take a look.
On 7/9/25 5:26 PM, Bryan Gurney wrote:
> From: John Meneghini <jmeneghi@redhat.com>
>
> Exclude marginal paths from queue-depth io policy. In the case where all
> paths are marginal and no optimized or non-optimized path is found, we
> fall back to __nvme_find_path which selects the best marginal path.
>
> Tested-by: Bryan Gurney <bgurney@redhat.com>
> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
> ---
> drivers/nvme/host/multipath.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> index 8d4e54bb4261..767583e8454b 100644
> --- a/drivers/nvme/host/multipath.c
> +++ b/drivers/nvme/host/multipath.c
> @@ -420,6 +420,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> if (nvme_path_is_disabled(ns))
> continue;
>
> + if (nvme_ctrl_is_marginal(ns->ctrl))
> + continue;
> +
> depth = atomic_read(&ns->ctrl->nr_active);
>
> switch (ns->ana_state) {
> @@ -443,7 +446,9 @@ static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> return best_opt;
> }
>
> - return best_opt ? best_opt : best_nonopt;
> + best_opt = (best_opt) ? best_opt : best_nonopt;
> +
> + return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
> }
>
> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
2025-07-09 22:03 ` John Meneghini
@ 2025-07-10 6:36 ` Hannes Reinecke
0 siblings, 0 replies; 5+ messages in thread
From: Hannes Reinecke @ 2025-07-10 6:36 UTC (permalink / raw)
To: John Meneghini
Cc: james.smart, dick.kennedy, njavali, linux-scsi, axboe, sagi, hch,
kbusch, linux-nvme, Bryan Gurney
On 7/10/25 00:03, John Meneghini wrote:
> Hannes, this patch fixes the queue-depth scheduler. Please take a look.
>
> On 7/9/25 5:26 PM, Bryan Gurney wrote:
>> From: John Meneghini <jmeneghi@redhat.com>
>>
>> Exclude marginal paths from queue-depth io policy. In the case where all
>> paths are marginal and no optimized or non-optimized path is found, we
>> fall back to __nvme_find_path which selects the best marginal path.
>>
>> Tested-by: Bryan Gurney <bgurney@redhat.com>
>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>> ---
>> drivers/nvme/host/multipath.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/
>> multipath.c
>> index 8d4e54bb4261..767583e8454b 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -420,6 +420,9 @@ static struct nvme_ns
>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>> if (nvme_path_is_disabled(ns))
>> continue;
>> + if (nvme_ctrl_is_marginal(ns->ctrl))
>> + continue;
>> +
>> depth = atomic_read(&ns->ctrl->nr_active);
>> switch (ns->ana_state) {
>> @@ -443,7 +446,9 @@ static struct nvme_ns
>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>> return best_opt;
>> }
>> - return best_opt ? best_opt : best_nonopt;
>> + best_opt = (best_opt) ? best_opt : best_nonopt;
>> +
>> + return best_opt ? best_opt : __nvme_find_path(head, numa_node_id());
>> }
>> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>
Hmm. Not convinced. I would expect a 'marginal' path to behave different
(performance-wise) than unaffected paths. And the queue-depth scheduler
should be able to handle paths with different performance
characteristics just fine.
(Is is possible that your results are test artifacts? I guess
your tool just injects FPIN messages with no performance impact,
resulting in this behaviour...)
But if you want to exclude marginal paths from queue depth:
by all means, go for it.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
@ 2025-06-24 20:20 Bryan Gurney
2025-07-11 2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
0 siblings, 1 reply; 5+ messages in thread
From: Bryan Gurney @ 2025-06-24 20:20 UTC (permalink / raw)
To: linux-nvme, kbusch, hch, sagi, axboe
Cc: james.smart, dick.kennedy, njavali, linux-scsi, hare, bgurney,
jmeneghi
FPIN LI (link integrity) messages are received when the attached
fabric detects hardware errors. In response to these messages I/O
should be directed away from the affected ports, and only used
if the 'optimized' paths are unavailable.
Upon port reset the paths should be put back in service as the
affected hardware might have been replaced.
This patch adds a new controller flag 'NVME_CTRL_MARGINAL'
which will be checked during multipath path selection, causing the
path to be skipped when checking for 'optimized' paths. If no
optimized paths are available the 'marginal' paths are considered
for path selection alongside the 'non-optimized' paths.
It also introduces a new nvme-fc callback 'nvme_fc_fpin_rcv()' to
evaluate the FPIN LI TLV payload and set the 'marginal' state on
all affected rports.
The testing for this patch set was performed by Bryan Gurney, using the
process outlined by John Meneghini's presentation at LSFMM 2024, where
the fibre channel switch sends an FPIN notification on a specific switch
port, and the following is checked on the initiator:
1. The controllers corresponding to the paths on the port that has
received the notification are showing a set NVME_CTRL_MARGINAL flag.
\
+- nvme4 fc traddr=c,host_traddr=e live optimized
+- nvme5 fc traddr=8,host_traddr=e live non-optimized
+- nvme8 fc traddr=e,host_traddr=f marginal optimized
+- nvme9 fc traddr=a,host_traddr=f marginal non-optimized
2. The I/O statistics of the test namespace show no I/O activity on the
controllers with NVME_CTRL_MARGINAL set.
Device tps MB_read/s MB_wrtn/s MB_dscd/s
nvme4c4n1 0.00 0.00 0.00 0.00
nvme4c5n1 25001.00 0.00 97.66 0.00
nvme4c9n1 25000.00 0.00 97.66 0.00
nvme4n1 50011.00 0.00 195.36 0.00
Device tps MB_read/s MB_wrtn/s MB_dscd/s
nvme4c4n1 0.00 0.00 0.00 0.00
nvme4c5n1 48360.00 0.00 188.91 0.00
nvme4c9n1 1642.00 0.00 6.41 0.00
nvme4n1 49981.00 0.00 195.24 0.00
Device tps MB_read/s MB_wrtn/s MB_dscd/s
nvme4c4n1 0.00 0.00 0.00 0.00
nvme4c5n1 50001.00 0.00 195.32 0.00
nvme4c9n1 0.00 0.00 0.00 0.00
nvme4n1 50016.00 0.00 195.38 0.00
Link: https://people.redhat.com/jmeneghi/LSFMM_2024/LSFMM_2024_NVMe_Cancel_and_FPIN.pdf
More rigorous testing was also performed to ensure proper path migration
on each of the eight different FPIN link integrity events, particularly
during a scenario where there are only non-optimized paths available, in
a state where all paths are marginal. On a configuration with a
round-robin iopolicy, when all paths on the host show as marginal, I/O
continues on the optimized path that was most recently non-marginal.
From this point, of both of the optimized paths are down, I/O properly
continues on the remaining paths.
Changes to the original submission:
- Changed flag name to 'marginal'
- Do not block marginal path; influence path selection instead
to de-prioritize marginal paths
Changes to v2:
- Split off driver-specific modifications
- Introduce 'union fc_tlv_desc' to avoid casts
Changes to v3:
- Include reviews from Justin Tee
- Split marginal path handling patch
Changes to v4:
- Change 'u8' to '__u8' on fc_tlv_desc to fix a failure to build
- Print 'marginal' instead of 'live' in the state of controllers
when they are marginal
Changes to v5:
- Minor spelling corrections to patch descriptions
Changes to v6:
- No code changes; added note about additional testing
Hannes Reinecke (5):
fc_els: use 'union fc_tlv_desc'
nvme-fc: marginal path handling
nvme-fc: nvme_fc_fpin_rcv() callback
lpfc: enable FPIN notification for NVMe
qla2xxx: enable FPIN notification for NVMe
Bryan Gurney (1):
nvme: sysfs: emit the marginal path state in show_state()
drivers/nvme/host/core.c | 1 +
drivers/nvme/host/fc.c | 99 +++++++++++++++++++
drivers/nvme/host/multipath.c | 17 ++--
drivers/nvme/host/nvme.h | 6 ++
drivers/nvme/host/sysfs.c | 4 +-
drivers/scsi/lpfc/lpfc_els.c | 84 ++++++++--------
drivers/scsi/qla2xxx/qla_isr.c | 3 +
drivers/scsi/scsi_transport_fc.c | 27 +++--
include/linux/nvme-fc-driver.h | 3 +
include/uapi/scsi/fc/fc_els.h | 165 +++++++++++++++++--------------
10 files changed, 269 insertions(+), 140 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* RE: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
@ 2025-07-11 2:59 ` Muneendra Kumar
2025-07-11 14:53 ` John Meneghini
0 siblings, 1 reply; 5+ messages in thread
From: Muneendra Kumar @ 2025-07-11 2:59 UTC (permalink / raw)
To: bgurney
Cc: axboe, dick.kennedy, hare, hch, james.smart, jmeneghi, kbusch,
linux-nvme, linux-scsi, njavali, sagi, muneendra737
Correct me if iam wrong.
>>. In the case where
>> all paths are marginal and no optimized or non-optimized path is
>> found, we fall back to __nvme_find_path which selects the best marginal path
With the current patch __nvme_find_path will allways picks the path from non-optimized path ?
Regards,
Muneendra
>On 7/10/25 00:03, John Meneghini wrote:
>> Hannes, this patch fixes the queue-depth scheduler. Please take a look.
>>
>> On 7/9/25 5:26 PM, Bryan Gurney wrote:
>>> From: John Meneghini <jmeneghi@redhat.com>
>>>
>>> Exclude marginal paths from queue-depth io policy. In the case where
>>> all paths are marginal and no optimized or non-optimized path is
>>> found, we fall back to __nvme_find_path which selects the best marginal path.
>>>
>>> Tested-by: Bryan Gurney <bgurney@redhat.com>
>>> Signed-off-by: John Meneghini <jmeneghi@redhat.com>
>>> ---
>>> drivers/nvme/host/multipath.c | 7 ++++++-
>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/
>>> multipath.c index 8d4e54bb4261..767583e8454b 100644
>>> --- a/drivers/nvme/host/multipath.c
>>> +++ b/drivers/nvme/host/multipath.c
>>> @@ -420,6 +420,9 @@ static struct nvme_ns
>>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>> if (nvme_path_is_disabled(ns))
>>> continue;
>>> + if (nvme_ctrl_is_marginal(ns->ctrl))
>>> + continue;
>>> +
>>> depth = atomic_read(&ns->ctrl->nr_active);
>>> switch (ns->ana_state) {
>>> @@ -443,7 +446,9 @@ static struct nvme_ns
>>> *nvme_queue_depth_path(struct nvme_ns_head *head)
>>> return best_opt;
>>> }
>>> - return best_opt ? best_opt : best_nonopt;
>>> + best_opt = (best_opt) ? best_opt : best_nonopt;
>>> +
>>> + return best_opt ? best_opt : __nvme_find_path(head,
>>> +numa_node_id());
>>> }
>>> static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
>>
>
>Hmm. Not convinced. I would expect a 'marginal' path to behave different
>(performance-wise) than unaffected paths. And the queue-depth scheduler should be able to handle paths with different performance characteristics just fine.
>(Is is possible that your results are test artifacts? I guess your tool just injects FPIN messages with no performance impact, resulting in this behaviour...)
>
>But if you want to exclude marginal paths from queue depth:
>by all means, go for it.
>
--
This electronic communication and the information and any files transmitted
with it, or attached to it, are confidential and are intended solely for
the use of the individual or entity to whom it is addressed and may contain
information that is confidential, legally privileged, protected by privacy
laws, or otherwise restricted from disclosure to anyone else. If you are
not the intended recipient or the person responsible for delivering the
e-mail to the intended recipient, you are hereby notified that any use,
copying, distributing, dissemination, forwarding, printing, or copying of
this e-mail is strictly prohibited. If you received this e-mail in error,
please return the e-mail to the sender, delete it from your computer, and
destroy any printed copy of it.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths
2025-07-11 2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
@ 2025-07-11 14:53 ` John Meneghini
0 siblings, 0 replies; 5+ messages in thread
From: John Meneghini @ 2025-07-11 14:53 UTC (permalink / raw)
To: Muneendra Kumar, bgurney
Cc: axboe, dick.kennedy, hare, hch, james.smart, kbusch, linux-nvme,
linux-scsi, njavali, sagi, muneendra737
On 7/10/25 10:59 PM, Muneendra Kumar wrote:
> Correct me if iam wrong.
>>> . In the case where
>>> all paths are marginal and no optimized or non-optimized path is
>>> found, we fall back to __nvme_find_path which selects the best marginal path
>
> With the current patch __nvme_find_path will allways picks the path from non-optimized path ?
Not necessarily. I think it all comes down the this code:
switch (ns->ana_state) {
case NVME_ANA_OPTIMIZED:
if (!nvme_ctrl_is_marginal(ns->ctrl)) {
if (distance < found_distance) {
found_distance = distance;
found = ns;
}
break;
}
fallthrough;
case NVME_ANA_NONOPTIMIZED:
if (distance < fallback_distance) {
fallback_distance = distance;
fallback = ns;
}
break;
Any NVME_ANA_OPTIMIZED path that is marginal becomes a part of the fallback ns algorithm.
In the case where there is at least one NVME_ANA_OPTIMIZED path, it works correctly. You will always find the NVME_ANA_OPTIMIZED
path. In the case there there are no NVME_ANA_OPTIMIZED paths it turns in to kind of a crap shoot. You end up with the first fallback
ns that's found. That could be an NVME_ANA_OPTIMIZED path or an NVME_ANA_NONOPTIMIZED path. It all depends upon how the head->list is
sorted and if there are any disabled paths.
In our testing I've seen that this sometimes selects the NVME_ANA_OPTIMIZED path and sometimes the NVME_ANA_NONOPTIMIZED path.
In the simple test case, when the first two paths are optimized, and only one is marginal, this algorithm always selects the NVME_ANA_NONOPTIMIZED path.
It's only the more complicated test when all NVME_ANA_NONOPTIMIZED paths are marginal that I see some unpredictability.
/John
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-07-11 15:41 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 21:26 [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Bryan Gurney
2025-07-09 22:03 ` John Meneghini
2025-07-10 6:36 ` Hannes Reinecke
-- strict thread matches above, loose matches on Subject: below --
2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
2025-07-11 2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
2025-07-11 14:53 ` John Meneghini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).