From: Hannes Reinecke <hare@suse.de>
To: John Meneghini <jmeneghi@redhat.com>,
Bryan Gurney <bgurney@redhat.com>,
linux-nvme@lists.infradead.org, kbusch@kernel.org, hch@lst.de,
sagi@grimberg.me, axboe@kernel.dk
Cc: james.smart@broadcom.com, dick.kennedy@broadcom.com,
linux-scsi@vger.kernel.org, njavali@marvell.com,
muneendra737@gmail.com
Subject: Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
Date: Wed, 9 Jul 2025 08:21:02 +0200 [thread overview]
Message-ID: <0db83ea3-83aa-45ef-bafd-6b37da541d22@suse.de> (raw)
In-Reply-To: <b0817e2a-1c4e-4d7d-9a84-633913d9e0e4@redhat.com>
On 7/8/25 21:56, John Meneghini wrote:
> On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>>> During path fail testing on the numa iopolicy, I found that I/O moves
>>> off of the marginal path after a first link integrity event is
>>> received, but if the non-marginal path the I/O is on is disconnected,
>>> the I/O is transferred onto a marginal path (in testing, sometimes
>>> I've seen it go to a "marginal optimized" path, and sometimes
>>> "marginal non-optimized").
>>>
>> That is by design.
>> 'marginal' paths are only evaluated for the 'optimized' path selection,
>> where it's obvious that 'marginal' paths should not be selected as
>> 'optimized'.
>
> I think we might want to change this. With the NUMA scheduler you can
> end up with using the non-optimized marginal path. This happens when
> we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
> paths to marginal. In this case> the NUMA scheduler should simply
choose the optimized marginal path on
> the closest numa node. However, that's not what happens. It consistently
> chooses the non-optimized first non-optimized path.>
Ah. So it seems that the NUMA scheduler needs to be fixed.
I'll have a look there.
>> For 'non-optimized' the situation is less clear; is the 'non-optimized'
>> path preferrable to 'marginal'? Or the other way round?
>> So once the 'optimized' path selection returns no paths, _any_ of the
>> remaining paths are eligible.
>
> This is a good question for Broadcom. I think, with all IO schedulers,
> as long
> as there is a non-marginal path available, that path should be used. So
> a non-marginal non-optimized path should take precedence over a marginal
> optimized path.
>
> In the case were all paths are marginal, I think the scheduler should
> simply use the first optimized path on the closest numa node.
For the NUMA case, yes. But as I said above, it seems that the NUMA
scheduler needs to fixes.
>>> The queue-depth iopolicy doesn't change its path selection based on
>>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>>> that there's currently no logic to handle marginal paths. We're
>>> developing a patch to address that issue in queue-depth, but we need
>>> to do more testing.
>>>
>> Again, by design.
>> The whole point of the marginal path patchset is that I/O should
>> be steered away from the marginal path, but the path itself should
>> not completely shut off (otherwise we just could have declared the
>> path as 'faulty' and be done with).
>> Any I/O on 'marginal' paths should have higher latencies, and higher
>> latencies should result in higher queue depths on these paths. So
>> the queue-depth based IO scheduler should to the right thing
>> automatically.
>
> I don't understand this. The Round-robin scheduler removes marginal
> paths, why shouldn't the queue-depth and numa scheduler do the same?
>
The NUMA scheduler should, that's correct.
>> Always assuming that marginal paths should have higher latencies.
>> If they haven't then they will be happily selection for I/O.
>> But then again, if the marginal path does _not_ have highert
>> latencies why shouldn't we select it for I/O?
>
> This may be true with FPIN Congestion Signal, but we are testing Link
> Integrity. With FPIN LI I think we want to simply stop using the path.
> LI has nothing to do with latency. So unless ALL paths are marginal,
> the IO scheduler should not be using any marginal path.
>
For FPIN LI the paths should be marked as 'faulty', true.
> Do we need another state here? There is an ask to support FPIN CS, so
> maybe using the term "marginal" to describe LI is wrong.
>
> Maybe we need something like "marginal_li" and "marginal_cs" to describe
> the difference.
>
Really not so sure. I really wonder how a FPIN LI event reflect back on
the actual I/O. Will the I/O be aborted with an error? Or does the I/O
continue at a slower pace?
I would think the latter, and that's the design assumption for this
patchset. If it's the former and I/O is aborted with an error we are in
a similar situation like we have with a faulty cable, and we need
to come up with a different solution.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
next prev parent reply other threads:[~2025-07-09 6:21 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
2025-07-03 11:47 ` Christoph Hellwig
2025-06-24 20:20 ` [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 5/6] qla2xxx: " Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state() Bryan Gurney
2025-07-01 0:27 ` [PATCHv7 0/6] nvme-fc: FPIN link integrity handling Muneendra Kumar
2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
2025-07-02 6:10 ` Hannes Reinecke
2025-07-08 19:56 ` John Meneghini
2025-07-09 6:21 ` Hannes Reinecke [this message]
2025-07-09 13:42 ` Bryan Gurney
2025-07-09 17:44 ` Hannes Reinecke
2025-07-11 8:54 ` Muneendra Kumar M
2025-07-11 16:49 ` John Meneghini
2025-07-11 2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
2025-07-11 14:53 ` John Meneghini
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0db83ea3-83aa-45ef-bafd-6b37da541d22@suse.de \
--to=hare@suse.de \
--cc=axboe@kernel.dk \
--cc=bgurney@redhat.com \
--cc=dick.kennedy@broadcom.com \
--cc=hch@lst.de \
--cc=james.smart@broadcom.com \
--cc=jmeneghi@redhat.com \
--cc=kbusch@kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=linux-scsi@vger.kernel.org \
--cc=muneendra737@gmail.com \
--cc=njavali@marvell.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).