linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Hannes Reinecke <hare@suse.de>
To: John Meneghini <jmeneghi@redhat.com>,
	Bryan Gurney <bgurney@redhat.com>,
	linux-nvme@lists.infradead.org, kbusch@kernel.org, hch@lst.de,
	sagi@grimberg.me, axboe@kernel.dk
Cc: james.smart@broadcom.com, dick.kennedy@broadcom.com,
	linux-scsi@vger.kernel.org, njavali@marvell.com,
	muneendra737@gmail.com
Subject: Re: [PATCH v7 0/6] nvme-fc: FPIN link integrity handling
Date: Wed, 9 Jul 2025 08:21:02 +0200	[thread overview]
Message-ID: <0db83ea3-83aa-45ef-bafd-6b37da541d22@suse.de> (raw)
In-Reply-To: <b0817e2a-1c4e-4d7d-9a84-633913d9e0e4@redhat.com>

On 7/8/25 21:56, John Meneghini wrote:
> On 7/2/25 2:10 AM, Hannes Reinecke wrote:
>>> During path fail testing on the numa iopolicy, I found that I/O moves
>>> off of the marginal path after a first link integrity event is
>>> received, but if the non-marginal path the I/O is on is disconnected,
>>> the I/O is transferred onto a marginal path (in testing, sometimes
>>> I've seen it go to a "marginal optimized" path, and sometimes
>>> "marginal non-optimized").
>>>
>> That is by design.
>> 'marginal' paths are only evaluated for the 'optimized' path selection,
>> where it's obvious that 'marginal' paths should not be selected as
>> 'optimized'.
> 
> I think we might want to change this.  With the NUMA scheduler you can 
> end up with using the non-optimized marginal path.  This happens when
 > we test with 4 paths (2 optimized and 2 non-optimized) and set all 4
 > paths to marginal.  In this case> the NUMA scheduler should simply 
choose the optimized marginal path on
> the closest numa node.  However, that's not what happens. It consistently
 > chooses the non-optimized first non-optimized path.>
Ah. So it seems that the NUMA scheduler needs to be fixed.
I'll have a look there.

>> For 'non-optimized' the situation is less clear; is the 'non-optimized'
>> path preferrable to 'marginal'? Or the other way round?
>> So once the 'optimized' path selection returns no paths, _any_ of the
>> remaining paths are eligible.
> 
> This is a good question for Broadcom.  I think, with all IO schedulers, 
> as long
> as there is a non-marginal path available, that path should be used.  So
> a non-marginal non-optimized path should take precedence over a marginal 
> optimized path.
> 
> In the case were all paths are marginal, I think the scheduler should 
> simply use the first optimized path on the closest numa node.

For the NUMA case, yes. But as I said above, it seems that the NUMA
scheduler needs to fixes.

>>> The queue-depth iopolicy doesn't change its path selection based on
>>> the marginal flag, but looking at nvme_queue_depth_path(), I can see
>>> that there's currently no logic to handle marginal paths.  We're
>>> developing a patch to address that issue in queue-depth, but we need
>>> to do more testing.
>>>
>> Again, by design.
>> The whole point of the marginal path patchset is that I/O should
>> be steered away from the marginal path, but the path itself should
>> not completely shut off (otherwise we just could have declared the
>> path as 'faulty' and be done with).
>> Any I/O on 'marginal' paths should have higher latencies, and higher
>> latencies should result in higher queue depths on these paths. So
>> the queue-depth based IO scheduler should to the right thing
>> automatically.
> 
> I don't understand this.  The Round-robin scheduler removes marginal 
> paths, why shouldn't the queue-depth and numa scheduler do the same?
> 
The NUMA scheduler should, that's correct.

>> Always assuming that marginal paths should have higher latencies.
>> If they haven't then they will be happily selection for I/O.
>> But then again, if the marginal path does _not_ have highert
>> latencies why shouldn't we select it for I/O?
> 
> This may be true with FPIN Congestion Signal, but we are testing Link 
> Integrity.  With FPIN LI I think we want to simply stop using the path.
> LI has nothing to do with latency.  So unless ALL paths are marginal, 
> the IO scheduler should not be using any marginal path.
> 
For FPIN LI the paths should be marked as 'faulty', true.

> Do we need another state here?  There is an ask to support FPIN CS, so 
> maybe using the term "marginal" to describe LI is wrong.
> 
> Maybe we need something like "marginal_li" and "marginal_cs" to describe 
> the difference.
> 
Really not so sure. I really wonder how a FPIN LI event reflect back on
the actual I/O. Will the I/O be aborted with an error? Or does the I/O
continue at a slower pace?
I would think the latter, and that's the design assumption for this
patchset. If it's the former and I/O is aborted with an error we are in
a similar situation like we have with a faulty cable, and we need
to come up with a different solution.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare@suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich

  reply	other threads:[~2025-07-09  6:21 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-24 20:20 [PATCH v7 0/6] nvme-fc: FPIN link integrity handling Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 1/6] fc_els: use 'union fc_tlv_desc' Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 2/6] nvme-fc: marginal path handling Bryan Gurney
2025-07-03 11:47   ` Christoph Hellwig
2025-06-24 20:20 ` [PATCH v7 3/6] nvme-fc: nvme_fc_fpin_rcv() callback Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 4/6] lpfc: enable FPIN notification for NVMe Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 5/6] qla2xxx: " Bryan Gurney
2025-06-24 20:20 ` [PATCH v7 6/6] nvme: sysfs: emit the marginal path state in show_state() Bryan Gurney
2025-07-01  0:27 ` [PATCHv7 0/6] nvme-fc: FPIN link integrity handling Muneendra Kumar
2025-07-01 20:32 ` [PATCH v7 " Bryan Gurney
2025-07-02  6:10   ` Hannes Reinecke
2025-07-08 19:56     ` John Meneghini
2025-07-09  6:21       ` Hannes Reinecke [this message]
2025-07-09 13:42         ` Bryan Gurney
2025-07-09 17:44           ` Hannes Reinecke
2025-07-11  8:54             ` Muneendra Kumar M
2025-07-11 16:49               ` John Meneghini
2025-07-11  2:59 ` [PATCH v8 8/8] nvme-multipath: queue-depth support for marginal paths Muneendra Kumar
2025-07-11 14:53   ` John Meneghini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0db83ea3-83aa-45ef-bafd-6b37da541d22@suse.de \
    --to=hare@suse.de \
    --cc=axboe@kernel.dk \
    --cc=bgurney@redhat.com \
    --cc=dick.kennedy@broadcom.com \
    --cc=hch@lst.de \
    --cc=james.smart@broadcom.com \
    --cc=jmeneghi@redhat.com \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=muneendra737@gmail.com \
    --cc=njavali@marvell.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).