From: Nilay Shroff <nilay@linux.ibm.com>
To: Hannes Reinecke <hare@suse.de>,
linux-nvme@lists.infradead.org, linux-block@vger.kernel.org
Cc: hch@lst.de, kbusch@kernel.org, sagi@grimberg.me,
jmeneghi@redhat.com, axboe@kernel.dk, martin.petersen@oracle.com,
gjoyce@ibm.com
Subject: Re: [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param
Date: Tue, 29 Apr 2025 12:45:49 +0530 [thread overview]
Message-ID: <b9bb4b91-a4a0-4cbd-85ae-969efffe0951@linux.ibm.com> (raw)
In-Reply-To: <10ba7fa9-15e9-48b9-a8ac-e7c3982a211c@suse.de>
On 4/29/25 12:31 PM, Hannes Reinecke wrote:
> On 4/29/25 08:24, Nilay Shroff wrote:
>>
>>
>> On 4/29/25 11:19 AM, Hannes Reinecke wrote:
>>> On 4/28/25 09:39, Nilay Shroff wrote:
>>>>
>>>>
>>>> On 4/28/25 12:27 PM, Hannes Reinecke wrote:
>>>>> On 4/25/25 12:33, Nilay Shroff wrote:
>>>>>> Currently, a multipath head disk node is not created for single-ported
>>>>>> NVMe adapters or private namespaces. However, creating a head node in
>>>>>> these cases can help transparently handle transient PCIe link failures.
>>>>>> Without a head node, features like delayed removal cannot be leveraged,
>>>>>> making it difficult to tolerate such link failures. To address this,
>>>>>> this commit introduces nvme_core module parameter multipath_head_always.
>>>>>>
>>>>>> When this param is set to true, it forces the creation of a multipath
>>>>>> head node regardless NVMe disk or namespace type. So this option allows
>>>>>> the use of delayed removal of head node functionality even for single-
>>>>>> ported NVMe disks and private namespaces and thus helps transparently
>>>>>> handling transient PCIe link failures.
>>>>>>
>>>>>> By default multipath_head_always is set to false, thus preserving the
>>>>>> existing behavior. Setting it to true enables improved fault tolerance
>>>>>> in PCIe setups. Moreover, please note that enabling this option would
>>>>>> also implicitly enable nvme_core.multipath.
>>>>>>
>>>>>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>>>>>> ---
>>>>>> drivers/nvme/host/multipath.c | 70 +++++++++++++++++++++++++++++++----
>>>>>> 1 file changed, 63 insertions(+), 7 deletions(-)
>>>>>>
>>>>> I really would model this according to dm-multipath where we have the
>>>>> 'fail_if_no_path' flag.
>>>>> This can be set for PCIe devices to retain the current behaviour
>>>>> (which we need for things like 'md' on top of NVMe) whenever the
>>>>> this flag is set.
>>>>>
>>>> Okay so you meant that when sysfs attribute "delayed_removal_secs"
>>>> under head disk node is _NOT_ configured (or delayed_removal_secs
>>>> is set to zero) we have internal flag "fail_if_no_path" is set to
>>>> true. However in other case when "delayed_removal_secs" is set to
>>>> a non-zero value we set "fail_if_no_path" to false. Is that correct?
>>>>
>>> Don't make it overly complicated.
>>> 'fail_if_no_path' (and the inverse 'queue_if_no_path') can both be
>>> mapped onto delayed_removal_secs; if the value is '0' then the head
>>> disk is immediately removed (the 'fail_if_no_path' case), and if it's
>>> -1 it is never removed (the 'queue_if_no_path' case).
>>>
>> Yes if the value of delayed_removal_secs is 0 then the head is immediately
>> removed, however if value of delayed_removal_secs is anything but zero
>> (i.e. greater than zero as delayed_removal_secs is unsigned) then head
>> is removed only after delayed_removal_secs is elapsed and hence disk
>> couldn't recover from transient link failure. We never pin head node
>> indefinitely.
>>
>>> Question, though: How does it interact with the existing 'ctrl_loss_tmo'? Both describe essentially the same situation...
>>>
>> The delayed_removal_secs is modeled for NVMe PCIe adapter. So it really
>> doesn't interact or interfere with ctrl_loss_tmo which is fabric controller
>> option.
>>
> Not so sure here.
> You _could_ expand the scope for ctrl_loss_tmo to PCI, too;
> as most PCI devices will only ever have one controller 'ctrl_loss_tmo'
> will be identical to 'delayed_removal_secs'.
>
> So I guess my question is: is there a value for fabrics to control
> the lifetime of struct ns_head independent on the lifetime of the
> controller?
>
The ctrl_loss_tmo option doesn't actually controls the lifetime of
ns_head. In fact, the ctrl_loss_tmo allows the fabric I/O commands to
fail fast so that I/O commands don't get stuck while host NVMe-oF
controller is in reconnect state. User may not want to wait longer
while the fabric controller enters into reconnect state when it
loses connection with the target. Typically, the default reconnect
timeout is 10 minutes which is way longer than the expected timeout
of 30 seconds for any I/O command to fail.
You may find more details in this commit 8c4dfea97f15 ("nvme-fabrics:
reject I/O to offline device") which implements the ctrl_loss_tmo.
Thanks,
--Nilay
next prev parent reply other threads:[~2025-04-29 7:18 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-25 10:33 [RFC PATCHv2 0/3] improve NVMe multipath handling Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-04-25 14:43 ` Christoph Hellwig
2025-04-28 7:05 ` Nilay Shroff
2025-04-25 22:26 ` Sagi Grimberg
2025-04-28 7:39 ` Nilay Shroff
2025-04-25 10:33 ` [RFC PATCHv2 2/3] nvme: introduce multipath_head_always module param Nilay Shroff
2025-04-25 14:45 ` Christoph Hellwig
2025-04-29 6:26 ` Nilay Shroff
2025-04-28 6:57 ` Hannes Reinecke
2025-04-28 7:39 ` Nilay Shroff
2025-04-29 5:49 ` Hannes Reinecke
2025-04-29 6:24 ` Nilay Shroff
2025-04-29 7:01 ` Hannes Reinecke
2025-04-29 7:15 ` Nilay Shroff [this message]
2025-04-25 10:33 ` [RFC PATCHv2 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff
2025-04-25 14:46 ` Christoph Hellwig
2025-04-25 22:27 ` Sagi Grimberg
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b9bb4b91-a4a0-4cbd-85ae-969efffe0951@linux.ibm.com \
--to=nilay@linux.ibm.com \
--cc=axboe@kernel.dk \
--cc=gjoyce@ibm.com \
--cc=hare@suse.de \
--cc=hch@lst.de \
--cc=jmeneghi@redhat.com \
--cc=kbusch@kernel.org \
--cc=linux-block@vger.kernel.org \
--cc=linux-nvme@lists.infradead.org \
--cc=martin.petersen@oracle.com \
--cc=sagi@grimberg.me \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox