public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
* [RFC PATCHv4 0/3] improve NVMe multipath handling
@ 2025-05-09 17:51 Nilay Shroff
  2025-05-09 17:51 ` [RFC PATCHv4 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Nilay Shroff @ 2025-05-09 17:51 UTC (permalink / raw)
  To: linux-nvme
  Cc: hch, hare, kbusch, sagi, jmeneghi, axboe, martin.petersen, gjoyce

Hi,

This patch series introduces improvements to NVMe multipath handling by
refining the removal behavior of the multipath head node and simplifying
configuration options. The idea/POC for this change was originally
proposed by Christoph[1] and Keith[2]. I worked upon their original
idea/POC and implemented this series.

The first patch in the series addresses an issue where the multipath
head node of a PCIe NVMe disk is removed immediately when all disk paths
are lost. This can cause problems in scenarios such as:
- Hot removal and re-addition of a disk.
- Transient PCIe link failures that trigger re-enumeration,
  briefly removing and restoring the disk.

In such cases, premature removal of the head node may result in a device
node name change, requiring applications to reopen device handles if
they were performing I/O during the failure. To mitigate this, we
introduce a delayed removal mechanism. Instead of removing the head node
immediately, the system waits for a configurable timeout, allowing the
disk to recover. If the disk comes back online within this window, the
head node remains unchanged, ensuring uninterrupted workloads.

A new sysfs attribute, delayed_removal_secs, allows users to configure
this timeout. By default, it is set to 0 seconds, preserving the
existing behavior unless explicitly changed.

The second patch in the series introduced multipath_always_on module
param. When this option is set, it forces creating multipath head disk
node even for single ported NVMe disks or private namespaces and thus
allows delayed head node removal. This would help handle transient PCIe
link failures transparently even in case of single ported NVMe disk or a
private namespace.

The third patch in the series doesn't make any functional changes but
just renames few of the function name which improves code readability
and it better aligns function names with their actual roles.

These changes should help improve NVMe multipath reliability and simplify
configuration. Feedback and testing are welcome!

[1] https://lore.kernel.org/linux-nvme/Y9oGTKCFlOscbPc2@infradead.org/
[2] https://lore.kernel.org/linux-nvme/Y+1aKcQgbskA2tra@kbusch-mbp.dhcp.thefacebook.com/

Changes from v3:
    - Removed special case for fabric handling and unified head node 
      delayed removal behavior across PCIe and fabric controllers (hch)

Link to v3: https://lore.kernel.org/all/20250504175051.2208162-1-nilay@linux.ibm.com/      

Changes from v2:
    - Rename multipath_head_always to multipath_always_on (Hannes Reinecke)
    - Map delayed_removal_secs to queue_if_no_path internally; if delayed_
      removal_secs is non-zero then queue_if_no_path is set otherwise its
      unset (Hannes Reinecke)
    - Few minor code readability improvements in the second patch while
      handling multipath_param_set and multipath_always_on_set (hch)
    - Avoid the race in shutdown namespace removal by deleting head->entry
      during the first critical section of the nvme_ns_remove for the case
      head delayed_removal is not configured (hch)
    - Use ctrl->ops->flags & NVME_F_FABRICS to determine whether the 
      ctrl uses fabric setup (Sagi)

Link to v2: https://lore.kernel.org/all/20250425103319.1185884-1-nilay@linux.ibm.com/

Changes from v1:
    - Renamed delayed_shutdown_sec to delayed_removal_secs as "shutdown"
      has a special meaning when used with NVMe device (Martin Petersen)
    - Instead of adding mpath head disk node always by default, added new
      module option nvme_core.multipath_head_always which when set creates
      mpath head disk node (even for a private namespace or a namespace
      backed by single ported nvme disk). This way we can preserve the
      default old behavior.(hch)
    - Renamed nvme_mpath_shutdown_disk function as shutdown as in the NVMe
      context, the term "shutdown" has a specific technical meaning. (hch)
    - Undo changes which removed multipath module param as this param is
      still useful and used for many different things.

Link to v1: https://lore.kernel.org/all/20250321063901.747605-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme: introduce multipath_always_on module param
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk

 drivers/nvme/host/core.c      |  12 +-
 drivers/nvme/host/multipath.c | 206 ++++++++++++++++++++++++++++++----
 drivers/nvme/host/nvme.h      |  24 +++-
 drivers/nvme/host/sysfs.c     |   7 ++
 4 files changed, 220 insertions(+), 29 deletions(-)

-- 
2.49.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-05-14 13:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-09 17:51 [RFC PATCHv4 0/3] improve NVMe multipath handling Nilay Shroff
2025-05-09 17:51 ` [RFC PATCHv4 1/3] nvme-multipath: introduce delayed removal of the multipath head node Nilay Shroff
2025-05-12  5:51   ` Hannes Reinecke
2025-05-09 17:51 ` [RFC PATCHv4 2/3] nvme: introduce multipath_always_on module param Nilay Shroff
2025-05-14  5:42   ` Christoph Hellwig
2025-05-14 13:03     ` Nilay Shroff
2025-05-09 17:51 ` [RFC PATCHv4 3/3] nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk Nilay Shroff

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox