All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv7 RFC 0/3] Add visibility for native NVMe multipath using sysfs
@ 2025-01-12 12:41 Nilay Shroff
  2025-01-12 12:41 ` [PATCHv7 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Nilay Shroff @ 2025-01-12 12:41 UTC (permalink / raw)
  To: linux-nvme; +Cc: kbusch, sagi, hch, dwagner, hare, chaitanyak, axboe, gjoyce

Hi,

This RFC propose adding new sysfs attributes for adding visibility of
nvme native multipath I/O.

The changes are divided into three patches.
The first patch adds visibility for round-robin io-policy.
The second patch adds visibility for numa io-policy.
The third patch adds the visibility for queue-depth io-policy.

As we know, NVMe native multipath supports three different io policies
(numa, round-robin and queue-depth) for selecting I/O path, however, we
don't have any visibility about which path is being selected by multipath
code for forwarding I/O. This RFC helps add that visibility by adding new
sysfs attribute files named "numa_nodes" and "queue_depth" under each
namespace block device path /sys/block/nvmeXcYnZ/. We also create a
"multipath" sysfs directory under head disk node and then from this
directory add a link to each namespace path device this head disk node
points to.

Please find below output generated with this proposed RFC patch applied on
a system with two multi-controller PCIe NVMe disks attached to it. This
system is also an NVMf-TCP host which is connected to an NVMf-TCP target
over two NIC cards. This system has four numa nodes online when the below
output was captured:

# cat /sys/devices/system/node/online
0-3

# lscpu
<snip>
NUMA:
  NUMA node(s):           4
  NUMA node0 CPU(s):
  NUMA node1 CPU(s):      0-7
  NUMA node2 CPU(s):      8-31
  NUMA node3 CPU(s):      32-63
<snip>

Please note that numa node 0 though online, doesn't have any CPU
currently assigned to it.

# nvme list -v
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme0, nvme1
nvme-subsys3     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme2, nvme3
nvme-subsys4     nvmet_subsystem                                                                                  nvme4, nvme5

Device           Cntlid SN                   MN                                       FR       TxPort Address        Slot   Subsystem    Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0    66     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0   U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
nvme1    65     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0   U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
nvme2    2      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0   U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
nvme3    1      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0   U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
nvme4    1      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100        nvme-subsys4 nvme4n1
nvme5    2      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100        nvme-subsys4 nvme4n1

Device            Generic           NSID       Usage                      Format           Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme0, nvme1
/dev/nvme3n1 /dev/ng3n1   0x2          0.00   B /   5.75  GB      4 KiB +  0 B   nvme2, nvme3
/dev/nvme4n1 /dev/ng4n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme4, nvme5


# nvme show-topology
nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=numa
\
 +- ns 1
 \
  +- nvme0 pcie 052e:78:00.0 live optimized
  +- nvme1 pcie 058e:78:00.0 live optimized

nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=round-robin
\
 +- ns 2
 \
  +- nvme2 pcie 0524:28:00.0 live optimized
  +- nvme3 pcie 0584:28:00.0 live optimized

nvme-subsys4 - NQN=nvmet_subsystem
               hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
               iopolicy=queue-depth
\
 +- ns 1
 \
  +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
  +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized

As we could see above, we've three shared namespaces created. In terms of
iopolicy, we have "numa" configured for nvme-subsys1, "round-robin"
configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.

Now, under each namespace "head disk node", we create a sysfs group
attribute named "multipath". The "multipath" group then points to the
each path this head disk node points to:

# tree /sys/block/nvme1n1/multipath/
/sys/block/nvme1n1/multipath/
├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
└── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1

# tree /sys/block/nvme3n1/multipath/
/sys/block/nvme3n1/multipath/
├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
└── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1

# tree /sys/block/nvme4n1/multipath/
/sys/block/nvme4n1/multipath/
├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
└── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1

One can easily infer from the above output that for the "round-robin"
I/O policy, configured under nvme-subsys3, the I/O workload targeted at
nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state
of each path is optimized (as can be seen in the output of show-topology).

For numa I/O policy, configured under nvme-subsys1, the "numa_nodes"
attribute file shows the numa nodes being preferred by the respective
namespace path. The numa nodes value is comma delimited list of nodes or
A-B range of nodes.

# cat  /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes
0-1

# cat  /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
2-3

From the above output, one can easily infer that I/O workload targeted at
nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1.
Similarly, I/O workload running on numa nodes 2 and 3 would use path
nvme1c1n1.

For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth"
attribute file shows the number of active/in-flight I/O requests currently
queued for each path.

# cat  /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth
518

# cat  /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
504

From the above output, one can easily infer that I/O workload targeted at
nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth
of each path is 518 and 504 respectively.

Changes since v6:
    - Fix sysfs link warning menifested while running blktest nvme/058
	  (Keith Busch, Chaitanya Kulkarni)
    - Link to v6: https://lore.kernel.org/all/20241213041908.1381196-1-nilay@linux.ibm.com/
Changes since v5:
    - Fix typo in the subject line of the first patch in the series (Danied
      Wagner)
    - Link to v5: https://lore.kernel.org/all/20241030104156.747675-1-nilay@linux.ibm.com/
changes since v4:
    - Ensure that we create sysfs link from head gendisk node to each path
      device irrespective of the ANA state of the path (Hannes Reinecke)
    - Split the patch into three patch series and add commentary in the
      code so that it's easy to read and understand the core logic (Sagi
      Grimberg)
    - Don't show any output if user reads "numa_nodes" file and configured
      iopolicy is anything but numa; similarly don't emit any output if user
      reads "queue_depth" file and configured iopolicy is anything but
      queue-depth (Sagi Grimberg)
    - Link to v4: https://lore.kernel.org/all/20240911062653.1060056-1-nilay@linux.ibm.com/

Changes since v3:
    - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
    - Link to v3: https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/

Changes since v2:
    - Use one value per one sysfs attribute (Keith Busch)
    - Link to v2: https://lore.kernel.org/all/20240809173030.2281021-1-nilay@linux.ibm.com/

Changes since v1:
    - Use sysfs to export multipath I/O information instead of debugfs
    - Link to v1: https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/

Nilay Shroff (3):
  nvme-multipath: Add visibility for round-robin io-policy
  nvme-multipath: Add visibility for numa io-policy
  nvme-multipath: Add visibility for queue-depth io-policy

 drivers/nvme/host/core.c      |   3 +
 drivers/nvme/host/multipath.c | 138 ++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  20 ++++-
 drivers/nvme/host/sysfs.c     |  20 +++++
 4 files changed, 177 insertions(+), 4 deletions(-)

-- 
2.47.1



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-01-24 15:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-12 12:41 [PATCHv7 RFC 0/3] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2025-01-12 12:41 ` [PATCHv7 RFC 1/3] nvme-multipath: Add visibility for round-robin io-policy Nilay Shroff
2025-01-13 10:34   ` Hannes Reinecke
2025-01-12 12:41 ` [PATCHv7 RFC 2/3] nvme-multipath: Add visibility for numa io-policy Nilay Shroff
2025-01-13 10:35   ` Hannes Reinecke
2025-01-12 12:41 ` [PATCHv7 RFC 3/3] nvme-multipath: Add visibility for queue-depth io-policy Nilay Shroff
2025-01-13 10:35   ` Hannes Reinecke
2025-01-24 15:57 ` [PATCHv7 RFC 0/3] Add visibility for native NVMe multipath using sysfs Keith Busch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.