Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, sagi@grimberg.me, axboe@fb.com,
	gjoyce@linux.ibm.com, Nilay Shroff <nilay@linux.ibm.com>
Subject: [PATCH RFC 0/1] Add visibility for native NVMe miltipath using debugfs
Date: Mon, 22 Jul 2024 15:01:08 +0530	[thread overview]
Message-ID: <20240722093124.42581-1-nilay@linux.ibm.com> (raw)

Hi,

This patch propose adding a new debugfs file entry for NVMe native 
multipath. As we know NVMe native multipath today supports three different
io-policies (numa, round-robin and queue-depth) for selecting optimal I/O
path and forwarding data. However we don't have yet any visibility to find
the I/O path being selected by NVMe native multipath code.

IMO, it'd be nice to have this visibility information available under 
debugfs which could help a user to validate the I/O path being chosen is 
optimal for a given io policy. This patch propose adding a debugfs file 
for each head disk node on the system. The proposal is to create a file 
named "multipath" under "/sys/kernel/debug/nvmeXnY/".

Please find below output generated with this patch applied on a system 
with a multi-controller PCIe NVMe disk attached to it. This system is also
an NVMf-TCP host which is connected to NVMf-TCP target over two NIC cards. 
This system has two numa nodes online when the below output was captured:

# cat /sys/devices/system/node/online
2-3

# nvme list -v
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1     nvmet_subsystem                                                                                  nvme1, nvme3
nvme-subsys2     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme0, nvme2

Device           Cntlid SN                   MN                                       FR       TxPort Address        Slot   Subsystem    Namespaces
---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0    2      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0   U50EE.001.WZS000E-P3-C4-R1 nvme-subsys2 nvme2n2
nvme2    1      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0   U50EE.001.WZS000E-P3-C4-R2 nvme-subsys2 nvme2n2
nvme1    1      a224673364d1dcb6fab9 Linux                                    6.9.0    tcp    traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100        nvme-subsys1 nvme1n1
nvme3    2      a224673364d1dcb6fab9 Linux                                    6.9.0    tcp    traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100        nvme-subsys1 nvme1n1

Device            Generic           NSID       Usage                      Format           Controllers
----------------- ----------------- ---------- -------------------------- ---------------- ----------------
/dev/nvme1n1 /dev/ng1n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme1, nvme3
/dev/nvme2n2 /dev/ng2n2   0x2          0.00   B /   5.75  GB      4 KiB +  0 B   nvme0, nvme2


# cat /sys/class/nvme-subsystem/nvme-subsys2/iopolicy
numa

# cat /sys/kernel/debug/block/nvme2n2/multipath
io-policy: numa
io-path:
--------
node  current-path  ctrl    ana-state
2     nvme2c2n2     nvme2   optimized
3     nvme2c0n2     nvme0   optimized

The above output shows that current selected iopolicy is numa. And when we 
have workload running I/O on numa node 2, accessing namespace "nvme2n2", 
it uses path nvme2c2n2 and controller nvme2 for forwarding data. Moreover
the current ana-state for this path is optimized. Similarly, for I/O 
workload running on numa node 3 would use path nvme2c0n2 and controller
nvme0.

Now changing the iopolicy to round-robin,

# echo "round-robin" > /sys/class/nvme-subsystem/nvme-subsys2/iopolicy

# cat /sys/kernel/debug/block/nvme2n2/multipath
io-policy: round-robin
io-path:
--------
node  rr-path       ctrl    ana-state
2     nvme2c2n2     nvme2   optimized
2     nvme2c0n2     nvme0   optimized
3     nvme2c2n2     nvme2   optimized
3     nvme2c0n2     nvme0   optimized

The above output shows that current selected iopolicy is round-robin, and
when we have I/O workload running on numa node 2, accessing namespace 
"nvme2n2", the I/O path would toggle between nvme2c2n2/nvme2 and 
nvme2c0n2/nvme0. And the same is true for I/O workload running on node 3. 
Both I/O paths are currently optimized.

The namespace "nvme1n1" is accessible over fabric(NVMf-TCP).

# cat /sys/kernel/debug/block/nvme1n1/multipath
io-policy: queue-depth
io-path:
--------
node  path          ctrl    qdepth      ana-state
2     nvme1c1n1     nvme1   1328        optimized
2     nvme1c3n1     nvme3   1324        optimized
3     nvme1c1n1     nvme1   1328        optimized
3     nvme1c3n1     nvme3   1324        optimized

The above output was captured while I/O was running and accessing 
namespace nvme1n1. From the above output, we see that iopolicy is set to 
"queue-depth". When we have I/O workload running on numa node 2, accessing
namespace "nvme1n1", the I/O path nvme1c1n1/nvme1 has queue depth of 1328 
and another I/O path nvme1c3n1/nvme3 has queue depth of 1324. Both paths 
are optimized and seems that both paths are equally utilized for 
forwarding I/O. The same could be said for workload running on numa 
node 3.

Nilay Shroff (1):
  nvme-multipath: Add debugfs entry for showing multipath info

 drivers/nvme/host/multipath.c | 92 +++++++++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h      |  1 +
 2 files changed, 93 insertions(+)

-- 
2.45.2



             reply	other threads:[~2024-07-22  9:31 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-22  9:31 Nilay Shroff [this message]
2024-07-22  9:31 ` [PATCH RFC 1/1] nvme-multipath: Add debugfs entry for showing multipath info Nilay Shroff
2024-07-22 14:18 ` [PATCH RFC 0/1] Add visibility for native NVMe miltipath using debugfs Daniel Wagner
2024-07-23  5:18   ` Nilay Shroff
2024-07-23  7:40     ` Daniel Wagner
2024-07-24 13:41       ` Christoph Hellwig
2024-07-25  6:23         ` Nilay Shroff
2024-07-24 14:37 ` Keith Busch
2024-07-25  6:20   ` Nilay Shroff
2024-07-28 20:47 ` Sagi Grimberg
2024-07-29  4:50   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240722093124.42581-1-nilay@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox