public inbox for linux-nvme@lists.infradead.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: linux-nvme@lists.infradead.org
Cc: dwagner@suse.de, hch@lst.de, kbusch@kernel.org, sagi@grimberg.me,
	axboe@fb.com, gjoyce@linux.ibm.com
Subject: Re: [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs
Date: Tue, 24 Sep 2024 12:11:51 +0530	[thread overview]
Message-ID: <37866374-26a0-485b-82ac-bfc2c23def0b@linux.ibm.com> (raw)
In-Reply-To: <20240911062653.1060056-1-nilay@linux.ibm.com>

A gentle ping about this RFC. Does it look okay or if there are any further comments? 

Thanks,
--Nilay

On 9/11/24 11:56, Nilay Shroff wrote:
> Hi,
> 
> This patch propose adding new sysfs attributes for adding visibility of
> native multipath I/O. 
> 
> The first version of this RFC[1] proposed using debugfs for visibility 
> however the general feedback was to instead export the multipath I/O 
> information using sysfs attributes and then latter parse and format those 
> sysfs attributes using libnvme/nvme-cli. 
> 
> The second version of this RFC[2] uses sysfs however the sysfs attribute 
> file contains multiple lines of output and the feedback was to instead
> follow the principal of one value per one attribute. 
> 
> The third version of this RFC[3] follows the one value per one attrbiute
> principal. There was a review comment about using srcu read lock while
> dereferencing the namespace for each node which is protected by the srcu
> lock.
> 
> So now the fourth version of this RFC ensures that we protect the 
> namespace dereference code with the srcu read lock.
> 
> As we know, NVMe native multipath supports three different io policies 
> (numa, round-robin and queue-depth) for selecting I/O path, however, we  
> don't have any visibility about which path is being selected by multipath
> code for forwarding I/O. This RFC helps add that visibility by adding new 
> sysfs attribute files named "numa_nodes" and "queue_depth" under each 
> namespace block device path /sys/block/nvmeXcYnZ/. We also create a 
> "multipath" sysfs directory under head disk node and then from this 
> directory add a link to each namespace path device this head disk node 
> points to.
> 
> Please find below output generated with this proposed RFC patch applied on  
> a system with two multi-controller PCIe NVMe disks attached to it. This 
> system is also an NVMf-TCP host which is connected to an NVMf-TCP target 
> over two NIC cards. This system has four numa nodes online when the below 
> output was captured:
> 
> # cat /sys/devices/system/node/online 
> 0-3
> 
> # lscpu
> <snip>
> NUMA:
>   NUMA node(s):           4
>   NUMA node0 CPU(s):
>   NUMA node1 CPU(s):      0-7
>   NUMA node2 CPU(s):      8-31
>   NUMA node3 CPU(s):      32-63
> <snip>
> 
> Please note that numa node 0 though online, doesn't have any CPU 
> currently assigned to it.
> 
> # nvme list -v 
> Subsystem        Subsystem-NQN                                                                                    Controllers
> ---------------- ------------------------------------------------------------------------------------------------ ----------------
> nvme-subsys1     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme0, nvme1
> nvme-subsys3     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme2, nvme3
> nvme-subsys4     nvmet_subsystem                                                                                  nvme4, nvme5
> 
> Device           Cntlid SN                   MN                                       FR       TxPort Address        Slot   Subsystem    Namespaces
> ---------------- ------ -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
> nvme0    66     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0   U50EE.001.WZS000E-P3-C14-R1 nvme-subsys1 nvme1n1
> nvme1    65     S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0   U50EE.001.WZS000E-P3-C14-R2 nvme-subsys1 nvme1n1
> nvme2    2      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0   U50EE.001.WZS000E-P3-C4-R1 nvme-subsys3 nvme3n1
> nvme3    1      3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0   U50EE.001.WZS000E-P3-C4-R2 nvme-subsys3 nvme3n1
> nvme4    1      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100        nvme-subsys4 nvme4n1
> nvme5    2      a224673364d1dcb6fab9 Linux                                    6.9.0-rc tcp    traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100        nvme-subsys4 nvme4n1
> 
> Device            Generic           NSID       Usage                      Format           Controllers
> ----------------- ----------------- ---------- -------------------------- ---------------- ----------------
> /dev/nvme1n1 /dev/ng1n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme0, nvme1
> /dev/nvme3n1 /dev/ng3n1   0x2          0.00   B /   5.75  GB      4 KiB +  0 B   nvme2, nvme3
> /dev/nvme4n1 /dev/ng4n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme4, nvme5
> 
> 
> # nvme show-topology
> nvme-subsys1 - NQN=nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=numa
> \
>  +- ns 1
>  \
>   +- nvme0 pcie 052e:78:00.0 live optimized
>   +- nvme1 pcie 058e:78:00.0 live optimized
> 
> nvme-subsys3 - NQN=nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=round-robin
> \
>  +- ns 2
>  \
>   +- nvme2 pcie 0524:28:00.0 live optimized
>   +- nvme3 pcie 0584:28:00.0 live optimized
> 
> nvme-subsys4 - NQN=nvmet_subsystem
>                hostnqn=nqn.2014-08.org.nvmexpress:uuid:41528538-e8ad-4eaf-84a7-9c552917d988
>                iopolicy=queue-depth
> \
>  +- ns 1
>  \
>   +- nvme4 tcp traddr=20.0.0.200,trsvcid=4420,src_addr=20.0.0.100 live optimized
>   +- nvme5 tcp traddr=10.0.0.200,trsvcid=4420,src_addr=10.0.0.100 live optimized
> 
> As we could see above, we've three shared namespaces created. In terms of 
> iopolicy, we have "numa" configured for nvme-subsys1, "round-robin" 
> configured for nvme-subsys3 and "queue-depth" configured for nvme-subsys4.
> 
> Now, under each namespace "head disk node", we create a sysfs group
> attribute named "multipath". The "multipath" group then points to the 
> each path this head disk node points to:
> 
> # tree /sys/block/nvme1n1/multipath/
> /sys/block/nvme1n1/multipath/
> ├── nvme1c0n1 -> ../../../../../pci052e:78/052e:78:00.0/nvme/nvme0/nvme1c0n1
> └── nvme1c1n1 -> ../../../../../pci058e:78/058e:78:00.0/nvme/nvme1/nvme1c1n1
> 
> # tree /sys/block/nvme3n1/multipath/
> /sys/block/nvme3n1/multipath/
> ├── nvme3c2n1 -> ../../../../../pci0524:28/0524:28:00.0/nvme/nvme2/nvme3c2n1
> └── nvme3c3n1 -> ../../../../../pci0584:28/0584:28:00.0/nvme/nvme3/nvme3c3n1
> 
> # tree /sys/block/nvme4n1/multipath/
> /sys/block/nvme4n1/multipath/
> ├── nvme4c4n1 -> ../../../../nvme-fabrics/ctl/nvme4/nvme4c4n1
> └── nvme4c5n1 -> ../../../../nvme-fabrics/ctl/nvme5/nvme4c5n1
> 
> One can easily infer from the above output that for the "round-robin"
> I/O policy, configured under nvme-subsys3, the I/O workload targeted at 
> nvme3n1 would toggle across nvme3c2n1 and nvme3c3n1 assuming the ana state 
> of each path is optimized (as can be seen in the output of show-topology).
> 
> For numa I/O policy, configured under nvme-subsys1, the "numa_nodes" 
> attribute file shows the numa nodes being preferred by the respective 
> namespace path. The numa nodes value is comma delimited list of nodes or 
> A-B range of nodes.
> 
> # cat  /sys/block/nvme1n1/multipath/nvme1c0n1/numa_nodes 
> 0-1
> 
> # cat  /sys/block/nvme1n1/multipath/nvme1c1n1/numa_nodes
> 2-3
> 
> From the above output, one can easily infer that I/O workload targeted at
> nvme1n1 and running on numa nodes 0 and 1 would use path nvme1c0n1. 
> Similarly, I/O workload running on numa nodes 2 and 3 would use path 
> nvme1c1n1.
> 
> For queue-depth I/O policy, configured under nvme-subsys4, the "queue_depth" 
> attribute file shows the number of active/in-flight I/O requests currently 
> queued for each path.
> 
> # cat  /sys/block/nvme4n1/multipath/nvme4c4n1/queue_depth 
> 518
> 
> # cat  /sys/block/nvme4n1/multipath/nvme4c5n1/queue_depth
> 504
> 
> From the above output, one can easily infer that I/O workload targeted at
> nvme4n1 uses two paths nvme4c4n1 and nvme4c5n1 and the current queue depth 
> of each path is 518 and 504 respectively.
> 
> [1] https://lore.kernel.org/all/20240722093124.42581-1-nilay@linux.ibm.com/
> [2] https://lore.kernel.org/all/20240809173030.2281021-2-nilay@linux.ibm.com/
> [3] https://lore.kernel.org/all/20240903135228.283820-1-nilay@linux.ibm.com/
> 
> Changes since v3:
>     - Protect the namespace dereference code with srcu read lock (Daniel Wagner)
> 
> Changes since v2:
>     - Use one value per one sysfs attribute (Keith Busch)
> 
> Changes since v1:
>     - Use sysfs to export multipath I/O information instead of debugfs
> 
> Nilay Shroff (1):
>   nvme-multipath: Add sysfs attributes for showing multipath info
> 
>  drivers/nvme/host/core.c      |  3 ++
>  drivers/nvme/host/multipath.c | 69 +++++++++++++++++++++++++++++++++++
>  drivers/nvme/host/nvme.h      | 20 ++++++++--
>  drivers/nvme/host/sysfs.c     | 20 ++++++++++
>  4 files changed, 108 insertions(+), 4 deletions(-)
> 


      parent reply	other threads:[~2024-09-24  6:44 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-11  6:26 [PATCHv4 RFC 0/1] Add visibility for native NVMe multipath using sysfs Nilay Shroff
2024-09-11  6:26 ` [PATCHv4 RFC 1/1] nvme-multipath: Add sysfs attributes for showing multipath info Nilay Shroff
2024-10-07 10:14   ` Hannes Reinecke
2024-10-07 13:47     ` Nilay Shroff
2024-10-07 14:04       ` Hannes Reinecke
2024-10-07 15:33         ` Nilay Shroff
2024-10-16  3:19           ` Nilay Shroff
2024-10-16  6:52             ` Hannes Reinecke
2024-10-21 12:24               ` Nilay Shroff
2024-10-20 23:17   ` Sagi Grimberg
2024-10-21 13:37     ` Nilay Shroff
2024-10-23  9:58       ` Sagi Grimberg
2024-10-23 13:31         ` Nilay Shroff
2024-09-24  6:41 ` Nilay Shroff [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=37866374-26a0-485b-82ac-bfc2c23def0b@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=dwagner@suse.de \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox