Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: Sagi Grimberg <sagi@grimberg.me>, linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, gjoyce@linux.ibm.com, axboe@fb.com
Subject: Re: [PATCH] nvme: find numa distance only if controller has valid numa id
Date: Sun, 14 Apr 2024 16:32:30 +0530	[thread overview]
Message-ID: <ffcfbe15-8e6f-422a-87cd-52c30ebf726c@linux.ibm.com> (raw)
In-Reply-To: <afcd5e05-f4af-40de-8d13-9bacf5fadd80@grimberg.me>



On 4/14/24 14:00, Sagi Grimberg wrote:
> 
> 
> On 13/04/2024 12:04, Nilay Shroff wrote:
>> On numa aware system where native nvme multipath is configured and
>> iopolicy is set to numa but the nvme controller numa node id is
>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance
>> for finding optimal io path. In such case we may access numa distance
>> table with invalid index and that may potentially refer to incorrect
>> memory. So this patch ensures that if the nvme controller numa node
>> id is -1 then instead of calculating node distance for finding optimal
>> io path, we set the numa node distance of such controller to default 10
>> (LOCAL_DISTANCE).
> 
> Patch looks ok to me, but it is not clear weather this fixes a real issue or not.
> 
I think this patch does help fix a real issue. I have a numa aware system where
I have a multi port/controller NNVMe PCIe disk attached. On this system, I found  
that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the 
reason being, my system has processors and memory coming from one or more NUMA nodes 
and the NVMe PCIe device is coming from a NUMA node which is different. For example, 
we could have processors coming from node 0 and node 1, but the PCIe device coming from 
node 2, and we don't have any processor coming from node 2, so there would be no way for
Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe 
device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2
then kernel would assign the numa node id 2 to the PCIe device.

For instance, I have a system with two numa nodes currently online. I also have
a multi controller NVMe PCIe disk attached to this system:

# numactl -H 
available: 2 nodes (2-3)
node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 2 size: 15290 MB
node 2 free: 14200 MB
node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 3 size: 16336 MB
node 3 free: 15075 MB
node distances:
node   2   3 
  2:  10  20 
  3:  20  10 

As we could see above on this system I have currently numa node 2 and 3 online.
And I have CPUs coming from node 2 and 3.

# lspci 
052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa 

# nvme list -v
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys3     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme1, nvme3

Device   SN                   MN                                       FR       TxPort Asdress        Slot   Subsystem    Namespaces      
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme1    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0          nvme-subsys3 nvme3n1
nvme3    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0          nvme-subsys3 nvme3n1, nvme3n2

Device       Generic      NSID       Usage                      Format           Controllers     
------------ ------------ ---------- -------------------------- ---------------- ----------------
/dev/nvme3n1 /dev/ng3n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme1, nvme3
/dev/nvme3n2 /dev/ng3n2   0x2          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme3

# cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node
2
# cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node
-1

# cat /sys/class/nvme/nvme3/numa_node 
2
# cat /sys/class/nvme/nvme1/numa_node 
-1  

As we could see above I have multi controller NVMe disk atatched to this system. This disk 
has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. 
This is because on this system, currently I don't have any processor coming from a numa node 
where nvme1 controller numa node could be be affinitized. 

Thanks,
--Nilay

>>
>> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
>> ---
>>   drivers/nvme/host/multipath.c | 12 +++++++-----
>>   1 file changed, 7 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
>> index 5397fb428b24..4c73a8038978 100644
>> --- a/drivers/nvme/host/multipath.c
>> +++ b/drivers/nvme/host/multipath.c
>> @@ -240,17 +240,19 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns)
>>     static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node)
>>   {
>> -    int found_distance = INT_MAX, fallback_distance = INT_MAX, distance;
>> +    int found_distance = INT_MAX, fallback_distance = INT_MAX;
>>       struct nvme_ns *found = NULL, *fallback = NULL, *ns;
>>         list_for_each_entry_rcu(ns, &head->list, siblings) {
>> +        int distance = LOCAL_DISTANCE;
>> +
>>           if (nvme_path_is_disabled(ns))
>>               continue;
>>   -        if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA)
>> -            distance = node_distance(node, ns->ctrl->numa_node);
>> -        else
>> -            distance = LOCAL_DISTANCE;
>> +        if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) {
>> +            if (ns->ctrl->numa_node != NUMA_NO_NODE)
>> +                distance = node_distance(node, ns->ctrl->numa_node);
>> +        }
>>             switch (ns->ana_state) {
>>           case NVME_ANA_OPTIMIZED:
> 


  reply	other threads:[~2024-04-14 11:03 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-13  9:04 [PATCH] nvme: find numa distance only if controller has valid numa id Nilay Shroff
2024-04-14  8:30 ` Sagi Grimberg
2024-04-14 11:02   ` Nilay Shroff [this message]
2024-04-15  8:55     ` Sagi Grimberg
2024-04-15  9:30       ` Nilay Shroff
2024-04-15 10:04         ` Sagi Grimberg
2024-04-15 14:39         ` Hannes Reinecke
2024-04-15 16:56           ` Keith Busch
2024-04-16  8:06           ` Nilay Shroff
2024-04-15  7:25 ` Christoph Hellwig
2024-04-15  7:54   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ffcfbe15-8e6f-422a-87cd-52c30ebf726c@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox