Linux-NVME Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Sagi Grimberg <sagi@grimberg.me>
To: Nilay Shroff <nilay@linux.ibm.com>, linux-nvme@lists.infradead.org
Cc: hch@lst.de, kbusch@kernel.org, gjoyce@linux.ibm.com, axboe@fb.com
Subject: Re: [PATCH] nvme: find numa distance only if controller has valid numa id
Date: Mon, 15 Apr 2024 13:04:34 +0300	[thread overview]
Message-ID: <c9c454ea-075a-40f0-bee9-42eaa5f1b427@grimberg.me> (raw)
In-Reply-To: <7b188849-5c3f-45ff-9747-096ffdaff6ee@linux.ibm.com>



On 15/04/2024 12:30, Nilay Shroff wrote:
>
> On 4/15/24 14:25, Sagi Grimberg wrote:
>>
>> On 14/04/2024 14:02, Nilay Shroff wrote:
>>> On 4/14/24 14:00, Sagi Grimberg wrote:
>>>> On 13/04/2024 12:04, Nilay Shroff wrote:
>>>>> On numa aware system where native nvme multipath is configured and
>>>>> iopolicy is set to numa but the nvme controller numa node id is
>>>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance
>>>>> for finding optimal io path. In such case we may access numa distance
>>>>> table with invalid index and that may potentially refer to incorrect
>>>>> memory. So this patch ensures that if the nvme controller numa node
>>>>> id is -1 then instead of calculating node distance for finding optimal
>>>>> io path, we set the numa node distance of such controller to default 10
>>>>> (LOCAL_DISTANCE).
>>>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not.
>>>>
>>> I think this patch does help fix a real issue. I have a numa aware system where
>>> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found
>>> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the
>>> reason being, my system has processors and memory coming from one or more NUMA nodes
>>> and the NVMe PCIe device is coming from a NUMA node which is different. For example,
>>> we could have processors coming from node 0 and node 1, but the PCIe device coming from
>>> node 2, and we don't have any processor coming from node 2, so there would be no way for
>>> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe
>>> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2
>>> then kernel would assign the numa node id 2 to the PCIe device.
>>>
>>> For instance, I have a system with two numa nodes currently online. I also have
>>> a multi controller NVMe PCIe disk attached to this system:
>>>
>>> # numactl -H
>>> available: 2 nodes (2-3)
>>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
>>> node 2 size: 15290 MB
>>> node 2 free: 14200 MB
>>> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
>>> node 3 size: 16336 MB
>>> node 3 free: 15075 MB
>>> node distances:
>>> node   2   3
>>>     2:  10  20
>>>     3:  20  10
>>>
>>> As we could see above on this system I have currently numa node 2 and 3 online.
>>> And I have CPUs coming from node 2 and 3.
>>>
>>> # lspci
>>> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
>>> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa
>>>
>>> # nvme list -v
>>> Subsystem        Subsystem-NQN                                                                                    Controllers
>>> ---------------- ------------------------------------------------------------------------------------------------ ----------------
>>> nvme-subsys3     nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057                                     nvme1, nvme3
>>>
>>> Device   SN                   MN                                       FR       TxPort Asdress        Slot   Subsystem    Namespaces
>>> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
>>> nvme1    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   052e:78:00.0          nvme-subsys3 nvme3n1
>>> nvme3    S6RTNE0R900057       3.2TB NVMe Gen4 U.2 SSD III              REV.SN66 pcie   058e:78:00.0          nvme-subsys3 nvme3n1, nvme3n2
>>>
>>> Device       Generic      NSID       Usage                      Format           Controllers
>>> ------------ ------------ ---------- -------------------------- ---------------- ----------------
>>> /dev/nvme3n1 /dev/ng3n1   0x1          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme1, nvme3
>>> /dev/nvme3n2 /dev/ng3n2   0x2          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme3
>>>
>>> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node
>>> 2
>>> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node
>>> -1
>>>
>>> # cat /sys/class/nvme/nvme3/numa_node
>>> 2
>>> # cat /sys/class/nvme/nvme1/numa_node
>>> -1
>>>
>>> As we could see above I have multi controller NVMe disk atatched to this system. This disk
>>> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1.
>>> This is because on this system, currently I don't have any processor coming from a numa node
>>> where nvme1 controller numa node could be be affinitized.
>> Thanks for the explanation. But what is the bug you see in this configuration? panic? suboptimal performance?
>> which is it? it is not clear from the patch description.
>>
> I didn't encounter panic, however the issue here is with accessing numa distance table with incorrect index.

Yes, I agree its not guaranteed that all arch implementation would have 
a correct bounds check.


  reply	other threads:[~2024-04-15 10:05 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-13  9:04 [PATCH] nvme: find numa distance only if controller has valid numa id Nilay Shroff
2024-04-14  8:30 ` Sagi Grimberg
2024-04-14 11:02   ` Nilay Shroff
2024-04-15  8:55     ` Sagi Grimberg
2024-04-15  9:30       ` Nilay Shroff
2024-04-15 10:04         ` Sagi Grimberg [this message]
2024-04-15 14:39         ` Hannes Reinecke
2024-04-15 16:56           ` Keith Busch
2024-04-16  8:06           ` Nilay Shroff
2024-04-15  7:25 ` Christoph Hellwig
2024-04-15  7:54   ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c9c454ea-075a-40f0-bee9-42eaa5f1b427@grimberg.me \
    --to=sagi@grimberg.me \
    --cc=axboe@fb.com \
    --cc=gjoyce@linux.ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=nilay@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox