* [PATCH] nvme: find numa distance only if controller has valid numa id @ 2024-04-13 9:04 Nilay Shroff 2024-04-14 8:30 ` Sagi Grimberg 2024-04-15 7:25 ` Christoph Hellwig 0 siblings, 2 replies; 11+ messages in thread From: Nilay Shroff @ 2024-04-13 9:04 UTC (permalink / raw) To: linux-nvme; +Cc: hch, kbusch, sagi, gjoyce, axboe, Nilay Shroff On numa aware system where native nvme multipath is configured and iopolicy is set to numa but the nvme controller numa node id is undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance for finding optimal io path. In such case we may access numa distance table with invalid index and that may potentially refer to incorrect memory. So this patch ensures that if the nvme controller numa node id is -1 then instead of calculating node distance for finding optimal io path, we set the numa node distance of such controller to default 10 (LOCAL_DISTANCE). Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> --- drivers/nvme/host/multipath.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 5397fb428b24..4c73a8038978 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -240,17 +240,19 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node) { - int found_distance = INT_MAX, fallback_distance = INT_MAX, distance; + int found_distance = INT_MAX, fallback_distance = INT_MAX; struct nvme_ns *found = NULL, *fallback = NULL, *ns; list_for_each_entry_rcu(ns, &head->list, siblings) { + int distance = LOCAL_DISTANCE; + if (nvme_path_is_disabled(ns)) continue; - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) - distance = node_distance(node, ns->ctrl->numa_node); - else - distance = LOCAL_DISTANCE; + if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) { + if (ns->ctrl->numa_node != NUMA_NO_NODE) + distance = node_distance(node, ns->ctrl->numa_node); + } switch (ns->ana_state) { case NVME_ANA_OPTIMIZED: -- 2.44.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-13 9:04 [PATCH] nvme: find numa distance only if controller has valid numa id Nilay Shroff @ 2024-04-14 8:30 ` Sagi Grimberg 2024-04-14 11:02 ` Nilay Shroff 2024-04-15 7:25 ` Christoph Hellwig 1 sibling, 1 reply; 11+ messages in thread From: Sagi Grimberg @ 2024-04-14 8:30 UTC (permalink / raw) To: Nilay Shroff, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 13/04/2024 12:04, Nilay Shroff wrote: > On numa aware system where native nvme multipath is configured and > iopolicy is set to numa but the nvme controller numa node id is > undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance > for finding optimal io path. In such case we may access numa distance > table with invalid index and that may potentially refer to incorrect > memory. So this patch ensures that if the nvme controller numa node > id is -1 then instead of calculating node distance for finding optimal > io path, we set the numa node distance of such controller to default 10 > (LOCAL_DISTANCE). Patch looks ok to me, but it is not clear weather this fixes a real issue or not. > > Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> > --- > drivers/nvme/host/multipath.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c > index 5397fb428b24..4c73a8038978 100644 > --- a/drivers/nvme/host/multipath.c > +++ b/drivers/nvme/host/multipath.c > @@ -240,17 +240,19 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) > > static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node) > { > - int found_distance = INT_MAX, fallback_distance = INT_MAX, distance; > + int found_distance = INT_MAX, fallback_distance = INT_MAX; > struct nvme_ns *found = NULL, *fallback = NULL, *ns; > > list_for_each_entry_rcu(ns, &head->list, siblings) { > + int distance = LOCAL_DISTANCE; > + > if (nvme_path_is_disabled(ns)) > continue; > > - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) > - distance = node_distance(node, ns->ctrl->numa_node); > - else > - distance = LOCAL_DISTANCE; > + if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) { > + if (ns->ctrl->numa_node != NUMA_NO_NODE) > + distance = node_distance(node, ns->ctrl->numa_node); > + } > > switch (ns->ana_state) { > case NVME_ANA_OPTIMIZED: ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-14 8:30 ` Sagi Grimberg @ 2024-04-14 11:02 ` Nilay Shroff 2024-04-15 8:55 ` Sagi Grimberg 0 siblings, 1 reply; 11+ messages in thread From: Nilay Shroff @ 2024-04-14 11:02 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 4/14/24 14:00, Sagi Grimberg wrote: > > > On 13/04/2024 12:04, Nilay Shroff wrote: >> On numa aware system where native nvme multipath is configured and >> iopolicy is set to numa but the nvme controller numa node id is >> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >> for finding optimal io path. In such case we may access numa distance >> table with invalid index and that may potentially refer to incorrect >> memory. So this patch ensures that if the nvme controller numa node >> id is -1 then instead of calculating node distance for finding optimal >> io path, we set the numa node distance of such controller to default 10 >> (LOCAL_DISTANCE). > > Patch looks ok to me, but it is not clear weather this fixes a real issue or not. > I think this patch does help fix a real issue. I have a numa aware system where I have a multi port/controller NNVMe PCIe disk attached. On this system, I found that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the reason being, my system has processors and memory coming from one or more NUMA nodes and the NVMe PCIe device is coming from a NUMA node which is different. For example, we could have processors coming from node 0 and node 1, but the PCIe device coming from node 2, and we don't have any processor coming from node 2, so there would be no way for Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 then kernel would assign the numa node id 2 to the PCIe device. For instance, I have a system with two numa nodes currently online. I also have a multi controller NVMe PCIe disk attached to this system: # numactl -H available: 2 nodes (2-3) node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 node 2 size: 15290 MB node 2 free: 14200 MB node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 node 3 size: 16336 MB node 3 free: 15075 MB node distances: node 2 3 2: 10 20 3: 20 10 As we could see above on this system I have currently numa node 2 and 3 online. And I have CPUs coming from node 2 and 3. # lspci 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa # nvme list -v Subsystem Subsystem-NQN Controllers ---------------- ------------------------------------------------------------------------------------------------ ---------------- nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 Device SN MN FR TxPort Asdress Slot Subsystem Namespaces -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 Device Generic NSID Usage Format Controllers ------------ ------------ ---------- -------------------------- ---------------- ---------------- /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node 2 # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node -1 # cat /sys/class/nvme/nvme3/numa_node 2 # cat /sys/class/nvme/nvme1/numa_node -1 As we could see above I have multi controller NVMe disk atatched to this system. This disk has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. This is because on this system, currently I don't have any processor coming from a numa node where nvme1 controller numa node could be be affinitized. Thanks, --Nilay >> >> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> >> --- >> drivers/nvme/host/multipath.c | 12 +++++++----- >> 1 file changed, 7 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c >> index 5397fb428b24..4c73a8038978 100644 >> --- a/drivers/nvme/host/multipath.c >> +++ b/drivers/nvme/host/multipath.c >> @@ -240,17 +240,19 @@ static bool nvme_path_is_disabled(struct nvme_ns *ns) >> static struct nvme_ns *__nvme_find_path(struct nvme_ns_head *head, int node) >> { >> - int found_distance = INT_MAX, fallback_distance = INT_MAX, distance; >> + int found_distance = INT_MAX, fallback_distance = INT_MAX; >> struct nvme_ns *found = NULL, *fallback = NULL, *ns; >> list_for_each_entry_rcu(ns, &head->list, siblings) { >> + int distance = LOCAL_DISTANCE; >> + >> if (nvme_path_is_disabled(ns)) >> continue; >> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) >> - distance = node_distance(node, ns->ctrl->numa_node); >> - else >> - distance = LOCAL_DISTANCE; >> + if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) { >> + if (ns->ctrl->numa_node != NUMA_NO_NODE) >> + distance = node_distance(node, ns->ctrl->numa_node); >> + } >> switch (ns->ana_state) { >> case NVME_ANA_OPTIMIZED: > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-14 11:02 ` Nilay Shroff @ 2024-04-15 8:55 ` Sagi Grimberg 2024-04-15 9:30 ` Nilay Shroff 0 siblings, 1 reply; 11+ messages in thread From: Sagi Grimberg @ 2024-04-15 8:55 UTC (permalink / raw) To: Nilay Shroff, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 14/04/2024 14:02, Nilay Shroff wrote: > > On 4/14/24 14:00, Sagi Grimberg wrote: >> >> On 13/04/2024 12:04, Nilay Shroff wrote: >>> On numa aware system where native nvme multipath is configured and >>> iopolicy is set to numa but the nvme controller numa node id is >>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >>> for finding optimal io path. In such case we may access numa distance >>> table with invalid index and that may potentially refer to incorrect >>> memory. So this patch ensures that if the nvme controller numa node >>> id is -1 then instead of calculating node distance for finding optimal >>> io path, we set the numa node distance of such controller to default 10 >>> (LOCAL_DISTANCE). >> Patch looks ok to me, but it is not clear weather this fixes a real issue or not. >> > I think this patch does help fix a real issue. I have a numa aware system where > I have a multi port/controller NNVMe PCIe disk attached. On this system, I found > that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the > reason being, my system has processors and memory coming from one or more NUMA nodes > and the NVMe PCIe device is coming from a NUMA node which is different. For example, > we could have processors coming from node 0 and node 1, but the PCIe device coming from > node 2, and we don't have any processor coming from node 2, so there would be no way for > Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe > device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 > then kernel would assign the numa node id 2 to the PCIe device. > > For instance, I have a system with two numa nodes currently online. I also have > a multi controller NVMe PCIe disk attached to this system: > > # numactl -H > available: 2 nodes (2-3) > node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > node 2 size: 15290 MB > node 2 free: 14200 MB > node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 > node 3 size: 16336 MB > node 3 free: 15075 MB > node distances: > node 2 3 > 2: 10 20 > 3: 20 10 > > As we could see above on this system I have currently numa node 2 and 3 online. > And I have CPUs coming from node 2 and 3. > > # lspci > 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa > 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa > > # nvme list -v > Subsystem Subsystem-NQN Controllers > ---------------- ------------------------------------------------------------------------------------------------ ---------------- > nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 > > Device SN MN FR TxPort Asdress Slot Subsystem Namespaces > -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- > nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 > nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 > > Device Generic NSID Usage Format Controllers > ------------ ------------ ---------- -------------------------- ---------------- ---------------- > /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 > /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 > > # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node > 2 > # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node > -1 > > # cat /sys/class/nvme/nvme3/numa_node > 2 > # cat /sys/class/nvme/nvme1/numa_node > -1 > > As we could see above I have multi controller NVMe disk atatched to this system. This disk > has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. > This is because on this system, currently I don't have any processor coming from a numa node > where nvme1 controller numa node could be be affinitized. Thanks for the explanation. But what is the bug you see in this configuration? panic? suboptimal performance? which is it? it is not clear from the patch description. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 8:55 ` Sagi Grimberg @ 2024-04-15 9:30 ` Nilay Shroff 2024-04-15 10:04 ` Sagi Grimberg 2024-04-15 14:39 ` Hannes Reinecke 0 siblings, 2 replies; 11+ messages in thread From: Nilay Shroff @ 2024-04-15 9:30 UTC (permalink / raw) To: Sagi Grimberg, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 4/15/24 14:25, Sagi Grimberg wrote: > > > On 14/04/2024 14:02, Nilay Shroff wrote: >> >> On 4/14/24 14:00, Sagi Grimberg wrote: >>> >>> On 13/04/2024 12:04, Nilay Shroff wrote: >>>> On numa aware system where native nvme multipath is configured and >>>> iopolicy is set to numa but the nvme controller numa node id is >>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >>>> for finding optimal io path. In such case we may access numa distance >>>> table with invalid index and that may potentially refer to incorrect >>>> memory. So this patch ensures that if the nvme controller numa node >>>> id is -1 then instead of calculating node distance for finding optimal >>>> io path, we set the numa node distance of such controller to default 10 >>>> (LOCAL_DISTANCE). >>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not. >>> >> I think this patch does help fix a real issue. I have a numa aware system where >> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found >> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the >> reason being, my system has processors and memory coming from one or more NUMA nodes >> and the NVMe PCIe device is coming from a NUMA node which is different. For example, >> we could have processors coming from node 0 and node 1, but the PCIe device coming from >> node 2, and we don't have any processor coming from node 2, so there would be no way for >> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe >> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 >> then kernel would assign the numa node id 2 to the PCIe device. >> >> For instance, I have a system with two numa nodes currently online. I also have >> a multi controller NVMe PCIe disk attached to this system: >> >> # numactl -H >> available: 2 nodes (2-3) >> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >> node 2 size: 15290 MB >> node 2 free: 14200 MB >> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 >> node 3 size: 16336 MB >> node 3 free: 15075 MB >> node distances: >> node 2 3 >> 2: 10 20 >> 3: 20 10 >> >> As we could see above on this system I have currently numa node 2 and 3 online. >> And I have CPUs coming from node 2 and 3. >> >> # lspci >> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >> >> # nvme list -v >> Subsystem Subsystem-NQN Controllers >> ---------------- ------------------------------------------------------------------------------------------------ ---------------- >> nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 >> >> Device SN MN FR TxPort Asdress Slot Subsystem Namespaces >> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- >> nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 >> nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 >> >> Device Generic NSID Usage Format Controllers >> ------------ ------------ ---------- -------------------------- ---------------- ---------------- >> /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 >> /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 >> >> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node >> 2 >> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node >> -1 >> >> # cat /sys/class/nvme/nvme3/numa_node >> 2 >> # cat /sys/class/nvme/nvme1/numa_node >> -1 >> >> As we could see above I have multi controller NVMe disk atatched to this system. This disk >> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. >> This is because on this system, currently I don't have any processor coming from a numa node >> where nvme1 controller numa node could be be affinitized. > > Thanks for the explanation. But what is the bug you see in this configuration? panic? suboptimal performance? > which is it? it is not clear from the patch description. > I didn't encounter panic, however the issue here is with accessing numa distance table with incorrect index. For calculating the distance between two nodes we invoke the function __node_distance(). This function would then access the numa distance table, which is typically an array with valid index starting from 0. So obviously accessing this table with index of -1 would deference incorrect memory location. De-referencing incorrect memory location might have side effects including panic (though I didn't encounter panic). Furthermore in such a case, the calculated node distance could potentially be incorrect and that might cause the nvme multipath to choose a suboptimal IO path. This patch may not help choosing the optimal IO path (as we assume that the node distance would be LOCAL_DISTANCE in case nvme controller numa node id is -1) but it ensures that we don't access the invalid memory location for calculating node distance. Thanks, --Nilay ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 9:30 ` Nilay Shroff @ 2024-04-15 10:04 ` Sagi Grimberg 2024-04-15 14:39 ` Hannes Reinecke 1 sibling, 0 replies; 11+ messages in thread From: Sagi Grimberg @ 2024-04-15 10:04 UTC (permalink / raw) To: Nilay Shroff, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 15/04/2024 12:30, Nilay Shroff wrote: > > On 4/15/24 14:25, Sagi Grimberg wrote: >> >> On 14/04/2024 14:02, Nilay Shroff wrote: >>> On 4/14/24 14:00, Sagi Grimberg wrote: >>>> On 13/04/2024 12:04, Nilay Shroff wrote: >>>>> On numa aware system where native nvme multipath is configured and >>>>> iopolicy is set to numa but the nvme controller numa node id is >>>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >>>>> for finding optimal io path. In such case we may access numa distance >>>>> table with invalid index and that may potentially refer to incorrect >>>>> memory. So this patch ensures that if the nvme controller numa node >>>>> id is -1 then instead of calculating node distance for finding optimal >>>>> io path, we set the numa node distance of such controller to default 10 >>>>> (LOCAL_DISTANCE). >>>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not. >>>> >>> I think this patch does help fix a real issue. I have a numa aware system where >>> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found >>> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the >>> reason being, my system has processors and memory coming from one or more NUMA nodes >>> and the NVMe PCIe device is coming from a NUMA node which is different. For example, >>> we could have processors coming from node 0 and node 1, but the PCIe device coming from >>> node 2, and we don't have any processor coming from node 2, so there would be no way for >>> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe >>> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 >>> then kernel would assign the numa node id 2 to the PCIe device. >>> >>> For instance, I have a system with two numa nodes currently online. I also have >>> a multi controller NVMe PCIe disk attached to this system: >>> >>> # numactl -H >>> available: 2 nodes (2-3) >>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >>> node 2 size: 15290 MB >>> node 2 free: 14200 MB >>> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 >>> node 3 size: 16336 MB >>> node 3 free: 15075 MB >>> node distances: >>> node 2 3 >>> 2: 10 20 >>> 3: 20 10 >>> >>> As we could see above on this system I have currently numa node 2 and 3 online. >>> And I have CPUs coming from node 2 and 3. >>> >>> # lspci >>> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>> >>> # nvme list -v >>> Subsystem Subsystem-NQN Controllers >>> ---------------- ------------------------------------------------------------------------------------------------ ---------------- >>> nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 >>> >>> Device SN MN FR TxPort Asdress Slot Subsystem Namespaces >>> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- >>> nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 >>> nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 >>> >>> Device Generic NSID Usage Format Controllers >>> ------------ ------------ ---------- -------------------------- ---------------- ---------------- >>> /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 >>> /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 >>> >>> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node >>> 2 >>> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node >>> -1 >>> >>> # cat /sys/class/nvme/nvme3/numa_node >>> 2 >>> # cat /sys/class/nvme/nvme1/numa_node >>> -1 >>> >>> As we could see above I have multi controller NVMe disk atatched to this system. This disk >>> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. >>> This is because on this system, currently I don't have any processor coming from a numa node >>> where nvme1 controller numa node could be be affinitized. >> Thanks for the explanation. But what is the bug you see in this configuration? panic? suboptimal performance? >> which is it? it is not clear from the patch description. >> > I didn't encounter panic, however the issue here is with accessing numa distance table with incorrect index. Yes, I agree its not guaranteed that all arch implementation would have a correct bounds check. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 9:30 ` Nilay Shroff 2024-04-15 10:04 ` Sagi Grimberg @ 2024-04-15 14:39 ` Hannes Reinecke 2024-04-15 16:56 ` Keith Busch 2024-04-16 8:06 ` Nilay Shroff 1 sibling, 2 replies; 11+ messages in thread From: Hannes Reinecke @ 2024-04-15 14:39 UTC (permalink / raw) To: Nilay Shroff, Sagi Grimberg, linux-nvme; +Cc: hch, kbusch, gjoyce, axboe On 4/15/24 11:30, Nilay Shroff wrote: > > > On 4/15/24 14:25, Sagi Grimberg wrote: >> >> >> On 14/04/2024 14:02, Nilay Shroff wrote: >>> >>> On 4/14/24 14:00, Sagi Grimberg wrote: >>>> >>>> On 13/04/2024 12:04, Nilay Shroff wrote: >>>>> On numa aware system where native nvme multipath is configured and >>>>> iopolicy is set to numa but the nvme controller numa node id is >>>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >>>>> for finding optimal io path. In such case we may access numa distance >>>>> table with invalid index and that may potentially refer to incorrect >>>>> memory. So this patch ensures that if the nvme controller numa node >>>>> id is -1 then instead of calculating node distance for finding optimal >>>>> io path, we set the numa node distance of such controller to default 10 >>>>> (LOCAL_DISTANCE). >>>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not. >>>> >>> I think this patch does help fix a real issue. I have a numa aware system where >>> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found >>> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the >>> reason being, my system has processors and memory coming from one or more NUMA nodes >>> and the NVMe PCIe device is coming from a NUMA node which is different. For example, >>> we could have processors coming from node 0 and node 1, but the PCIe device coming from >>> node 2, and we don't have any processor coming from node 2, so there would be no way for >>> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe >>> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 >>> then kernel would assign the numa node id 2 to the PCIe device. >>> >>> For instance, I have a system with two numa nodes currently online. I also have >>> a multi controller NVMe PCIe disk attached to this system: >>> >>> # numactl -H >>> available: 2 nodes (2-3) >>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >>> node 2 size: 15290 MB >>> node 2 free: 14200 MB >>> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 >>> node 3 size: 16336 MB >>> node 3 free: 15075 MB >>> node distances: >>> node 2 3 >>> 2: 10 20 >>> 3: 20 10 >>> >>> As we could see above on this system I have currently numa node 2 and 3 online. >>> And I have CPUs coming from node 2 and 3. >>> >>> # lspci >>> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>> >>> # nvme list -v >>> Subsystem Subsystem-NQN Controllers >>> ---------------- ------------------------------------------------------------------------------------------------ ---------------- >>> nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 >>> >>> Device SN MN FR TxPort Asdress Slot Subsystem Namespaces >>> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- >>> nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 >>> nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 >>> >>> Device Generic NSID Usage Format Controllers >>> ------------ ------------ ---------- -------------------------- ---------------- ---------------- >>> /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 >>> /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 >>> >>> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node >>> 2 >>> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node >>> -1 >>> >>> # cat /sys/class/nvme/nvme3/numa_node >>> 2 >>> # cat /sys/class/nvme/nvme1/numa_node >>> -1 >>> >>> As we could see above I have multi controller NVMe disk atatched to this system. This disk >>> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. >>> This is because on this system, currently I don't have any processor coming from a numa node >>> where nvme1 controller numa node could be be affinitized. >> >> Thanks for the explanation. But what is the bug you see in this configuration? panic? >> suboptimal performance? >> which is it? it is not clear from the patch description. >> > I didn't encounter panic, however the issue here is with accessing numa distance table > with incorrect index. > > For calculating the distance between two nodes we invoke the function __node_distance(). > This function would then access the numa distance table, which is typically an array with > valid index starting from 0. So obviously accessing this table with index of -1 would > deference incorrect memory location. De-referencing incorrect memory location might have > side effects including panic (though I didn't encounter panic). Furthermore in such a case, > the calculated node distance could potentially be incorrect and that might cause the nvme > multipath to choose a suboptimal IO path. > > This patch may not help choosing the optimal IO path (as we assume that the node distance would be > LOCAL_DISTANCE in case nvme controller numa node id is -1) but it ensures that we don't access the > invalid memory location for calculating node distance. > Hmm. One wonders: how does such a system work? The systems I know always have the PCI slots attached to the CPU sockets, so if the CPU is not present the NVMe device on that slot will be non-functional. In fact, it wouldn't be visible at all as the PCI lanes are not powered up. In your system the PCI lanes clearly are powered up, as the NVMe device shows up in the PCI enumeration. Which means you are running a rather different PCI configuration. Question now is: does the NVMe device _work_? If it does, shouldn't the NUMA node continue to be present (some kind of memory-less, CPU-less NUMA node ...)? As a side-note, we'll need these kind of configuration anyway once CXL switches become available... Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 14:39 ` Hannes Reinecke @ 2024-04-15 16:56 ` Keith Busch 2024-04-16 8:06 ` Nilay Shroff 1 sibling, 0 replies; 11+ messages in thread From: Keith Busch @ 2024-04-15 16:56 UTC (permalink / raw) To: Hannes Reinecke Cc: Nilay Shroff, Sagi Grimberg, linux-nvme, hch, gjoyce, axboe On Mon, Apr 15, 2024 at 04:39:45PM +0200, Hannes Reinecke wrote: > > For calculating the distance between two nodes we invoke the function __node_distance(). > > This function would then access the numa distance table, which is typically an array with > > valid index starting from 0. So obviously accessing this table with index of -1 would > > deference incorrect memory location. De-referencing incorrect memory location might have > > side effects including panic (though I didn't encounter panic). Furthermore in such a case, > > the calculated node distance could potentially be incorrect and that might cause the nvme > > multipath to choose a suboptimal IO path. > > > > This patch may not help choosing the optimal IO path (as we assume that the node distance would be > > LOCAL_DISTANCE in case nvme controller numa node id is -1) but it ensures that we don't access the > > invalid memory location for calculating node distance. > > > Hmm. One wonders: how does such a system work? > The systems I know always have the PCI slots attached to the CPU > sockets, so if the CPU is not present the NVMe device on that > slot will be non-functional. In fact, it wouldn't be visible at > all as the PCI lanes are not powered up. > In your system the PCI lanes clearly are powered up, as the NVMe > device shows up in the PCI enumeration. > Which means you are running a rather different PCI configuration. > Question now is: does the NVMe device _work_? > If it does, shouldn't the NUMA node continue to be present (some kind of > memory-less, CPU-less NUMA node ...)? > As a side-note, we'll need these kind of configuration anyway once > CXL switches become available... I recall systems with IO controller attached in a shared manner to all sockets, so memory is UMA from IO device perspecitve (it may still be NUMA from CPU). I don't think you need to consider memory-only NUMA nodes unless there are additional distances to consider (at which point it's no longer UMA). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 14:39 ` Hannes Reinecke 2024-04-15 16:56 ` Keith Busch @ 2024-04-16 8:06 ` Nilay Shroff 1 sibling, 0 replies; 11+ messages in thread From: Nilay Shroff @ 2024-04-16 8:06 UTC (permalink / raw) To: linux-nvme On 4/15/24 20:09, Hannes Reinecke wrote: > On 4/15/24 11:30, Nilay Shroff wrote: >> >> >> On 4/15/24 14:25, Sagi Grimberg wrote: >>> >>> >>> On 14/04/2024 14:02, Nilay Shroff wrote: >>>> >>>> On 4/14/24 14:00, Sagi Grimberg wrote: >>>>> >>>>> On 13/04/2024 12:04, Nilay Shroff wrote: >>>>>> On numa aware system where native nvme multipath is configured and >>>>>> iopolicy is set to numa but the nvme controller numa node id is >>>>>> undefined or -1 (NUMA_NO_NODE) then avoid calculating node distance >>>>>> for finding optimal io path. In such case we may access numa distance >>>>>> table with invalid index and that may potentially refer to incorrect >>>>>> memory. So this patch ensures that if the nvme controller numa node >>>>>> id is -1 then instead of calculating node distance for finding optimal >>>>>> io path, we set the numa node distance of such controller to default 10 >>>>>> (LOCAL_DISTANCE). >>>>> Patch looks ok to me, but it is not clear weather this fixes a real issue or not. >>>>> >>>> I think this patch does help fix a real issue. I have a numa aware system where >>>> I have a multi port/controller NNVMe PCIe disk attached. On this system, I found >>>> that sometimes the nvme controller numa id is set to -1 (NUMA_NO_NODE). And the >>>> reason being, my system has processors and memory coming from one or more NUMA nodes >>>> and the NVMe PCIe device is coming from a NUMA node which is different. For example, >>>> we could have processors coming from node 0 and node 1, but the PCIe device coming from >>>> node 2, and we don't have any processor coming from node 2, so there would be no way for >>>> Linux to affinitize the PCIe device with a processor and hence while enumerating PCIe >>>> device kernel sets the numa id of such device to -1. Later if we hotplug CPU on node 2 >>>> then kernel would assign the numa node id 2 to the PCIe device. >>>> >>>> For instance, I have a system with two numa nodes currently online. I also have >>>> a multi controller NVMe PCIe disk attached to this system: >>>> >>>> # numactl -H >>>> available: 2 nodes (2-3) >>>> node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 >>>> node 2 size: 15290 MB >>>> node 2 free: 14200 MB >>>> node 3 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 >>>> node 3 size: 16336 MB >>>> node 3 free: 15075 MB >>>> node distances: >>>> node 2 3 >>>> 2: 10 20 >>>> 3: 20 10 >>>> >>>> As we could see above on this system I have currently numa node 2 and 3 online. >>>> And I have CPUs coming from node 2 and 3. >>>> >>>> # lspci >>>> 052e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>>> 058e:78:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller PM173Xa >>>> >>>> # nvme list -v >>>> Subsystem Subsystem-NQN Controllers >>>> ---------------- ------------------------------------------------------------------------------------------------ ---------------- >>>> nvme-subsys3 nqn.1994-11.com.samsung:nvme:PM1735a:2.5-inch:S6RTNE0R900057 nvme1, nvme3 >>>> >>>> Device SN MN FR TxPort Asdress Slot Subsystem Namespaces >>>> -------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ---------------- >>>> nvme1 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 052e:78:00.0 nvme-subsys3 nvme3n1 >>>> nvme3 S6RTNE0R900057 3.2TB NVMe Gen4 U.2 SSD III REV.SN66 pcie 058e:78:00.0 nvme-subsys3 nvme3n1, nvme3n2 >>>> >>>> Device Generic NSID Usage Format Controllers >>>> ------------ ------------ ---------- -------------------------- ---------------- ---------------- >>>> /dev/nvme3n1 /dev/ng3n1 0x1 5.75 GB / 5.75 GB 4 KiB + 0 B nvme1, nvme3 >>>> /dev/nvme3n2 /dev/ng3n2 0x2 5.75 GB / 5.75 GB 4 KiB + 0 B nvme3 >>>> >>>> # cat ./sys/devices/pci058e:78/058e:78:00.0/numa_node >>>> 2 >>>> # cat ./sys/devices/pci052e:78/052e:78:00.0/numa_node >>>> -1 >>>> >>>> # cat /sys/class/nvme/nvme3/numa_node >>>> 2 >>>> # cat /sys/class/nvme/nvme1/numa_node >>>> -1 >>>> >>>> As we could see above I have multi controller NVMe disk atatched to this system. This disk >>>> has 2 controllers. However the numa node id assigned to one of the controller (nvme1) is -1. >>>> This is because on this system, currently I don't have any processor coming from a numa node >>>> where nvme1 controller numa node could be be affinitized. >>> >>> Thanks for the explanation. But what is the bug you see in this configuration? panic? >>> suboptimal performance? >>> which is it? it is not clear from the patch description. >>> >> I didn't encounter panic, however the issue here is with accessing numa distance table >> with incorrect index. >> >> For calculating the distance between two nodes we invoke the function __node_distance(). >> This function would then access the numa distance table, which is typically an array with >> valid index starting from 0. So obviously accessing this table with index of -1 would >> deference incorrect memory location. De-referencing incorrect memory location might have >> side effects including panic (though I didn't encounter panic). Furthermore in such a case, >> the calculated node distance could potentially be incorrect and that might cause the nvme >> multipath to choose a suboptimal IO path. >> >> This patch may not help choosing the optimal IO path (as we assume that the node distance would be >> LOCAL_DISTANCE in case nvme controller numa node id is -1) but it ensures that we don't access the >> invalid memory location for calculating node distance. >> > Hmm. One wonders: how does such a system work? > The systems I know always have the PCI slots attached to the CPU > sockets, so if the CPU is not present the NVMe device on that > slot will be non-functional. In fact, it wouldn't be visible at > all as the PCI lanes are not powered up. > In your system the PCI lanes clearly are powered up, as the NVMe > device shows up in the PCI enumeration. > Which means you are running a rather different PCI configuration. > Question now is: does the NVMe device _work_? Yes on my system NVMe device works as expected even in case the numa node id assigned to NVMe controller is -1. The only side effect with such configuration is that if multipath is configured then we may not be able to find optimal IO path. > If it does, shouldn't the NUMA node continue to be present (some kind of memory-less, CPU-less NUMA node ...)? I think NUMA node to be present/online always if there's a PCI device affnitized to that numa node is not always required. At-least on the system which I have that's not the case. In fact, there's a NUMA node id assigned to each PCI device in the system however during PCI enumeration kernel first validates whether NUMA id assigned to the PCI device is online or not. If the respective numa node is NOT online then the PCI enumeration code sets the numa node id of such PCI device to NUMA_NO_NODE (i.e. -1). > As a side-note, we'll need these kind of configuration anyway once > CXL switches become available... Please ping me if I could be of any help for your need with CXL switches.. Thanks, --Nilay ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-13 9:04 [PATCH] nvme: find numa distance only if controller has valid numa id Nilay Shroff 2024-04-14 8:30 ` Sagi Grimberg @ 2024-04-15 7:25 ` Christoph Hellwig 2024-04-15 7:54 ` Nilay Shroff 1 sibling, 1 reply; 11+ messages in thread From: Christoph Hellwig @ 2024-04-15 7:25 UTC (permalink / raw) To: Nilay Shroff; +Cc: linux-nvme, hch, kbusch, sagi, gjoyce, axboe On Sat, Apr 13, 2024 at 02:34:36PM +0530, Nilay Shroff wrote: > continue; > > - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) > - distance = node_distance(node, ns->ctrl->numa_node); > - else > - distance = LOCAL_DISTANCE; > + if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) { > + if (ns->ctrl->numa_node != NUMA_NO_NODE) > + distance = node_distance(node, ns->ctrl->numa_node); > + } Please avoid the overly long line. This could be easily done by keeping the old code structure, which IMHO is more readable anyway: if (ns->ctrl->numa_node != NUMA_NO_NODE && READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) distance = node_distance(node, ns->ctrl->numa_node); else distance = LOCAL_DISTANCE; ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] nvme: find numa distance only if controller has valid numa id 2024-04-15 7:25 ` Christoph Hellwig @ 2024-04-15 7:54 ` Nilay Shroff 0 siblings, 0 replies; 11+ messages in thread From: Nilay Shroff @ 2024-04-15 7:54 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-nvme, kbusch, sagi, gjoyce, axboe On 4/15/24 12:55, Christoph Hellwig wrote: > On Sat, Apr 13, 2024 at 02:34:36PM +0530, Nilay Shroff wrote: >> continue; >> >> - if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) >> - distance = node_distance(node, ns->ctrl->numa_node); >> - else >> - distance = LOCAL_DISTANCE; >> + if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) { >> + if (ns->ctrl->numa_node != NUMA_NO_NODE) >> + distance = node_distance(node, ns->ctrl->numa_node); >> + } > > Please avoid the overly long line. This could be easily done by keeping > the old code structure, which IMHO is more readable anyway: > > if (ns->ctrl->numa_node != NUMA_NO_NODE && > READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_NUMA) > distance = node_distance(node, ns->ctrl->numa_node); > else > distance = LOCAL_DISTANCE; > > Sure. I will incorporate the above comment in the next version of the patch. Thanks, --Nilay ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-04-16 8:06 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-13 9:04 [PATCH] nvme: find numa distance only if controller has valid numa id Nilay Shroff 2024-04-14 8:30 ` Sagi Grimberg 2024-04-14 11:02 ` Nilay Shroff 2024-04-15 8:55 ` Sagi Grimberg 2024-04-15 9:30 ` Nilay Shroff 2024-04-15 10:04 ` Sagi Grimberg 2024-04-15 14:39 ` Hannes Reinecke 2024-04-15 16:56 ` Keith Busch 2024-04-16 8:06 ` Nilay Shroff 2024-04-15 7:25 ` Christoph Hellwig 2024-04-15 7:54 ` Nilay Shroff
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox