* [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online
@ 2024-05-17 14:25 Nilay Shroff
2024-05-17 14:25 ` [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
2024-06-20 12:49 ` [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Michael Ellerman
0 siblings, 2 replies; 5+ messages in thread
From: Nilay Shroff @ 2024-05-17 14:25 UTC (permalink / raw)
To: mpe, npiggin, christophe.leroy, naveen.n.rao
Cc: gjoyce, srikar, Nilay Shroff, linuxppc-dev, sshegde
Hi,
On NUMA aware system, we make a numa-node online only if that node is
attached to cpu/memory. However it's possible that we have some PCI/IO
device affinitized to a numa-node which is not currently online. In such
case we set the numa-node id of the corresponding PCI device to -1
(NUMA_NO_NODE). Not assigning the correct numa-node id to PCI device may
impact the performance of such device. For instance, we have a multi
controller NVMe disk where each controller of the disk is attached to
different PHB (PCI host bridge). Each of these PHBs has numa-node id
assigned during PCI enumeration. During PCI enumeration if we find that
the numa-node is not online then we set the numa-node id of the PHB to -1.
If we create shared namespace and attach to multi controller NVMe disk
then that namespace could be accessed through each controller and as each
controller is connected to different PHBs, it's possible to access the
same namespace using multiple PCI channel. While sending IO to a shared
namespace, NVMe driver would calculate the optimal IO path using numa-node
distance. However if the numa-node id is not correctly assigned to NVMe
PCIe controller then it's possible that driver would calculate incorrect
NUMA distance and hence select the non-optimal path for sending IO. If
this happens then we could potentially observe the degraded IO performance.
Please find below the performance of a multi-controller NVMe disk w/ and
w/o the proposed patch applied:
# lspci
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
0584:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
# nvme list -v
Subsystem Subsystem-NQN Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1 nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1 nvme0, nvme1
Device SN MN FR TxPort Asdress Slot Subsystem Namespaces
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0524:28:00.0 nvme-subsys1 nvme1n3
nvme1 3D60A04906N1 1.6TB NVMe Gen4 U.2 SSD IV REV.CAS2 pcie 0584:28:00.0 nvme-subsys1 nvme1n3
Device Generic NSID Usage Format Controllers
------------ ------------ ---------- -------------------------- ---------------- ----------------
/dev/nvme1n3 /dev/ng1n3 0x3 5.75 GB / 5.75 GB 4 KiB + 0 B nvme0, nvme1
We can see above the nvme disk has two controllers nvme0 and nvme1.Both
these controllers can be accessed from two different PCI channels (0524:28
and 0584:28).
I have also created a shared namespace (/dev/nvme1n3) which is connected
behind controllers nvme0 and nvme1.
Test-1: Measure IO performance w/o proposed patch:
--------------------------------------------------
# numactl -H
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31565 MB
node 0 free: 28452 MB
node distances:
node 0
0: 10
On this machine we only have node 0 online.
# cat /sys/class/nvme/nvme1/numa_node
-1
# cat /sys/class/nvme/nvme0/numa_node
0
# cat /sys/class/nvme-subsystem/nvme-subsys1/iopolicy
numa
We can find above the numa node id assigned to nvme1 is -1, however, the
numa node id assigned to nvme0 is 0. Also the iopolicy is set to numa.
Now we would run IO perf test and measure the performance:
# fio --filename=/dev/nvme1n3 --direct=1 --rw=randwrite --bs=4k --ioengine=io_uring --iodepth=512 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --cpus_allowed=0-3
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=512
...
fio-3.35
Starting 4 processes
[...]
[...]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=5665: Tue Apr 30 04:07:31 2024
write: IOPS=632k, BW=2469MiB/s (2589MB/s)(145GiB/60003msec); 0 zone resets
slat (usec): min=2, max=10031, avg= 4.62, stdev= 5.40
clat (usec): min=12, max=15687, avg=3233.58, stdev=877.78
lat (usec): min=16, max=15693, avg=3238.19, stdev=879.06
clat percentiles (usec):
| 1.00th=[ 2868], 5.00th=[ 2900], 10.00th=[ 2900], 20.00th=[ 2900],
| 30.00th=[ 2933], 40.00th=[ 2933], 50.00th=[ 2933], 60.00th=[ 2933],
| 70.00th=[ 2933], 80.00th=[ 2966], 90.00th=[ 5604], 95.00th=[ 5669],
| 99.00th=[ 5735], 99.50th=[ 5735], 99.90th=[ 5866], 99.95th=[ 6456],
| 99.99th=[15533]
bw ( MiB/s): min= 1305, max= 2739, per=99.94%, avg=2467.92, stdev=130.72, samples=476
iops : min=334100, max=701270, avg=631786.39, stdev=33464.48, samples=476
lat (usec) : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=88.87%, 10=11.10%, 20=0.02%
cpu : usr=37.15%, sys=62.78%, ctx=638, majf=0, minf=50
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,37932685,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=512
Run status group 0 (all jobs):
WRITE: bw=2469MiB/s (2589MB/s), 2469MiB/s-2469MiB/s (2589MB/s-2589MB/s), io=145GiB (155GB), run=60003-60003msec
Disk stats (read/write):
nvme0n3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=99.87%
While test is running we could enable trace event to capture and see
the controller being used by driver to perform the IO.
# tail -5 /sys/kerenl/debug/tracing/trace
fio-5665 [002] ..... 508.635554: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=3, cmdid=57856, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=748098, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5666 [000] ..... 508.635554: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=1, cmdid=8385, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=139215, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5667 [001] ..... 508.635557: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=2, cmdid=21440, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=815508, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5668 [003] ..... 508.635558: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=4, cmdid=33089, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=405932, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5665 [002] ..... 508.635771: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=3, cmdid=37376, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=497267, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
From the above output we can notice that driver is using controller nvme1
for performing IO however this IO path could be sub-optimal as the numa
id assigned to nvme1 is -1 and so driver couldn't accurately calculate
numa node distance for this controller wrt to the cpu node 0 where this
test is running. Ideally, the driver could have used the nvme0 to perform
IO for optimal IO path.
In the fio/perf test result above we have got write IOPS 632k and
BW 2589MB/s.
Test-2: Measure IO performance w/ proposed patch:
-------------------------------------------------
# numactl -H
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31565 MB
node 0 free: 28740 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node 0 2
0: 10 40
2: 40 10
# cat /sys/class/nvme/nvme0/numa_node
0
# cat /sys/class/nvme/nvme1/numa_node
2
# cat /sys/class/nvme-subsystem/nvme-subsys1/iopolicy
numa
We could now see above numa node id 2 is made online.The numa node 2 is
cpu/memory less. The nvme1 controller is now assigned the numa node id 2.
Let's run IO perf test and measure the performance:
# fio --filename=/dev/nvme1n3 --direct=1 --rw=randwrite --bs=4k --ioengine=io_uring --iodepth=512 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --cpus_allowed=0-3
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=512
...
fio-3.35
Starting 4 processes
[...]
[...]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=5661: Tue Apr 30 04:33:46 2024
write: IOPS=715k, BW=2792MiB/s (2928MB/s)(164GiB/60001msec); 0 zone resets
slat (usec): min=2, max=10023, avg= 4.09, stdev= 4.40
clat (usec): min=11, max=12874, avg=2859.70, stdev=109.44
lat (usec): min=15, max=12878, avg=2863.78, stdev=109.54
clat percentiles (usec):
| 1.00th=[ 2737], 5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2835],
| 30.00th=[ 2835], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2868],
| 70.00th=[ 2868], 80.00th=[ 2868], 90.00th=[ 2900], 95.00th=[ 2900],
| 99.00th=[ 2966], 99.50th=[ 2999], 99.90th=[ 3064], 99.95th=[ 3097],
| 99.99th=[12780]
bw ( MiB/s): min= 2656, max= 2834, per=100.00%, avg=2792.81, stdev= 4.73, samples=476
iops : min=680078, max=725670, avg=714959.61, stdev=1209.66, samples=476
lat (usec) : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
lat (usec) : 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=99.99%, 10=0.01%, 20=0.01%
cpu : usr=36.22%, sys=63.73%, ctx=838, majf=0, minf=50
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
issued rwts: total=0,42891699,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=512
Run status group 0 (all jobs):
WRITE: bw=2792MiB/s (2928MB/s), 2792MiB/s-2792MiB/s (2928MB/s-2928MB/s), io=164GiB (176GB), run=60001-60001msec
Disk stats (read/write):
nvme1n3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=99.87%
While test is running we could enable trace event to capture and see
the controller being used by driver to perform the IO.
# tail -5 /sys/kernel/debug/tracing/trace
fio-5661 [000] ..... 673.238805: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=61953, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=589070, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5664 [003] ..... 673.238807: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=4, cmdid=12802, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=1235913, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5661 [000] ..... 673.238809: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=57858, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=798690, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5664 [003] ..... 673.238814: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=4, cmdid=37376, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=643839, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
fio-5661 [000] ..... 673.238814: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=4608, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=1319701, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
We can notice above that driver is now using nvme0 for IO. As this test is
running on cpu node 0 and the numa node id assigned to nvme0 is also 0,
the nvme0 is the optimal IO path. With this patch, the driver is able to
accurately calculate numa node distance and select nvme0 as optimal IO
path.
In the fio/perf test result above we have got write IOPS 715k and
BW 2928MB/s.
Summary:
--------
In summary, after comparing both tests results, it's apparent
that with the proposed patch driver could choose the optimal
IO path when iopolicy is set to NUMA and we get the better
IO performance. With the proposed patch we get ~12% of perf-
ormance improvment.
Changes since v1:
- Fixed warning reported by kernel test robot
https://lore.kernel.org/oe-kbuild-all/202405171615.NBRa8Poe-lkp@intel.com/
Nilay Shroff (1):
powerpc/numa: Online a node if PHB is attached.
arch/powerpc/mm/numa.c | 14 +++++++++++++-
arch/powerpc/platforms/pseries/pci_dlpar.c | 14 ++++++++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
--
2.44.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached.
2024-05-17 14:25 [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Nilay Shroff
@ 2024-05-17 14:25 ` Nilay Shroff
2024-05-20 17:16 ` Srikar Dronamraju
2024-05-24 13:31 ` Krishna Kumar
2024-06-20 12:49 ` [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Michael Ellerman
1 sibling, 2 replies; 5+ messages in thread
From: Nilay Shroff @ 2024-05-17 14:25 UTC (permalink / raw)
To: mpe, npiggin, christophe.leroy, naveen.n.rao
Cc: gjoyce, srikar, Nilay Shroff, linuxppc-dev, sshegde
In the current design, a numa-node is made online only if
that node is attached to cpu/memory. With this design, if
any PCI/IO device is found to be attached to a numa-node
which is not online then the numa-node id of the corresponding
PCI/IO device is set to NUMA_NO_NODE(-1). This design may
negatively impact the performance of PCIe device if the
numa-node assigned to PCIe device is -1 because in such case
we may not be able to accurately calculate the distance
between two nodes.
The multi-controller NVMe PCIe disk has an issue with
calculating the node distance if the PCIe NVMe controller
is attached to a PCI host bridge which has numa-node id
value set to NUMA_NO_NODE. This patch helps fix this ensuring
that a cpu/memory less numa node is made online if it's
attached to PCI host bridge.
Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
arch/powerpc/mm/numa.c | 14 +++++++++++++-
arch/powerpc/platforms/pseries/pci_dlpar.c | 14 ++++++++++++++
2 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a490724e84ad..aa89899f0c1a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -896,7 +896,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
static int __init parse_numa_properties(void)
{
- struct device_node *memory;
+ struct device_node *memory, *pci;
int default_nid = 0;
unsigned long i;
const __be32 *associativity;
@@ -1010,6 +1010,18 @@ static int __init parse_numa_properties(void)
goto new_range;
}
+ for_each_node_by_name(pci, "pci") {
+ int nid = NUMA_NO_NODE;
+
+ associativity = of_get_associativity(pci);
+ if (associativity) {
+ nid = associativity_to_nid(associativity);
+ initialize_form1_numa_distance(associativity);
+ }
+ if (likely(nid >= 0) && !node_online(nid))
+ node_set_online(nid);
+ }
+
/*
* Now do the same thing for each MEMBLOCK listed in the
* ibm,dynamic-memory property in the
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 4448386268d9..52e2623a741d 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -11,6 +11,7 @@
#include <linux/pci.h>
#include <linux/export.h>
+#include <linux/node.h>
#include <asm/pci-bridge.h>
#include <asm/ppc-pci.h>
#include <asm/firmware.h>
@@ -21,9 +22,22 @@
struct pci_controller *init_phb_dynamic(struct device_node *dn)
{
struct pci_controller *phb;
+ int nid;
pr_debug("PCI: Initializing new hotplug PHB %pOF\n", dn);
+ nid = of_node_to_nid(dn);
+ if (likely((nid) >= 0)) {
+ if (!node_online(nid)) {
+ if (__register_one_node(nid)) {
+ pr_err("PCI: Failed to register node %d\n", nid);
+ } else {
+ update_numa_distance(dn);
+ node_set_online(nid);
+ }
+ }
+ }
+
phb = pcibios_alloc_controller(dn);
if (!phb)
return NULL;
--
2.44.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached.
2024-05-17 14:25 ` [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
@ 2024-05-20 17:16 ` Srikar Dronamraju
2024-05-24 13:31 ` Krishna Kumar
1 sibling, 0 replies; 5+ messages in thread
From: Srikar Dronamraju @ 2024-05-20 17:16 UTC (permalink / raw)
To: Nilay Shroff; +Cc: sshegde, gjoyce, npiggin, naveen.n.rao, linuxppc-dev
* Nilay Shroff <nilay@linux.ibm.com> [2024-05-17 19:55:23]:
Hi Nilay,
> In the current design, a numa-node is made online only if
> that node is attached to cpu/memory. With this design, if
> any PCI/IO device is found to be attached to a numa-node
> which is not online then the numa-node id of the corresponding
> PCI/IO device is set to NUMA_NO_NODE(-1). This design may
> negatively impact the performance of PCIe device if the
> numa-node assigned to PCIe device is -1 because in such case
> we may not be able to accurately calculate the distance
> between two nodes.
> The multi-controller NVMe PCIe disk has an issue with
> calculating the node distance if the PCIe NVMe controller
> is attached to a PCI host bridge which has numa-node id
> value set to NUMA_NO_NODE. This patch helps fix this ensuring
> that a cpu/memory less numa node is made online if it's
> attached to PCI host bridge.
>
Looks good to me.
Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
--
Thanks and Regards
Srikar Dronamraju
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached.
2024-05-17 14:25 ` [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
2024-05-20 17:16 ` Srikar Dronamraju
@ 2024-05-24 13:31 ` Krishna Kumar
1 sibling, 0 replies; 5+ messages in thread
From: Krishna Kumar @ 2024-05-24 13:31 UTC (permalink / raw)
To: Nilay Shroff, mpe, npiggin, christophe.leroy, naveen.n.rao
Cc: gjoyce, srikar, linuxppc-dev, sshegde
On 5/17/24 19:55, Nilay Shroff wrote:
> In the current design, a numa-node is made online only if
> that node is attached to cpu/memory. With this design, if
> any PCI/IO device is found to be attached to a numa-node
> which is not online then the numa-node id of the corresponding
> PCI/IO device is set to NUMA_NO_NODE(-1). This design may
> negatively impact the performance of PCIe device if the
> numa-node assigned to PCIe device is -1 because in such case
> we may not be able to accurately calculate the distance
> between two nodes.
> The multi-controller NVMe PCIe disk has an issue with
> calculating the node distance if the PCIe NVMe controller
> is attached to a PCI host bridge which has numa-node id
> value set to NUMA_NO_NODE. This patch helps fix this ensuring
> that a cpu/memory less numa node is made online if it's
> attached to PCI host bridge.
>
> Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
Thanks for fixing this. Looks good to me.
Reviewed-by: Krishna Kumar (krishnak@linux.ibm.com)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online
2024-05-17 14:25 [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Nilay Shroff
2024-05-17 14:25 ` [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
@ 2024-06-20 12:49 ` Michael Ellerman
1 sibling, 0 replies; 5+ messages in thread
From: Michael Ellerman @ 2024-06-20 12:49 UTC (permalink / raw)
To: mpe, npiggin, christophe.leroy, naveen.n.rao, Nilay Shroff
Cc: gjoyce, srikar, linuxppc-dev, sshegde
On Fri, 17 May 2024 19:55:21 +0530, Nilay Shroff wrote:
> On NUMA aware system, we make a numa-node online only if that node is
> attached to cpu/memory. However it's possible that we have some PCI/IO
> device affinitized to a numa-node which is not currently online. In such
> case we set the numa-node id of the corresponding PCI device to -1
> (NUMA_NO_NODE). Not assigning the correct numa-node id to PCI device may
> impact the performance of such device. For instance, we have a multi
> controller NVMe disk where each controller of the disk is attached to
> different PHB (PCI host bridge). Each of these PHBs has numa-node id
> assigned during PCI enumeration. During PCI enumeration if we find that
> the numa-node is not online then we set the numa-node id of the PHB to -1.
> If we create shared namespace and attach to multi controller NVMe disk
> then that namespace could be accessed through each controller and as each
> controller is connected to different PHBs, it's possible to access the
> same namespace using multiple PCI channel. While sending IO to a shared
> namespace, NVMe driver would calculate the optimal IO path using numa-node
> distance. However if the numa-node id is not correctly assigned to NVMe
> PCIe controller then it's possible that driver would calculate incorrect
> NUMA distance and hence select the non-optimal path for sending IO. If
> this happens then we could potentially observe the degraded IO performance.
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/numa: Online a node if PHB is attached.
https://git.kernel.org/powerpc/c/11981816e3614156a1fe14a1e8e77094ea46c7d5
cheers
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-06-20 12:52 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-17 14:25 [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Nilay Shroff
2024-05-17 14:25 ` [PATCH v2 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
2024-05-20 17:16 ` Srikar Dronamraju
2024-05-24 13:31 ` Krishna Kumar
2024-06-20 12:49 ` [PATCH v2 0/1] powerpc/numa: Make cpu/memory less numa-node online Michael Ellerman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).