linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] powerpc/numa: Make cpu/memory less numa-node online
@ 2024-05-16  8:12 Nilay Shroff
  2024-05-16  8:12 ` [PATCH 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
  0 siblings, 1 reply; 3+ messages in thread
From: Nilay Shroff @ 2024-05-16  8:12 UTC (permalink / raw)
  To: mpe, npiggin, christophe.leroy, naveen.n.rao
  Cc: gjoyce, srikar, Nilay Shroff, linuxppc-dev, sshegde

Hi,

On NUMA aware system, we make a numa-node online only if that node is 
attached to cpu/memory. However it's possible that we have some PCI/IO 
device affinitized to a numa-node which is not currently online. In such 
case we set the numa-node id of the corresponding PCI device to -1 
(NUMA_NO_NODE). Not assigning the correct numa-node id to PCI device may 
impact the performance of such device. For instance, we have a multi 
controller NVMe disk where each controller of the disk is attached to 
different PHB (PCI host bridge). Each of these PHBs has numa-node id 
assigned during PCI enumeration. During PCI enumeration if we find that
the numa-node is not online then we set the numa-node id of the PHB to -1. 
If we create shared namespace and attach to multi controller NVMe disk 
then that namespace could be accessed through each controller and as each 
controller is connected to different PHBs, it's possible to access the 
same namespace using multiple PCI channel. While sending IO to a shared 
namespace, NVMe driver would calculate the optimal IO path using numa-node 
distance. However if the numa-node id is not correctly assigned to NVMe 
PCIe controller then it's possible that driver would calculate incorrect 
NUMA distance and hence select the non-optimal path for sending IO. If 
this happens then we could potentially observe the degraded IO performance.

Please find below the performance of a multi-controller NVMe disk w/ and 
w/o the proposed patch applied:

# lspci 
0524:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)
0584:28:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller CM7 2.5" (rev 01)

# nvme list -v 
Subsystem        Subsystem-NQN                                                                                    Controllers
---------------- ------------------------------------------------------------------------------------------------ ----------------
nvme-subsys1     nqn.2019-10.com.kioxia:KCM7DRUG1T92:3D60A04906N1                                                 nvme0, nvme1

Device   SN                   MN                                       FR       TxPort Asdress        Slot   Subsystem    Namespaces      
-------- -------------------- ---------------------------------------- -------- ------ -------------- ------ ------------ ----------------
nvme0    3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0524:28:00.0          nvme-subsys1 nvme1n3
nvme1    3D60A04906N1         1.6TB NVMe Gen4 U.2 SSD IV               REV.CAS2 pcie   0584:28:00.0          nvme-subsys1 nvme1n3

Device       Generic      NSID       Usage                      Format           Controllers     
------------ ------------ ---------- -------------------------- ---------------- ----------------
/dev/nvme1n3 /dev/ng1n3   0x3          5.75  GB /   5.75  GB      4 KiB +  0 B   nvme0, nvme1

We can see above the nvme disk has two controllers nvme0 and nvme1.Both 
these controllers can be accessed from two different PCI channels (0524:28 
and 0584:28). 
I have also created a shared namespace (/dev/nvme1n3) which is connected 
behind controllers nvme0 and nvme1.

Test-1: Measure IO performance w/o proposed patch:
--------------------------------------------------
# numactl -H 
available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31565 MB
node 0 free: 28452 MB
node distances:
node   0 
  0:  10 

On this machine we only have node 0 online. 

# cat /sys/class/nvme/nvme1/numa_node 
-1
# cat /sys/class/nvme/nvme0/numa_node 
0 
# cat /sys/class/nvme-subsystem/nvme-subsys1/iopolicy 
numa

We can find above the numa node id assigned to nvme1 is -1, however, the 
numa node id assigned to nvme0 is 0. Also the iopolicy is set to numa.

Now we would run IO perf test and measure the performance:

# fio --filename=/dev/nvme1n3 --direct=1 --rw=randwrite  --bs=4k --ioengine=io_uring --iodepth=512 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --cpus_allowed=0-3 
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=512
...
fio-3.35
Starting 4 processes
[...]
[...]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=5665: Tue Apr 30 04:07:31 2024
  write: IOPS=632k, BW=2469MiB/s (2589MB/s)(145GiB/60003msec); 0 zone resets
    slat (usec): min=2, max=10031, avg= 4.62, stdev= 5.40
    clat (usec): min=12, max=15687, avg=3233.58, stdev=877.78
     lat (usec): min=16, max=15693, avg=3238.19, stdev=879.06
    clat percentiles (usec):
     |  1.00th=[ 2868],  5.00th=[ 2900], 10.00th=[ 2900], 20.00th=[ 2900],
     | 30.00th=[ 2933], 40.00th=[ 2933], 50.00th=[ 2933], 60.00th=[ 2933],
     | 70.00th=[ 2933], 80.00th=[ 2966], 90.00th=[ 5604], 95.00th=[ 5669],
     | 99.00th=[ 5735], 99.50th=[ 5735], 99.90th=[ 5866], 99.95th=[ 6456],
     | 99.99th=[15533]
   bw (  MiB/s): min= 1305, max= 2739, per=99.94%, avg=2467.92, stdev=130.72, samples=476
   iops        : min=334100, max=701270, avg=631786.39, stdev=33464.48, samples=476
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=88.87%, 10=11.10%, 20=0.02%
  cpu          : usr=37.15%, sys=62.78%, ctx=638, majf=0, minf=50
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,37932685,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=512

Run status group 0 (all jobs):
  WRITE: bw=2469MiB/s (2589MB/s), 2469MiB/s-2469MiB/s (2589MB/s-2589MB/s), io=145GiB (155GB), run=60003-60003msec

Disk stats (read/write):
  nvme0n3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=99.87%

While test is running we could enable trace event to capture and see 
the controller being used by driver to perform the IO.

# tail -5  /sys/kerenl/debug/tracing/trace
             fio-5665    [002] .....   508.635554: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=3, cmdid=57856, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=748098, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5666    [000] .....   508.635554: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=1, cmdid=8385, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=139215, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5667    [001] .....   508.635557: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=2, cmdid=21440, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=815508, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5668    [003] .....   508.635558: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=4, cmdid=33089, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=405932, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5665    [002] .....   508.635771: nvme_setup_cmd: nvme1: disk=nvme0c1n3, qid=3, cmdid=37376, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=497267, len=0, ctrl=0x0, dsmgmt=0, reftag=0)

From the above output we can notice that driver is using controller nvme1
 for performing IO however this IO path could be sub-optimal as the numa 
id assigned to nvme1 is -1 and so driver couldn't accurately calculate 
numa node distance for this controller wrt to the cpu node 0 where this 
test is running. Ideally, the driver could have used the nvme0 to perform 
IO for optimal IO path. 

In the fio/perf test result above we have got write IOPS 632k and 
BW 2589MB/s.

Test-2: Measure IO performance w/ proposed patch:
-------------------------------------------------
# numactl -H 
available: 3 nodes (0,2-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 0 size: 31565 MB
node 0 free: 28740 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   2  
  0:  10  40 
  2:  40  10

# cat /sys/class/nvme/nvme0/numa_node
0
# cat /sys/class/nvme/nvme1/numa_node
2
# cat /sys/class/nvme-subsystem/nvme-subsys1/iopolicy
numa

We could now see above numa node id 2 is made online.The numa node 2 is 
cpu/memory less. The nvme1 controller is now assigned the numa node id 2. 

Let's run IO perf test and measure the performance:

# fio --filename=/dev/nvme1n3 --direct=1 --rw=randwrite  --bs=4k --ioengine=io_uring --iodepth=512 --runtime=60 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --cpus_allowed=0-3 
iops-test-job: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=512
...
fio-3.35
Starting 4 processes
[...]
[...]
iops-test-job: (groupid=0, jobs=4): err= 0: pid=5661: Tue Apr 30 04:33:46 2024
  write: IOPS=715k, BW=2792MiB/s (2928MB/s)(164GiB/60001msec); 0 zone resets
    slat (usec): min=2, max=10023, avg= 4.09, stdev= 4.40
    clat (usec): min=11, max=12874, avg=2859.70, stdev=109.44
     lat (usec): min=15, max=12878, avg=2863.78, stdev=109.54
    clat percentiles (usec):
     |  1.00th=[ 2737],  5.00th=[ 2835], 10.00th=[ 2835], 20.00th=[ 2835],
     | 30.00th=[ 2835], 40.00th=[ 2868], 50.00th=[ 2868], 60.00th=[ 2868],
     | 70.00th=[ 2868], 80.00th=[ 2868], 90.00th=[ 2900], 95.00th=[ 2900],
     | 99.00th=[ 2966], 99.50th=[ 2999], 99.90th=[ 3064], 99.95th=[ 3097],
     | 99.99th=[12780]
   bw (  MiB/s): min= 2656, max= 2834, per=100.00%, avg=2792.81, stdev= 4.73, samples=476
   iops        : min=680078, max=725670, avg=714959.61, stdev=1209.66, samples=476
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=0.01%
  lat (usec)   : 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=99.99%, 10=0.01%, 20=0.01%
  cpu          : usr=36.22%, sys=63.73%, ctx=838, majf=0, minf=50
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,42891699,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=512

Run status group 0 (all jobs):
  WRITE: bw=2792MiB/s (2928MB/s), 2792MiB/s-2792MiB/s (2928MB/s-2928MB/s), io=164GiB (176GB), run=60001-60001msec

Disk stats (read/write):
  nvme1n3: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=99.87%

While test is running we could enable trace event to capture and see
the controller being used by driver to perform the IO.

# tail -5  /sys/kernel/debug/tracing/trace
             fio-5661    [000] .....   673.238805: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=61953, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=589070, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5664    [003] .....   673.238807: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=4, cmdid=12802, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=1235913, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5661    [000] .....   673.238809: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=57858, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=798690, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5664    [003] .....   673.238814: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=4, cmdid=37376, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=643839, len=0, ctrl=0x0, dsmgmt=0, reftag=0)
             fio-5661    [000] .....   673.238814: nvme_setup_cmd: nvme0: disk=nvme0c0n3, qid=1, cmdid=4608, nsid=3, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=1319701, len=0, ctrl=0x0, dsmgmt=0, reftag=0)

We can notice above that driver is now using nvme0 for IO. As this test is 
running on cpu node 0 and the numa node id assigned to nvme0 is also 0, 
the nvme0 is the optimal IO path. With this patch, the driver is able to 
accurately calculate numa node distance and select nvme0 as optimal IO 
path. 

In the fio/perf test result above we have got write IOPS 715k and 
BW 2928MB/s.

Summary:
--------
In summary, after comparing both tests results, it's apparent 
that with the proposed patch driver could choose the optimal 
IO path when iopolicy is set to NUMA and we get the better 
IO performance. With the proposed patch we get ~12% of perf-
ormance improvment.

Nilay Shroff (1):
  powerpc/numa: Online a node if PHB is attached.

 arch/powerpc/mm/numa.c                     | 14 +++++++++++++-
 arch/powerpc/platforms/pseries/pci_dlpar.c | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

-- 
2.44.0

^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/1] powerpc/numa: Online a node if PHB is attached.
  2024-05-16  8:12 [PATCH 0/1] powerpc/numa: Make cpu/memory less numa-node online Nilay Shroff
@ 2024-05-16  8:12 ` Nilay Shroff
  2024-05-17  8:58   ` kernel test robot
  0 siblings, 1 reply; 3+ messages in thread
From: Nilay Shroff @ 2024-05-16  8:12 UTC (permalink / raw)
  To: mpe, npiggin, christophe.leroy, naveen.n.rao
  Cc: gjoyce, srikar, Nilay Shroff, linuxppc-dev, sshegde

In the current design, a numa-node is made online only if
that node is attached to cpu/memory. With this design, if
any PCI/IO device is found to be attached to a numa-node
which is not online then the numa-node id of the corresponding
PCI/IO device is set to NUMA_NO_NODE(-1). This design may
negatively impact the performance of PCIe device if the
numa-node assigned to PCIe device is -1 because in such case
we may not be able to accurately calculate the distance
between two nodes.
The multi-controller NVMe PCIe disk has an issue with
calculating the node distance if the PCIe NVMe controller
is attached to a PCI host bridge which has numa-node id
value set to NUMA_NO_NODE. This patch helps fix this ensuring
that a cpu/memory less numa node is made online if it's
attached to PCI host bridge.

Signed-off-by: Nilay Shroff <nilay@linux.ibm.com>
---
 arch/powerpc/mm/numa.c                     | 14 +++++++++++++-
 arch/powerpc/platforms/pseries/pci_dlpar.c | 14 ++++++++++++++
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a490724e84ad..9e5e366cee43 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -896,7 +896,7 @@ static int __init numa_setup_drmem_lmb(struct drmem_lmb *lmb,
 
 static int __init parse_numa_properties(void)
 {
-	struct device_node *memory;
+	struct device_node *memory, *pci;
 	int default_nid = 0;
 	unsigned long i;
 	const __be32 *associativity;
@@ -1010,6 +1010,18 @@ static int __init parse_numa_properties(void)
 			goto new_range;
 	}
 
+	for_each_node_by_name(pci, "pci") {
+		int nid;
+
+		associativity = of_get_associativity(pci);
+		if (associativity) {
+			nid = associativity_to_nid(associativity);
+			initialize_form1_numa_distance(associativity);
+		}
+		if (likely(nid >= 0) && !node_online(nid))
+			node_set_online(nid);
+	}
+
 	/*
 	 * Now do the same thing for each MEMBLOCK listed in the
 	 * ibm,dynamic-memory property in the
diff --git a/arch/powerpc/platforms/pseries/pci_dlpar.c b/arch/powerpc/platforms/pseries/pci_dlpar.c
index 4448386268d9..52e2623a741d 100644
--- a/arch/powerpc/platforms/pseries/pci_dlpar.c
+++ b/arch/powerpc/platforms/pseries/pci_dlpar.c
@@ -11,6 +11,7 @@
 
 #include <linux/pci.h>
 #include <linux/export.h>
+#include <linux/node.h>
 #include <asm/pci-bridge.h>
 #include <asm/ppc-pci.h>
 #include <asm/firmware.h>
@@ -21,9 +22,22 @@
 struct pci_controller *init_phb_dynamic(struct device_node *dn)
 {
 	struct pci_controller *phb;
+	int nid;
 
 	pr_debug("PCI: Initializing new hotplug PHB %pOF\n", dn);
 
+	nid = of_node_to_nid(dn);
+	if (likely((nid) >= 0)) {
+		if (!node_online(nid)) {
+			if (__register_one_node(nid)) {
+				pr_err("PCI: Failed to register node %d\n", nid);
+			} else {
+				update_numa_distance(dn);
+				node_set_online(nid);
+			}
+		}
+	}
+
 	phb = pcibios_alloc_controller(dn);
 	if (!phb)
 		return NULL;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH 1/1] powerpc/numa: Online a node if PHB is attached.
  2024-05-16  8:12 ` [PATCH 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
@ 2024-05-17  8:58   ` kernel test robot
  0 siblings, 0 replies; 3+ messages in thread
From: kernel test robot @ 2024-05-17  8:58 UTC (permalink / raw)
  To: Nilay Shroff, mpe, npiggin, christophe.leroy, naveen.n.rao
  Cc: Nilay Shroff, llvm, sshegde, gjoyce, srikar, oe-kbuild-all,
	linuxppc-dev

Hi Nilay,

kernel test robot noticed the following build warnings:

[auto build test WARNING on powerpc/next]
[also build test WARNING on powerpc/fixes linus/master v6.9 next-20240517]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Nilay-Shroff/powerpc-numa-Online-a-node-if-PHB-is-attached/20240516-201619
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link:    https://lore.kernel.org/r/20240516081230.3119651-2-nilay%40linux.ibm.com
patch subject: [PATCH 1/1] powerpc/numa: Online a node if PHB is attached.
config: powerpc-allyesconfig (https://download.01.org/0day-ci/archive/20240517/202405171615.NBRa8Poe-lkp@intel.com/config)
compiler: clang version 19.0.0git (https://github.com/llvm/llvm-project d3455f4ddd16811401fa153298fadd2f59f6914e)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240517/202405171615.NBRa8Poe-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202405171615.NBRa8Poe-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from arch/powerpc/mm/numa.c:10:
   In file included from include/linux/memblock.h:12:
   In file included from include/linux/mm.h:2208:
   include/linux/vmstat.h:508:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     508 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     509 |                            item];
         |                            ~~~~
   include/linux/vmstat.h:515:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     515 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     516 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:522:36: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
     522 |         return node_stat_name(NR_LRU_BASE + lru) + 3; // skip "nr_"
         |                               ~~~~~~~~~~~ ^ ~~~
   include/linux/vmstat.h:527:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     527 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     528 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
   include/linux/vmstat.h:536:43: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum numa_stat_item') [-Wenum-enum-conversion]
     536 |         return vmstat_text[NR_VM_ZONE_STAT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~ ^
     537 |                            NR_VM_NUMA_EVENT_ITEMS +
         |                            ~~~~~~~~~~~~~~~~~~~~~~
>> arch/powerpc/mm/numa.c:1017:7: warning: variable 'nid' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    1017 |                 if (associativity) {
         |                     ^~~~~~~~~~~~~
   arch/powerpc/mm/numa.c:1021:14: note: uninitialized use occurs here
    1021 |                 if (likely(nid >= 0) && !node_online(nid))
         |                            ^~~
   include/linux/compiler.h:76:40: note: expanded from macro 'likely'
      76 | # define likely(x)      __builtin_expect(!!(x), 1)
         |                                             ^
   arch/powerpc/mm/numa.c:1017:3: note: remove the 'if' if its condition is always true
    1017 |                 if (associativity) {
         |                 ^~~~~~~~~~~~~~~~~~
   arch/powerpc/mm/numa.c:1014:10: note: initialize the variable 'nid' to silence this warning
    1014 |                 int nid;
         |                        ^
         |                         = 0
   6 warnings generated.


vim +1017 arch/powerpc/mm/numa.c

   896	
   897	static int __init parse_numa_properties(void)
   898	{
   899		struct device_node *memory, *pci;
   900		int default_nid = 0;
   901		unsigned long i;
   902		const __be32 *associativity;
   903	
   904		if (numa_enabled == 0) {
   905			pr_warn("disabled by user\n");
   906			return -1;
   907		}
   908	
   909		primary_domain_index = find_primary_domain_index();
   910	
   911		if (primary_domain_index < 0) {
   912			/*
   913			 * if we fail to parse primary_domain_index from device tree
   914			 * mark the numa disabled, boot with numa disabled.
   915			 */
   916			numa_enabled = false;
   917			return primary_domain_index;
   918		}
   919	
   920		pr_debug("associativity depth for CPU/Memory: %d\n", primary_domain_index);
   921	
   922		/*
   923		 * If it is FORM2 initialize the distance table here.
   924		 */
   925		if (affinity_form == FORM2_AFFINITY)
   926			initialize_form2_numa_distance_lookup_table();
   927	
   928		/*
   929		 * Even though we connect cpus to numa domains later in SMP
   930		 * init, we need to know the node ids now. This is because
   931		 * each node to be onlined must have NODE_DATA etc backing it.
   932		 */
   933		for_each_present_cpu(i) {
   934			__be32 vphn_assoc[VPHN_ASSOC_BUFSIZE];
   935			struct device_node *cpu;
   936			int nid = NUMA_NO_NODE;
   937	
   938			memset(vphn_assoc, 0, VPHN_ASSOC_BUFSIZE * sizeof(__be32));
   939	
   940			if (__vphn_get_associativity(i, vphn_assoc) == 0) {
   941				nid = associativity_to_nid(vphn_assoc);
   942				initialize_form1_numa_distance(vphn_assoc);
   943			} else {
   944	
   945				/*
   946				 * Don't fall back to default_nid yet -- we will plug
   947				 * cpus into nodes once the memory scan has discovered
   948				 * the topology.
   949				 */
   950				cpu = of_get_cpu_node(i, NULL);
   951				BUG_ON(!cpu);
   952	
   953				associativity = of_get_associativity(cpu);
   954				if (associativity) {
   955					nid = associativity_to_nid(associativity);
   956					initialize_form1_numa_distance(associativity);
   957				}
   958				of_node_put(cpu);
   959			}
   960	
   961			/* node_set_online() is an UB if 'nid' is negative */
   962			if (likely(nid >= 0))
   963				node_set_online(nid);
   964		}
   965	
   966		get_n_mem_cells(&n_mem_addr_cells, &n_mem_size_cells);
   967	
   968		for_each_node_by_type(memory, "memory") {
   969			unsigned long start;
   970			unsigned long size;
   971			int nid;
   972			int ranges;
   973			const __be32 *memcell_buf;
   974			unsigned int len;
   975	
   976			memcell_buf = of_get_property(memory,
   977				"linux,usable-memory", &len);
   978			if (!memcell_buf || len <= 0)
   979				memcell_buf = of_get_property(memory, "reg", &len);
   980			if (!memcell_buf || len <= 0)
   981				continue;
   982	
   983			/* ranges in cell */
   984			ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
   985	new_range:
   986			/* these are order-sensitive, and modify the buffer pointer */
   987			start = read_n_cells(n_mem_addr_cells, &memcell_buf);
   988			size = read_n_cells(n_mem_size_cells, &memcell_buf);
   989	
   990			/*
   991			 * Assumption: either all memory nodes or none will
   992			 * have associativity properties.  If none, then
   993			 * everything goes to default_nid.
   994			 */
   995			associativity = of_get_associativity(memory);
   996			if (associativity) {
   997				nid = associativity_to_nid(associativity);
   998				initialize_form1_numa_distance(associativity);
   999			} else
  1000				nid = default_nid;
  1001	
  1002			fake_numa_create_new_node(((start + size) >> PAGE_SHIFT), &nid);
  1003			node_set_online(nid);
  1004	
  1005			size = numa_enforce_memory_limit(start, size);
  1006			if (size)
  1007				memblock_set_node(start, size, &memblock.memory, nid);
  1008	
  1009			if (--ranges)
  1010				goto new_range;
  1011		}
  1012	
  1013		for_each_node_by_name(pci, "pci") {
  1014			int nid;
  1015	
  1016			associativity = of_get_associativity(pci);
> 1017			if (associativity) {
  1018				nid = associativity_to_nid(associativity);
  1019				initialize_form1_numa_distance(associativity);
  1020			}
  1021			if (likely(nid >= 0) && !node_online(nid))
  1022				node_set_online(nid);
  1023		}
  1024	
  1025		/*
  1026		 * Now do the same thing for each MEMBLOCK listed in the
  1027		 * ibm,dynamic-memory property in the
  1028		 * ibm,dynamic-reconfiguration-memory node.
  1029		 */
  1030		memory = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
  1031		if (memory) {
  1032			walk_drmem_lmbs(memory, NULL, numa_setup_drmem_lmb);
  1033			of_node_put(memory);
  1034		}
  1035	
  1036		return 0;
  1037	}
  1038	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-05-17  9:00 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-16  8:12 [PATCH 0/1] powerpc/numa: Make cpu/memory less numa-node online Nilay Shroff
2024-05-16  8:12 ` [PATCH 1/1] powerpc/numa: Online a node if PHB is attached Nilay Shroff
2024-05-17  8:58   ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).