[PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
@ 2025-07-22  4:14 Jia He
  2025-07-22  5:45 ` Greg Kroah-Hartman
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jia He @ 2025-07-22  4:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich
  Cc: linux-kernel, Jia He

pcpu_embed_first_chunk() allocates the first percpu chunk via
pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
large physical address span (max_distance) and excessive vmalloc space
requirements.

For example, on an arm64 N2 server with 256 CPUs, the memory layout
includes:
[    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
[    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]

With the following NUMA distance matrix:
node distances:
node   0   1   2   3
  0:  10  16  22  22
  1:  16  10  22  22
  2:  22  22  10  16
  3:  22  22  16  10

In this configuration, pcpu_embed_first_chunk() computes a large
max_distance:
percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000

As a result, the allocator falls back to pcpu_page_first_chunk(), which
uses page-by-page allocation with nr_groups = 1, leading to degraded
performance.

This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),
allowing CPUs from nearby nodes to be grouped together. Consequently,
nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
allocate memory from a common node.

For example:
- cpu0 belongs to node 0
- cpu64 belongs to node 1
Both CPUs are considered local and will allocate memory from node 0.
This normalization reduces max_distance:
percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000

In addition, add a flag _need_norm_ to indicate the normalization is needed
iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].

Signed-off-by: Jia He <justin.he@arm.com>
---
 drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 46 insertions(+), 1 deletion(-)

diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
index c99f2ab105e5..f746d88239e9 100644
--- a/drivers/base/arch_numa.c
+++ b/drivers/base/arch_numa.c
@@ -17,6 +17,8 @@
 #include <asm/sections.h>
 
 static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
+static bool need_norm;
 
 bool numa_off;
 
@@ -149,9 +151,40 @@ int early_cpu_to_node(int cpu)
 	return cpu_to_node_map[cpu];
 }
 
+int __init early_cpu_to_norm_node(int cpu)
+{
+	return cpu_to_norm_node_map[cpu];
+}
+
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
 {
-	return node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+	int distance = node_distance(early_cpu_to_node(from), early_cpu_to_node(to));
+
+	if (distance > LOCAL_DISTANCE && distance < REMOTE_DISTANCE && !need_norm)
+		need_norm = true;
+
+	return distance;
+}
+
+static int __init pcpu_cpu_norm_distance(unsigned int from, unsigned int to)
+{
+	int distance = pcpu_cpu_distance(from, to);
+
+	if (distance >= REMOTE_DISTANCE)
+		return REMOTE_DISTANCE;
+
+	/*
+	 * If the distance is in the range [LOCAL_DISTANCE, REMOTE_DISTANCE),
+	 * normalize the node map, choose the first local numa node id as its
+	 * normalized node id.
+	 */
+	if (cpu_to_norm_node_map[from] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[from] = cpu_to_node_map[from];
+
+	if (cpu_to_norm_node_map[to] == NUMA_NO_NODE)
+		cpu_to_norm_node_map[to] = cpu_to_norm_node_map[from];
+
+	return LOCAL_DISTANCE;
 }
 
 void __init setup_per_cpu_areas(void)
@@ -169,6 +202,18 @@ void __init setup_per_cpu_areas(void)
 					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
 					    pcpu_cpu_distance,
 					    early_cpu_to_node);
+
+		if (rc < 0 && need_norm) {
+			/* Try the normalized node distance again */
+			pr_info("PERCPU: %s allocator, trying the normalization mode\n",
+				   pcpu_fc_names[pcpu_chosen_fc]);
+
+			rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+						    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+						    pcpu_cpu_norm_distance,
+						    early_cpu_to_norm_node);
+		}
+
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
 		if (rc < 0)
 			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-22  4:14 [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance Jia He
@ 2025-07-22  5:45 ` Greg Kroah-Hartman
  2025-07-28  2:54   ` Justin He
  2025-07-22 21:40 ` kernel test robot
  2025-07-26 12:27 ` kernel test robot
  2 siblings, 1 reply; 7+ messages in thread
From: Greg Kroah-Hartman @ 2025-07-22  5:45 UTC (permalink / raw)
  To: Jia He; +Cc: Rafael J. Wysocki, Danilo Krummrich, linux-kernel

On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> pcpu_embed_first_chunk() allocates the first percpu chunk via
> pcpu_fc_alloc() and used as-is without being mapped into vmalloc area. On
> NUMA systems, this can lead to a sparse CPU->unit mapping, resulting in a
> large physical address span (max_distance) and excessive vmalloc space
> requirements.

Why is the subject line "mm: percpu:" when this is driver-core code?

And if it is mm code, please cc: the mm maintainers and list please.

> For example, on an arm64 N2 server with 256 CPUs, the memory layout
> includes:
> [    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> [    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
> 
> With the following NUMA distance matrix:
> node distances:
> node   0   1   2   3
>   0:  10  16  22  22
>   1:  16  10  22  22
>   2:  22  22  10  16
>   3:  22  22  16  10
> 
> In this configuration, pcpu_embed_first_chunk() computes a large
> max_distance:
> percpu: max_distance=0x5fffbfac0000 too large for vmalloc space 0x7bff70000000
> 
> As a result, the allocator falls back to pcpu_page_first_chunk(), which
> uses page-by-page allocation with nr_groups = 1, leading to degraded
> performance.

But that's intentional, you don't want to go across the nodes, right?

> This patch introduces a normalized CPU-to-NUMA node mapping to mitigate
> the issue. Distances of 10 and 16 are treated as local (LOCAL_DISTANCE),

Why?  What is this going to now break on those systems that assumed that
those were NOT local?

> allowing CPUs from nearby nodes to be grouped together. Consequently,
> nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> allocate memory from a common node.
> 
> For example:
> - cpu0 belongs to node 0
> - cpu64 belongs to node 1
> Both CPUs are considered local and will allocate memory from node 0.
> This normalization reduces max_distance:
> percpu: max_distance=0x500000380000, ~64% of vmalloc space 0x7bff70000000
> 
> In addition, add a flag _need_norm_ to indicate the normalization is needed
> iff when cpu_to_norm_node_map[] is different from cpu_to_node_map[].
> 
> Signed-off-by: Jia He <justin.he@arm.com>

I think this needs a lot of testing and verification and acks from
maintainers of other arches that can say "this also works for us" before
we can take it, as it has the potential to make major changes to
systems.

What did you test this on?


> ---
>  drivers/base/arch_numa.c | 47 +++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c
> index c99f2ab105e5..f746d88239e9 100644
> --- a/drivers/base/arch_numa.c
> +++ b/drivers/base/arch_numa.c
> @@ -17,6 +17,8 @@
>  #include <asm/sections.h>
>  
>  static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] = NUMA_NO_NODE };
> +static bool need_norm;

Shouldn't these be marked __initdata as you don't touch them afterward?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-22  4:14 [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance Jia He
  2025-07-22  5:45 ` Greg Kroah-Hartman
@ 2025-07-22 21:40 ` kernel test robot
  2025-07-26 12:27 ` kernel test robot
  2 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2025-07-22 21:40 UTC (permalink / raw)
  To: Jia He, Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich
  Cc: llvm, oe-kbuild-all, linux-kernel, Jia He

Hi Jia,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com
patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
config: arm64-randconfig-001-20250722 (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 853c343b45b3e83cc5eeef5a52fc8cc9d8a09252)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507230509.juShbryQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507230509.juShbryQ-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/base/arch_numa.c:154:12: warning: no previous prototype for function 'early_cpu_to_norm_node' [-Wmissing-prototypes]
     154 | int __init early_cpu_to_norm_node(int cpu)
         |            ^
   drivers/base/arch_numa.c:154:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
     154 | int __init early_cpu_to_norm_node(int cpu)
         | ^
         | static 
   1 warning generated.

vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c

   153	
 > 154	int __init early_cpu_to_norm_node(int cpu)
   155	{
   156		return cpu_to_norm_node_map[cpu];
   157	}
   158	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-22  4:14 [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance Jia He
  2025-07-22  5:45 ` Greg Kroah-Hartman
  2025-07-22 21:40 ` kernel test robot
@ 2025-07-26 12:27 ` kernel test robot
  2 siblings, 0 replies; 7+ messages in thread
From: kernel test robot @ 2025-07-26 12:27 UTC (permalink / raw)
  To: Jia He, Greg Kroah-Hartman, Rafael J. Wysocki, Danilo Krummrich
  Cc: oe-kbuild-all, linux-kernel, Jia He

Hi Jia,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Jia-He/mm-percpu-Introduce-normalized-CPU-to-NUMA-node-mapping-to-reduce-max_distance/20250722-121559
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250722041418.2024870-1-justin.he%40arm.com
patch subject: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
config: arm64-randconfig-r113-20250725 (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/config)
compiler: aarch64-linux-gcc (GCC) 10.5.0
reproduce: (https://download.01.org/0day-ci/archive/20250726/202507262015.sw4niVFQ-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507262015.sw4niVFQ-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> drivers/base/arch_numa.c:154:12: sparse: sparse: symbol 'early_cpu_to_norm_node' was not declared. Should it be static?

vim +/early_cpu_to_norm_node +154 drivers/base/arch_numa.c

   153	
 > 154	int __init early_cpu_to_norm_node(int cpu)
   155	{
   156		return cpu_to_norm_node_map[cpu];
   157	}
   158	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-22  5:45 ` Greg Kroah-Hartman
@ 2025-07-28  2:54   ` Justin He
  2025-07-28  4:28     ` Greg Kroah-Hartman
  0 siblings, 1 reply; 7+ messages in thread
From: Justin He @ 2025-07-28  2:54 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, Danilo Krummrich, linux-kernel@vger.kernel.org

Hi Greg

> -----Original Message-----
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Sent: Tuesday, July 22, 2025 1:45 PM
> To: Justin He <Justin.He@arm.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>; Danilo Krummrich
> <dakr@kernel.org>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node
> mapping to reduce max_distance
> 
> On Tue, Jul 22, 2025 at 04:14:18AM +0000, Jia He wrote:
> > pcpu_embed_first_chunk() allocates the first percpu chunk via
> > pcpu_fc_alloc() and used as-is without being mapped into vmalloc area.
> > On NUMA systems, this can lead to a sparse CPU->unit mapping,
> > resulting in a large physical address span (max_distance) and
> > excessive vmalloc space requirements.
> 
> Why is the subject line "mm: percpu:" when this is driver-core code?
> 
> And if it is mm code, please cc: the mm maintainers and list please.
> 
Ok, thanks

> > For example, on an arm64 N2 server with 256 CPUs, the memory layout
> > includes:
> > [    0.000000] NUMA: NODE_DATA [mem 0x100fffff0b00-0x100fffffffff]
> > [    0.000000] NUMA: NODE_DATA [mem 0x500fffff0b00-0x500fffffffff]
> > [    0.000000] NUMA: NODE_DATA [mem 0x600fffff0b00-0x600fffffffff]
> > [    0.000000] NUMA: NODE_DATA [mem 0x700ffffbcb00-0x700ffffcbfff]
> >
> > With the following NUMA distance matrix:
> > node distances:
> > node   0   1   2   3
> >   0:  10  16  22  22
> >   1:  16  10  22  22
> >   2:  22  22  10  16
> >   3:  22  22  16  10
> >
> > In this configuration, pcpu_embed_first_chunk() computes a large
> > max_distance:
> > percpu: max_distance=0x5fffbfac0000 too large for vmalloc space
> > 0x7bff70000000
> >
> > As a result, the allocator falls back to pcpu_page_first_chunk(),
> > which uses page-by-page allocation with nr_groups = 1, leading to
> > degraded performance.
> 
> But that's intentional, you don't want to go across the nodes, right?
My intention is to 
> 
> > This patch introduces a normalized CPU-to-NUMA node mapping to
> > mitigate the issue. Distances of 10 and 16 are treated as local
> > (LOCAL_DISTANCE),
> 
> Why?  What is this going to now break on those systems that assumed that
> those were NOT local?
The normalization only affects percpu allocations - possibly only dynamic ones. 
Other mechanisms, such as cpu_to_node_map, remain unaffected and continue
to function as before in those contexts.

> 
> > allowing CPUs from nearby nodes to be grouped together. Consequently,
> > nr_groups will be 2 and pcpu_fc_alloc() uses the normalized node ID to
> > allocate memory from a common node.
> >
> > For example:
> > - cpu0 belongs to node 0
> > - cpu64 belongs to node 1
> > Both CPUs are considered local and will allocate memory from node 0.
> > This normalization reduces max_distance:
> > percpu: max_distance=0x500000380000, ~64% of vmalloc space
> > 0x7bff70000000
> >
> > In addition, add a flag _need_norm_ to indicate the normalization is
> > needed iff when cpu_to_norm_node_map[] is different from
> cpu_to_node_map[].
> >
> > Signed-off-by: Jia He <justin.he@arm.com>
> 
> I think this needs a lot of testing and verification and acks from maintainers of
> other arches that can say "this also works for us" before we can take it, as it
> has the potential to make major changes to systems.
Ok, understood.

> 
> What did you test this on?
> 
This was conducted on an Arm64 N2 server with 256 CPUs and 64 GB of memory.
 (Apologies, but I am not authorized to disclose the exact hardware specifications.)

> > ---
> >  drivers/base/arch_numa.c | 47
> > +++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 46 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/arch_numa.c b/drivers/base/arch_numa.c index
> > c99f2ab105e5..f746d88239e9 100644
> > --- a/drivers/base/arch_numa.c
> > +++ b/drivers/base/arch_numa.c
> > @@ -17,6 +17,8 @@
> >  #include <asm/sections.h>
> >
> >  static int cpu_to_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
> > NUMA_NO_NODE };
> > +static int cpu_to_norm_node_map[NR_CPUS] = { [0 ... NR_CPUS-1] =
> > +NUMA_NO_NODE }; static bool need_norm;
> 
> Shouldn't these be marked __initdata as you don't touch them afterward?
Yes


--- 
Cheers,
Justin He(Jia He)



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-28  2:54   ` Justin He
@ 2025-07-28  4:28     ` Greg Kroah-Hartman
  2025-07-28  6:14       ` Justin He
  0 siblings, 1 reply; 7+ messages in thread
From: Greg Kroah-Hartman @ 2025-07-28  4:28 UTC (permalink / raw)
  To: Justin He
  Cc: Rafael J. Wysocki, Danilo Krummrich, linux-kernel@vger.kernel.org

On Mon, Jul 28, 2025 at 02:54:42AM +0000, Justin He wrote:
> Hi Greg
> 
> > -----Original Message-----
> > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > Sent: Tuesday, July 22, 2025 1:45 PM
> > To: Justin He <Justin.He@arm.com>
> > Cc: Rafael J. Wysocki <rafael@kernel.org>; Danilo Krummrich
> > <dakr@kernel.org>; linux-kernel@vger.kernel.org
> > Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node

Odd quoting, please fix your email client :(

> > > In this configuration, pcpu_embed_first_chunk() computes a large
> > > max_distance:
> > > percpu: max_distance=0x5fffbfac0000 too large for vmalloc space
> > > 0x7bff70000000
> > >
> > > As a result, the allocator falls back to pcpu_page_first_chunk(),
> > > which uses page-by-page allocation with nr_groups = 1, leading to
> > > degraded performance.
> > 
> > But that's intentional, you don't want to go across the nodes, right?
> My intention is to 

Did something get dropped?

> > > This patch introduces a normalized CPU-to-NUMA node mapping to
> > > mitigate the issue. Distances of 10 and 16 are treated as local
> > > (LOCAL_DISTANCE),
> > 
> > Why?  What is this going to now break on those systems that assumed that
> > those were NOT local?
> The normalization only affects percpu allocations - possibly only dynamic ones. 

"possibly" doesn't instill much confidence here...

> Other mechanisms, such as cpu_to_node_map, remain unaffected and continue
> to function as before in those contexts.

percpu allocations are the "hottest" path we have, so without testing
this on systems that were working well before your change, I don't think
we could ever accept this, right?

> > What did you test this on?
> > 
> This was conducted on an Arm64 N2 server with 256 CPUs and 64 GB of memory.
>  (Apologies, but I am not authorized to disclose the exact hardware specifications.)

That's fine, but why didn't you test this on older systems that this
code was originally written for?  You don't want to have regressions on
them, right?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to  reduce max_distance
  2025-07-28  4:28     ` Greg Kroah-Hartman
@ 2025-07-28  6:14       ` Justin He
  0 siblings, 0 replies; 7+ messages in thread
From: Justin He @ 2025-07-28  6:14 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Rafael J. Wysocki, Danilo Krummrich, linux-kernel@vger.kernel.org

Hi Greg,

> -----Original Message-----
> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Sent: Monday, July 28, 2025 12:28 PM
> To: Justin He <Justin.He@arm.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>; Danilo Krummrich
> <dakr@kernel.org>; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node
> mapping to reduce max_distance
> 
> On Mon, Jul 28, 2025 at 02:54:42AM +0000, Justin He wrote:
> > Hi Greg
> >
> > > -----Original Message-----
> > > From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > Sent: Tuesday, July 22, 2025 1:45 PM
> > > To: Justin He <Justin.He@arm.com>
> > > Cc: Rafael J. Wysocki <rafael@kernel.org>; Danilo Krummrich
> > > <dakr@kernel.org>; linux-kernel@vger.kernel.org
> > > Subject: Re: [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA
> > > node
> 
> Odd quoting, please fix your email client :(
> 
> > > > In this configuration, pcpu_embed_first_chunk() computes a large
> > > > max_distance:
> > > > percpu: max_distance=0x5fffbfac0000 too large for vmalloc space
> > > > 0x7bff70000000
> > > >
> > > > As a result, the allocator falls back to pcpu_page_first_chunk(),
> > > > which uses page-by-page allocation with nr_groups = 1, leading to
> > > > degraded performance.
> > >
> > > But that's intentional, you don't want to go across the nodes, right?
> > My intention is to
> 
> Did something get dropped?
> 
Sorry, the previous text should be:
My intention is to optimize the percpu allocation case to avoid to go to 
pcpu_page_first_chunk() before trying again the normalization.

> > > > This patch introduces a normalized CPU-to-NUMA node mapping to
> > > > mitigate the issue. Distances of 10 and 16 are treated as local
> > > > (LOCAL_DISTANCE),
> > >
> > > Why?  What is this going to now break on those systems that assumed
> > > that those were NOT local?
> > The normalization only affects percpu allocations - possibly only dynamic
> ones.
> 
> "possibly" doesn't instill much confidence here...
> 
> > Other mechanisms, such as cpu_to_node_map, remain unaffected and
> > continue to function as before in those contexts.
> 
> percpu allocations are the "hottest" path we have, so without testing this on
> systems that were working well before your change, I don't think we could
> ever accept this, right?
> 
> > > What did you test this on?
> > >
> > This was conducted on an Arm64 N2 server with 256 CPUs and 64 GB of
> memory.
> >  (Apologies, but I am not authorized to disclose the exact hardware
> > specifications.)
> 
> That's fine, but why didn't you test this on older systems that this code was
> originally written for?  You don't want to have regressions on them, right?
Besides the N2 server I mentioned in the commit msg, I tested this on an
ARM64 N2 legacy system with 2 nodes, 128 CPUs, and 128 GB of memory.
It works well both with and without the patch.

The updated pseudo-code logic is as follows:
- Attempt pcpu_embed_first_chunk() — original logic
- If it fails and normalization is worthwhile, retry pcpu_embed_first_chunk() — with the patch, in normalization mode
- If it still fails, fall back to pcpu_page_first_chunk()

In practice, I believe most legacy systems won't enter normalization mode, except for my N2 server.


--- 
Cheers,
Justin He(Jia He)



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-28  6:15 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22  4:14 [PATCH] mm: percpu: Introduce normalized CPU-to-NUMA node mapping to reduce max_distance Jia He
2025-07-22  5:45 ` Greg Kroah-Hartman
2025-07-28  2:54   ` Justin He
2025-07-28  4:28     ` Greg Kroah-Hartman
2025-07-28  6:14       ` Justin He
2025-07-22 21:40 ` kernel test robot
2025-07-26 12:27 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).