linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems
@ 2023-09-05 14:13 Feng Tang
  2023-09-05 14:13 ` [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Feng Tang
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Feng Tang @ 2023-09-05 14:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, linux-kernel
  Cc: Feng Tang

Hi All,

Please help to review the ideas and patches, thanks!

Problem
-------

0Day bot found performance regression of 'hackbench', related with
slub's per-node 'list_lock' contention [1]. The same lock contention
is also found when running will-it-scale/mmap1 benchmark on rather
big system of 2 sockets with 224 CPUs, where the lock contention can
take up to 76% of cpu cycles.

As the trend is one processor (socket) will have more and more cpu
cores, the contention can be more severe, and we need to tackle it
sooner or later. 

Possible mitigations
--------------------

There are 3 directions we can try, they don't have dependency over
each other and can be taken separately or put together:

1) increase the order of each slab (including changing the max slub
   order from 3 to 4)
2) increase number of per-cpu partial slabs
3) increase the MIN_PARTIAL and MAX_PARTIAL to let each node have
   more (64) partial slabs in maxim 

Regarding reducing the lock contention and improving peformance,
#1 is the most efficient way, #2 second it.

Please be noted that the 3 patches are just for showing the idea
separately to get review and comments first, and NOT targeting for
merge. Patch 2 even can't apply upon patch 1.

A similar regression related to 'list_lock' contention was found when
testing 'hackbench' with new 'eevdf' scheduer patchset, and a rough
combination of these patches cure the performance drop [2].


Performance data
----------------

We have showed some rough performance data in previous discussion:
https://lore.kernel.org/all/ZO2smdi83wWwZBsm@feng-clx/

Following is performance data for using 'mmap1' case of 'will-it-scale'
and 'hackbench' mentioned in [1]. For 'mmap1' case, we run  3
configurations with parallel test threads of 25%, 50% and 100% of
number of CPUs

The test HW is a 2 socket Sapphire Rapids server (112 cores / 224
threads) + 256 GB DRAM, the base kernel is vanilla kernel v6.5.

1) order increasing patch

   * will-it-scale/mmap1:
   
   		     base                      base+patch
   wis-mmap1-25%    223670           +33.3%     298205        per_process_ops
   wis-mmap1-50%    186020           +51.8%     282383        per_process_ops
   wis-mmap1-100%    89200           +65.0%     147139        per_process_ops
   
   Take the perf-profile comparasion of 50% test case, the lock contention
   is greatly reduced:
   
         43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
         0.85            -0.2        0.65        pp.self.___slab_alloc
         0.41            -0.1        0.27        pp.self.__unfreeze_partials
         0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial
   
   * hackbench: 
   
   		     base                      base+patch
   hackbench	    759951           +10.5%     839601        hackbench.throughput
   
   perf-profile diff:
        22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
         0.82            -0.2        0.59        pp.self.___slab_alloc
         0.33            -0.2        0.13        pp.self.__unfreeze_partials
   
2) increasing per-cpu partial patch

   The patch itself only makes the per-cpu partial number 2X, and for
   better analysis, the 4X case is also profiled

   * will-it-scale/mmap1:

		  base             base + 2X patch        base + 4X patch
   wis-mmap1-25	 223670    +12.7%     251999     +34.9%     301749    per_process_ops
   wis-mmap1-50	 186020    +28.0%     238067     +55.6%     289521    per_process_ops
   wis-mmap1-100  89200    +40.7%     125478     +62.4%     144858    per_process_ops
   
   Take the perf-profile comparasion of 50% test case, the lock contention
   is greatly reduced:
   
        43.80           -11.5       32.27           -27.9       15.91   pp.self.native_queued_spin_lock_slowpath
   
   * hackbench (no obvious improvment)
                 
		 base             base + 2X patch        base + 4X patch
   hackbench	759951      +0.2%    761506      +0.5%     763972     hackbench.throughput
 
 3) increasing per-node partial patch 

    The patch effectively change the MIN_PARTIAL/MAX_PARTIAL to from
    5/10 to 64/128. 

    * will-it-scale/mmap1:

   		     base                      base+patch
    wis-mmap1-25%    223670            +0.2%     224035        per_process_ops
    wis-mmap1-50%    186020           +13.0%     210248        per_process_ops
    wis-mmap1-100%    89200           +11.3%      99308        per_process_ops

  4) combination patches


			base	            base+patch-3       base+patch-3,1        base+patch-3,1,2
     wis-mmap1-25%	223670      -0.0%     223641     +24.2%     277734    +37.7%     307991     per_process_ops
     wis-mmap1-50%	186172     +12.9%     210108     +42.4%     265028    +59.8%     297495     per_process_ops
     wis-mmap1-100%	 89289     +11.3%      99363     +47.4%     131571    +78.1%     158991     per_process_ops


Make the patch only affect large systems
----------------------------------------
      
In real world, there are different kinds of platforms which has
different useage cases, large systems with huge numbers of CPUs
usually comes with huge memory, and there are also small devices
with limited memory, whch may care more about memory footprint.

So the idea is to treat them separately, keep the current
order/partial settings for system with small number of CPUs, and
scale those settings according to CPU numbers. (there is similar
handling in slub code already). Though aggressive idea is to bump
them all together.

[1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
[2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/

Thanks,
Feng

Feng Tang (3):
  mm/slub: increase the maximum slab order to 4 for big systems
  mm/slub: setup maxim per-node partial according to cpu numbers
  mm/slub: double per-cpu partial number for large systems

 mm/slub.c | 7 +++++++
 1 file changed, 7 insertions(+)

-- 
2.27.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems
  2023-09-05 14:13 [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems Feng Tang
@ 2023-09-05 14:13 ` Feng Tang
  2023-09-12  4:52   ` Hyeonggon Yoo
  2023-09-05 14:13 ` [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems Feng Tang
  2023-09-05 14:13 ` [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers Feng Tang
  2 siblings, 1 reply; 11+ messages in thread
From: Feng Tang @ 2023-09-05 14:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, linux-kernel
  Cc: Feng Tang

There are reports about severe lock contention for slub's per-node
'list_lock' in 'hackbench' test, [1][2], on server systems. And
similar contention is also seen when running 'mmap1' case of
will-it-scale on big systems. As the trend is one processor (socket)
will have more and more CPUs (100+, 200+), the contention could be
much more severe and becomes a scalability issue.

One way to help reducing the contention is to increase the maximum
slab order from 3 to 4, for big systems.

Unconditionally increasing the order could  bring trouble to client
devices with very limited size of memory, which may care more about
memory footprint, also allocating order 4 page could be harder under
memory pressure. So the increase will only be done for big systems
like servers, which usually are equipped with plenty of memory and
easier to hit lock contention issues.

Following is some performance data:

will-it-scale/mmap1
-------------------
Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
configurations with parallel test threads of 25%, 50% and 100% of
number of CPUs, and the data is (base is vanilla v6.5 kernel):

		     base                      base+patch
wis-mmap1-25%	    223670           +33.3%     298205        per_process_ops
wis-mmap1-50%	    186020           +51.8%     282383        per_process_ops
wis-mmap1-100%       89200           +65.0%     147139        per_process_ops

Take the perf-profile comparasion of 50% test case, the lock contention
is greatly reduced:

      43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
      0.85            -0.2        0.65        pp.self.___slab_alloc
      0.41            -0.1        0.27        pp.self.__unfreeze_partials
      0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial

hackbench
---------

Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:

		     base                      base+patch
hackbench	    759951           +10.5%     839601        hackbench.throughput

perf-profile diff:
     22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
      0.82            -0.2        0.59        pp.self.___slab_alloc
      0.33            -0.2        0.13        pp.self.__unfreeze_partials

[1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
[2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 38 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..09ae1ed642b7 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
  */
 static unsigned int slub_min_order;
 static unsigned int slub_max_order =
-	IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
+	IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
 static unsigned int slub_min_objects;
 
 /*
@@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size,
 	return order;
 }
 
+static inline int num_cpus(void)
+{
+	int nr_cpus;
+
+	/*
+	 * Some architectures will only update present cpus when
+	 * onlining them, so don't trust the number if it's just 1. But
+	 * we also don't want to use nr_cpu_ids always, as on some other
+	 * architectures, there can be many possible cpus, but never
+	 * onlined. Here we compromise between trying to avoid too high
+	 * order on systems that appear larger than they are, and too
+	 * low order on systems that appear smaller than they are.
+	 */
+	nr_cpus = num_present_cpus();
+	if (nr_cpus <= 1)
+		nr_cpus = nr_cpu_ids;
+
+	return nr_cpus;
+}
+
 static inline int calculate_order(unsigned int size)
 {
 	unsigned int order;
@@ -4151,19 +4171,17 @@ static inline int calculate_order(unsigned int size)
 	 */
 	min_objects = slub_min_objects;
 	if (!min_objects) {
-		/*
-		 * Some architectures will only update present cpus when
-		 * onlining them, so don't trust the number if it's just 1. But
-		 * we also don't want to use nr_cpu_ids always, as on some other
-		 * architectures, there can be many possible cpus, but never
-		 * onlined. Here we compromise between trying to avoid too high
-		 * order on systems that appear larger than they are, and too
-		 * low order on systems that appear smaller than they are.
-		 */
-		nr_cpus = num_present_cpus();
-		if (nr_cpus <= 1)
-			nr_cpus = nr_cpu_ids;
+		nr_cpus = num_cpus();
 		min_objects = 4 * (fls(nr_cpus) + 1);
+
+		/*
+		 * If nr_cpus >= 32, the platform is likely to be a server
+		 * which usually has much more memory, and is easier to be
+		 * hurt by scalability issue, so enlarge it to reduce the
+		 * possible contention of the per-node 'list_lock'.
+		 */
+		if (nr_cpus >= 32)
+			min_objects *= 2;
 	}
 	max_objects = order_objects(slub_max_order, size);
 	min_objects = min(min_objects, max_objects);
@@ -4361,6 +4379,13 @@ static void set_cpu_partial(struct kmem_cache *s)
 	else
 		nr_objects = 120;
 
+	/*
+	 * Give larger system more buffer to reduce scalability issue, like
+	 * the handling in calculate_order().
+	 */
+	if (num_cpus() >= 32)
+		nr_objects *= 2;
+
 	slub_set_cpu_partial(s, nr_objects);
 #endif
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems
  2023-09-05 14:13 [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems Feng Tang
  2023-09-05 14:13 ` [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Feng Tang
@ 2023-09-05 14:13 ` Feng Tang
  2023-09-05 14:13 ` [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers Feng Tang
  2 siblings, 0 replies; 11+ messages in thread
From: Feng Tang @ 2023-09-05 14:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, linux-kernel
  Cc: Feng Tang

There are reports about severe lock contention for slub's per-node
'list_lock' in 'hackbench' test, [1][2], on server systems. And
similar contention is also seen when running 'mmap1' case of
will-it-scale on big systems. As the trend is one processor (socket)
will have more and more CPUs (100+, 200+), the contention could be
much more severe and becomes a scalability issue.

One way to help reducing the contention is to double the per-cpu
partial number for large systems.

Following is some performance data, where it shows big improvment
in will-it-scale/mmap1 case, but no ovbious change for the 'hackbench'
test.

The patch itself only makes the per-cpu partial number 2X, and for
better analysis, the 4X case is also profiled

will-it-scale/mmap1
-------------------
Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
configurations with parallel test threads of 25%, 50% and 100% of
number of CPUs, and the data is (base is vanilla v6.5 kernel):

		  base             base + 2X patch        base + 4X patch
wis-mmap1-25	 223670    +12.7%     251999     +34.9%     301749    per_process_ops
wis-mmap1-50	 186020    +28.0%     238067     +55.6%     289521    per_process_ops
wis-mmap1-100	  89200    +40.7%     125478     +62.4%     144858    per_process_ops

Take the perf-profile comparasion of 50% test case, the lock contention
is greatly reduced:

     43.80           -11.5       32.27           -27.9       15.91   pp.self.native_queued_spin_lock_slowpath

hackbench
---------

Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:

		  base             base + 2X patch        base + 4X patch
hackbench	759951      +0.2%    761506      +0.5%     763972     hackbench.throughput

[1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
[2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 mm/slub.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/mm/slub.c b/mm/slub.c
index f7940048138c..51ca6dbaad09 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4361,6 +4361,13 @@ static void set_cpu_partial(struct kmem_cache *s)
 	else
 		nr_objects = 120;
 
+	/*
+	 * Give larger system more per-cpu partial slabs to reduce/postpone
+	 * contending per-node partial list.
+	 */
+	if (num_cpus() >= 32)
+		nr_objects *= 2;
+
 	slub_set_cpu_partial(s, nr_objects);
 #endif
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-05 14:13 [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems Feng Tang
  2023-09-05 14:13 ` [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Feng Tang
  2023-09-05 14:13 ` [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems Feng Tang
@ 2023-09-05 14:13 ` Feng Tang
  2023-09-12  4:48   ` Hyeonggon Yoo
  2 siblings, 1 reply; 11+ messages in thread
From: Feng Tang @ 2023-09-05 14:13 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, Hyeonggon Yoo,
	linux-mm, linux-kernel
  Cc: Feng Tang

Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
is 5). This is fine for older or small systesms, and could be too
small for a large system with hundreds of CPUs, when per-node
'list_lock' is contended for allocating from and freeing to per-node
partial list.

So enlarge it based on the CPU numbers per node.

Signed-off-by: Feng Tang <feng.tang@intel.com>
---
 include/linux/nodemask.h | 1 +
 mm/slub.c                | 9 +++++++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
index 8d07116caaf1..6e22caab186d 100644
--- a/include/linux/nodemask.h
+++ b/include/linux/nodemask.h
@@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp)
 
 #define num_online_nodes()	num_node_state(N_ONLINE)
 #define num_possible_nodes()	num_node_state(N_POSSIBLE)
+#define num_cpu_nodes()		num_node_state(N_CPU)
 #define node_online(node)	node_state((node), N_ONLINE)
 #define node_possible(node)	node_state((node), N_POSSIBLE)
 
diff --git a/mm/slub.c b/mm/slub.c
index 09ae1ed642b7..984e012d7bbc 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s)
 
 static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 {
+	unsigned long min_partial;
 	s->flags = kmem_cache_flags(s->size, flags, s->name);
 #ifdef CONFIG_SLAB_FREELIST_HARDENED
 	s->random = get_random_long();
@@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
 	 * The larger the object size is, the more slabs we want on the partial
 	 * list to avoid pounding the page allocator excessively.
 	 */
-	s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
-	s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
+
+	min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
+	min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
+
+	s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
+	s->min_partial = max_t(unsigned long, min_partial, s->min_partial);
 
 	set_cpu_partial(s);
 
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-05 14:13 ` [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers Feng Tang
@ 2023-09-12  4:48   ` Hyeonggon Yoo
  2023-09-14  7:05     ` Feng Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Hyeonggon Yoo @ 2023-09-12  4:48 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
>
> Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
> is 5). This is fine for older or small systesms, and could be too
> small for a large system with hundreds of CPUs, when per-node
> 'list_lock' is contended for allocating from and freeing to per-node
> partial list.
>
> So enlarge it based on the CPU numbers per node.
>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
>  include/linux/nodemask.h | 1 +
>  mm/slub.c                | 9 +++++++--
>  2 files changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> index 8d07116caaf1..6e22caab186d 100644
> --- a/include/linux/nodemask.h
> +++ b/include/linux/nodemask.h
> @@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp)
>
>  #define num_online_nodes()     num_node_state(N_ONLINE)
>  #define num_possible_nodes()   num_node_state(N_POSSIBLE)
> +#define num_cpu_nodes()                num_node_state(N_CPU)
>  #define node_online(node)      node_state((node), N_ONLINE)
>  #define node_possible(node)    node_state((node), N_POSSIBLE)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index 09ae1ed642b7..984e012d7bbc 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s)
>
>  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>  {
> +       unsigned long min_partial;
>         s->flags = kmem_cache_flags(s->size, flags, s->name);
>  #ifdef CONFIG_SLAB_FREELIST_HARDENED
>         s->random = get_random_long();
> @@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
>          * The larger the object size is, the more slabs we want on the partial
>          * list to avoid pounding the page allocator excessively.
>          */
> -       s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
> -       s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
> +
> +       min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
> +       min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
> +
> +       s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
> +       s->min_partial = max_t(unsigned long, min_partial, s->min_partial);

Hello Feng,

How much memory is consumed by this change on your machine?

I won't argue that it would be huge for large machines but it
increases the minimum value for every
cache (even for those that are not contended) and there is no way to
reclaim this.

Maybe a way to reclaim a full slab on memory pressure (on buddy side)
wouldn't hurt?

>         set_cpu_partial(s);
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems
  2023-09-05 14:13 ` [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Feng Tang
@ 2023-09-12  4:52   ` Hyeonggon Yoo
  2023-09-12 15:52     ` Feng Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Hyeonggon Yoo @ 2023-09-12  4:52 UTC (permalink / raw)
  To: Feng Tang
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
>
> There are reports about severe lock contention for slub's per-node
> 'list_lock' in 'hackbench' test, [1][2], on server systems. And
> similar contention is also seen when running 'mmap1' case of
> will-it-scale on big systems. As the trend is one processor (socket)
> will have more and more CPUs (100+, 200+), the contention could be
> much more severe and becomes a scalability issue.
>
> One way to help reducing the contention is to increase the maximum
> slab order from 3 to 4, for big systems.

Hello Feng,

Increasing order with a higher number of CPUs (and so with more
memory) makes sense to me.
IIUC the contention here becomes worse when the number of slabs
increases, so it makes sense to
decrease the number of slabs by increasing order.

By the way, my silly question here is:
In the first place, is it worth taking 1/2 of s->cpu_partial_slabs in
the slowpath when slab is frequently used?
wouldn't the cpu partial slab list be re-filled again by free if free
operations are frequently performed?

> Unconditionally increasing the order could  bring trouble to client
> devices with very limited size of memory, which may care more about
> memory footprint, also allocating order 4 page could be harder under
> memory pressure. So the increase will only be done for big systems
> like servers, which usually are equipped with plenty of memory and
> easier to hit lock contention issues.

Also, does it make sense not to increase the order when PAGE_SIZE > 4096?

> Following is some performance data:
>
> will-it-scale/mmap1
> -------------------
> Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
> Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
> configurations with parallel test threads of 25%, 50% and 100% of
> number of CPUs, and the data is (base is vanilla v6.5 kernel):
>
>                      base                      base+patch
> wis-mmap1-25%       223670           +33.3%     298205        per_process_ops
> wis-mmap1-50%       186020           +51.8%     282383        per_process_ops
> wis-mmap1-100%       89200           +65.0%     147139        per_process_ops
>
> Take the perf-profile comparasion of 50% test case, the lock contention
> is greatly reduced:
>
>       43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
>       0.85            -0.2        0.65        pp.self.___slab_alloc
>       0.41            -0.1        0.27        pp.self.__unfreeze_partials
>       0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial
>
> hackbench
> ---------
>
> Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:
>
>                      base                      base+patch
> hackbench           759951           +10.5%     839601        hackbench.throughput
>
> perf-profile diff:
>      22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
>       0.82            -0.2        0.59        pp.self.___slab_alloc
>       0.33            -0.2        0.13        pp.self.__unfreeze_partials
>
> [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
> [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> Signed-off-by: Feng Tang <feng.tang@intel.com>

> ---
>  mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 38 insertions(+), 13 deletions(-)
>
> diff --git a/mm/slub.c b/mm/slub.c
> index f7940048138c..09ae1ed642b7 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
>   */
>  static unsigned int slub_min_order;
>  static unsigned int slub_max_order =
> -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
> +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
>  static unsigned int slub_min_objects;
>
>  /*
> @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size,
>         return order;
>  }


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems
  2023-09-12  4:52   ` Hyeonggon Yoo
@ 2023-09-12 15:52     ` Feng Tang
  0 siblings, 0 replies; 11+ messages in thread
From: Feng Tang @ 2023-09-12 15:52 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

Hi Hyeonggon,

Many thanks for the review!

On Tue, Sep 12, 2023 at 01:52:19PM +0900, Hyeonggon Yoo wrote:
> On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > There are reports about severe lock contention for slub's per-node
> > 'list_lock' in 'hackbench' test, [1][2], on server systems. And
> > similar contention is also seen when running 'mmap1' case of
> > will-it-scale on big systems. As the trend is one processor (socket)
> > will have more and more CPUs (100+, 200+), the contention could be
> > much more severe and becomes a scalability issue.
> >
> > One way to help reducing the contention is to increase the maximum
> > slab order from 3 to 4, for big systems.
> 
> Hello Feng,
> 
> Increasing order with a higher number of CPUs (and so with more
> memory) makes sense to me.
> IIUC the contention here becomes worse when the number of slabs
> increases, so it makes sense to
> decrease the number of slabs by increasing order.
> 
> By the way, my silly question here is:
> In the first place, is it worth taking 1/2 of s->cpu_partial_slabs in
> the slowpath when slab is frequently used?
> wouldn't the cpu partial slab list be re-filled again by free if free
> operations are frequently performed?

My understanding is the contention is related to the number of
objects for each cpu (the current slab and on the per-cpu partial
list), if it's easier to be used up, then the per-node lock will be
contended.

This patch increase the order (I should have also considered the
CPU number), while keeping the per-cpu partial numbers unchanged,
as it doubles the 'nr_objects' in set_cpu_partial().

But the 2/3 patch only increases the per-cpu partial number, and
keeps the order unchanged. From the performance data in cover
letter, 1/3 and 2/3 can individually reduce the contention for
will-it-scale/mmap1, as they both increase the available per-cpu
object numbers. 

> 
> > Unconditionally increasing the order could  bring trouble to client
> > devices with very limited size of memory, which may care more about
> > memory footprint, also allocating order 4 page could be harder under
> > memory pressure. So the increase will only be done for big systems
> > like servers, which usually are equipped with plenty of memory and
> > easier to hit lock contention issues.
> 
> Also, does it make sense not to increase the order when PAGE_SIZE > 4096?

Good point! Some other discussion  on mm list earlier this week
also reminded me that there are ARCHs supporting bigger pages like
64KB, and these patches needs to consider more about it.

> > Following is some performance data:
> >
> > will-it-scale/mmap1
> > -------------------
> > Run will-it-scale benchmark's 'mmap1' test case on a 2 socket Sapphire
> > Rapids server (112 cores / 224 threads) with 256 GB DRAM, run 3
> > configurations with parallel test threads of 25%, 50% and 100% of
> > number of CPUs, and the data is (base is vanilla v6.5 kernel):
> >
> >                      base                      base+patch
> > wis-mmap1-25%       223670           +33.3%     298205        per_process_ops
> > wis-mmap1-50%       186020           +51.8%     282383        per_process_ops
> > wis-mmap1-100%       89200           +65.0%     147139        per_process_ops
> >
> > Take the perf-profile comparasion of 50% test case, the lock contention
> > is greatly reduced:
> >
> >       43.80           -30.8       13.04       pp.self.native_queued_spin_lock_slowpath
> >       0.85            -0.2        0.65        pp.self.___slab_alloc
> >       0.41            -0.1        0.27        pp.self.__unfreeze_partials
> >       0.20 ±  2%      -0.1        0.12 ±  4%  pp.self.get_any_partial
> >
> > hackbench
> > ---------
> >
> > Run same hackbench testcase  mentioned in [1], use same HW/SW as will-it-scale:
> >
> >                      base                      base+patch
> > hackbench           759951           +10.5%     839601        hackbench.throughput
> >
> > perf-profile diff:
> >      22.20 ±  3%     -15.2        7.05        pp.self.native_queued_spin_lock_slowpath
> >       0.82            -0.2        0.59        pp.self.___slab_alloc
> >       0.33            -0.2        0.13        pp.self.__unfreeze_partials
> >
> > [1]. https://lore.kernel.org/all/202307172140.3b34825a-oliver.sang@intel.com/
> > [2]. ttps://lore.kernel.org/lkml/ZORaUsd+So+tnyMV@chenyu5-mobl2/
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> 
> > ---
> >  mm/slub.c | 51 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 38 insertions(+), 13 deletions(-)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index f7940048138c..09ae1ed642b7 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4081,7 +4081,7 @@ EXPORT_SYMBOL(kmem_cache_alloc_bulk);
> >   */
> >  static unsigned int slub_min_order;
> >  static unsigned int slub_max_order =
> > -       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : PAGE_ALLOC_COSTLY_ORDER;
> > +       IS_ENABLED(CONFIG_SLUB_TINY) ? 1 : 4;
> >  static unsigned int slub_min_objects;
> >
> >  /*
> > @@ -4134,6 +4134,26 @@ static inline unsigned int calc_slab_order(unsigned int size,
> >         return order;
> >  }
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-12  4:48   ` Hyeonggon Yoo
@ 2023-09-14  7:05     ` Feng Tang
  2023-09-15  2:40       ` Lameter, Christopher
  0 siblings, 1 reply; 11+ messages in thread
From: Feng Tang @ 2023-09-14  7:05 UTC (permalink / raw)
  To: Hyeonggon Yoo
  Cc: Vlastimil Babka, Andrew Morton, Christoph Lameter, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

Hi Hyeonggon,

On Tue, Sep 12, 2023 at 01:48:23PM +0900, Hyeonggon Yoo wrote:
> On Tue, Sep 5, 2023 at 11:07 PM Feng Tang <feng.tang@intel.com> wrote:
> >
> > Currently most of the slab's min_partial is set to 5 (as MIN_PARTIAL
> > is 5). This is fine for older or small systesms, and could be too
> > small for a large system with hundreds of CPUs, when per-node
> > 'list_lock' is contended for allocating from and freeing to per-node
> > partial list.
> >
> > So enlarge it based on the CPU numbers per node.
> >
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > ---
> >  include/linux/nodemask.h | 1 +
> >  mm/slub.c                | 9 +++++++--
> >  2 files changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/nodemask.h b/include/linux/nodemask.h
> > index 8d07116caaf1..6e22caab186d 100644
> > --- a/include/linux/nodemask.h
> > +++ b/include/linux/nodemask.h
> > @@ -530,6 +530,7 @@ static inline int node_random(const nodemask_t *maskp)
> >
> >  #define num_online_nodes()     num_node_state(N_ONLINE)
> >  #define num_possible_nodes()   num_node_state(N_POSSIBLE)
> > +#define num_cpu_nodes()                num_node_state(N_CPU)
> >  #define node_online(node)      node_state((node), N_ONLINE)
> >  #define node_possible(node)    node_state((node), N_POSSIBLE)
> >
> > diff --git a/mm/slub.c b/mm/slub.c
> > index 09ae1ed642b7..984e012d7bbc 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -4533,6 +4533,7 @@ static int calculate_sizes(struct kmem_cache *s)
> >
> >  static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >  {
> > +       unsigned long min_partial;
> >         s->flags = kmem_cache_flags(s->size, flags, s->name);
> >  #ifdef CONFIG_SLAB_FREELIST_HARDENED
> >         s->random = get_random_long();
> > @@ -4564,8 +4565,12 @@ static int kmem_cache_open(struct kmem_cache *s, slab_flags_t flags)
> >          * The larger the object size is, the more slabs we want on the partial
> >          * list to avoid pounding the page allocator excessively.
> >          */
> > -       s->min_partial = min_t(unsigned long, MAX_PARTIAL, ilog2(s->size) / 2);
> > -       s->min_partial = max_t(unsigned long, MIN_PARTIAL, s->min_partial);
> > +
> > +       min_partial = rounddown_pow_of_two(num_cpus() / num_cpu_nodes());
> > +       min_partial = max_t(unsigned long, MIN_PARTIAL, min_partial);
> > +
> > +       s->min_partial = min_t(unsigned long, min_partial * 2, ilog2(s->size) / 2);
> > +       s->min_partial = max_t(unsigned long, min_partial, s->min_partial);
> 
> Hello Feng,
> 
> How much memory is consumed by this change on your machine?

As the code touches mostly the per-node partial, I did some profiling
by checking the 'partial' of each slab in /sys/kernel/slab/, both
after boot and after running will-it-scale/mmap1 case with all cpu. 

The HW is a 2S 48C/96T platform, with CentOS 9. The kernel is
6.6-rc1 with and without this patch (effectively the MIN_PARTIL
increasing to 32).

There are 246 slabs in total for the system, and after boot, there
are 27 slabs show difference:

	6.6-rc1                         |    6.6-rc1 + node_paritl patch 
-----------------------------------------------------------------------------

anon_vma_chain/partial:8 N0=5 N1=3      | anon_vma_chain/partial:29 N0=22 N1=7
anon_vma/partial:1 N0=1                 | anon_vma/partial:22 N0=22
bio-184/partial:0                       | bio-184/partial:6 N0=6
buffer_head/partial:0                   | buffer_head/partial:29 N1=29
dentry/partial:2 N0=2                   | dentry/partial:3 N1=3
filp/partial:5 N0=5                     | filp/partial:44 N0=28 N1=16
ioat/partial:10 N0=5 N1=5               | ioat/partial:62 N0=31 N1=31
kmalloc-128/partial:0                   | kmalloc-128/partial:1 N0=1
kmalloc-16/partial:1 N1=1               | kmalloc-16/partial:0
kmalloc-1k/partial:5 N0=5               | kmalloc-1k/partial:12 N0=12
kmalloc-32/partial:2 N0=1 N1=1          | kmalloc-32/partial:0
kmalloc-512/partial:4 N0=4              | kmalloc-512/partial:5 N0=4 N1=1
kmalloc-64/partial:1 N0=1               | kmalloc-64/partial:0
kmalloc-8k/partial:6 N0=6               | kmalloc-8k/partial:28 N0=28
kmalloc-96/partial:24 N0=23 N1=1        | kmalloc-96/partial:44 N0=41 N1=3
kmalloc-cg-32/partial:1 N0=1            | kmalloc-cg-32/partial:0
maple_node/partial:10 N0=6 N1=4         | maple_node/partial:55 N0=27 N1=28
pool_workqueue/partial:1 N0=1           | pool_workqueue/partial:0
radix_tree_node/partial:0               | radix_tree_node/partial:2 N0=1 N1=1
sighand_cache/partial:4 N0=4            | sighand_cache/partial:0
signal_cache/partial:0                  | signal_cache/partial:2 N0=2
skbuff_head_cache/partial:4 N0=2 N1=2   | skbuff_head_cache/partial:27 N0=27
skbuff_small_head/partial:5 N0=5        | skbuff_small_head/partial:32 N0=32
task_struct/partial:1 N0=1              | task_struct/partial:17 N0=17
vma_lock/partial:6 N0=4 N1=2            | vma_lock/partial:32 N0=25 N1=7
vmap_area/partial:1 N0=1                | vmap_area/partial:53 N0=32 N1=21
vm_area_struct/partial:14 N0=8 N1=6     | vm_area_struct/partial:38 N0=15 N1=23


After running will-it-scale/mmap1 case with 96 proceses, 30 slab has diffs:

	6.6-rc1                         |    6.6-rc1 + node_paritl patch 
-----------------------------------------------------------------------------

anon_vma_chain/partial:8 N0=5 N1=3      | anon_vma_chain/partial:29 N0=22 N1=7
anon_vma/partial:1 N0=1                 | anon_vma/partial:22 N0=22
bio-184/partial:0                       | bio-184/partial:6 N0=6
buffer_head/partial:0                   | buffer_head/partial:29 N1=29
cred_jar/partial:0                      | cred_jar/partial:6 N1=6
dentry/partial:8 N0=3 N1=5              | dentry/partial:22 N0=6 N1=16
filp/partial:6 N0=1 N1=5                | filp/partial:48 N0=28 N1=20
ioat/partial:10 N0=5 N1=5               | ioat/partial:62 N0=31 N1=31
kmalloc-128/partial:0                   | kmalloc-128/partial:1 N0=1
kmalloc-16/partial:2 N0=1 N1=1          | kmalloc-16/partial:3 N0=3
kmalloc-1k/partial:94 N0=49 N1=45       | kmalloc-1k/partial:100 N0=58 N1=42
kmalloc-32/partial:2 N0=1 N1=1          | kmalloc-32/partial:0
kmalloc-512/partial:209 N0=120 N1=89    | kmalloc-512/partial:205 N0=156 N1=49
kmalloc-64/partial:1 N0=1               | kmalloc-64/partial:0
kmalloc-8k/partial:6 N0=6               | kmalloc-8k/partial:28 N0=28
kmalloc-8/partial:0                     | kmalloc-8/partial:1 N0=1
kmalloc-96/partial:25 N0=23 N1=2        | kmalloc-96/partial:36 N0=33 N1=3
kmalloc-cg-32/partial:1 N0=1            | kmalloc-cg-32/partial:0
lsm_inode_cache/partial:0               | lsm_inode_cache/partial:8 N0=8
maple_node/partial:89 N0=46 N1=43       | maple_node/partial:116 N0=63 N1=53
pool_workqueue/partial:1 N0=1           | pool_workqueue/partial:0
radix_tree_node/partial:0               | radix_tree_node/partial:2 N0=1 N1=1
sighand_cache/partial:4 N0=4            | sighand_cache/partial:0
signal_cache/partial:0                  | signal_cache/partial:2 N0=2
skbuff_head_cache/partial:4 N0=2 N1=2   | skbuff_head_cache/partial:27 N0=27
skbuff_small_head/partial:5 N0=5        | skbuff_small_head/partial:32 N0=32
task_struct/partial:1 N0=1              | task_struct/partial:41 N0=32 N1=9
vma_lock/partial:71 N0=40 N1=31         | vma_lock/partial:110 N0=65 N1=45
vmap_area/partial:1 N0=1                | vmap_area/partial:59 N0=38 N1=21
vm_area_struct/partial:106 N0=58 N1=48  | vm_area_struct/partial:151 N0=88 N1=63

There is meansurable increase for some slabs, but not that much.

> I won't argue that it would be huge for large machines but it
> increases the minimum value for every
> cache (even for those that are not contended) and there is no way to
> reclaim this.

For slabs with less contension, the per-node partial list may also
be less likely to grow? From above data, about 10% slabs get affect
by the change. Maybe we can also limit the change to large systems?

One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
hackbench speed"), the system has been much huger since then. 
Currently while a per-cpu partial can already have 5 or more slabs, 
the limit for a node with possible 100+ CPU could be reconsidered.  

> Maybe a way to reclaim a full slab on memory pressure (on buddy side)
> wouldn't hurt?


Sorry, I don't follow. Do you mean to reclaim a slab with 0 'inuse'
objects, like the work done in __kmem_cache_do_shrink()?

Thanks,
Feng

> 
> >         set_cpu_partial(s);
> >
> > --
> > 2.27.0
> >


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-14  7:05     ` Feng Tang
@ 2023-09-15  2:40       ` Lameter, Christopher
  2023-09-15  5:05         ` Feng Tang
  0 siblings, 1 reply; 11+ messages in thread
From: Lameter, Christopher @ 2023-09-15  2:40 UTC (permalink / raw)
  To: Feng Tang
  Cc: Hyeonggon Yoo, Vlastimil Babka, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

On Thu, 14 Sep 2023, Feng Tang wrote:

> One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
> 2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
> hackbench speed"), the system has been much huger since then.
> Currently while a per-cpu partial can already have 5 or more slabs,
> the limit for a node with possible 100+ CPU could be reconsidered.

Well the trick that I keep using in large systems with lots of memory is 
to use huge page sized page allocation. The applications on those 
already are using the same page size. Doing so usually removes a lot of 
overhead and speeds up things significantly.

Try booting with "slab_min_order=9"


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-15  2:40       ` Lameter, Christopher
@ 2023-09-15  5:05         ` Feng Tang
  2023-09-15 16:13           ` Lameter, Christopher
  0 siblings, 1 reply; 11+ messages in thread
From: Feng Tang @ 2023-09-15  5:05 UTC (permalink / raw)
  To: Lameter, Christopher
  Cc: Hyeonggon Yoo, Vlastimil Babka, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

On Thu, Sep 14, 2023 at 07:40:22PM -0700, Lameter, Christopher wrote:
> On Thu, 14 Sep 2023, Feng Tang wrote:
> 
> > One reason I wanted to revisit the MIN_PARTIAL is, it was changed from
> > 2 to 5 in 2007 by Christoph, in commit 76be895001f2 ("SLUB: Improve
> > hackbench speed"), the system has been much huger since then.
> > Currently while a per-cpu partial can already have 5 or more slabs,
> > the limit for a node with possible 100+ CPU could be reconsidered.
> 
> Well the trick that I keep using in large systems with lots of memory is to
> use huge page sized page allocation. The applications on those already are
> using the same page size. Doing so usually removes a lot of overhead and
> speeds up things significantly.
> 
> Try booting with "slab_min_order=9"

Thanks for sharing the trick! I tried and it works here. But this is
kind of extreme and fit for some special use case, and these patches
try to be useful for generic usage.

Thanks,
Feng


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers
  2023-09-15  5:05         ` Feng Tang
@ 2023-09-15 16:13           ` Lameter, Christopher
  0 siblings, 0 replies; 11+ messages in thread
From: Lameter, Christopher @ 2023-09-15 16:13 UTC (permalink / raw)
  To: Feng Tang
  Cc: Hyeonggon Yoo, Vlastimil Babka, Andrew Morton, Pekka Enberg,
	David Rientjes, Joonsoo Kim, Roman Gushchin, linux-mm,
	linux-kernel

On Fri, 15 Sep 2023, Feng Tang wrote:

> Thanks for sharing the trick! I tried and it works here. But this is
> kind of extreme and fit for some special use case, and these patches
> try to be useful for generic usage.

Having a couple of TB main storage becomes more and more customary for 
servers.




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-09-15 16:14 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-05 14:13 [RFC Patch 0/3] mm/slub: reduce contention for per-node list_lock for large systems Feng Tang
2023-09-05 14:13 ` [RFC Patch 1/3] mm/slub: increase the maximum slab order to 4 for big systems Feng Tang
2023-09-12  4:52   ` Hyeonggon Yoo
2023-09-12 15:52     ` Feng Tang
2023-09-05 14:13 ` [RFC Patch 2/3] mm/slub: double per-cpu partial number for large systems Feng Tang
2023-09-05 14:13 ` [RFC Patch 3/3] mm/slub: setup maxim per-node partial according to cpu numbers Feng Tang
2023-09-12  4:48   ` Hyeonggon Yoo
2023-09-14  7:05     ` Feng Tang
2023-09-15  2:40       ` Lameter, Christopher
2023-09-15  5:05         ` Feng Tang
2023-09-15 16:13           ` Lameter, Christopher

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).