[patch] slub: default min_partial to at least highest cpus per node

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch] slub: default min_partial to at least highest cpus per node
@ 2009-04-06  0:52 David Rientjes
  2009-04-07 18:58 ` Pekka Enberg
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-06  0:52 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

The pre-defined MIN_PARTIAL value may not be suitable for machines with a
large number of cpus per node.  To avoid excessively allocating new slabs
because there is not at least the same number of slabs on a node's
partial list as cpus, this will default a cache's min_partial value to be
at least the highest number of cpus per node on the system.

This default will never exceed MAX_PARTIAL, however, so very large
systems don't waste an excessive amount of memory.

When remote_node_defrag_ratio allows defragmenting remote nodes, it
ensures that nr_partial exceeds min_partial so there will always be a
local reserve when a cpu slab is filled to avoid allocating new slabs
locally as a result of a remote cpu stealing a partial slab.

The cache's min_partial setting may still be changed by writing to
/sys/kernel/slab/cache/min_partial.  The only restriction when doing so
is that the value be within MIN_PARTIAL and MAX_PARTIAL.

Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 mm/slub.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
 1 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1941,6 +1941,15 @@ static unsigned long calculate_alignment(unsigned long flags,
 	return ALIGN(align, sizeof(void *));
 }
 
+static void set_min_partial(struct kmem_cache *s, unsigned long min)
+{
+	if (min < MIN_PARTIAL)
+		min = MIN_PARTIAL;
+	else if (min > MAX_PARTIAL)
+		min = MAX_PARTIAL;
+	s->min_partial = min;
+}
+
 static void init_kmem_cache_cpu(struct kmem_cache *s,
 			struct kmem_cache_cpu *c)
 {
@@ -2093,6 +2102,8 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
 #endif
 
 #ifdef CONFIG_NUMA
+static unsigned long min_partial_per_node;
+
 /*
  * No kmalloc_node yet so do it by hand. We know that this is the first
  * slab on the node for this slabcache. There are no concurrent accesses
@@ -2188,6 +2199,26 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
 	}
 	return 1;
 }
+
+static void __cpuinit set_min_partial_per_node(int node)
+{
+	unsigned int nr_cpus;
+
+	/*
+	 * Add one cpu since this is called for CPU_UP_PREPARE and this cpu is
+	 * not online yet.  CPU_ONLINE would require taking slub_lock.
+	 */
+	nr_cpus = nr_cpus_node(node) + 1;
+	if (nr_cpus > min_partial_per_node)
+		min_partial_per_node = min_t(unsigned long,
+					     nr_cpus, MAX_PARTIAL);
+}
+
+static void set_cache_min_partial_per_node(struct kmem_cache *s)
+{
+	if (s->min_partial < min_partial_per_node)
+		set_min_partial(s, min_partial_per_node);
+}
 #else
 static void free_kmem_cache_nodes(struct kmem_cache *s)
 {
@@ -2198,16 +2229,15 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
 	init_kmem_cache_node(&s->local_node, s);
 	return 1;
 }
-#endif
 
-static void set_min_partial(struct kmem_cache *s, unsigned long min)
+static inline void set_min_partial_per_node(int node)
+{
+}
+
+static inline void set_cache_min_partial_per_node(const struct kmem_cache *s)
 {
-	if (min < MIN_PARTIAL)
-		min = MIN_PARTIAL;
-	else if (min > MAX_PARTIAL)
-		min = MAX_PARTIAL;
-	s->min_partial = min;
 }
+#endif
 
 /*
  * calculate_sizes() determines the order and the distribution of data within
@@ -2352,6 +2382,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
 	 * list to avoid pounding the page allocator excessively.
 	 */
 	set_min_partial(s, ilog2(s->size));
+	set_cache_min_partial_per_node(s);
 	s->refcount = 1;
 #ifdef CONFIG_NUMA
 	s->remote_node_defrag_ratio = 1000;
@@ -3239,10 +3270,13 @@ static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
 	case CPU_UP_PREPARE:
 	case CPU_UP_PREPARE_FROZEN:
 		init_alloc_cpu_cpu(cpu);
+		set_min_partial_per_node(cpu_to_node(cpu));
 		down_read(&slub_lock);
-		list_for_each_entry(s, &slab_caches, list)
+		list_for_each_entry(s, &slab_caches, list) {
 			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
 							GFP_KERNEL);
+			set_cache_min_partial_per_node(s);
+		}
 		up_read(&slub_lock);
 		break;
 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-06  0:52 [patch] slub: default min_partial to at least highest cpus per node David Rientjes
@ 2009-04-07 18:58 ` Pekka Enberg
  2009-04-07 19:09   ` Pekka Enberg
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 18:58 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-kernel

Hi David,

David Rientjes wrote:
> The pre-defined MIN_PARTIAL value may not be suitable for machines with a
> large number of cpus per node.  To avoid excessively allocating new slabs
> because there is not at least the same number of slabs on a node's
> partial list as cpus, this will default a cache's min_partial value to be
> at least the highest number of cpus per node on the system.
> 
> This default will never exceed MAX_PARTIAL, however, so very large
> systems don't waste an excessive amount of memory.
> 
> When remote_node_defrag_ratio allows defragmenting remote nodes, it
> ensures that nr_partial exceeds min_partial so there will always be a
> local reserve when a cpu slab is filled to avoid allocating new slabs
> locally as a result of a remote cpu stealing a partial slab.
> 
> The cache's min_partial setting may still be changed by writing to
> /sys/kernel/slab/cache/min_partial.  The only restriction when doing so
> is that the value be within MIN_PARTIAL and MAX_PARTIAL.
> 
> Cc: Christoph Lameter <cl@linux-foundation.org>
> Signed-off-by: David Rientjes <rientjes@google.com>

Hmm, partial lists are per-node, so wouldn't it be better to do the 
adjustment for every struct kmem_cache_node separately? The 
'min_partial_per_node' global seems just too ugly and confusing to live 
with.

			Pekka

> ---
>  mm/slub.c |   50 ++++++++++++++++++++++++++++++++++++++++++--------
>  1 files changed, 42 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1941,6 +1941,15 @@ static unsigned long calculate_alignment(unsigned long flags,
>  	return ALIGN(align, sizeof(void *));
>  }
>  
> +static void set_min_partial(struct kmem_cache *s, unsigned long min)
> +{
> +	if (min < MIN_PARTIAL)
> +		min = MIN_PARTIAL;
> +	else if (min > MAX_PARTIAL)
> +		min = MAX_PARTIAL;
> +	s->min_partial = min;
> +}
> +
>  static void init_kmem_cache_cpu(struct kmem_cache *s,
>  			struct kmem_cache_cpu *c)
>  {
> @@ -2093,6 +2102,8 @@ static inline int alloc_kmem_cache_cpus(struct kmem_cache *s, gfp_t flags)
>  #endif
>  
>  #ifdef CONFIG_NUMA
> +static unsigned long min_partial_per_node;
> +
>  /*
>   * No kmalloc_node yet so do it by hand. We know that this is the first
>   * slab on the node for this slabcache. There are no concurrent accesses
> @@ -2188,6 +2199,26 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
>  	}
>  	return 1;
>  }
> +
> +static void __cpuinit set_min_partial_per_node(int node)
> +{
> +	unsigned int nr_cpus;
> +
> +	/*
> +	 * Add one cpu since this is called for CPU_UP_PREPARE and this cpu is
> +	 * not online yet.  CPU_ONLINE would require taking slub_lock.
> +	 */
> +	nr_cpus = nr_cpus_node(node) + 1;
> +	if (nr_cpus > min_partial_per_node)
> +		min_partial_per_node = min_t(unsigned long,
> +					     nr_cpus, MAX_PARTIAL);
> +}
> +
> +static void set_cache_min_partial_per_node(struct kmem_cache *s)
> +{
> +	if (s->min_partial < min_partial_per_node)
> +		set_min_partial(s, min_partial_per_node);
> +}
>  #else
>  static void free_kmem_cache_nodes(struct kmem_cache *s)
>  {
> @@ -2198,16 +2229,15 @@ static int init_kmem_cache_nodes(struct kmem_cache *s, gfp_t gfpflags)
>  	init_kmem_cache_node(&s->local_node, s);
>  	return 1;
>  }
> -#endif
>  
> -static void set_min_partial(struct kmem_cache *s, unsigned long min)
> +static inline void set_min_partial_per_node(int node)
> +{
> +}
> +
> +static inline void set_cache_min_partial_per_node(const struct kmem_cache *s)
>  {
> -	if (min < MIN_PARTIAL)
> -		min = MIN_PARTIAL;
> -	else if (min > MAX_PARTIAL)
> -		min = MAX_PARTIAL;
> -	s->min_partial = min;
>  }
> +#endif
>  
>  /*
>   * calculate_sizes() determines the order and the distribution of data within
> @@ -2352,6 +2382,7 @@ static int kmem_cache_open(struct kmem_cache *s, gfp_t gfpflags,
>  	 * list to avoid pounding the page allocator excessively.
>  	 */
>  	set_min_partial(s, ilog2(s->size));
> +	set_cache_min_partial_per_node(s);
>  	s->refcount = 1;
>  #ifdef CONFIG_NUMA
>  	s->remote_node_defrag_ratio = 1000;
> @@ -3239,10 +3270,13 @@ static int __cpuinit slab_cpuup_callback(struct notifier_block *nfb,
>  	case CPU_UP_PREPARE:
>  	case CPU_UP_PREPARE_FROZEN:
>  		init_alloc_cpu_cpu(cpu);
> +		set_min_partial_per_node(cpu_to_node(cpu));
>  		down_read(&slub_lock);
> -		list_for_each_entry(s, &slab_caches, list)
> +		list_for_each_entry(s, &slab_caches, list) {
>  			s->cpu_slab[cpu] = alloc_kmem_cache_cpu(s, cpu,
>  							GFP_KERNEL);
> +			set_cache_min_partial_per_node(s);
> +		}
>  		up_read(&slub_lock);
>  		break;
>  


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per  node
  2009-04-07 18:58 ` Pekka Enberg
@ 2009-04-07 19:09   ` Pekka Enberg
  2009-04-07 19:44     ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 19:09 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-kernel

David Rientjes wrote:
>> The pre-defined MIN_PARTIAL value may not be suitable for machines with a
>> large number of cpus per node.  To avoid excessively allocating new slabs
>> because there is not at least the same number of slabs on a node's
>> partial list as cpus, this will default a cache's min_partial value to be
>> at least the highest number of cpus per node on the system.
>>
>> This default will never exceed MAX_PARTIAL, however, so very large
>> systems don't waste an excessive amount of memory.
>>
>> When remote_node_defrag_ratio allows defragmenting remote nodes, it
>> ensures that nr_partial exceeds min_partial so there will always be a
>> local reserve when a cpu slab is filled to avoid allocating new slabs
>> locally as a result of a remote cpu stealing a partial slab.
>>
>> The cache's min_partial setting may still be changed by writing to
>> /sys/kernel/slab/cache/min_partial.  The only restriction when doing so
>> is that the value be within MIN_PARTIAL and MAX_PARTIAL.
>>
>> Cc: Christoph Lameter <cl@linux-foundation.org>
>> Signed-off-by: David Rientjes <rientjes@google.com>

On Tue, Apr 7, 2009 at 9:58 PM, Pekka Enberg <penberg@cs.helsinki.fi> wrote:
> Hmm, partial lists are per-node, so wouldn't it be better to do the
> adjustment for every struct kmem_cache_node separately? The
> 'min_partial_per_node' global seems just too ugly and confusing to live
> with.

Btw, that requires moving ->min_partial to struct kmem_cache_node from
struct kmem_cache.  But I think that makes a whole lot of sense if
some nodes may have more CPUs than others.

And while the improvement is kinda obvious, I would be interested to
know what kind of workload benefits from this patch (and see numbers
if there are any).

                                  Pekka

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 19:09   ` Pekka Enberg
@ 2009-04-07 19:44     ` David Rientjes
  2009-04-07 19:54       ` Pekka Enberg
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-07 19:44 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

On Tue, 7 Apr 2009, Pekka Enberg wrote:

> > Hmm, partial lists are per-node, so wouldn't it be better to do the
> > adjustment for every struct kmem_cache_node separately? The
> > 'min_partial_per_node' global seems just too ugly and confusing to live
> > with.
> 
> Btw, that requires moving ->min_partial to struct kmem_cache_node from
> struct kmem_cache.  But I think that makes a whole lot of sense if
> some nodes may have more CPUs than others.
> 
> And while the improvement is kinda obvious, I would be interested to
> know what kind of workload benefits from this patch (and see numbers
> if there are any).
> 

It doesn't really depend on the workload, it depends on the type of NUMA 
machine it's running on (and whether that NUMA is asymmetric amongst 
cpus).

Since min_partial_per_node is capped at MAX_PARTIAL, this is only really 
relevant for remote node defragmentation if it's allowed (and not just 2% 
of the time like the default).  We want to avoid stealing partial slabs 
from remote nodes if there are fewer than the number of cpus on that node.

Otherwise, it's possible for each cpu on the victim node to try to 
allocate a single object and require nr_cpus_node(node) new slab 
allocations.  In this case it's entirely possible for the majority of cpus 
to have cpu slabs from remote nodes.  This change reduces the liklihood of 
that happening because we'll always have cpu slab replacements on our 
local partial list before allowing remote defragmentation.

I'd be just as happy with the following, although it would require 
changing MIN_PARTIAL to be greater than its default of 5 if a node 
supports more cpus for optimal performance (the old patch did that 
automatically up to MAX_PARTIAL).
---
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1326,11 +1326,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
 	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
 	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 		struct kmem_cache_node *n;
+		int node;

-		n = get_node(s, zone_to_nid(zone));
+		node = zone_to_nid(zone);
+		n = get_node(s, node);

 		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
-				n->nr_partial > s->min_partial) {
+				n->nr_partial > nr_cpus_node(node)) {
 			page = get_partial_node(n);
 			if (page)
 				return page;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 19:44     ` David Rientjes
@ 2009-04-07 19:54       ` Pekka Enberg
  2009-04-07 20:09         ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 19:54 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-kernel

David Rientjes wrote:
> On Tue, 7 Apr 2009, Pekka Enberg wrote:
> 
>>> Hmm, partial lists are per-node, so wouldn't it be better to do the
>>> adjustment for every struct kmem_cache_node separately? The
>>> 'min_partial_per_node' global seems just too ugly and confusing to live
>>> with.
>> Btw, that requires moving ->min_partial to struct kmem_cache_node from
>> struct kmem_cache.  But I think that makes a whole lot of sense if
>> some nodes may have more CPUs than others.
>>
>> And while the improvement is kinda obvious, I would be interested to
>> know what kind of workload benefits from this patch (and see numbers
>> if there are any).
>>
> 
> It doesn't really depend on the workload, it depends on the type of NUMA 
> machine it's running on (and whether that NUMA is asymmetric amongst 
> cpus).
> 
> Since min_partial_per_node is capped at MAX_PARTIAL, this is only really 
> relevant for remote node defragmentation if it's allowed (and not just 2% 
> of the time like the default).  We want to avoid stealing partial slabs 
> from remote nodes if there are fewer than the number of cpus on that node.
> 
> Otherwise, it's possible for each cpu on the victim node to try to 
> allocate a single object and require nr_cpus_node(node) new slab 
> allocations.  In this case it's entirely possible for the majority of cpus 
> to have cpu slabs from remote nodes.  This change reduces the liklihood of 
> that happening because we'll always have cpu slab replacements on our 
> local partial list before allowing remote defragmentation.
> 
> I'd be just as happy with the following, although it would require 
> changing MIN_PARTIAL to be greater than its default of 5 if a node 
> supports more cpus for optimal performance (the old patch did that 
> automatically up to MAX_PARTIAL).

Hmm but why not move ->min_partial to struct kmem_cache_node as I 
suggested and make sure it's adjusted properly as with nr_cpus_node()?

> diff --git a/mm/slub.c b/mm/slub.c
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1326,11 +1326,13 @@ static struct page *get_any_partial(struct kmem_cache *s, gfp_t flags)
>  	zonelist = node_zonelist(slab_node(current->mempolicy), flags);
>  	for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
>  		struct kmem_cache_node *n;
> +		int node;
>  
> -		n = get_node(s, zone_to_nid(zone));
> +		node = zone_to_nid(zone);
> +		n = get_node(s, node);
>  
>  		if (n && cpuset_zone_allowed_hardwall(zone, flags) &&
> -				n->nr_partial > s->min_partial) {
> +				n->nr_partial > nr_cpus_node(node)) {
>  			page = get_partial_node(n);
>  			if (page)
>  				return page;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 19:54       ` Pekka Enberg
@ 2009-04-07 20:09         ` David Rientjes
  2009-04-07 20:11           ` Pekka Enberg
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-07 20:09 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

On Tue, 7 Apr 2009, Pekka Enberg wrote:

> > I'd be just as happy with the following, although it would require changing
> > MIN_PARTIAL to be greater than its default of 5 if a node supports more cpus
> > for optimal performance (the old patch did that automatically up to
> > MAX_PARTIAL).
> 
> Hmm but why not move ->min_partial to struct kmem_cache_node as I suggested
> and make sure it's adjusted properly as with nr_cpus_node()?
> 

Sure, that's also possible except we'd lose the ability to tune 
min_partial with /sys/kernel/slab/cache/min_partial, unless it would then 
change n->min_partial for each N_NORMAL_MEMORY node.  We lack an interface 
to change the per-node min_partial.

If you think that's acceptable, I'd be just as satisfied with that 
approach as long as all archs have valid cpu_to_node() mappings at the 
time of CPU_UP_PREPARE.

Aside: we're lacking in the documentation of these sysfs tunables such as 
remote_node_defrag_ratio to begin with, the only way to figure out what it 
does is by reading the code or making assumptions based on its name.  I'd 
be happy to add some documentation but it'd be good to keep it separate 
from Documentation/vm/slub.txt.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:09         ` David Rientjes
@ 2009-04-07 20:11           ` Pekka Enberg
  2009-04-07 20:22             ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 20:11 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-kernel

Hi David,

On Tue, 7 Apr 2009, Pekka Enberg wrote:
>>> I'd be just as happy with the following, although it would require changing
>>> MIN_PARTIAL to be greater than its default of 5 if a node supports more cpus
>>> for optimal performance (the old patch did that automatically up to
>>> MAX_PARTIAL).
>> Hmm but why not move ->min_partial to struct kmem_cache_node as I suggested
>> and make sure it's adjusted properly as with nr_cpus_node()?

David Rientjes wrote:
> Sure, that's also possible except we'd lose the ability to tune 
> min_partial with /sys/kernel/slab/cache/min_partial, unless it would then 
> change n->min_partial for each N_NORMAL_MEMORY node.  We lack an interface 
> to change the per-node min_partial.
> 
> If you think that's acceptable, I'd be just as satisfied with that 
> approach as long as all archs have valid cpu_to_node() mappings at the 
> time of CPU_UP_PREPARE.

Well, that doesn't change the current behavior, so sure, I think it's 
acceptable. And if the new defaults seem reasonable enough, we can 
probably get rid of the tunable altogether.

David Rientjes wrote:
> Aside: we're lacking in the documentation of these sysfs tunables such as 
> remote_node_defrag_ratio to begin with, the only way to figure out what it 
> does is by reading the code or making assumptions based on its name.  I'd 
> be happy to add some documentation but it'd be good to keep it separate 
> from Documentation/vm/slub.txt.

AFAICT, the Documentation/ABI directory is the right place for this kind 
of stuff.

			Pekka

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:11           ` Pekka Enberg
@ 2009-04-07 20:22             ` David Rientjes
  2009-04-07 20:35               ` Pekka Enberg
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-07 20:22 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

On Tue, 7 Apr 2009, Pekka Enberg wrote:

> > If you think that's acceptable, I'd be just as satisfied with that approach
> > as long as all archs have valid cpu_to_node() mappings at the time of
> > CPU_UP_PREPARE.
> 
> Well, that doesn't change the current behavior, so sure, I think it's
> acceptable. And if the new defaults seem reasonable enough, we can probably
> get rid of the tunable altogether.
> 

I'd like to remove MAX_PARTIAL and replace it with nr_cpu_ids.

The tunable still makes sense to increase slowpath performance by avoiding 
the page allocator on machines where memory is abundant.  That remains 
unchanged from when I added it.

> AFAICT, the Documentation/ABI directory is the right place for this kind of
> stuff.
> 

Ok.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:22             ` David Rientjes
@ 2009-04-07 20:35               ` Pekka Enberg
  2009-04-07 20:43                 ` Christoph Lameter
  2009-04-07 22:01                 ` David Rientjes
  0 siblings, 2 replies; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 20:35 UTC (permalink / raw)
  To: David Rientjes; +Cc: Christoph Lameter, linux-kernel

David Rientjes wrote:
> On Tue, 7 Apr 2009, Pekka Enberg wrote:
> 
>>> If you think that's acceptable, I'd be just as satisfied with that approach
>>> as long as all archs have valid cpu_to_node() mappings at the time of
>>> CPU_UP_PREPARE.
>> Well, that doesn't change the current behavior, so sure, I think it's
>> acceptable. And if the new defaults seem reasonable enough, we can probably
>> get rid of the tunable altogether.
> 
> I'd like to remove MAX_PARTIAL and replace it with nr_cpu_ids.

Yeah, something more dynamic makes sense. But we probably need to do 
ilog2(nr_cpu_ids) or something; otherwise we will have very long partial 
lists on big iron machines (think 4096 cpus here).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:35               ` Pekka Enberg
@ 2009-04-07 20:43                 ` Christoph Lameter
  2009-04-07 20:48                   ` Pekka Enberg
  2009-04-07 22:01                 ` David Rientjes
  1 sibling, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2009-04-07 20:43 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: David Rientjes, linux-kernel

I am not sure about the benefit of this. If the page allocator finally
improves in performance then we can reduce the mininum number of partial
slabs kept and also do pass through for all sizes >2k.

Some numbers would be good to see if we should adopt this in the meantime.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:43                 ` Christoph Lameter
@ 2009-04-07 20:48                   ` Pekka Enberg
  2009-04-07 21:31                     ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: Pekka Enberg @ 2009-04-07 20:48 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: David Rientjes, linux-kernel

Christoph Lameter wrote:
> I am not sure about the benefit of this. If the page allocator finally
> improves in performance then we can reduce the mininum number of partial
> slabs kept and also do pass through for all sizes >2k.

Sure, we can do that, but it still makes sense to adjust the partial 
list sizes based on per-node CPU count. So I'm inclined to apply the 
patch I suggested if it shows performance gains on one or more of the 
relevant benchmarks (sysbench, tbench, hackbench, netperf, etc.).

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:48                   ` Pekka Enberg
@ 2009-04-07 21:31                     ` Christoph Lameter
  2009-04-07 21:46                       ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:31 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: David Rientjes, linux-kernel

On Tue, 7 Apr 2009, Pekka Enberg wrote:

> Christoph Lameter wrote:
> > I am not sure about the benefit of this. If the page allocator finally
> > improves in performance then we can reduce the mininum number of partial
> > slabs kept and also do pass through for all sizes >2k.
>
> Sure, we can do that, but it still makes sense to adjust the partial list
> sizes based on per-node CPU count. So I'm inclined to apply the patch I
> suggested if it shows performance gains on one or more of the relevant
> benchmarks (sysbench, tbench, hackbench, netperf, etc.).

The partial list sizes cannot be adjusted. There is just a mininum size
that is set via MIN_PARTIAL. Currently we use that to avoid trips to the
page allocator by keeping the list artificially long. This should not be
done because it increases memory overhead by keeping slab pages around
with no objects. If the page allocator trips are not that expensive
anymore then there is no reason to keep these slab pages around. A hot
page can be reused by any other subsystemm in the kernel.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 21:31                     ` Christoph Lameter
@ 2009-04-07 21:46                       ` David Rientjes
  2009-04-07 21:48                         ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-07 21:46 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-kernel

On Tue, 7 Apr 2009, Christoph Lameter wrote:

> The partial list sizes cannot be adjusted.

It's adjustable via /sys/kernel/slab/cache/min_partial.

> There is just a mininum size
> that is set via MIN_PARTIAL. Currently we use that to avoid trips to the
> page allocator by keeping the list artificially long. This should not be
> done because it increases memory overhead by keeping slab pages around
> with no objects.

That memory is easily reclaimed with kmem_cache_shrink().

> If the page allocator trips are not that expensive
> anymore then there is no reason to keep these slab pages around. A hot
> page can be reused by any other subsystemm in the kernel.
> 

Right, but right now allocating new slabs is extremely slow compared to 
using a partial slab.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 21:46                       ` David Rientjes
@ 2009-04-07 21:48                         ` Christoph Lameter
  2009-04-07 22:03                           ` David Rientjes
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Lameter @ 2009-04-07 21:48 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-kernel

On Tue, 7 Apr 2009, David Rientjes wrote:

> Right, but right now allocating new slabs is extremely slow compared to
> using a partial slab.

Work on the page allocator is in progress. So this is band aid until the
work on the page allocator is complete. Maybe better focus on helping Mel?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 20:35               ` Pekka Enberg
  2009-04-07 20:43                 ` Christoph Lameter
@ 2009-04-07 22:01                 ` David Rientjes
  1 sibling, 0 replies; 17+ messages in thread
From: David Rientjes @ 2009-04-07 22:01 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Christoph Lameter, linux-kernel

On Tue, 7 Apr 2009, Pekka Enberg wrote:

> Yeah, something more dynamic makes sense. But we probably need to do
> ilog2(nr_cpu_ids) or something; otherwise we will have very long partial lists
> on big iron machines (think 4096 cpus here).
> 

The suggestion was to replace MAX_PARTIAL with nr_cpu_ids, not as the 
default value for n->min_partial, which is ilog2(size).  To exceed the 
current value of MAX_PARTIAL, which is 10, size would need to be 2K or 
greater.

I agree that the default n->min_partial would be better represented in 
terms of cpus than object size.  ilog2(nr_cpu_ids) certainly seems 
appropriate and free partial slabs can always be reclaimed by utilizing 
/sys/kernel/slab/cache/shrink.

The 4096 cpu example is certainly realistic, but probably not UMA.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 21:48                         ` Christoph Lameter
@ 2009-04-07 22:03                           ` David Rientjes
  2009-04-13 18:07                             ` Christoph Lameter
  0 siblings, 1 reply; 17+ messages in thread
From: David Rientjes @ 2009-04-07 22:03 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Pekka Enberg, linux-kernel

On Tue, 7 Apr 2009, Christoph Lameter wrote:

> > Right, but right now allocating new slabs is extremely slow compared to
> > using a partial slab.
> 
> Work on the page allocator is in progress. So this is band aid until the
> work on the page allocator is complete. Maybe better focus on helping Mel?
> 

The page allocator may be getting significantly faster, but that is _not_ 
the only reason why we use partial slabs.  Higher order slabs, such as the 
default for kmalloc-2048, kmalloc-4096, etc., increase the liklihood of 
having cpu slabs at that order as opposed to order fallback with a 
sufficiently long uptime.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [patch] slub: default min_partial to at least highest cpus per node
  2009-04-07 22:03                           ` David Rientjes
@ 2009-04-13 18:07                             ` Christoph Lameter
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Lameter @ 2009-04-13 18:07 UTC (permalink / raw)
  To: David Rientjes; +Cc: Pekka Enberg, linux-kernel

On Tue, 7 Apr 2009, David Rientjes wrote:

> The page allocator may be getting significantly faster, but that is _not_
> the only reason why we use partial slabs.  Higher order slabs, such as the
> default for kmalloc-2048, kmalloc-4096, etc., increase the liklihood of
> having cpu slabs at that order as opposed to order fallback with a
> sufficiently long uptime.

kmalloc-4096 will go away with a faster page allocator.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2009-04-13 18:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-06  0:52 [patch] slub: default min_partial to at least highest cpus per node David Rientjes
2009-04-07 18:58 ` Pekka Enberg
2009-04-07 19:09   ` Pekka Enberg
2009-04-07 19:44     ` David Rientjes
2009-04-07 19:54       ` Pekka Enberg
2009-04-07 20:09         ` David Rientjes
2009-04-07 20:11           ` Pekka Enberg
2009-04-07 20:22             ` David Rientjes
2009-04-07 20:35               ` Pekka Enberg
2009-04-07 20:43                 ` Christoph Lameter
2009-04-07 20:48                   ` Pekka Enberg
2009-04-07 21:31                     ` Christoph Lameter
2009-04-07 21:46                       ` David Rientjes
2009-04-07 21:48                         ` Christoph Lameter
2009-04-07 22:03                           ` David Rientjes
2009-04-13 18:07                             ` Christoph Lameter
2009-04-07 22:01                 ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox