[patch -mm] cpusets: add memory_slab

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [patch -mm] cpusets: add memory_slab_hardwall flag
@ 2009-03-08 16:27 David Rientjes
  2009-03-08 16:53 ` Paul Menage
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: David Rientjes @ 2009-03-08 16:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, linux-kernel

Adds a per-cpuset `memory_slab_hardwall' flag.

The slab allocator interface for determining whether an object is allowed
is

	int current_cpuset_object_allowed(int node, gfp_t flags)

This returns non-zero when the object is allowed, either because
current's cpuset does not have memory_slab_hardwall enabled or because
it allows allocation on the node.  Otherwise, it returns zero.

This interface is lockless because a task's cpuset can always be safely
dereferenced atomically.

For slab, if the physical node id of the cpu cache is not from an
allowable node, the allocation will fail.  If an allocation is targeted
for a node that is not allowed, we allocate from an appropriate one
instead of failing.

For slob, if the page from the slob list is not from an allowable node,
we continue to scan for an appropriate slab.  If none can be used, a new
slab is allocated.

For slub, if the cpu slab is not from an allowable node, the partial list
is scanned for a replacement.  If none can be used, a new slab is
allocated.

Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 Documentation/cgroups/cpusets.txt |   54 ++++++++++++++++++++++++-------------
 include/linux/cpuset.h            |    6 ++++
 kernel/cpuset.c                   |   34 +++++++++++++++++++++++
 mm/slab.c                         |    4 +++
 mm/slob.c                         |    6 +++-
 mm/slub.c                         |   12 +++++---
 6 files changed, 91 insertions(+), 25 deletions(-)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -14,20 +14,21 @@ CONTENTS:
 =========
 
 1. Cpusets
-  1.1 What are cpusets ?
-  1.2 Why are cpusets needed ?
-  1.3 How are cpusets implemented ?
-  1.4 What are exclusive cpusets ?
-  1.5 What is memory_pressure ?
-  1.6 What is memory spread ?
-  1.7 What is sched_load_balance ?
-  1.8 What is sched_relax_domain_level ?
-  1.9 How do I use cpusets ?
+  1.1  What are cpusets ?
+  1.2  Why are cpusets needed ?
+  1.3  How are cpusets implemented ?
+  1.4  What are exclusive cpusets ?
+  1.5  What is memory_pressure ?
+  1.6  What is memory spread ?
+  1.7  What is sched_load_balance ?
+  1.8  What is sched_relax_domain_level ?
+  1.9  What is memory_slab_hardwall ?
+  1.10 How do I use cpusets ?
 2. Usage Examples and Syntax
-  2.1 Basic Usage
-  2.2 Adding/removing cpus
-  2.3 Setting flags
-  2.4 Attaching processes
+  2.1  Basic Usage
+  2.2  Adding/removing cpus
+  2.3  Setting flags
+  2.4  Attaching processes
 3. Questions
 4. Contact
 
@@ -581,8 +582,22 @@ If your situation is:
 then increasing 'sched_relax_domain_level' would benefit you.
 
 
-1.9 How do I use cpusets ?
---------------------------
+1.9 What is memory_slab_hardwall ?
+----------------------------------
+
+A cpuset may require that slab object allocations all originate from
+its set of mems, either for memory isolation or NUMA optimizations.  Slab
+allocators normally optimize allocations in the fastpath by returning
+objects from a cpu slab.  These objects do not necessarily originate from
+slabs allocated on a cpuset's mems.
+
+When memory_slab_hardwall is set, all objects are allocated from slabs on
+the cpuset's set of mems.  This may incur a performance penalty if the
+cpu slab must be swapped for a different slab.
+
+
+1.10 How do I use cpusets ?
+---------------------------
 
 In order to minimize the impact of cpusets on critical kernel
 code, such as the scheduler, and due to the fact that the kernel
@@ -725,10 +740,11 @@ Now you want to do something with this cpuset.
 
 In this directory you can find several files:
 # ls
-cpu_exclusive  memory_migrate      mems                      tasks
-cpus           memory_pressure     notify_on_release
-mem_exclusive  memory_spread_page  sched_load_balance
-mem_hardwall   memory_spread_slab  sched_relax_domain_level
+cpu_exclusive		memory_pressure			notify_on_release
+cpus			memory_slab_hardwall		sched_load_balance
+mem_exclusive		memory_spread_page		sched_relax_domain_level
+mem_hardwall		memory_spread_slab		tasks
+memory_migrate		mems
 
 Reading them will give you information about the state of this cpuset:
 the CPUs and Memory Nodes it can use, the processes that are using
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -87,6 +87,7 @@ static inline int cpuset_do_slab_mem_spread(void)
 }
 
 extern int current_cpuset_is_being_rebound(void);
+extern int current_cpuset_object_allowed(int node, gfp_t flags);
 
 extern void rebuild_sched_domains(void);
 
@@ -179,6 +180,11 @@ static inline int current_cpuset_is_being_rebound(void)
 	return 0;
 }
 
+static inline int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+	return 1;
+}
+
 static inline void rebuild_sched_domains(void)
 {
 	partition_sched_domains(1, NULL, NULL);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -142,6 +142,7 @@ typedef enum {
 	CS_SCHED_LOAD_BALANCE,
 	CS_SPREAD_PAGE,
 	CS_SPREAD_SLAB,
+	CS_SLAB_HARDWALL,
 } cpuset_flagbits_t;
 
 /* convenient tests for these bits */
@@ -180,6 +181,11 @@ static inline int is_spread_slab(const struct cpuset *cs)
 	return test_bit(CS_SPREAD_SLAB, &cs->flags);
 }
 
+static inline int is_slab_hardwall(const struct cpuset *cs)
+{
+	return test_bit(CS_SLAB_HARDWALL, &cs->flags);
+}
+
 /*
  * Increment this integer everytime any cpuset changes its
  * mems_allowed value.  Users of cpusets can track this generation
@@ -1190,6 +1196,19 @@ int current_cpuset_is_being_rebound(void)
 	return task_cs(current) == cpuset_being_rebound;
 }
 
+/**
+ * current_cpuset_object_allowed - can a slab object be allocated on a node?
+ * @node: the node for object allocation
+ * @flags: allocation flags
+ *
+ * Return non-zero if object is allowed, zero otherwise.
+ */
+int current_cpuset_object_allowed(int node, gfp_t flags)
+{
+	return !is_slab_hardwall(task_cs(current)) ||
+	       cpuset_node_allowed_hardwall(node, flags);
+}
+
 static int update_relax_domain_level(struct cpuset *cs, s64 val)
 {
 	if (val < -1 || val >= SD_LV_MAX)
@@ -1417,6 +1436,7 @@ typedef enum {
 	FILE_MEMORY_PRESSURE,
 	FILE_SPREAD_PAGE,
 	FILE_SPREAD_SLAB,
+	FILE_SLAB_HARDWALL,
 } cpuset_filetype_t;
 
 static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1458,6 +1478,9 @@ static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
 		retval = update_flag(CS_SPREAD_SLAB, cs, val);
 		cs->mems_generation = cpuset_mems_generation++;
 		break;
+	case FILE_SLAB_HARDWALL:
+		retval = update_flag(CS_SLAB_HARDWALL, cs, val);
+		break;
 	default:
 		retval = -EINVAL;
 		break;
@@ -1614,6 +1637,8 @@ static u64 cpuset_read_u64(struct cgroup *cont, struct cftype *cft)
 		return is_spread_page(cs);
 	case FILE_SPREAD_SLAB:
 		return is_spread_slab(cs);
+	case FILE_SLAB_HARDWALL:
+		return is_slab_hardwall(cs);
 	default:
 		BUG();
 	}
@@ -1721,6 +1746,13 @@ static struct cftype files[] = {
 		.write_u64 = cpuset_write_u64,
 		.private = FILE_SPREAD_SLAB,
 	},
+
+	{
+		.name = "memory_slab_hardwall",
+		.read_u64 = cpuset_read_u64,
+		.write_u64 = cpuset_write_u64,
+		.private = FILE_SLAB_HARDWALL,
+	},
 };
 
 static struct cftype cft_memory_pressure_enabled = {
@@ -1814,6 +1846,8 @@ static struct cgroup_subsys_state *cpuset_create(
 		set_bit(CS_SPREAD_PAGE, &cs->flags);
 	if (is_spread_slab(parent))
 		set_bit(CS_SPREAD_SLAB, &cs->flags);
+	if (is_slab_hardwall(parent))
+		set_bit(CS_SLAB_HARDWALL, &cs->flags);
 	set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
 	cpumask_clear(cs->cpus_allowed);
 	nodes_clear(cs->mems_allowed);
diff --git a/mm/slab.c b/mm/slab.c
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -3124,6 +3124,8 @@ static inline void *____cache_alloc(struct kmem_cache *cachep, gfp_t flags)
 	check_irq_off();
 
 	ac = cpu_cache_get(cachep);
+	if (!current_cpuset_object_allowed(numa_node_id(), flags))
+		return NULL;
 	if (likely(ac->avail)) {
 		STATS_INC_ALLOCHIT(cachep);
 		ac->touched = 1;
@@ -3249,6 +3251,8 @@ static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags,
 	void *obj;
 	int x;
 
+	if (!current_cpuset_object_allowed(nodeid, flags))
+		nodeid = cpuset_mem_spread_node();
 	l3 = cachep->nodelists[nodeid];
 	BUG_ON(!l3);
 
diff --git a/mm/slob.c b/mm/slob.c
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -319,14 +319,18 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 	spin_lock_irqsave(&slob_lock, flags);
 	/* Iterate through each partially free page, try to find room */
 	list_for_each_entry(sp, slob_list, list) {
+		int slab_node = page_to_nid(&sp->page);
+
 #ifdef CONFIG_NUMA
 		/*
 		 * If there's a node specification, search for a partial
 		 * page with a matching node id in the freelist.
 		 */
-		if (node != -1 && page_to_nid(&sp->page) != node)
+		if (node != -1 && slab_node != node)
 			continue;
 #endif
+		if (!current_cpuset_object_allowed(slab_node, gfp))
+			continue;
 		/* Enough room on this page? */
 		if (sp->units < SLOB_UNITS(size))
 			continue;
diff --git a/mm/slub.c b/mm/slub.c
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1353,6 +1353,8 @@ static struct page *get_partial(struct kmem_cache *s, gfp_t flags, int node)
 	struct page *page;
 	int searchnode = (node == -1) ? numa_node_id() : node;
 
+	if (!current_cpuset_object_allowed(node, flags))
+		searchnode = cpuset_mem_spread_node();
 	page = get_partial_node(get_node(s, searchnode));
 	if (page || (flags & __GFP_THISNODE))
 		return page;
@@ -1475,15 +1477,15 @@ static void flush_all(struct kmem_cache *s)
 
 /*
  * Check if the objects in a per cpu structure fit numa
- * locality expectations.
+ * locality expectations and is allowed in current's cpuset.
  */
-static inline int node_match(struct kmem_cache_cpu *c, int node)
+static inline int check_node(struct kmem_cache_cpu *c, int node, gfp_t flags)
 {
 #ifdef CONFIG_NUMA
 	if (node != -1 && c->node != node)
 		return 0;
 #endif
-	return 1;
+	return current_cpuset_object_allowed(node, flags);
 }
 
 /*
@@ -1517,7 +1519,7 @@ static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 		goto new_slab;
 
 	slab_lock(c->page);
-	if (unlikely(!node_match(c, node)))
+	if (unlikely(!check_node(c, node, gfpflags)))
 		goto another_slab;
 
 	stat(c, ALLOC_REFILL);
@@ -1604,7 +1606,7 @@ static __always_inline void *slab_alloc(struct kmem_cache *s,
 	local_irq_save(flags);
 	c = get_cpu_slab(s, smp_processor_id());
 	objsize = c->objsize;
-	if (unlikely(!c->freelist || !node_match(c, node)))
+	if (unlikely(!c->freelist || !check_node(c, node, gfpflags)))
 
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 16:27 [patch -mm] cpusets: add memory_slab_hardwall flag David Rientjes
@ 2009-03-08 16:53 ` Paul Menage
  2009-03-08 21:38   ` David Rientjes
  2009-03-08 17:01 ` Matt Mackall
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Paul Menage @ 2009-03-08 16:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Randy Dunlap, linux-kernel

On Sun, Mar 8, 2009 at 9:27 AM, David Rientjes <rientjes@google.com> wrote:
> +/**
> + * current_cpuset_object_allowed - can a slab object be allocated on a node?
> + * @node: the node for object allocation
> + * @flags: allocation flags
> + *
> + * Return non-zero if object is allowed, zero otherwise.
> + */
> +int current_cpuset_object_allowed(int node, gfp_t flags)
> +{
> +       return !is_slab_hardwall(task_cs(current)) ||
> +              cpuset_node_allowed_hardwall(node, flags);
> +}
> +

This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
dereference the result of task_cs(current)

I'll leave the actual memory allocator changes for others to comment on .

Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 16:53 ` Paul Menage
@ 2009-03-08 21:38   ` David Rientjes
  2009-03-09  7:08     ` Paul Menage
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-08 21:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Randy Dunlap, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1332 bytes --]

On Sun, 8 Mar 2009, Paul Menage wrote:

> > +/**
> > + * current_cpuset_object_allowed - can a slab object be allocated on a node?
> > + * @node: the node for object allocation
> > + * @flags: allocation flags
> > + *
> > + * Return non-zero if object is allowed, zero otherwise.
> > + */
> > +int current_cpuset_object_allowed(int node, gfp_t flags)
> > +{
> > +       return !is_slab_hardwall(task_cs(current)) ||
> > +              cpuset_node_allowed_hardwall(node, flags);
> > +}
> > +
> 
> This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
> dereference the result of task_cs(current)
> 

I've folded the following into the patch, thanks Paul.
---
 kernel/cpuset.c |    8 ++++++--
 1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1205,8 +1205,12 @@ int current_cpuset_is_being_rebound(void)
  */
 int current_cpuset_object_allowed(int node, gfp_t flags)
 {
-	return !is_slab_hardwall(task_cs(current)) ||
-	       cpuset_node_allowed_hardwall(node, flags);
+	int is_hardwall;
+
+	rcu_read_lock();
+	is_hardwall = is_slab_hardwall(task_cs(current));
+	rcu_read_unlock();
+	return !is_hardwall || cpuset_node_allowed_hardwall(node, flags);
 }
 
 static int update_relax_domain_level(struct cpuset *cs, s64 val)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 21:38   ` David Rientjes
@ 2009-03-09  7:08     ` Paul Menage
  2009-03-09  9:18       ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: Paul Menage @ 2009-03-09  7:08 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Randy Dunlap, linux-kernel

Another thought - it would probably be better to call this flag
kernel_mem_hardwall or mem_hardwall_kernel, to avoid hard-coding its
name to be slab-specific.

Paul

On Sun, Mar 8, 2009 at 2:38 PM, David Rientjes <rientjes@google.com> wrote:
> On Sun, 8 Mar 2009, Paul Menage wrote:
>
>> > +/**
>> > + * current_cpuset_object_allowed - can a slab object be allocated on a node?
>> > + * @node: the node for object allocation
>> > + * @flags: allocation flags
>> > + *
>> > + * Return non-zero if object is allowed, zero otherwise.
>> > + */
>> > +int current_cpuset_object_allowed(int node, gfp_t flags)
>> > +{
>> > +       return !is_slab_hardwall(task_cs(current)) ||
>> > +              cpuset_node_allowed_hardwall(node, flags);
>> > +}
>> > +
>>
>> This should be in rcu_read_lock()/rcu_read_unlock() in order to safely
>> dereference the result of task_cs(current)
>>
>
> I've folded the following into the patch, thanks Paul.
> ---
>  kernel/cpuset.c |    8 ++++++--
>  1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1205,8 +1205,12 @@ int current_cpuset_is_being_rebound(void)
>  */
>  int current_cpuset_object_allowed(int node, gfp_t flags)
>  {
> -       return !is_slab_hardwall(task_cs(current)) ||
> -              cpuset_node_allowed_hardwall(node, flags);
> +       int is_hardwall;
> +
> +       rcu_read_lock();
> +       is_hardwall = is_slab_hardwall(task_cs(current));
> +       rcu_read_unlock();
> +       return !is_hardwall || cpuset_node_allowed_hardwall(node, flags);
>  }
>
>  static int update_relax_domain_level(struct cpuset *cs, s64 val)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09  7:08     ` Paul Menage
@ 2009-03-09  9:18       ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2009-03-09  9:18 UTC (permalink / raw)
  To: Paul Menage
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, Paul Menage wrote:

> Another thought - it would probably be better to call this flag
> kernel_mem_hardwall or mem_hardwall_kernel, to avoid hard-coding its
> name to be slab-specific.
> 

The change only affects slab allocations, it doesn't affect all kernel 
memory allocations.  With slub, for example, allocations that are larger 
than SLUB_MAX_ORDER (was formerly PAGE_SIZE) simply use compound pages 
from the page allocator where the cpuset memory policy was already 
enforced.

While there are a few different options for slab allocators in mainline 
and slqb on the way, these are still generally referred to as "slab" 
allocations regardless of which one is configured.

Prefixing the name with `memory_' just seemed natural considering the 
tunables already in place such as memory_spread_page and 
memory_spread_slab.  The user also already understands `hardwall' since 
mem_hardwall has existed for quite some time.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 16:27 [patch -mm] cpusets: add memory_slab_hardwall flag David Rientjes
  2009-03-08 16:53 ` Paul Menage
@ 2009-03-08 17:01 ` Matt Mackall
  2009-03-08 21:51   ` David Rientjes
  2009-03-09  4:49 ` KOSAKI Motohiro
  2009-03-09 18:50 ` Christoph Lameter
  3 siblings, 1 reply; 27+ messages in thread
From: Matt Mackall @ 2009-03-08 17:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Paul Menage,
	Randy Dunlap, linux-kernel

On Sun, 2009-03-08 at 09:27 -0700, David Rientjes wrote:
> Adds a per-cpuset `memory_slab_hardwall' flag.
> 
> The slab allocator interface for determining whether an object is allowed
> is
> 
> 	int current_cpuset_object_allowed(int node, gfp_t flags)
> 
> This returns non-zero when the object is allowed, either because
> current's cpuset does not have memory_slab_hardwall enabled or because
> it allows allocation on the node.  Otherwise, it returns zero.
> 
> This interface is lockless because a task's cpuset can always be safely
> dereferenced atomically.
> 
> For slab, if the physical node id of the cpu cache is not from an
> allowable node, the allocation will fail.  If an allocation is targeted
> for a node that is not allowed, we allocate from an appropriate one
> instead of failing.
> 
> For slob, if the page from the slob list is not from an allowable node,
> we continue to scan for an appropriate slab.  If none can be used, a new
> slab is allocated.

Looks fine to me, if a little expensive. We'll be needing SLQB support
though.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 17:01 ` Matt Mackall
@ 2009-03-08 21:51   ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2009-03-08 21:51 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Paul Menage,
	Randy Dunlap, linux-kernel

On Sun, 8 Mar 2009, Matt Mackall wrote:

> > For slob, if the page from the slob list is not from an allowable node,
> > we continue to scan for an appropriate slab.  If none can be used, a new
> > slab is allocated.
> 
> Looks fine to me, if a little expensive.

Is that your acked-by? :)

It's not expensive for cpusets that do not set memory_slab_hardwall, which 
is disabled by default, other than some cacheline polluting.  If the 
option is set, then the performance penalty is described in the 
documentation and should be assumed by the user.

We currently have a couple different ways to check for a task's cpuset 
options:

 - per-task flags such as PF_SPREAD_PAGE and PF_SPREAD_SLAB used in the
   hotpath, and

 - rcu dereferencing current's cpuset and atomically checking a cpuset
   flag bit.

It would be nice to unify these to free up some task flag bits.

> We'll be needing SLQB support
> though.
> 

Yeah, I'd like to add the necessary slqb support from Pekka's git tree but 
this patch depends on 
cpusets-replace-zone-allowed-functions-with-node-allowed.patch in -mm, so 
we'll need to know the route by which this should be pushed.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 16:27 [patch -mm] cpusets: add memory_slab_hardwall flag David Rientjes
  2009-03-08 16:53 ` Paul Menage
  2009-03-08 17:01 ` Matt Mackall
@ 2009-03-09  4:49 ` KOSAKI Motohiro
  2009-03-09  9:12   ` David Rientjes
  2009-03-09 18:50 ` Christoph Lameter
  3 siblings, 1 reply; 27+ messages in thread
From: KOSAKI Motohiro @ 2009-03-09  4:49 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Christoph Lameter, Pekka Enberg,
	Matt Mackall, Paul Menage, Randy Dunlap, linux-kernel

> Adds a per-cpuset `memory_slab_hardwall' flag.
> 
> The slab allocator interface for determining whether an object is allowed
> is
> 
> 	int current_cpuset_object_allowed(int node, gfp_t flags)
> 
> This returns non-zero when the object is allowed, either because
> current's cpuset does not have memory_slab_hardwall enabled or because
> it allows allocation on the node.  Otherwise, it returns zero.
> 
> This interface is lockless because a task's cpuset can always be safely
> dereferenced atomically.
> 
> For slab, if the physical node id of the cpu cache is not from an
> allowable node, the allocation will fail.  If an allocation is targeted
> for a node that is not allowed, we allocate from an appropriate one
> instead of failing.
> 
> For slob, if the page from the slob list is not from an allowable node,
> we continue to scan for an appropriate slab.  If none can be used, a new
> slab is allocated.
> 
> For slub, if the cpu slab is not from an allowable node, the partial list
> is scanned for a replacement.  If none can be used, a new slab is
> allocated.

Hmmm,
this description only explay how to implement this.
but no explain why this patch is useful.

Could you please who and why need it?




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09  4:49 ` KOSAKI Motohiro
@ 2009-03-09  9:12   ` David Rientjes
  2009-03-09 10:19     ` KOSAKI Motohiro
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-09  9:12 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:

> Hmmm,
> this description only explay how to implement this.
> but no explain why this patch is useful.
> 
> Could you please who and why need it?
> 

The change to Documentation/cgroups/cpusets.txt should have explained it.

This is for two cases: true memory isolation (now including slab 
allocations at the object level) and NUMA optimizations.

Prior to this change, it was possible for slabs to be allocated in a 
cpuset while its objects were largely consumed by disjoint cpusets.  We 
can fix that by only allocating objects from slabs that are found on 
current->mems_allowed.  While this incurs a performance penalty, some 
users may find that true isolation outweighs the cache optimizations.

It is also helpful for long-lived objects that require NUMA affinity to a 
certain cpu or group of cpus.  That is, after all, the reasoning behind 
cpusets in the first place.  If slab objects were all allocated from a 
node with remote affinity to the cpus that will be addressing it, it 
negates a significant advantage that cpusets provides to the user.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09  9:12   ` David Rientjes
@ 2009-03-09 10:19     ` KOSAKI Motohiro
  2009-03-09 20:26       ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: KOSAKI Motohiro @ 2009-03-09 10:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Andrew Morton, Christoph Lameter, Pekka Enberg,
	Matt Mackall, Paul Menage, Randy Dunlap, linux-kernel

> On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
> 
> > Hmmm,
> > this description only explay how to implement this.
> > but no explain why this patch is useful.
> > 
> > Could you please who and why need it?
> > 
> 
> The change to Documentation/cgroups/cpusets.txt should have explained it.
> 
> This is for two cases: true memory isolation (now including slab 
> allocations at the object level) and NUMA optimizations.
> 
> Prior to this change, it was possible for slabs to be allocated in a 
> cpuset while its objects were largely consumed by disjoint cpusets.  We 
> can fix that by only allocating objects from slabs that are found on 
> current->mems_allowed.  While this incurs a performance penalty, some 
> users may find that true isolation outweighs the cache optimizations.
> 
> It is also helpful for long-lived objects that require NUMA affinity to a 
> certain cpu or group of cpus.  That is, after all, the reasoning behind 
> cpusets in the first place.  If slab objects were all allocated from a 
> node with remote affinity to the cpus that will be addressing it, it 
> negates a significant advantage that cpusets provides to the user.

My question mean, Why anyone need isolation?
your patch insert new branch into hotpath.
then, it makes slower hotpath a abit although a user don't use this feature.

typically, slab cache don't need strict node binding because
inode/dentry touched from multiple cpus.

In addition, on large numa systems, slab cache is relatively small
than page cache. then this feature's improvement seems relatively small too.

if you have strongly reason, I don't oppose this proposal.
but I don't think your explanation is enough reasonable reason.

btw, have you seen "immediate values" patch series? I think it
can become make the patch zero cost for non-cpuset user.
after that patch merging, I don't oppose this patch although
your reason isn't so much.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 10:19     ` KOSAKI Motohiro
@ 2009-03-09 20:26       ` David Rientjes
  2009-03-09 21:14         ` Christoph Lameter
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-09 20:26 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:

> My question mean, Why anyone need isolation?
> your patch insert new branch into hotpath.
> then, it makes slower hotpath a abit although a user don't use this feature.
> 

On large NUMA machines, it is currently possible for a very large 
percentage (if not all) of your slab allocations to come from memory that 
is distant from your application's set of allowable cpus.  Such 
allocations that are long-lived would benefit from having affinity to 
those processors.  Again, this is the typical use case for cpusets: to 
bind memory nodes to groups of cpus with affinity to it for the tasks 
attached to the cpuset.

> typically, slab cache don't need strict node binding because
> inode/dentry touched from multiple cpus.
> 

This change would obviously require inode and dentry objects to originate 
from a node on the cpuset's set of mems_allowed.  That would incur a 
performance penalty if the cpu slab is not from such a node, but that is 
assumed by the user who has enabled the option.

> In addition, on large numa systems, slab cache is relatively small
> than page cache. then this feature's improvement seems relatively small too.
> 

That's irrelevant, large NUMA machines may still require memory affinity 
to a specific group of cpus, the size of the global slab cache isn't 
important if that's the goal.  When the option is enabled for cpusets 
that require that memory locality, we happily trade off partial list 
fragmentation and increased slab allocations for the long-lived local 
allocations.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 20:26       ` David Rientjes
@ 2009-03-09 21:14         ` Christoph Lameter
  2009-03-09 21:31           ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2009-03-09 21:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, David Rientjes wrote:

> On Mon, 9 Mar 2009, KOSAKI Motohiro wrote:
>
> > My question mean, Why anyone need isolation?
> > your patch insert new branch into hotpath.
> > then, it makes slower hotpath a abit although a user don't use this feature.
> On large NUMA machines, it is currently possible for a very large
> percentage (if not all) of your slab allocations to come from memory that
> is distant from your application's set of allowable cpus.  Such
> allocations that are long-lived would benefit from having affinity to
> those processors.  Again, this is the typical use case for cpusets: to
> bind memory nodes to groups of cpus with affinity to it for the tasks
> attached to the cpuset.

Can you show us a real workload that suffers from this issue?

If you want to make sure that an allocation comes from a certain node then
specifying the node in kmalloc_node() will give you what you want.

> > typically, slab cache don't need strict node binding because
> > inode/dentry touched from multiple cpus.
> This change would obviously require inode and dentry objects to originate
> from a node on the cpuset's set of mems_allowed.  That would incur a
> performance penalty if the cpu slab is not from such a node, but that is
> assumed by the user who has enabled the option.

The usage of kernel objects may not be cpuset specific. This is true for
other objects than inode and dentries well.

> > In addition, on large numa systems, slab cache is relatively small
> > than page cache. then this feature's improvement seems relatively small too.
> That's irrelevant, large NUMA machines may still require memory affinity
> to a specific group of cpus, the size of the global slab cache isn't
> important if that's the goal.  When the option is enabled for cpusets
> that require that memory locality, we happily trade off partial list
> fragmentation and increased slab allocations for the long-lived local
> allocations.

Other memory may spill over too. F.e. two processes from disjunct cpu sets
cause faults in the same address range (its rather common for this to
happen to glibc code f.e.). Two processes may use another kernel feature
that buffers objects (are you going to want to search the LRU lists for objects
from the right node?)

NUMA affinity is there in the large picture. In detail the allocation
strategies over nodes etc etc may be disturbed by this and that in
particular if processes with disjoint cpusets run on the same processor.

Just dont do it. Dedicate a cpu to a cpuset. Overlapping cpusets can cause
other strange things as well.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 21:14         ` Christoph Lameter
@ 2009-03-09 21:31           ` David Rientjes
  2009-03-10 20:50             ` Christoph Lameter
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-09 21:31 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: KOSAKI Motohiro, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, Christoph Lameter wrote:

> > On large NUMA machines, it is currently possible for a very large
> > percentage (if not all) of your slab allocations to come from memory that
> > is distant from your application's set of allowable cpus.  Such
> > allocations that are long-lived would benefit from having affinity to
> > those processors.  Again, this is the typical use case for cpusets: to
> > bind memory nodes to groups of cpus with affinity to it for the tasks
> > attached to the cpuset.
> 
> Can you show us a real workload that suffers from this issue?
> 

We're more interested in the isolation characteristic, but that also 
benefits large NUMA machines by keeping nodes free of egregious amounts of 
slab allocated for remote cpus.

> If you want to make sure that an allocation comes from a certain node then
> specifying the node in kmalloc_node() will give you what you want.
> 

That's essentially what the change does implicitly: it changes all 
kmalloc() calls to kmalloc_node() for current->mems_allowed.

> > This change would obviously require inode and dentry objects to originate
> > from a node on the cpuset's set of mems_allowed.  That would incur a
> > performance penalty if the cpu slab is not from such a node, but that is
> > assumed by the user who has enabled the option.
> 
> The usage of kernel objects may not be cpuset specific. This is true for
> other objects than inode and dentries well.
> 

Yes, and that's why we require the cpuset hardwall on a configurable 
per-cpuset basis.  If a cpuset has set this option for its workload, then 
it is demanding object allocations from local memory.  Other cpusets that 
do not have memory_slab_hardwall set can still allocate from any cpu slab 
or partial slab, including those allocated for the hardwall cpuset.

> Other memory may spill over too. F.e. two processes from disjunct cpu sets
> cause faults in the same address range (its rather common for this to
> happen to glibc code f.e.). Two processes may use another kernel feature
> that buffers objects (are you going to want to search the LRU lists for objects
> from the right node?)
> 

If a workload is demanding node local object allocation, then an object 
buffer probably isn't in its best interest if they are not all from nodes 
with affinity.

> NUMA affinity is there in the large picture.

It depends heavily on the allocation and freeing pattern, it is quite 
possible that NUMA affinity will never be realized through slub if all 
slabs are consistently allocated on a single node just because we get an 
alloc when the current cpu slab must be replaced.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 21:31           ` David Rientjes
@ 2009-03-10 20:50             ` Christoph Lameter
  2009-03-10 21:08               ` Matt Mackall
  2009-03-10 22:12               ` Paul Menage
  0 siblings, 2 replies; 27+ messages in thread
From: Christoph Lameter @ 2009-03-10 20:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, David Rientjes wrote:

> On Mon, 9 Mar 2009, Christoph Lameter wrote:
>
> > > On large NUMA machines, it is currently possible for a very large
> > > percentage (if not all) of your slab allocations to come from memory that
> > > is distant from your application's set of allowable cpus.  Such
> > > allocations that are long-lived would benefit from having affinity to
> > > those processors.  Again, this is the typical use case for cpusets: to
> > > bind memory nodes to groups of cpus with affinity to it for the tasks
> > > attached to the cpuset.
> >
> > Can you show us a real workload that suffers from this issue?
> >
>
> We're more interested in the isolation characteristic, but that also
> benefits large NUMA machines by keeping nodes free of egregious amounts of
> slab allocated for remote cpus.

So no real workload just some isolation idea.

> > If you want to make sure that an allocation comes from a certain node then
> > specifying the node in kmalloc_node() will give you what you want.
> >
>
> That's essentially what the change does implicitly: it changes all
> kmalloc() calls to kmalloc_node() for current->mems_allowed.

Ok then you can use kmalloc_node?


> > The usage of kernel objects may not be cpuset specific. This is true for
> > other objects than inode and dentries well.
> >
>
> Yes, and that's why we require the cpuset hardwall on a configurable
> per-cpuset basis.  If a cpuset has set this option for its workload, then
> it is demanding object allocations from local memory.  Other cpusets that
> do not have memory_slab_hardwall set can still allocate from any cpu slab
> or partial slab, including those allocated for the hardwall cpuset.

You cannot hardwall something that is used in a shared way by processes in
multiple cpusets.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10 20:50             ` Christoph Lameter
@ 2009-03-10 21:08               ` Matt Mackall
  2009-03-12 16:08                 ` Christoph Lameter
  2009-03-10 22:12               ` Paul Menage
  1 sibling, 1 reply; 27+ messages in thread
From: Matt Mackall @ 2009-03-10 21:08 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Pekka Enberg,
	Paul Menage, Randy Dunlap, linux-kernel

On Tue, 2009-03-10 at 16:50 -0400, Christoph Lameter wrote:
> On Mon, 9 Mar 2009, David Rientjes wrote:
> 
> > On Mon, 9 Mar 2009, Christoph Lameter wrote:
> >
> > > > On large NUMA machines, it is currently possible for a very large
> > > > percentage (if not all) of your slab allocations to come from memory that
> > > > is distant from your application's set of allowable cpus.  Such
> > > > allocations that are long-lived would benefit from having affinity to
> > > > those processors.  Again, this is the typical use case for cpusets: to
> > > > bind memory nodes to groups of cpus with affinity to it for the tasks
> > > > attached to the cpuset.
> > >
> > > Can you show us a real workload that suffers from this issue?
> > >
> >
> > We're more interested in the isolation characteristic, but that also
> > benefits large NUMA machines by keeping nodes free of egregious amounts of
> > slab allocated for remote cpus.
> 
> So no real workload just some isolation idea.
> 
> > > If you want to make sure that an allocation comes from a certain node then
> > > specifying the node in kmalloc_node() will give you what you want.
> > >
> >
> > That's essentially what the change does implicitly: it changes all
> > kmalloc() calls to kmalloc_node() for current->mems_allowed.
> 
> Ok then you can use kmalloc_node?

Yes, he certainly could change every single kmalloc that a process might
ever reach to kmalloc_node. But I don't think that's optimal.

> 
> > > The usage of kernel objects may not be cpuset specific. This is true for
> > > other objects than inode and dentries well.
> > >
> >
> > Yes, and that's why we require the cpuset hardwall on a configurable
> > per-cpuset basis.  If a cpuset has set this option for its workload, then
> > it is demanding object allocations from local memory.  Other cpusets that
> > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > or partial slab, including those allocated for the hardwall cpuset.
> 
> You cannot hardwall something that is used in a shared way by processes in
> multiple cpusets.

He can enforce that every allocation made when a given task is current
conforms. His patch demonstrates that.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10 21:08               ` Matt Mackall
@ 2009-03-12 16:08                 ` Christoph Lameter
  2009-03-12 17:23                   ` Matt Mackall
  0 siblings, 1 reply; 27+ messages in thread
From: Christoph Lameter @ 2009-03-12 16:08 UTC (permalink / raw)
  To: Matt Mackall
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Pekka Enberg,
	Paul Menage, Randy Dunlap, linux-kernel

On Tue, 10 Mar 2009, Matt Mackall wrote:

> > > Yes, and that's why we require the cpuset hardwall on a configurable
> > > per-cpuset basis.  If a cpuset has set this option for its workload, then
> > > it is demanding object allocations from local memory.  Other cpusets that
> > > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > > or partial slab, including those allocated for the hardwall cpuset.
> >
> > You cannot hardwall something that is used in a shared way by processes in
> > multiple cpusets.
>
> He can enforce that every allocation made when a given task is current
> conforms. His patch demonstrates that.

Of course. But that may just be a subset of the data used by a task. If an
inode, dentry and so on was already allocated in the context of another
process then the locality of that allocation will not be changed. The
hardwall will have no effect.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-12 16:08                 ` Christoph Lameter
@ 2009-03-12 17:23                   ` Matt Mackall
  0 siblings, 0 replies; 27+ messages in thread
From: Matt Mackall @ 2009-03-12 17:23 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Pekka Enberg,
	Paul Menage, Randy Dunlap, linux-kernel

On Thu, 2009-03-12 at 12:08 -0400, Christoph Lameter wrote:
> On Tue, 10 Mar 2009, Matt Mackall wrote:
> 
> > > > Yes, and that's why we require the cpuset hardwall on a configurable
> > > > per-cpuset basis.  If a cpuset has set this option for its workload, then
> > > > it is demanding object allocations from local memory.  Other cpusets that
> > > > do not have memory_slab_hardwall set can still allocate from any cpu slab
> > > > or partial slab, including those allocated for the hardwall cpuset.
> > >
> > > You cannot hardwall something that is used in a shared way by processes in
> > > multiple cpusets.
> >
> > He can enforce that every allocation made when a given task is current
> > conforms. His patch demonstrates that.
> 
> Of course. But that may just be a subset of the data used by a task. If an
> inode, dentry and so on was already allocated in the context of another
> process then the locality of that allocation will not be changed. The
> hardwall will have no effect.

It will if he's also using a namespace. This is part of a larger puzzle.

-- 
http://selenic.com : development and support for Mercurial and Linux



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10 20:50             ` Christoph Lameter
  2009-03-10 21:08               ` Matt Mackall
@ 2009-03-10 22:12               ` Paul Menage
  2009-03-12 15:51                 ` Christoph Lameter
  1 sibling, 1 reply; 27+ messages in thread
From: Paul Menage @ 2009-03-10 22:12 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Pekka Enberg,
	Matt Mackall, Randy Dunlap, linux-kernel

On Tue, Mar 10, 2009 at 1:50 PM, Christoph Lameter
<cl@linux-foundation.org> wrote:
>
> So no real workload just some isolation idea.

We definitely  have real workloads where a job is allocating lots of
slab memory (e.g. network socket buffers, dentry/inode objects, etc)
and we want to be able to account the memory usage to each job rather
than having all the slab scattered around unidentifiably, and to
reduce fragmentation (so when a job finishes, all its sockets close
and all its files are deleted, there's a better chance that we'll be
able to reclaim some slab memory). We could probably turn those into
more synthetic benchmarkable loads if necessary for demonstration.

Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10 22:12               ` Paul Menage
@ 2009-03-12 15:51                 ` Christoph Lameter
  0 siblings, 0 replies; 27+ messages in thread
From: Christoph Lameter @ 2009-03-12 15:51 UTC (permalink / raw)
  To: Paul Menage
  Cc: David Rientjes, KOSAKI Motohiro, Andrew Morton, Pekka Enberg,
	Matt Mackall, Randy Dunlap, linux-kernel

On Tue, 10 Mar 2009, Paul Menage wrote:

> We definitely  have real workloads where a job is allocating lots of
> slab memory (e.g. network socket buffers, dentry/inode objects, etc)
> and we want to be able to account the memory usage to each job rather
> than having all the slab scattered around unidentifiably, and to
> reduce fragmentation (so when a job finishes, all its sockets close
> and all its files are deleted, there's a better chance that we'll be
> able to reclaim some slab memory). We could probably turn those into
> more synthetic benchmarkable loads if necessary for demonstration.

So this is about memory accounting? The kernel tracks all memory used by a
process and releases it independantly from this patch.

The resources that you are mentioning are resources that are typically
shared by multiple processes. There no task owning these items. It is
accidental that a certain process is exclusively using one of these at a
time.

The real workloads are running in cpusets that are overlapping? Why would
this be done? The point of cpusets is typically to segment the
processors for a certain purpose.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-08 16:27 [patch -mm] cpusets: add memory_slab_hardwall flag David Rientjes
                   ` (2 preceding siblings ...)
  2009-03-09  4:49 ` KOSAKI Motohiro
@ 2009-03-09 18:50 ` Christoph Lameter
  2009-03-09 20:13   ` David Rientjes
  2009-03-12  1:03   ` Paul E. McKenney
  3 siblings, 2 replies; 27+ messages in thread
From: Christoph Lameter @ 2009-03-09 18:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, linux-kernel

Again these are fastpath modifications.

Scanning the partial list for matching nodes is an expensive operation.

Adding RCU into the fast paths is also another big worry.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 18:50 ` Christoph Lameter
@ 2009-03-09 20:13   ` David Rientjes
  2009-03-10  1:35     ` KOSAKI Motohiro
  2009-03-12  1:03   ` Paul E. McKenney
  1 sibling, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-09 20:13 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andrew Morton, Pekka Enberg, Matt Mackall, Paul Menage,
	Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, Christoph Lameter wrote:

> Again these are fastpath modifications.
> 

The nature of the change requires the logic to be placed in the fastpath 
to determine whether a cpu slab's node is allowed by the allocating task's 
cpuset.

You have previously stated that you would prefer that this feature be 
tunable from userspace.  This patch adds the `memory_slab_hardwall' cpuset 
flag which defaults to off.

> Scanning the partial list for matching nodes is an expensive operation.
> 

It depends on how long you scan for a matching node, but again: this 
should be assumed by the user if the option has been enabled.

> Adding RCU into the fast paths is also another big worry.
> 

This could be mitigated by adding a PF_SLAB_HARDWALL flag similiar to 
PF_SPREAD_PAGE and PF_SPREAD_SLAB.  I'd prefer not to add additional 
cpuset-specific task flags, but this would address your concern.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 20:13   ` David Rientjes
@ 2009-03-10  1:35     ` KOSAKI Motohiro
  2009-03-10  2:01       ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: KOSAKI Motohiro @ 2009-03-10  1:35 UTC (permalink / raw)
  To: David Rientjes
  Cc: kosaki.motohiro, Christoph Lameter, Andrew Morton, Pekka Enberg,
	Matt Mackall, Paul Menage, Randy Dunlap, linux-kernel

> On Mon, 9 Mar 2009, Christoph Lameter wrote:
> 
> > Again these are fastpath modifications.
> > 
> 
> The nature of the change requires the logic to be placed in the fastpath 
> to determine whether a cpu slab's node is allowed by the allocating task's 
> cpuset.
> 
> You have previously stated that you would prefer that this feature be 
> tunable from userspace.  This patch adds the `memory_slab_hardwall' cpuset 
> flag which defaults to off.

That's pointless.
Again, any fastpath modification should have reasonable reason.
We are looking for your explanation.

I have each 6+ year experience on embedded, HPC, and high-end server area.
but I haven't hear this requirement. I still can't imazine who use this feature.




^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10  1:35     ` KOSAKI Motohiro
@ 2009-03-10  2:01       ` David Rientjes
  2009-03-10  4:05         ` Matt Mackall
  0 siblings, 1 reply; 27+ messages in thread
From: David Rientjes @ 2009-03-10  2:01 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Christoph Lameter, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Tue, 10 Mar 2009, KOSAKI Motohiro wrote:

> That's pointless.
> Again, any fastpath modification should have reasonable reason.
> We are looking for your explanation.
> 

The fastpath modification simply checks if the hardwall bit is set in the 
allocating task's cpuset flags.  If it's disabled, there is no additional 
overhead.

This requirement was mandated during the first review of the patch by 
Christoph, who requested that it be configurable.  Before that it was 
possible to simply check if the global `number_of_cpusets' count was > 1.  
If not, cpuset_node_allowed_hardwall() would always return true.  If the 
system had more than one cpuset, it would have reduced to checking

	return in_interrupt() || (gfp_mask & __GFP_THISNODE) ||
	       node_isset(node, current->mems_allowed);

As I already mentioned, if fastpath optimization is your only concern, 
that we could simply add PF_SLAB_HARDWALL task flags that would simply 
make this

	return current->flags & PF_SLAB_HARDWALL;

So the fastpath cost can be mitigated at the expense of an additional task 
flag.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10  2:01       ` David Rientjes
@ 2009-03-10  4:05         ` Matt Mackall
  2009-03-10  4:26           ` David Rientjes
  0 siblings, 1 reply; 27+ messages in thread
From: Matt Mackall @ 2009-03-10  4:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: KOSAKI Motohiro, Christoph Lameter, Andrew Morton, Pekka Enberg,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 2009-03-09 at 19:01 -0700, David Rientjes wrote:
> On Tue, 10 Mar 2009, KOSAKI Motohiro wrote:
> 
> > That's pointless.
> > Again, any fastpath modification should have reasonable reason.
> > We are looking for your explanation.
> > 
> 
> The fastpath modification simply checks if the hardwall bit is set in the 
> allocating task's cpuset flags.  If it's disabled, there is no additional 
> overhead.

Ok, I for one understand perfectly the desire for this feature.

But we are still extremely sensitive to adding potential branches to one
of the most important fast-paths in the kernel, especially for a feature
with a fairly narrow use case. We've invested an awful lot of time into
micro-optimizing SLAB (by rewriting it as SLUB/SLQB) so any steps
backward at this stage are cause for concern. Also, remember 99%+ of
users will never care about this feature.

For SLOB, I think the code is fine as it stands, but we probably want to
be a bit more clever for the others. At the very minimum, we'd like this
to be in an unlikely path. Better still if the initial test can somehow
be hidden with another test. It might also be possible to use the
patching code used by markers to enable the path only when one or more
tasks needs it.

-- 
http://selenic.com : development and support for Mercurial and Linux

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-10  4:05         ` Matt Mackall
@ 2009-03-10  4:26           ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2009-03-10  4:26 UTC (permalink / raw)
  To: Matt Mackall
  Cc: KOSAKI Motohiro, Christoph Lameter, Andrew Morton, Pekka Enberg,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, 9 Mar 2009, Matt Mackall wrote:

> But we are still extremely sensitive to adding potential branches to one
> of the most important fast-paths in the kernel, especially for a feature
> with a fairly narrow use case. We've invested an awful lot of time into
> micro-optimizing SLAB (by rewriting it as SLUB/SLQB) so any steps
> backward at this stage are cause for concern. Also, remember 99%+ of
> users will never care about this feature.
> 

My latest proposal simply checks for !(current->flags & PF_SLAB_HARDWALL) 
before determining whether the set of allowable nodes needs to be checked.  
For slub, this is in addition to the prexisting logic that checks whether 
the object can be from any node (node == -1) in slab_alloc() or the cpu 
slab is from the node requested for kmalloc_node() users for CONFIG_NUMA 
kernels.

You could argue that, in the slub example, check_node() should do this:

	static inline int check_node(struct kmem_cache_cpu *c, int node,
				     gfp_t flags)
	{
	#ifdef CONFIG_NUMA
		if (node != -1 && c->node != node)
			return 0;
		if (likely(!(current->flags & PF_SLAB_HARDWALL)))
			return 1;
	#endif
		return current_cpuset_object_allowed(node, flags);
	}

Although this would penalize the case where current's cpuset has 
memory_slab_hardwall enabled, yet the cpu slab is still allowed because it 
originated from current->mems_allowed.

If checking for the PF_SLAB_HARDWALL bit in current->flags really is 
unacceptable in my latest proposal, then a viable solution probably 
doesn't exist for such workloads that want hardwall object allocations.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-09 18:50 ` Christoph Lameter
  2009-03-09 20:13   ` David Rientjes
@ 2009-03-12  1:03   ` Paul E. McKenney
  2009-03-12  5:51     ` David Rientjes
  1 sibling, 1 reply; 27+ messages in thread
From: Paul E. McKenney @ 2009-03-12  1:03 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David Rientjes, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Mon, Mar 09, 2009 at 02:50:06PM -0400, Christoph Lameter wrote:
> Again these are fastpath modifications.
> 
> Scanning the partial list for matching nodes is an expensive operation.
> 
> Adding RCU into the fast paths is also another big worry.

Hello, Christoph,

Adding synchronize_rcu() into a fast path would certainly be a problem,
but call_rcu() should be OK.  If the data structure is updated often
(old elements removed and new elements added), then the cache misses
from elements that were removed, went cache-cold, and then were added
again could potentially cause trouble, but read-mostly data structures
should be OK.

Or were you worried about some other aspect of RCU overhead?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [patch -mm] cpusets: add memory_slab_hardwall flag
  2009-03-12  1:03   ` Paul E. McKenney
@ 2009-03-12  5:51     ` David Rientjes
  0 siblings, 0 replies; 27+ messages in thread
From: David Rientjes @ 2009-03-12  5:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Christoph Lameter, Andrew Morton, Pekka Enberg, Matt Mackall,
	Paul Menage, Randy Dunlap, linux-kernel

On Wed, 11 Mar 2009, Paul E. McKenney wrote:

> Adding synchronize_rcu() into a fast path would certainly be a problem,
> but call_rcu() should be OK.  If the data structure is updated often
> (old elements removed and new elements added), then the cache misses
> from elements that were removed, went cache-cold, and then were added
> again could potentially cause trouble, but read-mostly data structures
> should be OK.
> 
> Or were you worried about some other aspect of RCU overhead?
> 

Thanks for looking at this, Paul.  My latest proposal actually replaces 
the need for the rcu with a per-task flag called PF_SLAB_HARDWALL (see 
http://marc.info/?l=linux-kernel&m=123665181400366).

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2009-03-12 17:25 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-08 16:27 [patch -mm] cpusets: add memory_slab_hardwall flag David Rientjes
2009-03-08 16:53 ` Paul Menage
2009-03-08 21:38   ` David Rientjes
2009-03-09  7:08     ` Paul Menage
2009-03-09  9:18       ` David Rientjes
2009-03-08 17:01 ` Matt Mackall
2009-03-08 21:51   ` David Rientjes
2009-03-09  4:49 ` KOSAKI Motohiro
2009-03-09  9:12   ` David Rientjes
2009-03-09 10:19     ` KOSAKI Motohiro
2009-03-09 20:26       ` David Rientjes
2009-03-09 21:14         ` Christoph Lameter
2009-03-09 21:31           ` David Rientjes
2009-03-10 20:50             ` Christoph Lameter
2009-03-10 21:08               ` Matt Mackall
2009-03-12 16:08                 ` Christoph Lameter
2009-03-12 17:23                   ` Matt Mackall
2009-03-10 22:12               ` Paul Menage
2009-03-12 15:51                 ` Christoph Lameter
2009-03-09 18:50 ` Christoph Lameter
2009-03-09 20:13   ` David Rientjes
2009-03-10  1:35     ` KOSAKI Motohiro
2009-03-10  2:01       ` David Rientjes
2009-03-10  4:05         ` Matt Mackall
2009-03-10  4:26           ` David Rientjes
2009-03-12  1:03   ` Paul E. McKenney
2009-03-12  5:51     ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox