All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] Cpuset: rebind vma's and other refinements
@ 2005-12-10  8:18 Paul Jackson
  2005-12-10  8:18 ` [PATCH 01/10] Cpuset: remove marker_pid documentation Paul Jackson
                   ` (9 more replies)
  0 siblings, 10 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:18 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

The following ten patches provide automatic rebinding of per-vma
mempolicies after a cpuset is migrated, along with a variety
of other cpuset tweaks, fixes, cleanup and optimizations.

  1. remove marker_pid documentation
  2. minor spacing initializer fixes
  3. update_nodemask code reformat
  4. fork hook fix
  5. combine refresh_mems and update_mems
  6. implement cpuset_mems_allowed
  7. numa_policy_rebind cleanup
  8. number_of_cpusets optimization
  9. rebind vma mempolicies fix
 10. migrate all tasks in cpuset at once

Andrew - this is exactly the same set of 10 patches that
I sent to you one day ago, but forgot to cc lkml.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 01/10] Cpuset: remove marker_pid documentation
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
@ 2005-12-10  8:18 ` Paul Jackson
  2005-12-10  8:18 ` [PATCH 02/10] Cpuset: minor spacing initializer fixes Paul Jackson
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:18 UTC (permalink / raw)
  To: akpm; +Cc: Paul Jackson, Simon Derr, linux-kernel, Christoph Lameter

Remove documentation for the cpuset 'marker_pid' feature, that
was in the patch "cpuset: change marker for relative numbering"
That patch was previously pulled from *-mm at my (pj) request.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 Documentation/cpusets.txt |   50 +++-------------------------------------------
 1 files changed, 4 insertions(+), 46 deletions(-)

--- 2.6.15-rc3-mm1.orig/Documentation/cpusets.txt	2005-12-04 02:09:06.423921557 -0800
+++ 2.6.15-rc3-mm1/Documentation/cpusets.txt	2005-12-04 02:32:31.231015268 -0800
@@ -16,9 +16,8 @@ CONTENTS:
   1.3 How are cpusets implemented ?
   1.4 What are exclusive cpusets ?
   1.5 What does notify_on_release do ?
-  1.6 What is a marker_pid ?
-  1.7 What is memory_pressure ?
-  1.8 How do I use cpusets ?
+  1.6 What is memory_pressure ?
+  1.7 How do I use cpusets ?
 2. Usage Examples and Syntax
   2.1 Basic Usage
   2.2 Adding/removing cpus
@@ -178,7 +177,6 @@ containing the following files describin
  - mem_exclusive flag: is memory placement exclusive?
  - tasks: list of tasks (by pid) attached to that cpuset
  - notify_on_release flag: run /sbin/cpuset_release_agent on exit?
- - marker_pid: pid of user task in co-ordinated operation sequence
  - memory_pressure: measure of how much paging pressure in cpuset
 
 In addition, the root cpuset only has the following file:
@@ -260,47 +258,7 @@ boot is disabled (0).  The default value
 is the current value of their parents notify_on_release setting.
 
 
-1.6 What is a marker_pid ?
---------------------------
-
-The marker_pid helps manage cpuset changes safely from user space.
-
-The interface presented to user space for cpusets uses system wide
-numbering of CPUs and Memory Nodes.   It is the responsibility of
-user level code, presumably in a library, to present cpuset-relative
-numbering to applications when that would be more useful to them.
-
-However if a task is moved to a different cpuset, or if the 'cpus' or
-'mems' of a cpuset are changed, then we need a way for such library
-code to detect that its cpuset-relative numbering has changed, when
-expressed using system wide numbering.
-
-The kernel cannot safely allow user code to lock kernel resources.
-The kernel could deliver out-of-band notice of cpuset changes by
-such mechanisms as signals or usermodehelper callbacks, however
-this can't be synchronously delivered to library code linked in
-applications without intruding on the IPC mechanisms available to
-the app.  The kernel could require user level code to do all the work,
-tracking the cpuset state before and during changes, to verify no
-unexpected change occurred, but this becomes an onerous task.
-
-The "marker_pid" cpuset field provides a simple way to make this task
-less onerous on user library code.  A task writes its pid to a cpusets
-"marker_pid" at the start of a sequence of queries and updates,
-and check as it goes that the cpusets marker_pid doesn't change.
-The pread(2) system call does a seek and read in a single call.
-If the marker_pid changes, the user code should retry the required
-sequence of operations.
-
-Anytime that a task modifies the "cpus" or "mems" of a cpuset,
-unless it's pid is in the cpusets marker_pid field, the kernel zeros
-this field.
-
-The above was inspired by the load linked and store conditional
-(ll/sc) instructions in the MIPS II instruction set.
-
-
-1.7 What is memory_pressure ?
+1.6 What is memory_pressure ?
 -----------------------------
 The memory_pressure of a cpuset provides a simple per-cpuset metric
 of the rate that the tasks in a cpuset are attempting to free up in
@@ -357,7 +315,7 @@ the tasks in the cpuset, in units of rec
 times 1000.
 
 
-1.8 How do I use cpusets ?
+1.7 How do I use cpusets ?
 --------------------------
 
 In order to minimize the impact of cpusets on critical kernel

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 02/10] Cpuset: minor spacing initializer fixes
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
  2005-12-10  8:18 ` [PATCH 01/10] Cpuset: remove marker_pid documentation Paul Jackson
@ 2005-12-10  8:18 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 03/10] Cpuset: update_nodemask code reformat Paul Jackson
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:18 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

Four trivial cpuset fixes: remove extra spaces, remove
useless initializers, mark one __read_mostly.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 kernel/cpuset.c |    9 +++------
 1 files changed, 3 insertions(+), 6 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-07 19:17:14.783189624 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-07 19:18:14.776048770 -0800
@@ -54,7 +54,7 @@
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
 
-#define CPUSET_SUPER_MAGIC 		0x27e0eb
+#define CPUSET_SUPER_MAGIC		0x27e0eb
 
 /* See "Frequency meter" comments, below. */
 
@@ -154,9 +154,6 @@ static struct cpuset top_cpuset = {
 	.count = ATOMIC_INIT(0),
 	.sibling = LIST_HEAD_INIT(top_cpuset.sibling),
 	.children = LIST_HEAD_INIT(top_cpuset.children),
-	.parent = NULL,
-	.dentry = NULL,
-	.mems_generation = 0,
 };
 
 static struct vfsmount *cpuset_mount;
@@ -1341,7 +1338,7 @@ static int cpuset_create_file(struct den
 
 /*
  *	cpuset_create_dir - create a directory for an object.
- *	cs: 	the cpuset we create the directory for.
+ *	cs:	the cpuset we create the directory for.
  *		It must have a valid ->parent field
  *		And we are going to fill its ->dentry field.
  *	name:	The name to give to the cpuset directory. Will be copied.
@@ -2049,7 +2046,7 @@ done:
  * cpuset file 'memory_pressure_enabled' in the root cpuset.
  */
 
-int cpuset_memory_pressure_enabled;
+int cpuset_memory_pressure_enabled __read_mostly;
 
 /**
  * cpuset_memory_pressure_bump - keep stats of per-cpuset reclaims.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 03/10] Cpuset: update_nodemask code reformat
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
  2005-12-10  8:18 ` [PATCH 01/10] Cpuset: remove marker_pid documentation Paul Jackson
  2005-12-10  8:18 ` [PATCH 02/10] Cpuset: minor spacing initializer fixes Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 04/10] Cpuset: fork hook fix Paul Jackson
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Paul Jackson, Simon Derr, linux-kernel, Christoph Lameter

Restructure code layout of the kernel/cpuset.c update_nodemask()
routine, removing embedded returns and nested if's in favor of
goto completion labels.  This is being done in anticipation of
adding more logic to this routine, which will favor the goto
style structure.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 kernel/cpuset.c |   25 +++++++++++++++----------
 1 files changed, 15 insertions(+), 10 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-07 19:15:37.420771718 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-07 19:16:31.104966322 -0800
@@ -799,18 +799,23 @@ static int update_nodemask(struct cpuset
 	trialcs = *cs;
 	retval = nodelist_parse(buf, trialcs.mems_allowed);
 	if (retval < 0)
-		return retval;
+		goto done;
 	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
-	if (nodes_empty(trialcs.mems_allowed))
-		return -ENOSPC;
-	retval = validate_change(cs, &trialcs);
-	if (retval == 0) {
-		down(&callback_sem);
-		cs->mems_allowed = trialcs.mems_allowed;
-		atomic_inc(&cpuset_mems_generation);
-		cs->mems_generation = atomic_read(&cpuset_mems_generation);
-		up(&callback_sem);
+	if (nodes_empty(trialcs.mems_allowed)) {
+		retval = -ENOSPC;
+		goto done;
 	}
+	retval = validate_change(cs, &trialcs);
+	if (retval < 0)
+		goto done;
+
+	down(&callback_sem);
+	cs->mems_allowed = trialcs.mems_allowed;
+	atomic_inc(&cpuset_mems_generation);
+	cs->mems_generation = atomic_read(&cpuset_mems_generation);
+	up(&callback_sem);
+
+done:
 	return retval;
 }
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 04/10] Cpuset: fork hook fix
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (2 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 03/10] Cpuset: update_nodemask code reformat Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 05/10] Cpuset: combine refresh_mems and update_mems Paul Jackson
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

Fix obscure, never seen in real life, cpuset fork race.
The cpuset_fork() call in fork.c was setting up the correct
task->cpuset pointer after the tasklist_lock was dropped,
which briefly exposed the newly forked process with an unsafe
(copied from parent without locks or usage counter increment)
cpuset pointer.

In theory, that exposed cpuset pointer could have been pointing
at a cpuset that was already freed and removed, and in theory
another task that had been sitting on the tasklist_lock waiting
to scan the task list could have raced down the entire tasklist,
found our new child at the far end, and dereferenced that bogus
cpuset pointer.

To fix, setup up the correct cpuset pointer in the new child
by calling cpuset_fork() before the new task is linked into the
tasklist, and with that, add a fork failure case, to dereference
that cpuset, if the fork fails along the way, after cpuset_fork()
was called.

Had to remove a BUG_ON() from cpuset_exit(), because it was
no longer valid - the call to cpuset_exit() from a failed fork
would not have PF_EXITING set.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 kernel/cpuset.c |    4 +---
 kernel/fork.c   |    6 ++++--
 2 files changed, 5 insertions(+), 5 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-08 02:05:37.457685051 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-08 15:19:04.600207271 -0800
@@ -1821,15 +1821,13 @@ void cpuset_fork(struct task_struct *chi
  *
  * We don't need to task_lock() this reference to tsk->cpuset,
  * because tsk is already marked PF_EXITING, so attach_task() won't
- * mess with it.
+ * mess with it, or task is a failed fork, never visible to attach_task.
  **/
 
 void cpuset_exit(struct task_struct *tsk)
 {
 	struct cpuset *cs;
 
-	BUG_ON(!(tsk->flags & PF_EXITING));
-
 	cs = tsk->cpuset;
 	tsk->cpuset = NULL;
 
--- 2.6.15-rc3-mm1.orig/kernel/fork.c	2005-12-08 02:05:34.885390778 -0800
+++ 2.6.15-rc3-mm1/kernel/fork.c	2005-12-08 15:19:50.203259819 -0800
@@ -971,12 +971,13 @@ static task_t *copy_process(unsigned lon
 	p->io_context = NULL;
 	p->io_wait = NULL;
 	p->audit_context = NULL;
+	cpuset_fork(p);
 #ifdef CONFIG_NUMA
  	p->mempolicy = mpol_copy(p->mempolicy);
  	if (IS_ERR(p->mempolicy)) {
  		retval = PTR_ERR(p->mempolicy);
  		p->mempolicy = NULL;
- 		goto bad_fork_cleanup;
+ 		goto bad_fork_cleanup_cpuset;
  	}
 #endif
 
@@ -1147,7 +1148,6 @@ static task_t *copy_process(unsigned lon
 	total_forks++;
 	write_unlock_irq(&tasklist_lock);
 	proc_fork_connector(p);
-	cpuset_fork(p);
 	retval = 0;
 
 fork_out:
@@ -1179,7 +1179,9 @@ bad_fork_cleanup_security:
 bad_fork_cleanup_policy:
 #ifdef CONFIG_NUMA
 	mpol_free(p->mempolicy);
+bad_fork_cleanup_cpuset:
 #endif
+	cpuset_exit(p);
 bad_fork_cleanup:
 	if (p->binfmt)
 		module_put(p->binfmt->module);

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 05/10] Cpuset: combine refresh_mems and update_mems
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (3 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 04/10] Cpuset: fork hook fix Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 06/10] Cpuset: implement cpuset_mems_allowed Paul Jackson
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Paul Jackson, Simon Derr, linux-kernel, Christoph Lameter

The important code paths through alloc_pages_current()
and alloc_page_vma(), by which most kernel page allocations
go, both called cpuset_update_current_mems_allowed(),
which in turn called refresh_mems().  -Both- of these
latter two routines did a tasklock, got the tasks cpuset
pointer, and checked for out of date cpuset->mems_generation.

That was a silly duplication of code and waste of CPU cycles
on an important code path.

Consolidated those two routines into a single routine,
called cpuset_update_task_memory_state(), since it updates
more than just mems_allowed.

Changed all callers of either routine to call the new
consolidated routine.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/cpuset.h |    4 +-
 kernel/cpuset.c        |   95 +++++++++++++++++++++----------------------------
 mm/mempolicy.c         |   10 ++---
 3 files changed, 48 insertions(+), 61 deletions(-)

--- 2.6.15-rc3-mm1.orig/include/linux/cpuset.h	2005-12-07 22:00:40.525006978 -0800
+++ 2.6.15-rc3-mm1/include/linux/cpuset.h	2005-12-07 23:48:54.860211028 -0800
@@ -20,7 +20,7 @@ extern void cpuset_fork(struct task_stru
 extern void cpuset_exit(struct task_struct *p);
 extern cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
-void cpuset_update_current_mems_allowed(void);
+void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
@@ -51,7 +51,7 @@ static inline cpumask_t cpuset_cpus_allo
 }
 
 static inline void cpuset_init_current_mems_allowed(void) {}
-static inline void cpuset_update_current_mems_allowed(void) {}
+static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
 
 static inline int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl)
--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-07 22:12:08.509137821 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-07 23:52:21.746290572 -0800
@@ -584,13 +584,26 @@ static void guarantee_online_mems(const 
 	BUG_ON(!nodes_intersects(*pmask, node_online_map));
 }
 
-/*
- * Refresh current tasks mems_allowed and mems_generation from current
- * tasks cpuset.
+/**
+ * cpuset_update_task_memory_state - update task memory placement
  *
- * Call without callback_sem or task_lock() held.  May be called with
- * or without manage_sem held.  Will acquire task_lock() and might
- * acquire callback_sem during call.
+ * If the current tasks cpusets mems_allowed changed behind our
+ * backs, update current->mems_allowed, mems_generation and task NUMA
+ * mempolicy to the new value.
+ *
+ * Task mempolicy is updated by rebinding it relative to the
+ * current->cpuset if a task has its memory placement changed.
+ * Do not call this routine if in_interrupt().
+ *
+ * Call without callback_sem or task_lock() held.  May be called
+ * with or without manage_sem held.  Except in early boot or
+ * an exiting task, when tsk->cpuset is NULL, this routine will
+ * acquire task_lock().  We don't need to use task_lock to guard
+ * against another task changing a non-NULL cpuset pointer to NULL,
+ * as that is only done by a task on itself, and if the current task
+ * is here, it is not simultaneously in the exit code NULL'ing its
+ * cpuset pointer.  This routine also might acquire callback_sem and
+ * current->mm->mmap_sem during call.
  *
  * The task_lock() is required to dereference current->cpuset safely.
  * Without it, we could pick up the pointer value of current->cpuset
@@ -605,32 +618,36 @@ static void guarantee_online_mems(const 
  * task has been modifying its cpuset.
  */
 
-static void refresh_mems(void)
+void cpuset_update_task_memory_state()
 {
 	int my_cpusets_mem_gen;
+	struct task_struct *tsk = current;
+	struct cpuset *cs = tsk->cpuset;
 
-	task_lock(current);
-	my_cpusets_mem_gen = current->cpuset->mems_generation;
-	task_unlock(current);
+	if (unlikely(!cs))
+		return;
+
+	task_lock(tsk);
+	my_cpusets_mem_gen = cs->mems_generation;
+	task_unlock(tsk);
 
-	if (current->cpuset_mems_generation != my_cpusets_mem_gen) {
-		struct cpuset *cs;
-		nodemask_t oldmem = current->mems_allowed;
+	if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
+		nodemask_t oldmem = tsk->mems_allowed;
 		int migrate;
 
 		down(&callback_sem);
-		task_lock(current);
-		cs = current->cpuset;
+		task_lock(tsk);
+		cs = tsk->cpuset;	/* Maybe changed when task not locked */
 		migrate = is_memory_migrate(cs);
-		guarantee_online_mems(cs, &current->mems_allowed);
-		current->cpuset_mems_generation = cs->mems_generation;
-		task_unlock(current);
+		guarantee_online_mems(cs, &tsk->mems_allowed);
+		tsk->cpuset_mems_generation = cs->mems_generation;
+		task_unlock(tsk);
 		up(&callback_sem);
-		if (!nodes_equal(oldmem, current->mems_allowed)) {
-			numa_policy_rebind(&oldmem, &current->mems_allowed);
+		numa_policy_rebind(&oldmem, &tsk->mems_allowed);
+		if (!nodes_equal(oldmem, tsk->mems_allowed)) {
 			if (migrate) {
-				do_migrate_pages(current->mm, &oldmem,
-					&current->mems_allowed,
+				do_migrate_pages(tsk->mm, &oldmem,
+					&tsk->mems_allowed,
 					MPOL_MF_MOVE_ALL);
 			}
 		}
@@ -1630,7 +1647,7 @@ static long cpuset_create(struct cpuset 
 		return -ENOMEM;
 
 	down(&manage_sem);
-	refresh_mems();
+	cpuset_update_task_memory_state();
 	cs->flags = 0;
 	if (notify_on_release(parent))
 		set_bit(CS_NOTIFY_ON_RELEASE, &cs->flags);
@@ -1688,7 +1705,7 @@ static int cpuset_rmdir(struct inode *un
 	/* the vfs holds both inode->i_sem already */
 
 	down(&manage_sem);
-	refresh_mems();
+	cpuset_update_task_memory_state();
 	if (atomic_read(&cs->count) > 0) {
 		up(&manage_sem);
 		return -EBUSY;
@@ -1873,36 +1890,6 @@ void cpuset_init_current_mems_allowed(vo
 }
 
 /**
- * cpuset_update_current_mems_allowed - update mems parameters to new values
- *
- * If the current tasks cpusets mems_allowed changed behind our backs,
- * update current->mems_allowed and mems_generation to the new value.
- * Do not call this routine if in_interrupt().
- *
- * Call without callback_sem or task_lock() held.  May be called
- * with or without manage_sem held.  Unless exiting, it will acquire
- * task_lock().  Also might acquire callback_sem during call to
- * refresh_mems().
- */
-
-void cpuset_update_current_mems_allowed(void)
-{
-	struct cpuset *cs;
-	int need_to_refresh = 0;
-
-	task_lock(current);
-	cs = current->cpuset;
-	if (!cs)
-		goto done;
-	if (current->cpuset_mems_generation != cs->mems_generation)
-		need_to_refresh = 1;
-done:
-	task_unlock(current);
-	if (need_to_refresh)
-		refresh_mems();
-}
-
-/**
  * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
  * @zl: the zonelist to be checked
  *
--- 2.6.15-rc3-mm1.orig/mm/mempolicy.c	2005-12-07 22:00:40.525983551 -0800
+++ 2.6.15-rc3-mm1/mm/mempolicy.c	2005-12-07 23:48:54.994978144 -0800
@@ -389,7 +389,7 @@ static int contextualize_policy(int mode
 	if (!nodes)
 		return 0;
 
-	cpuset_update_current_mems_allowed();
+	cpuset_update_task_memory_state();
 	if (!cpuset_nodes_subset_current_mems_allowed(*nodes))
 		return -EINVAL;
 	return mpol_check_policy(mode, nodes);
@@ -463,7 +463,7 @@ long do_get_mempolicy(int *policy, nodem
 	struct vm_area_struct *vma = NULL;
 	struct mempolicy *pol = current->mempolicy;
 
-	cpuset_update_current_mems_allowed();
+	cpuset_update_task_memory_state();
 	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
 		return -EINVAL;
 	if (flags & MPOL_F_ADDR) {
@@ -1118,7 +1118,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
 {
 	struct mempolicy *pol = get_vma_policy(current, vma, addr);
 
-	cpuset_update_current_mems_allowed();
+	cpuset_update_task_memory_state();
 
 	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
 		unsigned nid;
@@ -1144,7 +1144,7 @@ alloc_page_vma(gfp_t gfp, struct vm_area
  *	interrupt context and apply the current process NUMA policy.
  *	Returns NULL when no page can be allocated.
  *
- *	Don't call cpuset_update_current_mems_allowed() unless
+ *	Don't call cpuset_update_task_memory_state() unless
  *	1) it's ok to take cpuset_sem (can WAIT), and
  *	2) allocating for current task (not interrupt).
  */
@@ -1153,7 +1153,7 @@ struct page *alloc_pages_current(gfp_t g
 	struct mempolicy *pol = current->mempolicy;
 
 	if ((gfp & __GFP_WAIT) && !in_interrupt())
-		cpuset_update_current_mems_allowed();
+		cpuset_update_task_memory_state();
 	if (!pol || in_interrupt())
 		pol = &default_policy;
 	if (pol->policy == MPOL_INTERLEAVE)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 06/10] Cpuset: implement cpuset_mems_allowed
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (4 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 05/10] Cpuset: combine refresh_mems and update_mems Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 07/10] Cpuset: numa_policy_rebind cleanup Paul Jackson
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

Provide a cpuset_mems_allowed() method, which the sys_migrate_pages()
code needed, to obtain the mems_allowed vector of a cpuset, and
replaced the workaround in sys_migrate_pages() to call this new
method.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/cpuset.h |    8 +++++++-
 kernel/cpuset.c        |   29 ++++++++++++++++++++++++++---
 mm/mempolicy.c         |    3 ---
 3 files changed, 33 insertions(+), 7 deletions(-)

--- 2.6.15-rc3-mm1.orig/include/linux/cpuset.h	2005-12-07 23:34:04.173695910 -0800
+++ 2.6.15-rc3-mm1/include/linux/cpuset.h	2005-12-07 23:36:26.159621364 -0800
@@ -18,7 +18,8 @@ extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_fork(struct task_struct *p);
 extern void cpuset_exit(struct task_struct *p);
-extern cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
+extern cpumask_t cpuset_cpus_allowed(struct task_struct *p);
+extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
 void cpuset_init_current_mems_allowed(void);
 void cpuset_update_task_memory_state(void);
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
@@ -50,6 +51,11 @@ static inline cpumask_t cpuset_cpus_allo
 	return cpu_possible_map;
 }
 
+static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
+{
+	return node_possible_map;
+}
+
 static inline void cpuset_init_current_mems_allowed(void) {}
 static inline void cpuset_update_task_memory_state(void) {}
 #define cpuset_nodes_subset_current_mems_allowed(nodes) (1)
--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-07 23:35:54.229585036 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-07 23:36:26.164504230 -0800
@@ -1871,14 +1871,14 @@ void cpuset_exit(struct task_struct *tsk
  * tasks cpuset.
  **/
 
-cpumask_t cpuset_cpus_allowed(const struct task_struct *tsk)
+cpumask_t cpuset_cpus_allowed(struct task_struct *tsk)
 {
 	cpumask_t mask;
 
 	down(&callback_sem);
-	task_lock((struct task_struct *)tsk);
+	task_lock(tsk);
 	guarantee_online_cpus(tsk->cpuset, &mask);
-	task_unlock((struct task_struct *)tsk);
+	task_unlock(tsk);
 	up(&callback_sem);
 
 	return mask;
@@ -1890,6 +1890,29 @@ void cpuset_init_current_mems_allowed(vo
 }
 
 /**
+ * cpuset_mems_allowed - return mems_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->mems_allowed.
+ *
+ * Description: Returns the nodemask_t mems_allowed of the cpuset
+ * attached to the specified @tsk.  Guaranteed to return some non-empty
+ * subset of node_online_map, even if this means going outside the
+ * tasks cpuset.
+ **/
+
+nodemask_t cpuset_mems_allowed(struct task_struct *tsk)
+{
+	nodemask_t mask;
+
+	down(&callback_sem);
+	task_lock(tsk);
+	guarantee_online_mems(tsk->cpuset, &mask);
+	task_unlock(tsk);
+	up(&callback_sem);
+
+	return mask;
+}
+
+/**
  * cpuset_zonelist_valid_mems_allowed - check zonelist vs. curremt mems_allowed
  * @zl: the zonelist to be checked
  *
--- 2.6.15-rc3-mm1.orig/mm/mempolicy.c	2005-12-07 23:34:04.182485069 -0800
+++ 2.6.15-rc3-mm1/mm/mempolicy.c	2005-12-07 23:36:26.168410523 -0800
@@ -774,9 +774,6 @@ asmlinkage long sys_set_mempolicy(int mo
 	return do_set_mempolicy(mode, &nodes);
 }
 
-/* Macro needed until Paul implements this function in kernel/cpusets.c */
-#define cpuset_mems_allowed(task) node_online_map
-
 asmlinkage long sys_migrate_pages(pid_t pid, unsigned long maxnode,
 		const unsigned long __user *old_nodes,
 		const unsigned long __user *new_nodes)

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 07/10] Cpuset: numa_policy_rebind cleanup
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (5 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 06/10] Cpuset: implement cpuset_mems_allowed Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10 10:02   ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 08/10] Cpuset: number_of_cpusets optimization Paul Jackson
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Paul Jackson, Simon Derr, linux-kernel, Christoph Lameter

Cleanup, reorganize and make more robust the mempolicy.c code
to rebind mempolicies relative to the containing cpuset after
a tasks memory placement changes.

The real motivator for this cleanup patch is to lay more
groundwork for the upcoming patch to correctly rebind NUMA
mempolicies that are attached to vma's after the containing
cpuset memory placement changes.

NUMA mempolicies are constrained by the cpuset their task is
a member of.  When either (1) a task is moved to a different
cpuset, or (2) the 'mems' mems_allowed of a cpuset is changed,
then the NUMA mempolicies have embedded node numbers (for
MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to be
recalculated, relative to their new cpuset placement.

The old code used an unreliable method of determining what was
the old mems_allowed constraining the mempolicy.  It just looked
at the tasks mems_allowed value.  This sort of worked with the
present code, that just rebinds the -task- mempolicy, and leaves
any -vma- mempolicies broken, referring to the old nodes.  But in
an upcoming patch, the vma mempolicies will be rebound as well.
Then the order in which the various task and vma mempolicies
are updated will no longer be deterministic, and one can no
longer count on the task->mems_allowed holding the old value
for as long as needed.  It's not even clear if the current code
was guaranteed to work reliably for task mempolicies.

So I added a mems_allowed field to each mempolicy, stating
exactly what mems_allowed the policy is relative to, and updated
synchronously and reliably anytime that the mempolicy is rebound.

Also removed a useless wrapper routine, numa_policy_rebind(),
and had its caller, cpuset_update_task_memory_state(), call
directly to the rewritten policy_rebind() routine, and made that
rebind routine extern instead of static, and added a "mpol_"
prefix to its name, making it mpol_rebind_policy().

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/mempolicy.h |   12 ++++++++++--
 kernel/cpuset.c           |    2 +-
 mm/mempolicy.c            |   31 +++++++++++++++++++------------
 3 files changed, 30 insertions(+), 15 deletions(-)

--- 2.6.15-rc3-mm1.orig/include/linux/mempolicy.h	2005-12-08 19:37:37.906709408 -0800
+++ 2.6.15-rc3-mm1/include/linux/mempolicy.h	2005-12-08 19:42:05.996633138 -0800
@@ -68,6 +68,7 @@ struct mempolicy {
 		nodemask_t	 nodes;		/* interleave */
 		/* undefined for default */
 	} v;
+	nodemask_t cpuset_mems_allowed;	/* mempolicy relative to these nodes */
 };
 
 /*
@@ -146,7 +147,9 @@ struct mempolicy *mpol_shared_policy_loo
 
 extern void numa_default_policy(void);
 extern void numa_policy_init(void);
-extern void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new);
+extern void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *new);
+extern void mpol_rebind_task(struct task_struct *tsk,
+					const nodemask_t *new);
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr);
@@ -214,7 +217,12 @@ static inline void numa_default_policy(v
 {
 }
 
-static inline void numa_policy_rebind(const nodemask_t *old,
+static inline void mpol_rebind_policy(struct mempolicy *pol,
+					const nodemask_t *new)
+{
+}
+
+static inline void mpol_rebind_task(struct task_struct *tsk,
 					const nodemask_t *new)
 {
 }
--- 2.6.15-rc3-mm1.orig/mm/mempolicy.c	2005-12-08 19:39:31.277122633 -0800
+++ 2.6.15-rc3-mm1/mm/mempolicy.c	2005-12-08 19:43:22.825620845 -0800
@@ -185,6 +185,7 @@ static struct mempolicy *mpol_new(int mo
 		break;
 	}
 	policy->policy = mode;
+	policy->cpuset_mems_allowed = cpuset_mems_allowed(current);
 	return policy;
 }
 
@@ -1440,25 +1441,31 @@ void numa_default_policy(void)
 }
 
 /* Migrate a policy to a different set of nodes */
-static void rebind_policy(struct mempolicy *pol, const nodemask_t *old,
-							const nodemask_t *new)
+void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *newmask)
 {
+	nodemask_t *mpolmask;
 	nodemask_t tmp;
 
 	if (!pol)
 		return;
+	mpolmask = &pol->cpuset_mems_allowed;
+	if (nodes_equal(*mpolmask, *newmask))
+		return;
 
 	switch (pol->policy) {
 	case MPOL_DEFAULT:
 		break;
 	case MPOL_INTERLEAVE:
-		nodes_remap(tmp, pol->v.nodes, *old, *new);
+		nodes_remap(tmp, pol->v.nodes, *mpolmask, *newmask);
 		pol->v.nodes = tmp;
-		current->il_next = node_remap(current->il_next, *old, *new);
+		*mpolmask = *newmask;
+		current->il_next = node_remap(current->il_next,
+						*mpolmask, *newmask);
 		break;
 	case MPOL_PREFERRED:
 		pol->v.preferred_node = node_remap(pol->v.preferred_node,
-								*old, *new);
+						*mpolmask, *newmask);
+		*mpolmask = *newmask;
 		break;
 	case MPOL_BIND: {
 		nodemask_t nodes;
@@ -1468,7 +1475,7 @@ static void rebind_policy(struct mempoli
 		nodes_clear(nodes);
 		for (z = pol->v.zonelist->zones; *z; z++)
 			node_set((*z)->zone_pgdat->node_id, nodes);
-		nodes_remap(tmp, nodes, *old, *new);
+		nodes_remap(tmp, nodes, *mpolmask, *newmask);
 		nodes = tmp;
 
 		zonelist = bind_zonelist(&nodes);
@@ -1483,6 +1490,7 @@ static void rebind_policy(struct mempoli
 			kfree(pol->v.zonelist);
 			pol->v.zonelist = zonelist;
 		}
+		*mpolmask = *newmask;
 		break;
 	}
 	default:
@@ -1492,14 +1500,13 @@ static void rebind_policy(struct mempoli
 }
 
 /*
- * Someone moved this task to different nodes.  Fixup mempolicies.
- *
- * TODO - fixup current->mm->vma and shmfs/tmpfs/hugetlbfs policies as well,
- * once we have a cpuset mechanism to mark which cpuset subtree is migrating.
+ * Wrapper for mpol_rebind_policy() that just requires task
+ * pointer, and updates task mempolicy.
  */
-void numa_policy_rebind(const nodemask_t *old, const nodemask_t *new)
+
+void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new)
 {
-	rebind_policy(current->mempolicy, old, new);
+	mpol_rebind_policy(tsk->mempolicy, new);
 }
 
 /*
--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-08 19:39:31.273216339 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-08 19:44:26.519692537 -0800
@@ -643,7 +643,7 @@ void cpuset_update_task_memory_state()
 		tsk->cpuset_mems_generation = cs->mems_generation;
 		task_unlock(tsk);
 		up(&callback_sem);
-		numa_policy_rebind(&oldmem, &tsk->mems_allowed);
+		mpol_rebind_task(tsk, &tsk->mems_allowed);
 		if (!nodes_equal(oldmem, tsk->mems_allowed)) {
 			if (migrate) {
 				do_migrate_pages(tsk->mm, &oldmem,

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 08/10] Cpuset: number_of_cpusets optimization
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (6 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 07/10] Cpuset: numa_policy_rebind cleanup Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 09/10] Cpuset: rebind vma mempolicies fix Paul Jackson
  2005-12-10  8:19 ` [PATCH 10/10] Cpuset: migrate all tasks in cpuset at once Paul Jackson
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

Easy little optimization hack to avoid actually having to call
cpuset_zone_allowed() and check mems_allowed, in the main page
allocation routine, __alloc_pages().  This saves several CPU
cycles per page allocation on systems not using cpusets.

A counter is updated each time a cpuset is created or removed,
and whenever there is only one cpuset in the system, it must be
the root cpuset, which contains all CPUs and all Memory Nodes.
In that case, when the counter is one, all allocations are
allowed.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/cpuset.h |   10 +++++++++-
 kernel/cpuset.c        |   12 +++++++++++-
 2 files changed, 20 insertions(+), 2 deletions(-)

--- 2.6.15-rc3-mm1.orig/include/linux/cpuset.h	2005-12-05 23:11:38.389769102 -0800
+++ 2.6.15-rc3-mm1/include/linux/cpuset.h	2005-12-05 23:11:51.876246895 -0800
@@ -14,6 +14,8 @@
 
 #ifdef CONFIG_CPUSETS
 
+extern int number_of_cpusets;	/* How many cpusets are defined in system? */
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_fork(struct task_struct *p);
@@ -25,7 +27,13 @@ void cpuset_update_task_memory_state(voi
 #define cpuset_nodes_subset_current_mems_allowed(nodes) \
 		nodes_subset((nodes), current->mems_allowed)
 int cpuset_zonelist_valid_mems_allowed(struct zonelist *zl);
-extern int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask);
+
+extern int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask);
+static int inline cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+{
+	return number_of_cpusets <= 1 || __cpuset_zone_allowed(z, gfp_mask);
+}
+
 extern int cpuset_excl_nodes_overlap(const struct task_struct *p);
 
 #define cpuset_memory_pressure_bump() 				\
--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-05 23:11:45.502152719 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-05 23:11:51.880153188 -0800
@@ -56,6 +56,13 @@
 
 #define CPUSET_SUPER_MAGIC		0x27e0eb
 
+/*
+ * Tracks how many cpusets are currently defined in system.
+ * When there is only one cpuset (the root cpuset) we can
+ * short circuit some hooks.
+ */
+int number_of_cpusets;
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -1664,6 +1671,7 @@ static long cpuset_create(struct cpuset 
 
 	down(&callback_sem);
 	list_add(&cs->sibling, &cs->parent->children);
+	number_of_cpusets++;
 	up(&callback_sem);
 
 	err = cpuset_create_dir(cs, name, mode);
@@ -1726,6 +1734,7 @@ static int cpuset_rmdir(struct inode *un
 	spin_unlock(&d->d_lock);
 	cpuset_d_remove_dir(d);
 	dput(d);
+	number_of_cpusets--;
 	up(&callback_sem);
 	if (list_empty(&parent->children))
 		check_for_release(parent, &pathbuf);
@@ -1769,6 +1778,7 @@ int __init cpuset_init(void)
 	root->d_inode->i_nlink++;
 	top_cpuset.dentry = root;
 	root->d_inode->i_op = &cpuset_dir_inode_operations;
+	number_of_cpusets = 1;
 	err = cpuset_populate_dir(root);
 	/* memory_pressure_enabled is in root cpuset only */
 	if (err == 0)
@@ -1982,7 +1992,7 @@ static const struct cpuset *nearest_excl
  *	GFP_USER     - only nodes in current tasks mems allowed ok.
  **/
 
-int cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
+int __cpuset_zone_allowed(struct zone *z, gfp_t gfp_mask)
 {
 	int node;			/* node that zone z is on */
 	const struct cpuset *cs;	/* current cpuset ancestors */

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 09/10] Cpuset: rebind vma mempolicies fix
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (7 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 08/10] Cpuset: number_of_cpusets optimization Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  2005-12-10  8:19 ` [PATCH 10/10] Cpuset: migrate all tasks in cpuset at once Paul Jackson
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Paul Jackson, Simon Derr, linux-kernel, Christoph Lameter

Fix more of longstanding bug in cpuset/mempolicy interaction.

NUMA mempolicies (mm/mempolicy.c) are constrained by the current
tasks cpuset to just the Memory Nodes allowed by that cpuset.
The kernel maintains internal state for each mempolicy, tracking
what nodes are used for the MPOL_INTERLEAVE, MPOL_BIND or
MPOL_PREFERRED policies.

When a tasks cpuset memory placement changes, whether because the
cpuset changed, or because the task was attached to a different
cpuset, then the tasks mempolicies have to be rebound to the
new cpuset placement, so as to preserve the cpuset-relative
numbering of the nodes in that policy.

An earlier fix handled such mempolicy rebinding for mempolicies
attached to a task.

This fix rebinds mempolicies attached to vma's (address
ranges in a tasks address space.)  Due to the need to hold
the task->mm->mmap_sem semaphore while updating vma's, the
rebinding of vma mempolicies has to be done when the cpuset
memory placement is changed, at which time mmap_sem can be
safely acquired.  The tasks mempolicy is rebound later, when
the task next attempts to allocate memory and notices that its
task->cpuset_mems_generation is out-of-date with its cpusets
mems_generation.

Because walking the tasklist to find all tasks attached to a
changing cpuset requires holding tasklist_lock, a spinlock,
one cannot update the vma's of the affected tasks while doing
the tasklist scan.  In general, one cannot acquire a semaphore
(which can sleep) while already holding a spinlock (such as
tasklist_lock).  So a list of mm references has to be built
up during the tasklist scan, then the tasklist lock dropped,
then for each mm, its mmap_sem acquired, and the vma's in that
mm rebound.

Once the tasklist lock is dropped, affected tasks may fork
new tasks, before their mm's are rebound.  A kernel global
'cpuset_being_rebound' is set to point to the cpuset being
rebound (there can only be one; cpuset modifications are done
under a global 'manage_sem' semaphore), and the mpol_copy code
that is used to copy a tasks mempolicies during fork catches
such forking tasks, and ensures their children are also rebound.

When a task is moved to a different cpuset, it is easier, as
there is only one task involved.  It's mm->vma's are scanned,
using the same mpol_rebind_policy() as used above.

It may happen that both the mpol_copy hook and the update done
via the tasklist scan update the same mm twice.  This is ok,
as the mempolicies of each vma in an mm keep track of what
mems_allowed they are relative to, and safely no-op a second
request to rebind to the same nodes.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 include/linux/mempolicy.h |   18 ++++++++
 kernel/cpuset.c           |   94 ++++++++++++++++++++++++++++++++++++++++++++++
 mm/mempolicy.c            |   29 ++++++++++++++
 3 files changed, 141 insertions(+)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-09 04:02:43.013074202 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-09 05:08:00.834840778 -0800
@@ -812,12 +812,24 @@ static int update_cpumask(struct cpuset 
 }
 
 /*
+ * Handle user request to change the 'mems' memory placement
+ * of a cpuset.  Needs to validate the request, update the
+ * cpusets mems_allowed and mems_generation, and for each
+ * task in the cpuset, rebind any vma mempolicies.
+ *
  * Call with manage_sem held.  May take callback_sem during call.
+ * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
+ * lock each such tasks mm->mmap_sem, scan its vma's and rebind
+ * their mempolicies to the cpusets new mems_allowed.
  */
 
 static int update_nodemask(struct cpuset *cs, char *buf)
 {
 	struct cpuset trialcs;
+	struct task_struct *g, *p;
+	struct mm_struct **mmarray;
+	int i, n, ntasks;
+	int fudge;
 	int retval;
 
 	trialcs = *cs;
@@ -839,6 +851,80 @@ static int update_nodemask(struct cpuset
 	cs->mems_generation = atomic_read(&cpuset_mems_generation);
 	up(&callback_sem);
 
+	set_cpuset_being_rebound(cs);		/* causes mpol_copy() rebind */
+
+	fudge = 10;				/* spare mmarray[] slots */
+	fudge += cpus_weight(cs->cpus_allowed);	/* imagine one fork-bomb/cpu */
+	retval = -ENOMEM;
+
+	/*
+	 * Allocate mmarray[] to hold mm reference for each task
+	 * in cpuset cs.  Can't kmalloc GFP_KERNEL while holding
+	 * tasklist_lock.  We could use GFP_ATOMIC, but with a
+	 * few more lines of code, we can retry until we get a big
+	 * enough mmarray[] w/o using GFP_ATOMIC.
+	 */
+	while (1) {
+		ntasks = atomic_read(&cs->count);	/* guess */
+		ntasks += fudge;
+		mmarray = kmalloc(ntasks * sizeof(*mmarray), GFP_KERNEL);
+		if (!mmarray)
+			goto done;
+		write_lock_irq(&tasklist_lock);		/* block fork */
+		if (atomic_read(&cs->count) <= ntasks)
+			break;				/* got enough */
+		kfree(mmarray);
+		write_unlock_irq(&tasklist_lock);	/* try again */
+	}
+
+	n = 0;
+
+	/* Load up mmarray[] with mm reference for each task in cpuset. */
+	do_each_thread(g, p) {
+		struct mm_struct *mm;
+
+		if (p->cpuset != cs)
+			continue;
+		mm = get_task_mm(p);
+		if (!mm)
+			continue;
+		if (n >= ntasks) {
+			if (printk_ratelimit()) {
+				printk (KERN_ERR
+					"Cpuset mempolicy rebind failed.\n");
+				BUG();
+			}
+			mmput(mm);
+			continue;
+		}
+		mmarray[n++] = mm;
+	} while_each_thread(g, p);
+	write_unlock_irq(&tasklist_lock);
+
+	/*
+	 * Now that we've dropped the tasklist spinlock, we can
+	 * rebind the vma mempolicies of each mm in mmarray[] to their
+	 * new cpuset, and release that mm.  The mpol_rebind_mm()
+	 * call takes mmap_sem, which we couldn't take while holding
+	 * tasklist_lock.  Forks can happen again now - the mpol_copy()
+	 * cpuset_being_rebound check will catch such forks, and rebind
+	 * their vma mempolicies too.  Because we still hold the global
+	 * cpuset manage_sem, we know that no other rebind effort will
+	 * be contending for the global variable cpuset_being_rebound.
+	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
+	 * is idempotent.
+	 */
+	for (i = 0; i < n; i++) {
+		struct mm_struct *mm = mmarray[i];
+
+		mpol_rebind_mm(mm, &cs->mems_allowed);
+		mmput(mm);
+	}
+
+	/* We're done rebinding vma's to this cpusets new mems_allowed. */
+	kfree(mmarray);
+	set_cpuset_being_rebound(NULL);
+	retval = 0;
 done:
 	return retval;
 }
@@ -1011,6 +1097,7 @@ static int attach_task(struct cpuset *cs
 	struct cpuset *oldcs;
 	cpumask_t cpus;
 	nodemask_t from, to;
+	struct mm_struct *mm;
 
 	if (sscanf(pidbuf, "%d", &pid) != 1)
 		return -EIO;
@@ -1060,6 +1147,13 @@ static int attach_task(struct cpuset *cs
 	to = cs->mems_allowed;
 
 	up(&callback_sem);
+
+	mm = get_task_mm(tsk);
+	if (mm) {
+		mpol_rebind_mm(mm, &to);
+		mmput(mm);
+	}
+
 	if (is_memory_migrate(cs))
 		do_migrate_pages(tsk->mm, &from, &to, MPOL_MF_MOVE_ALL);
 	put_task_struct(tsk);
--- 2.6.15-rc3-mm1.orig/mm/mempolicy.c	2005-12-09 04:02:42.980847269 -0800
+++ 2.6.15-rc3-mm1/mm/mempolicy.c	2005-12-09 04:02:43.092176676 -0800
@@ -1160,6 +1160,15 @@ struct page *alloc_pages_current(gfp_t g
 }
 EXPORT_SYMBOL(alloc_pages_current);
 
+/*
+ * If mpol_copy() sees current->cpuset == cpuset_being_rebound, then it
+ * rebinds the mempolicy its copying by calling mpol_rebind_policy()
+ * with the mems_allowed returned by cpuset_mems_allowed().  This
+ * keeps mempolicies cpuset relative after its cpuset moves.  See
+ * further kernel/cpuset.c update_nodemask().
+ */
+void *cpuset_being_rebound;
+
 /* Slow path of a mempolicy copy */
 struct mempolicy *__mpol_copy(struct mempolicy *old)
 {
@@ -1167,6 +1176,10 @@ struct mempolicy *__mpol_copy(struct mem
 
 	if (!new)
 		return ERR_PTR(-ENOMEM);
+	if (current_cpuset_is_being_rebound()) {
+		nodemask_t mems = cpuset_mems_allowed(current);
+		mpol_rebind_policy(old, &mems);
+	}
 	*new = *old;
 	atomic_set(&new->refcnt, 1);
 	if (new->policy == MPOL_BIND) {
@@ -1510,6 +1523,22 @@ void mpol_rebind_task(struct task_struct
 }
 
 /*
+ * Rebind each vma in mm to new nodemask.
+ *
+ * Call holding a reference to mm.  Takes mm->mmap_sem during call.
+ */
+
+void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+{
+	struct vm_area_struct *vma;
+
+	down_write(&mm->mmap_sem);
+	for (vma = mm->mmap; vma; vma = vma->vm_next)
+		mpol_rebind_policy(vma->vm_policy, new);
+	up_write(&mm->mmap_sem);
+}
+
+/*
  * Display pages allocated per node and memory policy via /proc.
  */
 
--- 2.6.15-rc3-mm1.orig/include/linux/mempolicy.h	2005-12-09 04:02:42.925182565 -0800
+++ 2.6.15-rc3-mm1/include/linux/mempolicy.h	2005-12-09 04:02:43.095106397 -0800
@@ -150,6 +150,16 @@ extern void numa_policy_init(void);
 extern void mpol_rebind_policy(struct mempolicy *pol, const nodemask_t *new);
 extern void mpol_rebind_task(struct task_struct *tsk,
 					const nodemask_t *new);
+extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
+#define set_cpuset_being_rebound(x) (cpuset_being_rebound = (x))
+
+#ifdef CONFIG_CPUSET
+#define current_cpuset_is_being_rebound() \
+				(cpuset_being_rebound == current->cpuset)
+#else
+#define current_cpuset_is_being_rebound() 0
+#endif
+
 extern struct mempolicy default_policy;
 extern struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr);
@@ -158,6 +168,8 @@ extern unsigned slab_node(struct mempoli
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
+extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
+
 #else
 
 struct mempolicy {};
@@ -227,6 +239,12 @@ static inline void mpol_rebind_task(stru
 {
 }
 
+static inline void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new)
+{
+}
+
+#define set_cpuset_being_rebound(x) do {} while (0)
+
 static inline struct zonelist *huge_zonelist(struct vm_area_struct *vma,
 		unsigned long addr)
 {

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 10/10] Cpuset: migrate all tasks in cpuset at once
  2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
                   ` (8 preceding siblings ...)
  2005-12-10  8:19 ` [PATCH 09/10] Cpuset: rebind vma mempolicies fix Paul Jackson
@ 2005-12-10  8:19 ` Paul Jackson
  9 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10  8:19 UTC (permalink / raw)
  To: akpm; +Cc: Simon Derr, Paul Jackson, linux-kernel, Christoph Lameter

Given the mechanism in the previous patch to handle rebinding
the per-vma mempolicies of all tasks in a cpuset that changes its
memory placement, it is now easier to handle the page migration
requirements of such tasks at the same time.

The previous code didn't actually attempt to migrate the pages
of the tasks in a cpuset whose memory placement changed until
the next time each such task tried to allocate memory.  This was
undesirable, as users invoking memory page migration exected
to happen when the placement changed, not some unspecified time
later when the task needed more memory.

It is now trivial to handle the page migration at the same time
as the per-vma rebinding is done.

The routine cpuset.c:update_nodemask(), which handles changing a
cpusets memory placement ('mems') now checks for the special case
of being asked to write a placement that is the same as before.
It was harmless enough before to just recompute everything again,
even though nothing had changed.  But page migration is a heavy
weight operation - moving pages about.  So now it is worth
avoiding that if asked to move a cpuset to its current location.

Signed-off-by: Paul Jackson <pj@sgi.com>

---

 kernel/cpuset.c |   29 ++++++++++++++++-------------
 1 files changed, 16 insertions(+), 13 deletions(-)

--- 2.6.15-rc3-mm1.orig/kernel/cpuset.c	2005-12-09 05:08:00.834840778 -0800
+++ 2.6.15-rc3-mm1/kernel/cpuset.c	2005-12-09 05:09:55.573472784 -0800
@@ -639,25 +639,14 @@ void cpuset_update_task_memory_state()
 	task_unlock(tsk);
 
 	if (my_cpusets_mem_gen != tsk->cpuset_mems_generation) {
-		nodemask_t oldmem = tsk->mems_allowed;
-		int migrate;
-
 		down(&callback_sem);
 		task_lock(tsk);
 		cs = tsk->cpuset;	/* Maybe changed when task not locked */
-		migrate = is_memory_migrate(cs);
 		guarantee_online_mems(cs, &tsk->mems_allowed);
 		tsk->cpuset_mems_generation = cs->mems_generation;
 		task_unlock(tsk);
 		up(&callback_sem);
 		mpol_rebind_task(tsk, &tsk->mems_allowed);
-		if (!nodes_equal(oldmem, tsk->mems_allowed)) {
-			if (migrate) {
-				do_migrate_pages(tsk->mm, &oldmem,
-					&tsk->mems_allowed,
-					MPOL_MF_MOVE_ALL);
-			}
-		}
 	}
 }
 
@@ -815,7 +804,9 @@ static int update_cpumask(struct cpuset 
  * Handle user request to change the 'mems' memory placement
  * of a cpuset.  Needs to validate the request, update the
  * cpusets mems_allowed and mems_generation, and for each
- * task in the cpuset, rebind any vma mempolicies.
+ * task in the cpuset, rebind any vma mempolicies and if
+ * the cpuset is marked 'memory_migrate', migrate the tasks
+ * pages to the new memory.
  *
  * Call with manage_sem held.  May take callback_sem during call.
  * Will take tasklist_lock, scan tasklist for tasks in cpuset cs,
@@ -826,9 +817,11 @@ static int update_cpumask(struct cpuset 
 static int update_nodemask(struct cpuset *cs, char *buf)
 {
 	struct cpuset trialcs;
+	nodemask_t oldmem;
 	struct task_struct *g, *p;
 	struct mm_struct **mmarray;
 	int i, n, ntasks;
+	int migrate;
 	int fudge;
 	int retval;
 
@@ -837,6 +830,11 @@ static int update_nodemask(struct cpuset
 	if (retval < 0)
 		goto done;
 	nodes_and(trialcs.mems_allowed, trialcs.mems_allowed, node_online_map);
+	oldmem = cs->mems_allowed;
+	if (nodes_equal(oldmem, trialcs.mems_allowed)) {
+		retval = 0;		/* Too easy - nothing to do */
+		goto done;
+	}
 	if (nodes_empty(trialcs.mems_allowed)) {
 		retval = -ENOSPC;
 		goto done;
@@ -912,12 +910,17 @@ static int update_nodemask(struct cpuset
 	 * cpuset manage_sem, we know that no other rebind effort will
 	 * be contending for the global variable cpuset_being_rebound.
 	 * It's ok if we rebind the same mm twice; mpol_rebind_mm()
-	 * is idempotent.
+	 * is idempotent.  Also migrate pages in each mm to new nodes.
 	 */
+	migrate = is_memory_migrate(cs);
 	for (i = 0; i < n; i++) {
 		struct mm_struct *mm = mmarray[i];
 
 		mpol_rebind_mm(mm, &cs->mems_allowed);
+		if (migrate) {
+			do_migrate_pages(mm, &oldmem, &cs->mems_allowed,
+							MPOL_MF_MOVE_ALL);
+		}
 		mmput(mm);
 	}
 

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 07/10] Cpuset: numa_policy_rebind cleanup
  2005-12-10  8:19 ` [PATCH 07/10] Cpuset: numa_policy_rebind cleanup Paul Jackson
@ 2005-12-10 10:02   ` Paul Jackson
  0 siblings, 0 replies; 12+ messages in thread
From: Paul Jackson @ 2005-12-10 10:02 UTC (permalink / raw)
  To: Andi Kleen; +Cc: akpm, Simon.Derr, linux-kernel, clameter

Andi,

I should have included you on the CC list for this set of ten cpuset
patches, a couple of which also apply to mempolicy.c.  Sorry I missed
you ... here's your very own custom personalized ping ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-12-10 10:02 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-12-10  8:18 [PATCH 00/10] Cpuset: rebind vma's and other refinements Paul Jackson
2005-12-10  8:18 ` [PATCH 01/10] Cpuset: remove marker_pid documentation Paul Jackson
2005-12-10  8:18 ` [PATCH 02/10] Cpuset: minor spacing initializer fixes Paul Jackson
2005-12-10  8:19 ` [PATCH 03/10] Cpuset: update_nodemask code reformat Paul Jackson
2005-12-10  8:19 ` [PATCH 04/10] Cpuset: fork hook fix Paul Jackson
2005-12-10  8:19 ` [PATCH 05/10] Cpuset: combine refresh_mems and update_mems Paul Jackson
2005-12-10  8:19 ` [PATCH 06/10] Cpuset: implement cpuset_mems_allowed Paul Jackson
2005-12-10  8:19 ` [PATCH 07/10] Cpuset: numa_policy_rebind cleanup Paul Jackson
2005-12-10 10:02   ` Paul Jackson
2005-12-10  8:19 ` [PATCH 08/10] Cpuset: number_of_cpusets optimization Paul Jackson
2005-12-10  8:19 ` [PATCH 09/10] Cpuset: rebind vma mempolicies fix Paul Jackson
2005-12-10  8:19 ` [PATCH 10/10] Cpuset: migrate all tasks in cpuset at once Paul Jackson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.