Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2.6.17-rc1-mm1 9/9] AutoPage Migration - V0.2 - hook automigration to migrate-on-fault
From: Lee Schermerhorn @ 2006-04-07 20:45 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 9/9 hook automigration to migrate-on-fault

Add a /sys/kernel/migration control--auto_migrate_lazy--to use 
migrate-on-fault for auto-migration.

Modify migrate_to_node() to just unmap the eligible pages
via migrate_pages_unmap_only() when MPOL_MF_LAZY flag is set.

This patch depends on the "migrate-on-fault" patch series that
defines the MPOL_MF_LAZY flag and the migrate_pages_unmap_only()
function.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:50:36.000000000 -0500
@@ -635,7 +635,11 @@ int migrate_to_node(struct mm_struct *mm
 			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
 
 	if (!list_empty(&pagelist)) {
-		err = migrate_pages_to(&pagelist, NULL, dest);
+		if (flags & MPOL_MF_LAZY)
+			err = migrate_pages_unmap_only(&pagelist);
+		else
+			err = migrate_pages_to(&pagelist, NULL, dest);
+
 		if (!list_empty(&pagelist))
 			putback_lru_pages(&pagelist);
 	}
@@ -744,6 +748,9 @@ void auto_migrate_task_memory(void)
 	 */
 	BUG_ON(!mm);
 
+	if (auto_migrate_lazy)
+		flags |= MPOL_MF_LAZY;
+
 	/*
 	 * Pass destination node as source node plus 'INVERT flag:
 	 *    Migrate all pages NOT on destination node.
@@ -1000,7 +1007,6 @@ out:
 	return err;
 }
 
-
 /* Retrieve NUMA policy */
 asmlinkage long sys_get_mempolicy(int __user *policy,
 				unsigned long __user *nmask,
Index: linux-2.6.16-mm1/mm/migrate.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/migrate.c	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/mm/migrate.c	2006-03-23 16:50:36.000000000 -0500
@@ -129,6 +129,37 @@ static ssize_t migrate_max_mapcount_stor
 }
 MIGRATION_ATTR_RW(migrate_max_mapcount);
 
+/*
+ * auto_migrate_lazy:  use "lazy migration"--i.e., migration-on-fault--
+ * for scheduler driven task memory migration.
+ */
+int auto_migrate_lazy = 0;
+
+static int __init set_auto_migrate_lazy(char *str)
+{
+	get_option(&str, &auto_migrate_lazy);
+	return 1;
+}
+
+__setup("auto_migrate_lazy", set_auto_migrate_lazy);
+
+static ssize_t auto_migrate_lazy_show(struct subsystem *subsys, char *page)
+{
+	return sprintf(page, "auto_migrate_lazy %s\n",
+			auto_migrate_lazy ? "on" : "off");
+}
+static ssize_t auto_migrate_lazy_store(struct subsystem *subsys,
+				      const char *page, size_t count)
+{
+        unsigned long n = simple_strtoul(page, NULL, 10);
+	if (n)
+		auto_migrate_lazy = 1;
+	else
+		auto_migrate_lazy = 0;
+        return count;
+}
+MIGRATION_ATTR_RW(auto_migrate_lazy);
+
 decl_subsys(migration, NULL, NULL);
 EXPORT_SYMBOL(migration_subsys);
 
@@ -136,6 +167,7 @@ static struct attribute *migration_attrs
 	&auto_migrate_enable_attr.attr,
 	&auto_migrate_interval_attr.attr,
 	&migrate_max_mapcount_attr.attr,
+	&auto_migrate_lazy_attr.attr,
 	NULL
 };
 
Index: linux-2.6.16-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/auto-migrate.h	2006-03-23 16:50:30.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/auto-migrate.h	2006-03-23 16:50:36.000000000 -0500
@@ -21,6 +21,7 @@ extern unsigned long auto_migrate_interv
 #define AUTO_MIGRATE_INTERVAL_MAX (300*HZ)
 
 extern unsigned int migrate_max_mapcount;
+extern int auto_migrate_lazy;
 
 #ifdef _LINUX_SCHED_H	/* only used where this is defined */
 static inline void check_internode_migration(task_t *task, int dest_cpu)
@@ -101,6 +102,7 @@ out:
 
 #define check_migrate_pending()		/* NOTHING */
 #define migrate_max_mapcount (1)
+#define auto_migrate_lazy (0)
 
 #endif	/* CONFIG_MIGRATION */
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 8/9] AutoPage Migration - V0.2 - add max mapcount migration threshold
From: Lee Schermerhorn @ 2006-04-07 20:43 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 8/9 add max mapcount migration threshold

This patch adds an additional migration control that allows one
to vary the page mapcount threshold above which pages will not
be migrated by MPOL_MF_MOVE.  The default value is 1, which yields
the same behavior as before this patch.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/auto-migrate.h	2006-03-23 16:50:24.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/auto-migrate.h	2006-03-23 16:50:30.000000000 -0500
@@ -20,6 +20,8 @@ extern unsigned long auto_migrate_interv
 #define AUTO_MIGRATE_INTERVAL_MIN (5*HZ)
 #define AUTO_MIGRATE_INTERVAL_MAX (300*HZ)
 
+extern unsigned int migrate_max_mapcount;
+
 #ifdef _LINUX_SCHED_H	/* only used where this is defined */
 static inline void check_internode_migration(task_t *task, int dest_cpu)
 {
@@ -98,6 +100,7 @@ out:
 #define too_soon_for_internode_migration(t,c) 0
 
 #define check_migrate_pending()		/* NOTHING */
+#define migrate_max_mapcount (1)
 
 #endif	/* CONFIG_MIGRATION */
 
Index: linux-2.6.16-mm1/mm/migrate.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/migrate.c	2006-03-23 16:50:24.000000000 -0500
+++ linux-2.6.16-mm1/mm/migrate.c	2006-03-23 16:50:30.000000000 -0500
@@ -107,12 +107,35 @@ static ssize_t auto_migrate_interval_sto
 }
 MIGRATION_ATTR_RW(auto_migrate_interval);
 
+/*
+ * migrate_max_mapcount:  specify how many mappers allowed
+ * before we won't migrate a page via MPOL_MF_MOVE.
+ */
+unsigned int migrate_max_mapcount = 1;	/* default == minimum */
+
+static ssize_t migrate_max_mapcount_show(struct subsystem *subsys, char *page)
+{
+	return sprintf(page, "migrate_max_mapcount %d\n", migrate_max_mapcount);
+}
+static ssize_t migrate_max_mapcount_store(struct subsystem *subsys,
+				      const char *page, size_t count)
+{
+        unsigned int n = simple_strtoul(page, NULL, 10);
+	if (n < 1)
+		migrate_max_mapcount = 1;
+	else
+		migrate_max_mapcount = n;
+        return count;
+}
+MIGRATION_ATTR_RW(migrate_max_mapcount);
+
 decl_subsys(migration, NULL, NULL);
 EXPORT_SYMBOL(migration_subsys);
 
 static struct attribute *migration_attrs[] = {
 	&auto_migrate_enable_attr.attr,
 	&auto_migrate_interval_attr.attr,
+	&migrate_max_mapcount_attr.attr,
 	NULL
 };
 
Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:49:34.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:50:30.000000000 -0500
@@ -87,6 +87,7 @@
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
 #include <linux/migrate.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -452,7 +453,6 @@ static int contextualize_policy(int mode
 	return mpol_check_policy(mode, nodes);
 }
 
-
 /*
  * Update task->flags PF_MEMPOLICY bit: set iff non-default
  * mempolicy.  Allows more rapid checking of this (combined perhaps
@@ -611,9 +611,10 @@ static void migrate_page_add(struct page
 				unsigned long flags)
 {
 	/*
-	 * Avoid migrating a page that is shared with others.
+	 * Avoid migrating a page that is shared with [too many] others.
 	 */
-	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1)
+	if ((flags & MPOL_MF_MOVE_ALL) ||
+		page_mapcount(page) <= migrate_max_mapcount)
 		isolate_lru_page(page, pagelist);
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 7/9] AutoPage Migration - V0.2 - add hysteresis to internode migration
From: Lee Schermerhorn @ 2006-04-07 20:42 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 7/9 add hysteresis to internode migration

V0.2:	moved to mm/migrate.c; renamed to "auto_migrate_interval"

This patch adds hysteresis to the internode migration to prevent
page migration trashing when automatic scheduler driven page migration
is enabled.  

Add static in-line function "too_soon_for_internode_migration"
[macro => 0 if !CONFIG_MIGRATION] to check for attempts to move
task to a new node sooner than auto_migrate_interval jiffies
after previous migration.

Modify try_to_wakeup() to leave task on its current cpu if too
soon to move it to a different node.

Modify can_migrate_task() to "just say no!" if the load balancer
proposes an internode migration too soon after previous internode
migration.

Added a control variable--auto_migrate_interval--to /sys/kernel/migration
to query/set the interval.  Provide some fairly arbitrary min, max and
default values.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/sched.h	2006-04-05 10:15:00.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/sched.h	2006-04-05 10:16:26.000000000 -0400
@@ -909,6 +909,7 @@ struct task_struct {
   	struct mempolicy *mempolicy;
 	short il_next;
 #ifdef CONFIG_MIGRATION
+	unsigned long next_migrate;	/* internode migration hysteresis */
 	int migrate_pending;		/* internode mem migration pending */
 #endif
 #endif
Index: linux-2.6.17-rc1-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/migrate.c	2006-04-05 10:14:58.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/migrate.c	2006-04-05 10:16:26.000000000 -0400
@@ -26,6 +26,7 @@
 #include <linux/cpuset.h>
 #include <linux/swapops.h>
 #include <linux/sysfs.h>
+#include <linux/auto-migrate.h>
 
 #include "internal.h"
 
@@ -73,11 +74,45 @@ static ssize_t auto_migrate_enable_store
 }
 MIGRATION_ATTR_RW(auto_migrate_enable);
 
+/*
+ * auto_migrate_interval:  minimum interval between internode
+ * task migration when auto-migration enabled.
+ * units:  jiffies
+ */
+unsigned long auto_migrate_interval     = AUTO_MIGRATE_INTERVAL_DFLT;
+
+//TODO:  __setup function for boot command option
+
+static ssize_t auto_migrate_interval_show(struct subsystem *subsys,
+					 char *page)
+{
+	return sprintf(page, "auto_migrate_interval %ld\n",
+		 auto_migrate_interval/HZ );
+}
+static ssize_t auto_migrate_interval_store(struct subsystem *subsys,
+				      const char *page, size_t count)
+{
+        unsigned long n = simple_strtoul(page, NULL, 10) * HZ;
+
+	/*
+	 * silently clip to min/max
+	 */
+	if (n < AUTO_MIGRATE_INTERVAL_MIN)
+		auto_migrate_interval = AUTO_MIGRATE_INTERVAL_MIN;
+	else if (n > AUTO_MIGRATE_INTERVAL_MAX)
+		auto_migrate_interval = AUTO_MIGRATE_INTERVAL_MAX;
+	else
+		auto_migrate_interval = n;
+        return count;
+}
+MIGRATION_ATTR_RW(auto_migrate_interval);
+
 decl_subsys(migration, NULL, NULL);
 EXPORT_SYMBOL(migration_subsys);
 
 static struct attribute *migration_attrs[] = {
 	&auto_migrate_enable_attr.attr,
+	&auto_migrate_interval_attr.attr,
 	NULL
 };
 
Index: linux-2.6.17-rc1-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/auto-migrate.h	2006-04-05 10:15:00.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/auto-migrate.h	2006-04-05 10:16:26.000000000 -0400
@@ -15,6 +15,11 @@ extern void auto_migrate_task_memory(voi
 
 extern int auto_migrate_enable;
 
+extern unsigned long auto_migrate_interval;    /* seconds <=> jiffies */
+#define AUTO_MIGRATE_INTERVAL_DFLT (30*HZ)
+#define AUTO_MIGRATE_INTERVAL_MIN (5*HZ)
+#define AUTO_MIGRATE_INTERVAL_MAX (300*HZ)
+
 #ifdef _LINUX_SCHED_H	/* only used where this is defined */
 static inline void check_internode_migration(task_t *task, int dest_cpu)
 {
@@ -33,6 +38,25 @@ static inline void check_internode_migra
 	}
 }
 
+/*
+ * To avoids page migration thrashing when auto memory migration is enabled,
+ * check user task for too recent internode migration.
+ */
+static inline int too_soon_for_internode_migration(task_t *task,
+                                                         int this_cpu)
+{
+	if (auto_migrate_enable &&
+		task->mm && !(task->flags & PF_BORROWED_MM) &&
+		cpu_to_node(task_cpu(task)) != cpu_to_node(this_cpu)) {
+
+		if (task->migrate_pending ||
+			time_before(jiffies, task->next_migrate))
+			return 1;
+	}
+
+	return 0;
+}
+
 static inline void check_migrate_pending(void)
 {
 	if (!auto_migrate_enable)
@@ -55,6 +79,7 @@ static inline void check_migrate_pending
 		}
 
 		auto_migrate_task_memory();
+		current->next_migrate = jiffies + auto_migrate_interval;
 
 		if (likely(disable_irqs))
 			local_irq_disable();
@@ -70,6 +95,7 @@ out:
 #else	/* !CONFIG_MIGRATION */
 
 #define check_internode_migration(t,c)	/* NOTHING */
+#define too_soon_for_internode_migration(t,c) 0
 
 #define check_migrate_pending()		/* NOTHING */
 
Index: linux-2.6.17-rc1-mm1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/kernel/sched.c	2006-04-05 10:16:13.000000000 -0400
+++ linux-2.6.17-rc1-mm1/kernel/sched.c	2006-04-05 10:16:26.000000000 -0400
@@ -1378,7 +1378,8 @@ static int try_to_wake_up(task_t *p, uns
 		}
 	}
 
-	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)))
+	if (unlikely(!cpu_isset(this_cpu, p->cpus_allowed)
+		|| too_soon_for_internode_migration(p, this_cpu)))
 		goto out_set_cpu;
 
 	/*
@@ -2013,6 +2014,7 @@ int can_migrate_task(task_t *p, runqueue
 	 * 1) running (obviously), or
 	 * 2) cannot be migrated to this CPU due to cpus_allowed, or
 	 * 3) are cache-hot on their current CPU.
+	 * 4) too soon since last internode migration
 	 */
 	if (!cpu_isset(this_cpu, p->cpus_allowed))
 		return 0;
@@ -2021,6 +2023,10 @@ int can_migrate_task(task_t *p, runqueue
 	if (task_running(rq, p))
 		return 0;
 
+// TODO:  should this be under Agressive migration?
+	if (too_soon_for_internode_migration(p, this_cpu))
+		return 0;
+
 	/*
 	 * Aggressive migration if:
 	 * 1) task is cache cold, or


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 6/9] AutoPage Migration - V0.2 - hook sched migrate to memory migration
From: Lee Schermerhorn @ 2006-04-07 20:41 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 6/9 hook sched migrate to memory migration

Add check for internode migration to scheduler -- in most places
where a new cpu is assigned via set_task_cpu().  If MIGRATION is
configured, and auto-migration is enabled [and this is a
user space task], the check will set "migration pending" for the
task if the destination cpu is on a different node from the last
cpu to which the task was assigned.  Migration of affected pages
[those with default policy] will occur when the task returns to
user space.

V0.2:
	only check/notify task of internode migration in migrate_task()
	if not in exec() path.  Walking task address space and unmapping
	pages is probably a waste of time in this case.  Note, however,
	that we won't give the task a chance to pull any resident text
	or library pages local to itself.  If we ever support replication
	or more agressive migration, we can fix this.

	Thanks to Kamezawa Hiroyoki for pointing out this potential
	optimization.


Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/kernel/sched.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/kernel/sched.c	2006-04-05 10:14:36.000000000 -0400
+++ linux-2.6.17-rc1-mm1/kernel/sched.c	2006-04-05 10:16:13.000000000 -0400
@@ -52,8 +52,9 @@
 #include <linux/acct.h>
 #include <linux/kprobes.h>
 #include <linux/kgdb.h>
-#include <asm/tlb.h>
+#include <linux/auto-migrate.h>
 
+#include <asm/tlb.h>
 #include <asm/unistd.h>
 
 /*
@@ -1028,7 +1029,8 @@ typedef struct {
  * The task's runqueue lock must be held.
  * Returns true if you have to wait for migration thread.
  */
-static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req)
+static int migrate_task(task_t *p, int dest_cpu, migration_req_t *req,
+			int execing)
 {
 	runqueue_t *rq = task_rq(p);
 
@@ -1037,6 +1039,8 @@ static int migrate_task(task_t *p, int d
 	 * it is sufficient to simply update the task's cpu field.
 	 */
 	if (!p->array && !task_running(rq, p)) {
+		if (!execing)
+			check_internode_migration(p, dest_cpu);
 		set_task_cpu(p, dest_cpu);
 		return 0;
 	}
@@ -1432,6 +1436,7 @@ static int try_to_wake_up(task_t *p, uns
 out_set_cpu:
 	new_cpu = wake_idle(new_cpu, p);
 	if (new_cpu != cpu) {
+		check_internode_migration(p, new_cpu);
 		set_task_cpu(p, new_cpu);
 		task_rq_unlock(rq, &flags);
 		/* might preempt at this point */
@@ -1944,7 +1949,7 @@ static void sched_migrate_task(task_t *p
 		goto out;
 
 	/* force the process onto the specified CPU */
-	if (migrate_task(p, dest_cpu, &req)) {
+	if (migrate_task(p, dest_cpu, &req, 1)) {
 		/* Need to wait for migration thread (might exit: take ref). */
 		struct task_struct *mt = rq->migration_thread;
 		get_task_struct(mt);
@@ -1981,6 +1986,7 @@ void pull_task(runqueue_t *src_rq, prio_
 {
 	dequeue_task(p, src_array);
 	dec_nr_running(p, src_rq);
+	check_internode_migration(p, this_cpu);
 	set_task_cpu(p, this_cpu);
 	inc_nr_running(p, this_rq);
 	enqueue_task(p, this_array);
@@ -4721,7 +4727,7 @@ int set_cpus_allowed(task_t *p, cpumask_
 	if (cpu_isset(task_cpu(p), new_mask))
 		goto out;
 
-	if (migrate_task(p, any_online_cpu(new_mask), &req)) {
+	if (migrate_task(p, any_online_cpu(new_mask), &req, 0)) {
 		/* Need help from migration thread: drop lock and wait. */
 		task_rq_unlock(rq, &flags);
 		wake_up_process(rq->migration_thread);
@@ -4763,6 +4769,7 @@ static void __migrate_task(struct task_s
 	if (!cpu_isset(dest_cpu, p->cpus_allowed))
 		goto out;
 
+	check_internode_migration(p, dest_cpu);
 	set_task_cpu(p, dest_cpu);
 	if (p->array) {
 		/*


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 5/9] AutoPage Migration - V0.2 - x64_64 check/notify internode migration
From: Lee Schermerhorn @ 2006-04-07 20:40 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 5/9 x64_64 check/notify internode migration

Hook check for task memory migration for x86_64.

V0.1 -> V0.2:  fix type in auto-migrate.h include.
		tested on quad-opteron platform

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/arch/x86_64/kernel/signal.c
===================================================================
--- linux-2.6.16-mm1.orig/arch/x86_64/kernel/signal.c	2006-03-23 11:00:44.000000000 -0500
+++ linux-2.6.16-mm1/arch/x86_64/kernel/signal.c	2006-03-23 16:50:04.000000000 -0500
@@ -24,6 +24,8 @@
 #include <linux/stddef.h>
 #include <linux/personality.h>
 #include <linux/compiler.h>
+#include <linux/auto-migrate.h>
+
 #include <asm/ucontext.h>
 #include <asm/uaccess.h>
 #include <asm/i387.h>
@@ -493,6 +495,12 @@ void do_notify_resume(struct pt_regs *re
 		clear_thread_flag(TIF_SINGLESTEP);
 	}
 
+	/*
+	 * check for task memory migration before delivering
+	 * signals so that hander[s] use memory in new node.
+	 */
+	check_migrate_pending();
+
 	/* deal with pending signal delivery */
 	if (thread_info_flags & _TIF_SIGPENDING)
 		do_signal(regs,oldset);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 4/9] AutoPage Migration - V0.2 - ia64 check/notify internode migration
From: Lee Schermerhorn @ 2006-04-07 20:39 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 4/9 ia64 check/notify internode migration

V0.2 - refresh only

This patch hooks the check for task memory migration pending 
into the ia64 do_notify_resume() function.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/arch/ia64/kernel/process.c
===================================================================
--- linux-2.6.16-mm1.orig/arch/ia64/kernel/process.c	2006-03-23 11:00:43.000000000 -0500
+++ linux-2.6.16-mm1/arch/ia64/kernel/process.c	2006-03-23 16:49:58.000000000 -0500
@@ -30,6 +30,7 @@
 #include <linux/efi.h>
 #include <linux/interrupt.h>
 #include <linux/delay.h>
+#include <linux/auto-migrate.h>
 
 #include <asm/cpu.h>
 #include <asm/delay.h>
@@ -172,6 +173,12 @@ do_notify_resume_user (sigset_t *oldset,
 		pfm_handle_work();
 #endif
 
+	/*
+	 * check for task memory migration before delivering
+	 * signals so that hander[s] use memory in new node.
+	 */
+	check_migrate_pending();
+
 	/* deal with pending signal delivery */
 	if (test_thread_flag(TIF_SIGPENDING))
 		ia64_do_signal(oldset, scr, in_syscall);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 3/9] AutoPage Migration - V0.2 - generic check/notify internode migration
From: Lee Schermerhorn @ 2006-04-07 20:38 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 3/9 generic check/notify internode migration

V02:  renamed migrate_task_memory() to auto_migrate_task_memory().
      renamed auto-migration enable control.

This patch adds the check for internode migration to be called
from scheduler load balancing, and the check for migration pending
to be called when a task returning to user space notices 'NOTIFY_PENDING.

Check for internode migration:  if automatic memory migration
is enabled [auto_migrate_enable != 0] and this is a user task and the
destination cpu is on a different node from the task's current cpu,
the task will be marked for migration pending via member added to task
struct.  The TIF_NOTIFY_PENDING thread_info flag is set to cause the task
to enter do_notify_resume[_user]() to check for migration pending.

When a task is rescheduled to user space with TIF_NOTIFY_PENDING,
it will check for migration pending, unless SIGKILL is pending.
If the task notices migration pending, it will call
auto_migrate_task_memory() to migrate pages in vma's with default
policy.  Only default policy is affected by migration to a new node.

Note that we can't call auto_migrate_task_memory() with interrupts
disabled.  Temporarily enable interrupts around the call.

These checks become empty macros when 'MIGRATION' is not configured.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/sched.h	2006-04-05 10:14:36.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/sched.h	2006-04-05 10:15:00.000000000 -0400
@@ -908,6 +908,9 @@ struct task_struct {
 #ifdef CONFIG_NUMA
   	struct mempolicy *mempolicy;
 	short il_next;
+#ifdef CONFIG_MIGRATION
+	int migrate_pending;		/* internode mem migration pending */
+#endif
 #endif
 #ifdef CONFIG_CPUSETS
 	struct cpuset *cpuset;
Index: linux-2.6.17-rc1-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/auto-migrate.h	2006-04-05 10:14:58.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/auto-migrate.h	2006-04-05 10:15:00.000000000 -0400
@@ -15,8 +15,64 @@ extern void auto_migrate_task_memory(voi
 
 extern int auto_migrate_enable;
 
+#ifdef _LINUX_SCHED_H	/* only used where this is defined */
+static inline void check_internode_migration(task_t *task, int dest_cpu)
+{
+	if (auto_migrate_enable &&
+		task->mm && !(task->flags & PF_BORROWED_MM)) {
+		int node = cpu_to_node(task_cpu(task));
+		if ((node != cpu_to_node(dest_cpu))) {
+			/*
+			 * migrating a user task to a new node.
+			 * mark for memory migration on return to user space.
+			 */
+			struct thread_info *info = task->thread_info;
+			task->migrate_pending = 1;
+			set_bit(TIF_NOTIFY_RESUME, &info->flags);
+		}
+	}
+}
+
+static inline void check_migrate_pending(void)
+{
+	if (!auto_migrate_enable)
+		goto out;
+
+	/*
+	 * Don't bother with memory migration prep if 'KILL pending
+	 */
+	if (test_thread_flag(TIF_SIGPENDING) &&
+		(sigismember(&current->pending.signal, SIGKILL) ||
+		sigismember(&current->signal->shared_pending.signal, SIGKILL)))
+		goto out;
+
+	if (unlikely(current->migrate_pending)) {
+		int disable_irqs = 0;
+
+		if (likely(irqs_disabled())) {
+			disable_irqs = 1;
+			local_irq_enable();
+		}
+
+		auto_migrate_task_memory();
+
+		if (likely(disable_irqs))
+			local_irq_disable();
+	}
+
+out:
+	current->migrate_pending = 0;
+	clear_thread_flag(TIF_NOTIFY_RESUME);
+	return;
+}
+#endif /* _LINUX_SCHED_H */
+
 #else	/* !CONFIG_MIGRATION */
 
+#define check_internode_migration(t,c)	/* NOTHING */
+
+#define check_migrate_pending()		/* NOTHING */
+
 #endif	/* CONFIG_MIGRATION */
 
 #endif


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 2/9] AutoPage Migration - V0.2 - add auto_migrate_enable sysctl
From: Lee Schermerhorn @ 2006-04-07 20:37 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 2/9 add auto_migrate_enable sysctl

V0.2:  moved controls to mm/migrate.c
	renamed "sched_migrate_memory" to "auto_migrate_enable"

This patch adds the infrastructure for "migration controls" under
/sys/kernel/migration.  It also adds a single such control--
auto_migrate_enable--to enable/disable automatic, scheduler driven
task memory migration.  May also be initialized from boot command
line option.

Default is disabled!

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/mm/migrate.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/migrate.c	2006-03-23 16:49:16.000000000 -0500
+++ linux-2.6.16-mm1/mm/migrate.c	2006-03-23 16:49:40.000000000 -0500
@@ -25,8 +25,7 @@
 #include <linux/cpu.h>
 #include <linux/cpuset.h>
 #include <linux/swapops.h>
-
-#include "internal.h"
+#include <linux/sysfs.h>
 
 #include "internal.h"
 
@@ -36,6 +35,76 @@
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 /*
+ * System Controls for [auto] migration
+ */
+#define MIGRATION_ATTR_RW(_name) \
+static struct subsys_attribute _name##_attr = \
+	__ATTR(_name, 0644, _name##_show, _name##_store)
+
+/*
+ * auto_migrate_enable:  boot option and sysctl to enable/disable
+ * memory migration on inter-node task migration due to scheduler
+ * load balancing or change in cpu affinity.
+ */
+int auto_migrate_enable = 0;
+
+static int __init set_auto_migrate_enable(char *str)
+{
+	get_option(&str, &auto_migrate_enable);
+	return 1;
+}
+
+__setup("auto_migrate_enable", set_auto_migrate_enable);
+
+static ssize_t auto_migrate_enable_show(struct subsystem *subsys, char *page)
+{
+	return sprintf(page, "auto_migrate_enable %s\n",
+			auto_migrate_enable ? "on" : "off");
+}
+static ssize_t auto_migrate_enable_store(struct subsystem *subsys,
+				      const char *page, size_t count)
+{
+        unsigned long n = simple_strtoul(page, NULL, 10);
+	if (n)
+		auto_migrate_enable = 1;
+	else
+		auto_migrate_enable = 0;
+        return count;
+}
+MIGRATION_ATTR_RW(auto_migrate_enable);
+
+decl_subsys(migration, NULL, NULL);
+EXPORT_SYMBOL(migration_subsys);
+
+static struct attribute *migration_attrs[] = {
+	&auto_migrate_enable_attr.attr,
+	NULL
+};
+
+static struct attribute_group migration_attr_group = {
+	.attrs = migration_attrs,
+};
+
+static int __init migration_control_init(void)
+{
+	int error;
+
+	/*
+	 * child of kernel subsys
+	 */
+	kset_set_kset_s(&migration_subsys, kernel_subsys);
+	error = subsystem_register(&migration_subsys);
+	if (!error)
+		error = sysfs_create_group(&migration_subsys.kset.kobj,
+					   &migration_attr_group);
+	return error;
+}
+subsys_initcall(migration_control_init);
+/*
+ * end Migration System Controls
+ */
+
+/*
  * Isolate one page from the LRU lists. If successful put it onto
  * the indicated list with elevated page count.
  *
Index: linux-2.6.16-mm1/include/linux/auto-migrate.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/auto-migrate.h	2006-03-23 16:49:34.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/auto-migrate.h	2006-03-23 16:49:40.000000000 -0500
@@ -13,6 +13,8 @@
 
 extern void auto_migrate_task_memory(void);
 
+extern int auto_migrate_enable;
+
 #else	/* !CONFIG_MIGRATION */
 
 #endif	/* CONFIG_MIGRATION */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 1/9] AutoPage Migration - V0.2 - migrate task memory with default policy
From: Lee Schermerhorn @ 2006-04-07 20:37 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441946.5198.52.camel@localhost.localdomain>

AutoPage Migration - V0.2 - 1/9 migrate task memory with default policy

Define mempolicy.c internal flag for auto-migration.  This flag
will select auto-migration specific behavior in the existing 
page migration functions.

Add auto_migrate_task_memory() to mempolicy.c.  This function sets up 
to call migrate_to_node() with internal flags for auto-migration.

Modify vma_migratable(), called from check_range(), to skip VMAs that
don't have default policy when auto-migrating.  To do this,
vma_migratable() needs the MPOL flags.

I had to move get_vma_policy() up in mempolicy.c so that I could reference
it from vma_migratable().  Should I have just added a forward ref?

Subsequent patches will arrange for auto_migrate_task_memory() to be
called when a task returns to user space after the scheduler migrates
it to a cpu on a node different from the node where it last executed.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:49:22.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:49:34.000000000 -0500
@@ -92,9 +92,14 @@
 #include <asm/uaccess.h>
 
 /* Internal flags */
-#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0)	/* Skip checks for continuous vmas */
-#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
-#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_DISCONTIG_OK \
+	(MPOL_MF_INTERNAL << 0)		/* Skip checks for continuous vmas */
+#define MPOL_MF_INVERT \
+	(MPOL_MF_INTERNAL << 1)		/* Invert check for nodemask */
+#define MPOL_MF_STATS \
+	(MPOL_MF_INTERNAL << 2)		/* Gather statistics */
+#define MPOL_MF_AUTOMIGRATE \
+	(MPOL_MF_INTERNAL << 3)		/* auto-migrating task memory */
 
 static struct kmem_cache *policy_cache;
 static struct kmem_cache *sn_cache;
@@ -110,6 +115,24 @@ struct mempolicy default_policy = {
 	.policy = MPOL_DEFAULT,
 };
 
+/* Return effective policy for a VMA */
+static struct mempolicy * get_vma_policy(struct task_struct *task,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = task->mempolicy;
+
+	if (vma) {
+		if (vma->vm_ops && vma->vm_ops->get_policy)
+			pol = vma->vm_ops->get_policy(vma, addr);
+		else if (vma->vm_policy &&
+				vma->vm_policy->policy != MPOL_DEFAULT)
+			pol = vma->vm_policy;
+	}
+	if (!pol)
+		pol = &default_policy;
+	return pol;
+}
+
 /* Do sanity checking on a policy */
 static int mpol_check_policy(int mode, nodemask_t *nodes)
 {
@@ -309,11 +332,17 @@ static inline int check_pgd_range(struct
 }
 
 /* Check if a vma is migratable */
-static inline int vma_migratable(struct vm_area_struct *vma)
+static inline int vma_migratable(struct vm_area_struct *vma, int flags)
 {
 	if (vma->vm_flags & (
 		VM_LOCKED|VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED))
 		return 0;
+	if (flags & MPOL_MF_AUTOMIGRATE) {
+		struct mempolicy *pol =
+			get_vma_policy(current, vma, vma->vm_start);
+		if (pol->policy != MPOL_DEFAULT)
+			return 0;
+	}
 	return 1;
 }
 
@@ -350,7 +379,7 @@ check_range(struct mm_struct *mm, unsign
 		if (!is_vm_hugetlb_page(vma) &&
 		    ((flags & MPOL_MF_STRICT) ||
 		     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-				vma_migratable(vma)))) {
+				vma_migratable(vma, flags)))) {
 			unsigned long endvma = vma->vm_end;
 
 			if (endvma > end)
@@ -695,6 +724,33 @@ int do_migrate_pages(struct mm_struct *m
 
 }
 
+/**
+ * auto_migrate_task_memory()
+ *
+ * Called just before returning to user state when a task has been
+ * migrated to a new node by the schedule and sched_migrate_memory
+ * is enabled.
+ */
+void auto_migrate_task_memory(void)
+{
+	struct mm_struct *mm = NULL;
+	int dest = cpu_to_node(task_cpu(current));
+	int flags = MPOL_MF_MOVE | MPOL_MF_INVERT | MPOL_MF_AUTOMIGRATE;
+
+	mm = current->mm;
+	/*
+	 * we're returning to user space, so mm must be non-NULL
+	 */
+	BUG_ON(!mm);
+
+	/*
+	 * Pass destination node as source node plus 'INVERT flag:
+	 *    Migrate all pages NOT on destination node.
+	 * 'AUTOMIGRATE flag selects only VMAs with default policy
+	 */
+	migrate_to_node(mm, dest, dest, flags);
+}
+
 #else
 
 static void migrate_page_add(struct page *page, struct list_head *pagelist,
@@ -1049,24 +1105,6 @@ asmlinkage long compat_sys_mbind(compat_
 
 #endif
 
-/* Return effective policy for a VMA */
-static struct mempolicy * get_vma_policy(struct task_struct *task,
-		struct vm_area_struct *vma, unsigned long addr)
-{
-	struct mempolicy *pol = task->mempolicy;
-
-	if (vma) {
-		if (vma->vm_ops && vma->vm_ops->get_policy)
-			pol = vma->vm_ops->get_policy(vma, addr);
-		else if (vma->vm_policy &&
-				vma->vm_policy->policy != MPOL_DEFAULT)
-			pol = vma->vm_policy;
-	}
-	if (!pol)
-		pol = &default_policy;
-	return pol;
-}
-
 /* Return a zonelist representing a mempolicy */
 static struct zonelist *zonelist_policy(gfp_t gfp, struct mempolicy *policy)
 {
Index: linux-2.6.16-mm1/include/linux/auto-migrate.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.16-mm1/include/linux/auto-migrate.h	2006-03-23 16:49:34.000000000 -0500
@@ -0,0 +1,20 @@
+#ifndef _LINUX_AUTO_MIGRATE_H
+#define _LINUX_AUTO_MIGRATE_H
+
+/*
+ * minimal memory migration definitions need by scheduler,
+ * sysctl, ..., so that they don't need to drag in the entire
+ * migrate.h and all that it depends on.
+ */
+
+#include <linux/config.h>
+
+#ifdef CONFIG_MIGRATION
+
+extern void auto_migrate_task_memory(void);
+
+#else	/* !CONFIG_MIGRATION */
+
+#endif	/* CONFIG_MIGRATION */
+
+#endif


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2.6.17-rc1-mm1 0/9] AutoPage Migration - V0.2 - Overview
From: Lee Schermerhorn @ 2006-04-07 20:32 UTC (permalink / raw)
  To: linux-mm

This is a repost of the auto-migration series against 2.6.17-rc1-mm1.

I will post the rest of the series as responses to this message.

Lee
--------------------------------------------------------------------

AutoPage Migration - V0.2 - 0/9 Overview

V0.2 reworks the patches on 2.6.17-rc1-mm1, including Christoph's 
migration code reorg, moving much of the migration mechanism to 
mm/migrate.c  Also, some of the individual patches address comments
from Christoph and others on the V0.1 series.

----------------

We have seen some workloads suffer decreases in performance on NUMA
platforms when the Linux scheduler moves the tasks away from their initial
memory footprint.  Some users--e.g., HPC--are motivated by this to go to
great lengths to ensure that tasks start up and stay on specific nodes.
2.6.16+ includes memory migration mechanisms that will allow these users
to move memory along with their tasks--either manually or under control
of a load scheduling program--in response to changing demands on the
resourses.

Other users--e.g., "Enterprise" applications--would prefer that the system
just "do the right thing" in this respect.  One possible approach would
be to have the system automatically migrate a task's pages when it decides
to move the task to a different node from where it has executed in the
past.  In order to determine whether this approach would provide any 
benefit, we need working code to measure.

So, ....

This series of patches hooks up linux 2.6.16+ direct page migration to the
task scheduler. The effect is such that, when load balancing moves a task
to a cpu on a different node from where the task last executed, the task
is notified of this change using the same mechanism to notify a task of
pending signals.  When the task returns to user state, it attempts to
migrate, to the new node, any pages not already on that node in those of
the task's vm areas under control of default policy.

This behavior is disabled by default, but can be enabled by writing non-
zero to /sys/kernel/migration/auto_migrate_enable.  Furthermore, to prevent
thrashing, a second sysctl, auto_migrate_interval, has been implemented.
The load balancer will not move a task to a different node if it has move
to a new node in the last auto_migrate_interval seconds.  [User interface
is in seconds; internally it's in HZ.]  The idea is to give the task time
to ammortize the cost of the migration by giving it time to benefit from
local references to the page.

The controls, enable/disable and interval, will enable performance testing
of this mechanism to help decide whether it is worth inclusion.  Note: providing
these controls does not presuppose that these will be twiddled by human
administrators/users.  They may be useful to user space workload management
daemons or such...

The Patches:

Patches 01-06 apply to 2.6.17-rc1-mm1 with or without the previously
posted "migrate-on-fault" patches.   Most of my recent testing has
been done with this series layered on the "migrate-on-fault" patches.
So, some fixup may be necessary to apply the series directly to 
2.6.17-rc1-mm1 or beyond.
Patch 07 requires that the migrate-on-fault patches be applied first,
including the mbind/MPOL_MF_LAZY patch.

automigrate-01-prepare-mempolicy-for-automigrate.patch

	This patch adds the function auto_migrate_task_memory() to
	mempolicy.c.  In V0.2, this function sets up a call to
	migrate_to_node() with the appropriate [mempolicy internal]
	flags for auto-migration.  This addresses Christoph's comment
	about code duplication.

	This patch also modifies the vma_migratable() function, called
	from check_range(), to reject VMAs that don't have default
	policy when auto-migrating.

	Note that this mechanism uses non-aggressive migration--i.e.,
	MPOL_MF_MOVE rather than MPOL_MF_MOVE_ALL.  Therefore, it gives
	up rather easily.  E.g., anon pages still shared, copy-on-write,
	between ancestors and descendants will not be migrated.

automigrate-02-add-auto_migrate_enable-sysctl.patch

	This patch adds the infrastructure for the /sys/kernel/migration
	group as well as the auto_migrate_enable control.
	V02 of this series adds the control infrastructure to the new
	mm/migrate.c source file.

	TODO:  extract the basic control infrastructure for use by the
	migrate-on-fault series...

automigrate-03.0-check-notify-migrate-pending.patch

	The patch adds a static inline function to
	include/linux/auto-migrate.h for the schedule to check for
	internode migration and notify the task [by setting the
	TIF_NOTIFY_RESUME thread info flag], if the task is migrating
	to a new node and auto-migration is enabled.

	The header also includes the function check_migrate_pending()
	that the task will call when returning to user state when it notices
	TIF_NOTIFY_RESUME set.  Both of these functions become a null macro
	when MIGRATION is not configured.

automigrate-03.1-ia64-check-notify-migrate-pending.patch

	This patch adds the call to the check_migrate_pending() to the
	ia64 specific do_notify_resume_user() function.  Note that this
	is the same mechanism used to deliver signals and perfmon events
	to a task.  I have tested this patch on a 4-node, 16-cpu ia64 
	platform.

automigrate-03.2-x86_64-check-notify-migrate-pending.patch

	This patch adds the call to check_migrate_pending() to the x86_64
	specific do_notify_resume() function.  This is just an example
	for an arch other than ia64.  I have tested automigrate on a
	4-socket/dual-core Opteron platform.

	V0.2:  fixed auto-migrate.h header include

automigrate-04-hook-sched-internode-migration.patch

	This patch hooks the calls to check_internode_migration() into
	the scheduler [kernel/sched.c] in places where the scheduler
	sets a new cpu for the task--i.e., just before calls to
	set_task_cpu().  Because these are in migration paths, that are
	already relatively "heavy-weight", they don't add overhead to
	scheduler fast paths.  And, they become empty or constant
	macros when MIGRATION is not configured in.

	V0.2:  don't check/notify task of internode migration in 
	migrate_task() when migrating in exec() path.  Pointed out
	by Kamezawa Hiroyuki.

automigrate-05-add-internode-migration-hysteresis.patch

	This patch adds the auto_migrate_interval control to the
	/sys/kernel/migration group, and adds a function to the
	auto-migrate.h header--too_soon_for_internode_migration()--to
	check whether it's too soon for another internode migration.
	This function becomes a macro that evaluates to "false" [0],
	when MIGRATION is not configured.

	This check is added to try_to_wake_up() and can_migrate_task() to
	override internode migrations if the last one was less than
	auto_migrate_interval seconds [HZ] ago.

automigrate-06-max-mapcount-control.patch

	This patch adds an additional control:  migrate_max_mapcount.
	mempolicy.c:migrate_page_add() has been modified to allow
	pages with a mapcount <= this value to be migrated. The
	default of 1 results in the same behavior as without this
	patch.  Use of this patch will allow experimentation and
	measurement of the effect of different mapcount thresholds
	on workload performance.

automigrate-07-hook-to-migrate-on-fault.patch

	This patch, which requires the migrate-on-fault capability,
	hooks automigration up to migrate-on-fault, with an additional
	control--/sys/kernel/migration/auto_migrate_lazy--to enable
	it.

TESTING:

I have tested this patch on a 16-cpu/4-node/32GB HP rx8620 [ia64] platform
and a 4 socket/dual-core/8GB HP Proliant dl585 Opteron platform with
everyone's favorite benchmark [kernel builds].   Patch seems stable.
Performance results for Opteron reported below.

I have also tested on ia64 with the McAlpin Streams benchmark.  These
results were reported previously:

http://marc.theaimsgroup.com/?l=linux-mm&m=114237540231833&w=4

Kernel builds [after make mrproper+make defconfig]
on 2.6.16-mm2 on dl585.  Times are avg of 10 runs.
Entire kernel source likely held in page cache.

No auto-migrate patches:

	40.69 real  226.40 user  41.77 system

With auto-migration patches, auto_migrate disabled:

	40.52 real  227.21 user  42.19 system

With auto-migration patches, auto_migrate enabled,
direct [!lazy]:

	40.90 real  227.10 user  42.45 system

With patch; auto-migration + lazy enabled:

	41.43 real  228.74 user  43.97 system

As mentioned in previous posting of this series, the compiler
don't run long enough to amortize the cost of migrating the
pages.  But see the McAlpin Streams results linked above.
Also, the defconfig runs on x86_64 don't run all that long, 
anyway.  So, I tried allmodconfig builds.  The results are,
uh, interesting.  These are representative results from half
a dozen runs each.

no auto-migration patches:

	290 real  1740 user  344 system

	one run @ 316 real:  +26sec from typical

with patches; auto-migration disabled:

	287 real  1738 user  346 system

	basically the same as w/o patches.
	real and user slightly lower, system slightly higher.  

with patches;  auto-migration+lazy enabled:
	
	310s real  1800s user   386s system

	user and system times fairly consistent.
	did see 2 runs with real time +27sec from the typical runs,
	as I did with no patches.  System is running multiuser, so
	some daemon may jump in occasionally.

	In these runs, the cost of migrating pages really starts to
	impact the runtime.  Note that, on an Opteron, every
	inter-[phys]cpu task migration is an inter-node migration.
	I see LOTS more internode migrations and resulting triggering
	of page migrations in a kernel build on the Opteron platform
	than on the 16-cpu, 4-node ia64 platform--not that this is at
	all surprising.  E.g., from instrumented runs:


                               ia64        Opteron
inter-node task migrations     2109           4058
pages unmapped for migration   9898         163627
anon migration faults          3208          62518
attempt migrate misplaced page 3007          44973
actually migrate misplaced pg  3007          44968



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 6/6] Migrate-on-fault - add MPOL_NOOP
From: Lee Schermerhorn @ 2006-04-07 20:27 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 6/6 V0.2 - add MPOL_NOOP

V0.2 -	this patch is new in the V0.2 series.  No change between
	2.6.16-mm1 and 2.6.17-rc1-mm1

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP"
policy to mbind().  When the NOOP policy is used with the 'MOVE
and 'LAZY flags, mbind() [check_range()] will walk the specified
range and unmap eligible pages so that they will be migrated on
next touch.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.16-mm1.orig/include/linux/mempolicy.h	2006-03-23 16:49:16.000000000 -0500
+++ linux-2.6.16-mm1/include/linux/mempolicy.h	2006-03-23 16:49:22.000000000 -0500
@@ -13,8 +13,9 @@
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
+#define MPOL_NOOP	4	/* retain existing policy for range */
 
-#define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MAX MPOL_NOOP
 
 /* Flags for get_mem_policy */
 #define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
Index: linux-2.6.16-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.16-mm1.orig/mm/mempolicy.c	2006-03-23 16:49:16.000000000 -0500
+++ linux-2.6.16-mm1/mm/mempolicy.c	2006-03-23 16:49:22.000000000 -0500
@@ -117,6 +117,7 @@ static int mpol_check_policy(int mode, n
 
 	switch (mode) {
 	case MPOL_DEFAULT:
+	case MPOL_NOOP:
 		if (!empty)
 			return -EINVAL;
 		break;
@@ -163,7 +164,7 @@ static struct mempolicy *mpol_new(int mo
 	struct mempolicy *policy;
 
 	PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes_addr(*nodes)[0]);
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		return NULL;
 	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
 	if (!policy)
@@ -726,7 +727,7 @@ long do_mbind(unsigned long start, unsig
 	if (start & ~PAGE_MASK)
 		return -EINVAL;
 
-	if (mode == MPOL_DEFAULT)
+	if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
 		flags &= ~MPOL_MF_STRICT;
 
 	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -762,10 +763,13 @@ long do_mbind(unsigned long start, unsig
 	if (!IS_ERR(vma)) {
 		int nr_failed = 0;
 
-		err = mbind_range(vma, start, end, new);
+		if (mode == MPOL_NOOP)
+			err = 0;
+		else
+			err = mbind_range(vma, start, end, new);
 
 		if (!list_empty(&pagelist)) {
-			if (!(flags & MPOL_MF_LAZY))
+			if (mode != MPOL_NOOP && !(flags & MPOL_MF_LAZY))
 				nr_failed = migrate_pages_to(&pagelist,
 								 vma, -1);
 			else


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 5/6] Migrate-on-fault - add MPOL_MF_LAZY
From: Lee Schermerhorn @ 2006-04-07 20:26 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 5/6 V0.2 - add MPOL_MF_LAZY

V0.2 - reworked against 2.6.17-rc1 with Christoph's migration code
       reorg.  Moved migrate_pages_unmap_only() to mm/migrate.c

This patch adds another mbind() flag to request "lazy migration".
The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are simply unmapped from the calling task's page table ['_MOVE]
or from all referencing page tables [_MOVE_ALL].  Anon pages will first
be added to the swap [or migration?] cache, if necessary.  The pages
will be migrated in the fault path on "first touch", if the policy
dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault.  If useful to
applications, it could become a permanent part of the mbind() interface. 
Yes, it does duplicate some of the code in migrate_pages().  However,
lazy migration doesn't need to do all that migrate_pages() does, nor
does it need to try as hard.  Trying to weave both functions into
migrate_pages() could probably be done, but that could  result in fairly
ugly code. 

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/mempolicy.h	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/mempolicy.h	2006-04-03 12:12:30.000000000 -0400
@@ -22,9 +22,14 @@
 
 /* Flags for mbind */
 #define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3)	/* Internal flags start here */
+#define MPOL_MF_MOVE	(1<<1)	/* Move pages owned by this process to conform
+				   to policy */
+#define MPOL_MF_MOVE_ALL (1<<2)	/* Move every page to conform to policy */
+#define MPOL_MF_LAZY	(1<<3)	/* Modifies '_MOVE:  lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4)	/* Internal flags start here */
+
+#define MPOL_MF_VALID \
+	(MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL | MPOL_MF_LAZY)
 
 #ifdef __KERNEL__
 
@@ -180,7 +185,7 @@ int do_migrate_pages(struct mm_struct *m
  */
 #define MPOL_MIGRATE_NONINTERLEAVED 1
 #define MPOL_MIGRATE_INTERLEAVED 2
-#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+#define misplaced_is_interleaved(pol) (pol == MPOL_MIGRATE_INTERLEAVED)
 
 int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
Index: linux-2.6.17-rc1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/mempolicy.c	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/mm/mempolicy.c	2006-04-03 12:12:30.000000000 -0400
@@ -718,9 +718,7 @@ long do_mbind(unsigned long start, unsig
 	int err;
 	LIST_HEAD(pagelist);
 
-	if ((flags & ~(unsigned long)(MPOL_MF_STRICT |
-				      MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
-	    || mode > MPOL_MAX)
+	if ((flags & ~(unsigned long)MPOL_MF_VALID) || mode > MPOL_MAX)
 		return -EINVAL;
 	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
 		return -EPERM;
@@ -766,8 +764,13 @@ long do_mbind(unsigned long start, unsig
 
 		err = mbind_range(vma, start, end, new);
 
-		if (!list_empty(&pagelist))
-			nr_failed = migrate_pages_to(&pagelist, vma, -1);
+		if (!list_empty(&pagelist)) {
+			if (!(flags & MPOL_MF_LAZY))
+				nr_failed = migrate_pages_to(&pagelist,
+								 vma, -1);
+			else
+				nr_failed = migrate_pages_unmap_only(&pagelist);
+		}
 
 		if (!err && nr_failed && (flags & MPOL_MF_STRICT))
 			err = -EIO;
Index: linux-2.6.17-rc1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/migrate.h	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/migrate.h	2006-04-03 12:12:30.000000000 -0400
@@ -17,6 +17,7 @@ extern int migrate_pages(struct list_hea
 extern int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest);
 struct page *migrate_misplaced_page(struct page *, int, int);
+extern int migrate_pages_unmap_only(struct list_head *);
 extern int fail_migrate_page(struct page *, struct page *, int);
 
 extern int migrate_prep(void);
Index: linux-2.6.17-rc1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/migrate.c	2006-04-03 12:10:45.000000000 -0400
+++ linux-2.6.17-rc1/mm/migrate.c	2006-04-03 12:12:30.000000000 -0400
@@ -567,6 +567,66 @@ next:
 
 	return nr_failed + retry;
 }
+/*
+ * Lazy migration:  just unmap pages, moving anon pages to swap cache, if
+ * necessary.  Migration will occur, if policy dictates, when a task faults
+ * an unmapped page back into its page table--i.e., on "first touch" after
+ * unmapping.
+ *
+ * Successfully unmapped pages will be put back on the LRU.  Failed pages
+ * will be left on the argument pagelist for the caller to handle, like
+ * migrate_pages[_to]().
+ */
+int migrate_pages_unmap_only(struct list_head *pagelist)
+{
+	struct page *page;
+	struct page *page2;
+	int nr_failed = 0, nr_unmapped = 0;
+
+	list_for_each_entry_safe(page, page2, pagelist, lru) {
+		int nr_refs;
+
+		/*
+		 * Give up easily.  We are being lazy.
+		 */
+		if (page_count(page) == 1 || TestSetPageLocked(page))
+			continue;
+
+		if (PageWriteback(page))
+			goto unlock_page;
+
+		if (PageAnon(page) && !PageSwapCache(page)) {
+			if (!add_to_swap(page, GFP_KERNEL)) {
+				goto unlock_page;
+			}
+		}
+
+		if (page_has_buffers(page))
+			nr_refs = 3;	/* cache, bufs and current */
+		else
+			nr_refs = 2;	/* cache and current */
+
+		if (migrate_page_try_to_unmap(page, nr_refs)) {
+			++nr_failed;
+			goto unlock_page;
+		}
+
+		++nr_unmapped;
+		move_to_lru(page);
+
+	unlock_page:
+		unlock_page(page);
+
+	}
+
+	/*
+	 * so fault path can find them on lru
+	 */
+	if (nr_unmapped)
+		lru_add_drain_all();
+
+	return nr_failed;
+}
 
 /*
  * Migration function for pages with buffers. This function can only be used


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 4/6] Migrate-on-fault - handle misplaced anon pages
From: Lee Schermerhorn @ 2006-04-07 20:24 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 4/6 V0.2 - handle misplaced anon pages

V0.2 -- refreshed against 2.6.16-mm2 [no changes for 2.6.17-rc1-mm1]

This patch simply hooks the anon page fault handler [do_swap_page()]
to check for and migrate misplaced pages.

File and shmem fault paths will be addressed in separate patches.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.16-mm2/mm/memory.c
===================================================================
--- linux-2.6.16-mm2.orig/mm/memory.c	2006-03-28 12:00:46.000000000 -0500
+++ linux-2.6.16-mm2/mm/memory.c	2006-03-28 12:01:07.000000000 -0500
@@ -48,6 +48,7 @@
 #include <linux/rmap.h>
 #include <linux/module.h>
 #include <linux/init.h>
+#include <linux/mempolicy.h>	/* check_migrate_misplaced_page() */
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -1924,6 +1925,8 @@ again:
 
 	/* The page isn't present yet, go ahead with the fault. */
 
+	page = check_migrate_misplaced_page(page, vma, address);
+
 	inc_mm_counter(mm, anon_rss);
 	pte = mk_pte(page, vma->vm_page_prot);
 	if (write_access && can_share_swap_page(page)) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 3/6] Migrate-on-fault - migrate misplaced page
From: Lee Schermerhorn @ 2006-04-07 20:23 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 3/6 V0.2 - migrate misplaced page

V0.2 - reworked against 2.6.17-rc1-mm1 with Christoph's migration
       code reorg.

This patch adds a new function migrate_misplaced_page() to mm/migrate.c
[where most of the other page migration functions live] to migrate a
misplace page to a specified destination node.  This function will be
called from the fault path.  Because we already know the destination
node for the migration, we allocate pages directly rather than rerunning
the policy node computation in alloc_page_vma().

migrate_misplaced_page() will need to put a single page [the old or
new page] back to the lru, so this patch also splits out a
"putback_lru_page()" function from move_lru_page().  This avoids having
to insert the page on a dummy list just to have move_lru_page() delete
it from the list.

The patch also updates the address space migratepage operations to
skip the attempt to unmap the page, if the operation is being called
in the fault path to migrate a misplaced page.  To accomplish this, I
added an additional boolean [int] argument "faulting" to the migratepage
op functions.   This argument also adjusts the # of expected page
references because we have an extra count when called in the fault
path.

The migratepage operations now use the migrate_page_try_to_unmap()
and migrate_page_replace_in_mapping() functions separated out in a
previous patch.

I believe that we can now delete migrate_page_remove_references().
But, I haven't, yet.

Finally, the page adds the static inline function 
check_migrate_misplaced_page() to mempolicy.h to check whether a
page has no mappings [no pte references] and is "misplaced"--i.e.
on a node different from what the policy for (vma, address) dictates.
In this case, the page will be migrated to the "correct" node, if
possible.  If migration fails for any reason, we just use the
original page.

Note that when NUMA or MIGRATION is not configured, the
check_migrate_misplaced_page() function becomes a macro that
evaluates to its page argument.

Subsequent patches will hook the fault handlers [anon, file, shmem]
to check_migrate_misplaced_page().

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/mempolicy.h	2006-04-05 10:14:39.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/mempolicy.h	2006-04-05 10:14:41.000000000 -0400
@@ -34,6 +34,7 @@
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
 #include <linux/nodemask.h>
+#include <linux/migrate.h>
 
 struct vm_area_struct;
 
@@ -184,6 +185,31 @@ int do_migrate_pages(struct mm_struct *m
 int mpol_misplaced(struct page *, struct vm_area_struct *,
 		unsigned long, int *);
 
+#if defined(CONFIG_MIGRATION) && defined(_LINUX_MM_H)
+/*
+ * called in fault path, where _LINUX_MM_H will be defined.
+ * page is uptodate and locked.
+ */
+static inline struct page *check_migrate_misplaced_page(struct page *page,
+		struct vm_area_struct *vma, unsigned long address)
+{
+	int polnid, misplaced;
+
+	if (page_mapcount(page) || PageWriteback(page))
+		return page;
+
+	misplaced = mpol_misplaced(page, vma, address, &polnid);
+	if (!misplaced)
+		return page;
+
+	return migrate_misplaced_page(page, polnid,
+			misplaced_is_interleaved(misplaced));
+
+}
+#else
+#define check_migrate_misplaced_page(page, vma, address) (page)
+#endif
+
 extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
 
 #else
@@ -279,6 +305,8 @@ static inline int do_migrate_pages(struc
 	return 0;
 }
 
+#define check_migrate_misplaced_page(page, vma, address) (page)
+
 static inline void check_highest_zone(int k)
 {
 }
Index: linux-2.6.17-rc1-mm1/include/linux/fs.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/fs.h	2006-04-05 10:14:36.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/fs.h	2006-04-05 10:14:41.000000000 -0400
@@ -373,7 +373,7 @@ struct address_space_operations {
 	struct page* (*get_xip_page)(struct address_space *, sector_t,
 			int);
 	/* migrate the contents of a page to the specified target */
-	int (*migratepage) (struct page *, struct page *);
+	int (*migratepage) (struct page *, struct page *, int);
 };
 
 struct backing_dev_info;
@@ -1760,7 +1760,7 @@ extern void simple_release_fs(struct vfs
 extern ssize_t simple_read_from_buffer(void __user *, size_t, loff_t *, const void *, size_t);
 
 #ifdef CONFIG_MIGRATION
-extern int buffer_migrate_page(struct page *, struct page *);
+extern int buffer_migrate_page(struct page *, struct page *, int);
 #else
 #define buffer_migrate_page NULL
 #endif
Index: linux-2.6.17-rc1-mm1/include/linux/gfp.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/gfp.h	2006-03-20 00:53:29.000000000 -0500
+++ linux-2.6.17-rc1-mm1/include/linux/gfp.h	2006-04-05 10:14:41.000000000 -0400
@@ -131,10 +131,13 @@ alloc_pages(gfp_t gfp_mask, unsigned int
 }
 extern struct page *alloc_page_vma(gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr);
+extern struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+					unsigned nid);
 #else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
 #define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define alloc_page_interleave(gfp_mask, order, nid) alloc_pages(gfp_mask, 0)
 #endif
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
Index: linux-2.6.17-rc1-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/mempolicy.c	2006-04-05 10:14:39.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/mempolicy.c	2006-04-05 10:14:41.000000000 -0400
@@ -1179,7 +1179,7 @@ struct zonelist *huge_zonelist(struct vm
 
 /* Allocate a page in interleaved policy.
    Own path because it needs to do special accounting. */
-static struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
+struct page *alloc_page_interleave(gfp_t gfp, unsigned order,
 					unsigned nid)
 {
 	struct zonelist *zl;
Index: linux-2.6.17-rc1-mm1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/migrate.c	2006-04-05 10:14:38.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/migrate.c	2006-04-05 10:14:41.000000000 -0400
@@ -59,7 +59,8 @@ int isolate_lru_page(struct page *page, 
 				del_page_from_active_list(zone, page);
 			else
 				del_page_from_inactive_list(zone, page);
-			list_add_tail(&page->lru, pagelist);
+			if (pagelist)
+				list_add_tail(&page->lru, pagelist);
 		}
 		spin_unlock_irq(&zone->lru_lock);
 	}
@@ -88,9 +89,14 @@ int migrate_prep(void)
 	return 0;
 }
 
-static inline void move_to_lru(struct page *page)
+/*
+ * Put a single page back to appropriate lru list via cache.
+ * Removes page reference added by isolate_lru_page, but
+ * the lru_cache_add*() will add a temporary ref while the
+ * pages resides in the cache [pagevec].
+ */
+static inline void putback_lru_page(struct page *page)
 {
-	list_del(&page->lru);
 	if (PageActive(page)) {
 		/*
 		 * lru_cache_add_active checks that
@@ -104,6 +110,12 @@ static inline void move_to_lru(struct pa
 	put_page(page);
 }
 
+static inline void move_to_lru(struct page *page)
+{
+	list_del(&page->lru);
+	putback_lru_page(page);
+}
+
 /*
  * Add isolated pages on the list back to the LRU.
  *
@@ -125,7 +137,7 @@ int putback_lru_pages(struct list_head *
 /*
  * Non migratable page
  */
-int fail_migrate_page(struct page *newpage, struct page *page)
+int fail_migrate_page(struct page *newpage, struct page *page, int faulting)
 {
 	return -EIO;
 }
@@ -335,29 +347,35 @@ EXPORT_SYMBOL(migrate_page_copy);
  *
  * Pages are locked upon entry and exit.
  */
-int migrate_page(struct page *newpage, struct page *page)
+int migrate_page(struct page *newpage, struct page *page, int faulting)
 {
-	int rc;
-	int nr_refs = 2;	/* cache + current */
+	int rc = 0;
+	/*
+	 * nr_refs:  cache + current [+ fault path]
+	 */
+	int nr_refs = 2 + !!faulting;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
-
+	if (!faulting)
+		rc = migrate_page_try_to_unmap(page, nr_refs);
+	if (!rc)
+		rc = migrate_page_replace_in_mapping(newpage, page, nr_refs);
 	if (rc)
 		return rc;
 
 	migrate_page_copy(newpage, page);
 
 	/*
-	 * Remove auxiliary swap entries and replace
-	 * them with real ptes.
+	 * If we are not already in the fault path, remove auxiliary swap
+	 * entries and replace them with real ptes.
 	 *
 	 * Note that a real pte entry will allow processes that are not
 	 * waiting on the page lock to use the new page via the page tables
 	 * before the new page is unlocked.
 	 */
-	remove_from_swap(newpage);
+	if (!faulting)
+		remove_from_swap(newpage);
 	return 0;
 }
 EXPORT_SYMBOL(migrate_page);
@@ -468,7 +486,7 @@ redo:
 			 * own migration function. This is the most common
 			 * path for page migration.
 			 */
-			rc = mapping->a_ops->migratepage(newpage, page);
+			rc = mapping->a_ops->migratepage(newpage, page, 0);
 			goto unlock_both;
                 }
 
@@ -498,7 +516,7 @@ redo:
 		 */
 		if (!page_has_buffers(page) ||
 		    try_to_release_page(page, GFP_KERNEL)) {
-			rc = migrate_page(newpage, page);
+			rc = migrate_page(newpage, page, 0);
 			goto unlock_both;
 		}
 
@@ -555,23 +573,28 @@ next:
  * if the underlying filesystem guarantees that no other references to "page"
  * exist.
  */
-int buffer_migrate_page(struct page *newpage, struct page *page)
+int buffer_migrate_page(struct page *newpage, struct page *page, int faulting)
 {
 	struct address_space *mapping = page->mapping;
 	struct buffer_head *bh, *head;
-	int nr_refs = 3;	/* cache + bufs + current */
-	int rc;
+	int rc = 0;
+	/*
+	 * nr_refs:  cache + bufs + current [+ fault path]
+	 */
+	int nr_refs = 3 + !!faulting;
 
 	if (!mapping)
 		return -EAGAIN;
 
 	if (!page_has_buffers(page))
-		return migrate_page(newpage, page);
+		return migrate_page(newpage, page, faulting);
 
 	head = page_buffers(page);
 
- 	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
-
+	if (!faulting)
+		rc = migrate_page_try_to_unmap(page, nr_refs);
+	if (!rc)
+		rc = migrate_page_replace_in_mapping(newpage, page, nr_refs);
 	if (rc)
 		return rc;
 
@@ -683,3 +706,71 @@ out:
 		nr_pages++;
 	return nr_pages;
 }
+
+/*
+ * attempt to migrate a misplaced page to the specified destination
+ * node.  Page is already unmapped and locked by caller. Anon pages
+ * are in the swap cache.
+ *
+ * page refs on entry/exit:  cache + fault path [+ bufs]
+ */
+struct page *migrate_misplaced_page(struct page *page,
+				 int dest, int interleaved)
+{
+	struct page *newpage;
+	struct address_space *mapping = page_mapping(page);
+	unsigned int gfp;
+
+//TODO:  explicit assertions during debug/testing
+	BUG_ON(!PageLocked(page));
+	BUG_ON(page_mapcount(page));
+	if (PageAnon(page))
+		BUG_ON(!PageSwapCache(page));
+	BUG_ON(!mapping);
+
+	if (isolate_lru_page(page, NULL)) /* incrs page count on success */
+		goto out_nolru;	/* we lost */
+
+//TODO:  or just use GFP_HIGHUSER ?
+	gfp = (unsigned int)mapping_gfp_mask(mapping);
+
+	if (interleaved)
+		newpage = alloc_page_interleave(gfp, 0, dest);
+	else
+		newpage = alloc_pages_node(dest, gfp, 0);
+
+	if (!newpage)
+		goto out;	/* give up */
+	lock_page(newpage);
+
+	if (mapping->a_ops->migratepage) {
+		/*
+		 * migrating in fault path.
+		 * migrate a_op transfers cache [+ buf] refs
+		 */
+		int rc = mapping->a_ops->migratepage(newpage, page, 1);
+		if (rc) {
+			unlock_page(newpage);
+			__free_page(newpage);
+		} else {
+			get_page(newpage);	/* add isolate_lru_page ref */
+			put_page(page);		/* drop       "          "  */
+
+			unlock_page(page);
+			put_page(page);		/* drop fault path ref & free */
+
+			page = newpage;
+		}
+		goto out;
+	} else {
+//TODO:  for now, give up if no address space migrate op.
+//       later, handle w/ default mechanism, like migrate_pages?
+	}
+
+out:
+	putback_lru_page(page);		/* drops a page ref */
+
+out_nolru:
+	return page;
+
+}
Index: linux-2.6.17-rc1-mm1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/migrate.h	2006-04-05 10:14:38.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/migrate.h	2006-04-05 10:14:41.000000000 -0400
@@ -7,7 +7,7 @@
 #ifdef CONFIG_MIGRATION
 extern int isolate_lru_page(struct page *p, struct list_head *pagelist);
 extern int putback_lru_pages(struct list_head *l);
-extern int migrate_page(struct page *, struct page *);
+extern int migrate_page(struct page *, struct page *, int);
 extern void migrate_page_copy(struct page *, struct page *);
 extern int migrate_page_try_to_unmap(struct page *, int);
 extern int migrate_page_replace_in_mapping(struct page *, struct page *, int);
@@ -16,7 +16,8 @@ extern int migrate_pages(struct list_hea
 		struct list_head *moved, struct list_head *failed);
 extern int migrate_pages_to(struct list_head *pagelist,
 			struct vm_area_struct *vma, int dest);
-extern int fail_migrate_page(struct page *, struct page *);
+struct page *migrate_misplaced_page(struct page *, int, int);
+extern int fail_migrate_page(struct page *, struct page *, int);
 
 extern int migrate_prep(void);
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 2/6] Migrate-on-fault - check for misplaced page
From: Lee Schermerhorn @ 2006-04-07 20:23 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 2/6 V0.2 - check for misplaced page

V0.2 -	reworked against 2.6.17-rc1-mm1 with Christoph's migration
	code reorg
	Also:	get vma policy after updating task's cpuset memory
		state.  Use mems_allowed in policy to vet nodes,
		but I'm not sure this check is necessary.

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped.  This involves
looking up the node where the page belongs.  So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.  Because interleaved and
non-interleaved allocations are accounted differently, the function
also returns whether or not the new node came from an interleaved
policy, if the page is misplaced.

A subsequent patch will call this function from the fault path for
stable pages with zero page_mapcount().  Because of this, I don't
want to go ahead and allocate the page, e.g., via alloc_page_vma()
only to have to free it if it has the correct policy.  So, I just
mimic the alloc_page_vma() node computation logic.

Note that for "process interleaving" the destination node depends
on the order of access to pages.  I.e., there is no fixed layout
for process interleaved pages, as there is for pages interleaved
via vma policy.  So, as long as the page resides on a node that
exists in the process's interleave set, no migration is indicated.
Having said that, we may never need to call this function without
a vma, so maybe we can lose that "feature".

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1-mm1/mm/mempolicy.c
===================================================================
--- linux-2.6.17-rc1-mm1.orig/mm/mempolicy.c	2006-04-06 16:45:13.000000000 -0400
+++ linux-2.6.17-rc1-mm1/mm/mempolicy.c	2006-04-06 16:47:14.000000000 -0400
@@ -1874,3 +1874,102 @@ out:
 	return 0;
 }
 
+/**
+ * mpol_misplaced - check whether current page node id valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ * @newnid - [ptr to] node id to which page should be migrated
+ *
+ * lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ * if same, return 0 -- reuse current page
+ * if different,
+ *     return destination nid via newnid
+ *     return MPOL_MIGRATE_NONINTERLEAVED for non-interleaved policy
+ *     return MPOL_MIGRATE_INTERLEAVED for interleaved policy.
+ * policy determination mimics alloc_page_vma()
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+			 unsigned long addr, int *newnid)
+{
+	struct mempolicy *pol;
+	struct zonelist *zl;
+	nodemask_t *mems;
+	int curnid = page_to_nid(page);
+	int polnid = -1, interleave = 0;
+	int i;
+
+//TODO:  can we call this here, in the fault path [with mmap_sem held?]
+//       do we want to?  applications and systems that could benefit from
+//       migrate-on-fault probably want cpusets as well.
+	cpuset_update_task_memory_state();
+	pol = get_vma_policy(current, vma, addr);
+
+	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+		interleave = 1;	/* for accounting */
+		if (vma) {
+			unsigned long off;
+			BUG_ON(addr >= vma->vm_end);
+			BUG_ON(addr < vma->vm_start);
+			off = vma->vm_pgoff;
+			off += (addr - vma->vm_start) >> PAGE_SHIFT;
+			polnid = offset_il_node(pol, vma, off);
+		} else {
+//TODO:  can this ever happen?
+			/*
+			 * for process interleaving, just ensure that
+			 * curnid is in policy nodes -- to avoid thrashing
+			 */
+			if (node_isset(curnid, pol->v.nodes))
+				return 0;
+			polnid = interleave_nodes(pol);
+		}
+	} else
+		switch (pol->policy) {
+		case MPOL_PREFERRED:
+			polnid = pol->v.preferred_node;
+			if (polnid < 0)
+				polnid = numa_node_id();
+			break;
+		case MPOL_BIND:
+			/*
+			 * allows binding to multiple nodes.
+			 * use current page if in zonelist,
+			 * else select first allowed node
+			 */
+			mems = &pol->cpuset_mems_allowed;
+			zl = pol->v.zonelist;
+			for (i = 0; zl->zones[i]; i++) {
+				int nid = zl->zones[i]->zone_pgdat->node_id;
+
+				if (nid == curnid)
+					return 0;
+
+				if (polnid < 0 &&
+//TODO:  is this check necessary?
+					node_isset(nid, *mems))
+					polnid = nid;
+			}
+			if (polnid >= 0)
+				break;
+			/*FALL THROUGH*/
+		case MPOL_INTERLEAVE: /* should not happen */
+		case MPOL_DEFAULT:
+			polnid = numa_node_id();
+			break;
+		default:
+			polnid = 0;
+			BUG();
+		}
+
+	if (curnid == polnid)
+		return 0;
+
+	*newnid = polnid;
+	if (interleave)
+		return MPOL_MIGRATE_INTERLEAVED;
+
+	return MPOL_MIGRATE_NONINTERLEAVED;
+}
Index: linux-2.6.17-rc1-mm1/include/linux/mempolicy.h
===================================================================
--- linux-2.6.17-rc1-mm1.orig/include/linux/mempolicy.h	2006-04-06 16:45:13.000000000 -0400
+++ linux-2.6.17-rc1-mm1/include/linux/mempolicy.h	2006-04-06 16:46:17.000000000 -0400
@@ -173,6 +173,17 @@ static inline void check_highest_zone(in
 int do_migrate_pages(struct mm_struct *mm,
 	const nodemask_t *from_nodes, const nodemask_t *to_nodes, int flags);
 
+/*
+ * mm/vmscan.c doesn't include mempolicy.  Keep knowledge of these
+ * macros' values internal to mempolicy.[ch]
+ */
+#define MPOL_MIGRATE_NONINTERLEAVED 1
+#define MPOL_MIGRATE_INTERLEAVED 2
+#define misplaced_is_interleaved(pol) (MPOL_MIGRATE_INTERLEAVED - 1)
+
+int mpol_misplaced(struct page *, struct vm_area_struct *,
+		unsigned long, int *);
+
 extern void *cpuset_being_rebound;	/* Trigger mpol_copy vma rebind */
 
 #else


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 2.6.17-rc1-mm1 1/6] Migrate-on-fault - separate unmap from radix tree replace
From: Lee Schermerhorn @ 2006-04-07 20:22 UTC (permalink / raw)
  To: linux-mm
In-Reply-To: <1144441108.5198.36.camel@localhost.localdomain>

Migrate-on-fault prototype 1/6 V0.2 - separate unmap from radix tree replace

V0.2 - rework against 2.6.17-rc1, with Christoph migration code
       reorg.  No change for 2.6.17-rc1-mm1

The migrate_page_remove_references() function performs two distinct
operations:  actually attempting to remove pte references from the
page via try_to_unmap() and replacing the page with a new page in
the page's mapping's radix tree.  This patch separates these 
operations into two functions so that they can be called separately.

Then, migrate_page_remove_references() is replaced with a function
named migrate_page_unmap_and_replace() to indicate the two operations,
and existing calls in mm/migrate.c:migrate_page() and
mm/migrate.c:buffer_migrate_page() are updated.

Note:  this results in each of the functions having to load the
mapping when called for direct migration.  Perhaps passing mapping as
an argument would be preferable?

Subsequent patches in the series will make use of the separate
operations. 

Eventually, we can remove migrate_page_unmap_and_replace()

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Index: linux-2.6.17-rc1/include/linux/migrate.h
===================================================================
--- linux-2.6.17-rc1.orig/include/linux/migrate.h	2006-04-03 08:51:08.000000000 -0400
+++ linux-2.6.17-rc1/include/linux/migrate.h	2006-04-03 12:09:57.000000000 -0400
@@ -9,7 +9,9 @@ extern int isolate_lru_page(struct page 
 extern int putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct page *, struct page *);
 extern void migrate_page_copy(struct page *, struct page *);
-extern int migrate_page_remove_references(struct page *, struct page *, int);
+extern int migrate_page_try_to_unmap(struct page *, int);
+extern int migrate_page_replace_in_mapping(struct page *, struct page *, int);
+extern int migrate_page_unmap_and_replace(struct page *, struct page *, int);
 extern int migrate_pages(struct list_head *l, struct list_head *t,
 		struct list_head *moved, struct list_head *failed);
 extern int migrate_pages_to(struct list_head *pagelist,
Index: linux-2.6.17-rc1/mm/migrate.c
===================================================================
--- linux-2.6.17-rc1.orig/mm/migrate.c	2006-04-03 08:51:08.000000000 -0400
+++ linux-2.6.17-rc1/mm/migrate.c	2006-04-03 12:09:57.000000000 -0400
@@ -179,14 +179,12 @@ retry:
 EXPORT_SYMBOL(swap_page);
 
 /*
- * Remove references for a page and establish the new page with the correct
- * basic settings to be able to stop accesses to the page.
+ * Try to remove pte references from page in preparation to migrate to
+ * a new page.
  */
-int migrate_page_remove_references(struct page *newpage,
-				struct page *page, int nr_refs)
+int migrate_page_try_to_unmap(struct page *page, int nr_refs)
 {
 	struct address_space *mapping = page_mapping(page);
-	struct page **radix_pointer;
 
 	/*
 	 * Avoid doing any of the following work if the page count
@@ -225,6 +223,19 @@ int migrate_page_remove_references(struc
 	if (page_mapcount(page))
 		return -EAGAIN;
 
+	return 0;
+}
+EXPORT_SYMBOL(migrate_page_try_to_unmap);
+
+/*
+ * replace page in it's mapping's radix tree with newpage
+ */
+int migrate_page_replace_in_mapping(struct page *newpage,
+		struct page *page, int nr_refs)
+{
+	struct address_space *mapping = page_mapping(page);
+        struct page **radix_pointer;
+
 	write_lock_irq(&mapping->tree_lock);
 
 	radix_pointer = (struct page **)radix_tree_lookup_slot(
@@ -254,12 +265,29 @@ int migrate_page_remove_references(struc
 	}
 
 	*radix_pointer = newpage;
-	__put_page(page);
+	__put_page(page);		/* drop cache ref */
 	write_unlock_irq(&mapping->tree_lock);
 
 	return 0;
 }
-EXPORT_SYMBOL(migrate_page_remove_references);
+EXPORT_SYMBOL(migrate_page_replace_in_mapping);
+
+/*
+ * Remove references for a page and establish the new page with the correct
+ * basic settings to be able to stop accesses to the page.
+ */
+int migrate_page_unmap_and_replace(struct page *newpage,
+				struct page *page, int nr_refs)
+{
+	/*
+	 * Give up if we were unable to remove all mappings.
+	 */
+	if (migrate_page_try_to_unmap(page, nr_refs))
+		return 1;
+
+	return migrate_page_replace_in_mapping(page, newpage, nr_refs);
+}
+EXPORT_SYMBOL(migrate_page_unmap_and_replace);
 
 /*
  * Copy the page to its new location
@@ -310,10 +338,11 @@ EXPORT_SYMBOL(migrate_page_copy);
 int migrate_page(struct page *newpage, struct page *page)
 {
 	int rc;
+	int nr_refs = 2;	/* cache + current */
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_remove_references(newpage, page, 2);
+	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
 
 	if (rc)
 		return rc;
@@ -530,6 +559,7 @@ int buffer_migrate_page(struct page *new
 {
 	struct address_space *mapping = page->mapping;
 	struct buffer_head *bh, *head;
+	int nr_refs = 3;	/* cache + bufs + current */
 	int rc;
 
 	if (!mapping)
@@ -540,7 +570,7 @@ int buffer_migrate_page(struct page *new
 
 	head = page_buffers(page);
 
-	rc = migrate_page_remove_references(newpage, page, 3);
+ 	rc = migrate_page_unmap_and_replace(newpage, page, nr_refs);
 
 	if (rc)
 		return rc;
@@ -556,7 +586,7 @@ int buffer_migrate_page(struct page *new
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
 	set_page_private(page, 0);
-	put_page(page);
+	put_page(page);		/* transfer buf ref to newpage */
 	get_page(newpage);
 
 	bh = head;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH 2.6.17-rc1-mm1 0/6] Migrate-on-fault - Overview
From: Lee Schermerhorn @ 2006-04-07 20:18 UTC (permalink / raw)
  To: linux-mm

This is a reposting of the migrate-on-fault series, against
the 2.6.17-rc1-mm1 tree.  I would love to get some feedback on 
these patches--especially regarding criteria for getting them
into the mm tree for wider testing.

I will send the remainder of the series as responses to this 
message.  Auto-migrate series V0.2 to follow.

Lee
----------------------------------------------------------------------

Migrate-on-fault prototype 0/6 V0.2 - Overview

V0.2 -	refreshed against 2.6.17-rc1-mm1 with Christoph's migration
	code reorg.
	Some rework to 'mpol_replaced'.  See comments therein.

TODO:
	+ make a Kconfig sub-option of MIGRATION?
	+ add a sysctl to enable/disable migrate on fault?
		separate controls for anon, page cache?


This series of patches, against 2.6.17-rc1-mm1, implements page migration
in the fault path.  Based on discussions with Christoph Lameter, this 
seems like the next logical step in page migration.

The basic idea is that when a fault handler [do_swap_page, filemap_nopage,
...] finds a cached page with zero mappings that is otherwise "stable"--
i.e., no writebacks--this is a good opportunity to check whether the 
page resides on the node indicated by the policy in the current context.

We only want to check if there are zero mappings because 1) we can easily
migrate the page--don't have to go through the effort of removing all
mappings and 2) default policy--a common case--can give different answers
from different tasks running on different nodes.  Checking the policy
when there are zero mappings effectively implements a "first touch"
placement policy.

Note that this mechanism can be used to migrate page cache pages that 
were read in earlier, are no longer referenced, but are about to be
used by a new task on another node from where the page resides.  The
same mechanism can be used to pull anon pages along with a task when
the load balancer decides to move it to another node.  However, that
will require a bit more mechanism, and is the subject of another
patch series.

The current [2.6.17-rc*] direct migration facility support most of the
mechanism that is required to implement this "migration on fault".  
Some of the necessary operations are combined in functions with other
code that isn't required [must not be executed] in the fault path,
so these have been separated out in a couple of cases.

Then we need to add the function[s] to test the current page in the
fault path for zero mapping, no writebacks, misplacement; and the
function[s] to acutally migrate the page contents to a newly
allocated page using the [modified] migratepage address space
operations of the direct migration mechanism.

The Patches:

The patches are broken out in the order I implemented them. Each
should build and boot on its own.  [at least they did at one time!]

migrate-on-fault-01-separate-unmap-replace.patch

	Separates the mm/migrate.c:migrate_page_remove_references()
	function into its 2 distinct operations:  removing references
	[try_to_unmap()], and replacing the old page in the radix 
	tree of the page's "mapping".  Only the second part is 
	needed in the fault path, as the page is already completely
	unmapped.

	A wrapper function that calls both operations is provided,
	and the 2 places that call migrate_page_remove_references()
	have been modified to call that wrapper.

migrate-on-fault-02-mpol_misplaced.patch

	This patch implements the function mpol_misplaced() in
	mm/mempolicy.c to check whether a page resides on the
	node indicated by the vma and address arguments.  If
	so, it returns 0 [!misplaced].  If not, it returns an
	indication of whether the policy was interleaved or not
	[for properly accounting later allocation] and passes the
	node indicated by the policy through a pointer argument.

	Because this will be called in the fault path, I don't 
	want to go through the effort of actually allocating a
	page--e.g., via alloc_page_vma()--only to find that the
	current page in on the correct node.  However, I wanted
	to come to the same answer that alloc_page_vma() would.
	So, mpol_misplaced() mimics the node computation logic
	of alloc_page_vma().

migrate-on-fault-03-migrate_misplaced_page.patch

	This patch contains the main migrate on fault functions:

	check_migrate_misplaced_page() is implemented as a static
	inline function in mempolicy.h when MIGRATION is configured.
	If the page has zero mappings, is stable and misplaced,
	check_*() will call migrate_misplaced_page() in mmigrate.c
	to do the dirty work.  If for any reason the page can't
	or shouldn't be migrated, these functions will return the
	old page in the state it was found.

	Note that when a page is NOT found in the cache, and the fault
	handler has to allocate one and read it in, it will have zero
	mappings, so check_migrate_misplaced_page() WILL call
	mpol_misplaced() to see if it needs migration.  Of course, it
	should have been allocated on the correct node, so no migration
	should be necessary.  However, it's possible that the node 
	indicated by the policy has no free pages so the newly 
	allocated page may be on a different node.  In this case, I
	guess check_migrate_misplaced_page() will attempt to migrate
	it.  In either case, the "unnecessary" calls to mpol_misplaced()
	and to migrate_misplaced_page(), if the original allocation
	"overflowed", occur after an IO, so this is the slow path
	anyway.  

	When MIGRATION is NOT configured, check_migrate_misplaced_page()
	becomes a macro that evaluates to its argument page.

	More details with the patch.

migrate-on-fault-04.1-misplaced-anon-pages.patch

	This is a simple one-liner [OK, 2, counting an empty line]
	to call check_migrate_misplaced_page() from do_swap_page()
	in memory.c.  

	Patches to hook other fault paths [filemap_nopage(), etc.] 
	are still TBD.

migrate-on-fault-05-mbind-lazy-migrate.patch

	This patch adds an MPOL_MF_LAZY [maybe should be '_DEFERRED?]
	flag to modify the behavior of MPOL_MF_MOVE[_ALL].  When
	the 'LAZY flag is specified, mbind() simply unmaps eligible
	pages in the specified range, moving anon pages to the
	swap cache, if not already there.  Then, when the task
	touch the pages, or queries their location via 
	get_mempolicy(..., MPOL_F_NODE|MPOL_F_ADDR), it will take
	fault, find the page in the cache and migrate it, if the
	policy so indicates.  Actually, this will only happen for
	anon pages, until additional fault paths are hooked up.

	This patch allows me to test the migrate on fault mechanism
	by forcing pages to be unmapped.

migrate-on-fault-06-mbind-noop-policy.patch

	This patch adds a "NO-OP" policy to mbind() so that the
	"'MOVE+'LAZY" unmap-only function can be performed on a
	range of task memory without changing the policy.


Testing:

I have tested migrate-on-fault of anon pages using the MPOL_MF_LAZY 
extension to mbind() discussed in patch 5 above on 2.6.17-rc1-mm1.
I have an ad hoc [odd hack?] test program, called memtoy, available at:

	http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

The Xpm-tests subdirectory in the tarball contains memtoy test
scripts for "manual page migration"--i.e., the migrate_pages()
syscall, "direct migration" using mbind(MPOL_MF_MOVE) and
migrate-on-fault using mbind(MPOL_MF_MOVE+MPOL_MF_LAZY).

I have also tested with the "automigration" series layered on top
of this one.  In that environment, whenever the scheduler migrates
a task to a new node, the task unmaps pages with default policy and
migrates them, if necessary, on first touch after unmap.  Running
kernel builds in this environment provides a fairly good stress test
of the migrate-on-fault mechanism.



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Nick Piggin @ 2006-04-07 12:40 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <200604071902.16011.kernel@kolivas.org>

Con Kolivas wrote:
> On Friday 07 April 2006 16:25, Nick Piggin wrote:
> 
>>Con Kolivas wrote:
>>
>>>It is possible with a low enough lowmem_reserve ratio to make
>>>zone_watermark_ok always fail if the lower_zone is small enough.
>>
>>I don't see how this would happen?
> 
> 
> 3GB lowmem and a reserve ratio of 180 is enough to do it.
> 

How would zone_watermark_ok always fail though?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 1/3] mm: An enhancement of OVERCOMMIT_GUESS
From: Hideo AOKI @ 2006-04-07 11:49 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki; +Cc: akpm, linux-kernel, linux-mm
In-Reply-To: <20060406170851.1402c78d.kamezawa.hiroyu@jp.fujitsu.com>

[-- Attachment #1: Type: text/plain, Size: 1869 bytes --]

Hi Kamezawa-san,

Thank you for your quick response. And sorry for slow response.

KAMEZAWA Hiroyuki wrote:

> Hideo AOKI <haoki@redhat.com> wrote:
> 
>>Since __vm_enough_memory() doesn't know zone and cpuset information,
>>we have to guess proper value of lowmem_reserve in each zone
>>like I did in calculate_totalreserve_pages() in my patch.
>>Do you think that we can do this calculation every time?
>>
>>If it is good enough, I'll make revised patch.
>>
> 
> I just thought to show "how to calculate" in unified way is better.

I got it.

> Do you have a detailed comparison of test result with and without this patch ?

Yes. I have test logs and attach them to this e-mail.

The logs are verbose output of my test kernel module which I already
sent to lkml.
http://marc.theaimsgroup.com/?l=linux-kernel&m=114428121522349&w=2

Test machine was i386 4GB memory PC. I didn't use swap region.


Let me explain a few things about the log.

* 2.6.17-rc1-mm1

HIGH: <active 18220><inactive 12278><free 1419><sum 31917><present 622220>
NORMAL: <active 1618><inactive 2293><free 1397><sum 5308><present 225280>

   The test module consumes free pages until the number of free pages
   is less than pages_high.


<buf 3916><cache 31785><slab reclaim 1550><swap 0> <+ 1> <target 33336>

   This line shows the status of memory just before the module calls
   __vm_enough_memory(). Meaning of each item is below.

     buf:            bufferram
     cache:          page cache
     slab reclaim:   slab_reclaim_pages
     swap:           nr_swap_pages
     +:              margin
     target:         the number of pages to ask __vm_enough_memory()


Test MAY be <failed>.

   This line shows __vm_enough_memory() returned success.


Please let me know if you have any questions and suggestions.

Regards,
Hideo Aoki

---
Hideo Aoki, Hitachi Computer Products (America) Inc.

[-- Attachment #2: log-2.6.17-rc1-mm1.txt --]
[-- Type: text/plain, Size: 2233 bytes --]

* 2.6.17-rc1-mm1

Apr  6 20:33:33 dhcp1 kernel: Test module was loaded. <mode 1>
Apr  6 20:33:33 dhcp1 kernel: init ...<3>done
Apr  6 20:33:33 dhcp1 kernel:
Apr  6 20:33:33 dhcp1 kernel: HIGH: <active 18238><inactive 12278><free 590698><sum 621214><present 622220>
Apr  6 20:33:34 dhcp1 kernel: HighMem <target 589272>, <3>
Apr  6 20:33:34 dhcp1 kernel: HIGH: <active 18220><inactive 12278><free 1512><sum 32010><present 622220>
Apr  6 20:33:34 dhcp1 kernel:
Apr  6 20:33:34 dhcp1 kernel: HIGH: <active 18220><inactive 12278><free 1512><sum 32010><present 622220>
Apr  6 20:33:34 dhcp1 kernel: HighMem <target 86>, <3>
Apr  6 20:33:34 dhcp1 kernel: HIGH: <active 18220><inactive 12278><free 1419><sum 31917><present 622220>
Apr  6 20:33:34 dhcp1 kernel: already satisfied
Apr  6 20:33:34 dhcp1 kernel:
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2277><free 205532><sum 209427><present 225280>
Apr  6 20:33:34 dhcp1 kernel: Normal <target 204124>, <3>
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2291><free 1490><sum 5399><present 225280>
Apr  6 20:33:34 dhcp1 kernel:
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2293><free 1490><sum 5401><present 225280>
Apr  6 20:33:34 dhcp1 kernel: Normal <target 82>, <3>
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2293><free 1428><sum 5339><present 225280>
Apr  6 20:33:34 dhcp1 kernel:
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2293><free 1428><sum 5339><present 225280>
Apr  6 20:33:34 dhcp1 kernel: Normal <target 20>, <3>
Apr  6 20:33:34 dhcp1 kernel: NORMAL: <active 1618><inactive 2293><free 1397><sum 5308><present 225280>
Apr  6 20:33:34 dhcp1 kernel: already satisfied
Apr  6 20:33:34 dhcp1 kernel: concrete test ...
Apr  6 20:33:34 dhcp1 kernel: <buf 3916><cache 31785><slab reclaim 1550><swap 0> <+ 1> <target 33336>
Apr  6 20:33:34 dhcp1 kernel: Test MAY be <failed>.
Apr  6 20:33:34 dhcp1 kernel: allocation failed: out of vmalloc space - use
vmalloc=<size> to increase size.
Apr  6 20:33:35 dhcp1 kernel: allocation failed: out of vmalloc space - use
vmalloc=<size> to increase size.
Apr  6 20:33:35 dhcp1 kernel: Test SURELY was <FAILED>.
Apr  6 20:33:35 dhcp1 kernel: concrete test ...done.

[-- Attachment #3: log-2.6.17-rc1-mm1+patch.txt --]
[-- Type: text/plain, Size: 4053 bytes --]

* 2.6.17-rc1-mm1 + patches

Apr  6 20:56:36 dhcp1 kernel: Test module was loaded. <mode 1>
Apr  6 20:56:36 dhcp1 kernel: init ...<3>done
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: HIGH: <active 17074><inactive 13427><free 590727><sum 621228><present 622220>
Apr  6 20:56:36 dhcp1 kernel: HighMem <target 589301>, <3>
Apr  6 20:56:36 dhcp1 kernel: HIGH: <active 17074><inactive 13427><free 1479><sum 31980><present 622220>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: HIGH: <active 17074><inactive 13427><free 1479><sum 31980><present 622220>
Apr  6 20:56:36 dhcp1 kernel: HighMem <target 53>, <3>
Apr  6 20:56:36 dhcp1 kernel: HIGH: <active 17074><inactive 13427><free 1417><sum 31918><present 622220>
Apr  6 20:56:36 dhcp1 kernel: already satisfied
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2248><free 205669><sum 209543><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 204261>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2262><free 1441><sum 5329><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2264><free 1441><sum 5331><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 33>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2264><free 1410><sum 5300><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2264><free 1410><sum 5300><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel:
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1410><sum 5301><present 225280>
Apr  6 20:56:36 dhcp1 kernel: Normal <target 2>, <3>
Apr  6 20:56:36 dhcp1 kernel: NORMAL: <active 1626><inactive 2265><free 1379><sum 5270><present 225280>
Apr  6 20:56:36 dhcp1 kernel: already satisfied
Apr  6 20:56:36 dhcp1 kernel: concrete test ...
Apr  6 20:56:36 dhcp1 kernel: <buf 3902><cache 31720><slab reclaim 1538><swap 0> <+ 1> <target 33259>
Apr  6 20:56:36 dhcp1 kernel: Test was <PASSED>.
Apr  6 20:56:36 dhcp1 kernel: concrete test ...done.
Apr  6 20:56:48 dhcp1 kernel: Unloading module ...

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Con Kolivas @ 2006-04-07  9:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <443605E1.7060203@yahoo.com.au>

On Friday 07 April 2006 16:25, Nick Piggin wrote:
> Con Kolivas wrote:
> > It is possible with a low enough lowmem_reserve ratio to make
> > zone_watermark_ok always fail if the lower_zone is small enough.
>
> I don't see how this would happen?

3GB lowmem and a reserve ratio of 180 is enough to do it.

Cheers,
Con

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: limit lowmem_reserve
From: Nick Piggin @ 2006-04-07  6:25 UTC (permalink / raw)
  To: Con Kolivas; +Cc: Andrew Morton, ck, linux list, linux-mm
In-Reply-To: <200604061110.35789.kernel@kolivas.org>

Con Kolivas wrote:
> It is possible with a low enough lowmem_reserve ratio to make
> zone_watermark_ok always fail if the lower_zone is small enough.

I don't see how this would happen?

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Patch:003/004] wait_table and zonelist initializing for memory hotadd (wait_table initialization)
From: Dave Hansen @ 2006-04-07  3:12 UTC (permalink / raw)
  To: Yasunori Goto; +Cc: Andrew Morton, Linux Kernel ML, linux-mm
In-Reply-To: <20060407104859.EBED.Y-GOTO@jp.fujitsu.com>

On Fri, 2006-04-07 at 12:10 +0900, Yasunori Goto wrote:
> 
> This size doesn't mean bytes. It is hash table entry size.
> So, wait_table_hash_size() or wait_table_entry_size() might be better.

wait_table_hash_nr_entries() is pretty obvious, although a bit long.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Patch:003/004] wait_table and zonelist initializing for memory hotadd (wait_table initialization)
From: Yasunori Goto @ 2006-04-07  3:10 UTC (permalink / raw)
  To: Dave Hansen; +Cc: Andrew Morton, Linux Kernel ML, linux-mm
In-Reply-To: <1144361104.9731.190.camel@localhost.localdomain>

> On Wed, 2006-04-05 at 20:01 +0900, Yasunori Goto wrote:
> > 
> > +#ifdef CONFIG_MEMORY_HOTPLUG
> >  static inline unsigned long wait_table_size(unsigned long pages)
> >  {
> >         unsigned long size = 1;
> > @@ -1803,6 +1804,17 @@ static inline unsigned long wait_table_s
> >  
> >         return max(size, 4UL);
> >  }
> > +#else
> > +/*
> > + * XXX: Because zone size might be changed by hot-add,
> > + *      It is hard to determin suitable size for wait_table as
> > traditional.
> > + *      So, we use maximum size now.
> > + */
> > +static inline unsigned long wait_table_size(unsigned long pages)
> > +{
> > +       return 4096UL;
> > +}
> > +#endif 
> 
> Sorry for the slow response.  My IBM email is temporarily dead.
> 
> Couple of things.  
> 
> First, is there anything useful that prepending UL to the constants does
> to the functions?  It ends up looking a little messy to me.

I would like to show that it is max size of original wait_table_size().
Original one uses 4096UL for it.

> Also, I thought you were going to put a big fat comment on there about
> doing it correctly in the future.  It would also be nice to quantify the
> wasted space in terms of bytes, just so that people get a feel for it.

Hmmm. Ok.

> Oh, and wait_table_size() needs a unit.  wait_table_size_bytes() sounds
> like a winner to me.

This size doesn't mean bytes. It is hash table entry size.
So, wait_table_hash_size() or wait_table_entry_size() might be better.

Thanks.

-- 
Yasunori Goto 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Patch:003/004] wait_table and zonelist initializing for memory hotadd (wait_table initialization)
From: Dave Hansen @ 2006-04-06 22:05 UTC (permalink / raw)
  To: Yasunori Goto; +Cc: Andrew Morton, Linux Kernel ML, linux-mm
In-Reply-To: <20060405195913.3C45.Y-GOTO@jp.fujitsu.com>

On Wed, 2006-04-05 at 20:01 +0900, Yasunori Goto wrote:
> 
> +#ifdef CONFIG_MEMORY_HOTPLUG
>  static inline unsigned long wait_table_size(unsigned long pages)
>  {
>         unsigned long size = 1;
> @@ -1803,6 +1804,17 @@ static inline unsigned long wait_table_s
>  
>         return max(size, 4UL);
>  }
> +#else
> +/*
> + * XXX: Because zone size might be changed by hot-add,
> + *      It is hard to determin suitable size for wait_table as
> traditional.
> + *      So, we use maximum size now.
> + */
> +static inline unsigned long wait_table_size(unsigned long pages)
> +{
> +       return 4096UL;
> +}
> +#endif 

Sorry for the slow response.  My IBM email is temporarily dead.

Couple of things.  

First, is there anything useful that prepending UL to the constants does
to the functions?  It ends up looking a little messy to me.

Also, I thought you were going to put a big fat comment on there about
doing it correctly in the future.  It would also be nice to quantify the
wasted space in terms of bytes, just so that people get a feel for it.

Oh, and wait_table_size() needs a unit.  wait_table_size_bytes() sounds
like a winner to me.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [patch 1/3] mm: An enhancement of OVERCOMMIT_GUESS
From: KAMEZAWA Hiroyuki @ 2006-04-06  8:08 UTC (permalink / raw)
  To: Hideo AOKI; +Cc: akpm, linux-kernel, linux-mm
In-Reply-To: <4434C12A.4000108@redhat.com>

On Thu, 06 Apr 2006 03:20:10 -0400
Hideo AOKI <haoki@redhat.com> wrote:

> Hi Kamezawa-san,
> 
> Thank you for your comments.
> 
> KAMEZAWA Hiroyuki wrote:
> > Hi, AOKI-san
> I like your idea. But, in the function, I think we need to care
> lowmem_reserve too.
> 
Ah, I see.

> Since __vm_enough_memory() doesn't know zone and cpuset information,
> we have to guess proper value of lowmem_reserve in each zone
> like I did in calculate_totalreserve_pages() in my patch.
> Do you think that we can do this calculation every time?
> 
> If it is good enough, I'll make revised patch.
> 
I just thought to show "how to calculate" in unified way is better.
But if things goes ugly, please ignore my comment.

Do you have a detailed comparison of test result with and without this patch ?
I'm interested in.
I'm sorry if I missed your post of result.


Cheers!
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox