[PATCH] NUMA Scheduler

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] NUMA Scheduler - rev 4
@ 2002-10-16 22:31 Michael Hohnbaum
  2002-10-17  1:21 ` Andrew Theurer
  0 siblings, 1 reply; 2+ messages in thread
From: Michael Hohnbaum @ 2002-10-16 22:31 UTC (permalink / raw)
  To: torvalds, mingo, linux-kernel; +Cc: Erich Focht, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 726 bytes --]

Linus, Ingo,

Attached is a small patch which provides NUMA awareness to the
O(1) scheduler.  This patch retains the identical O(1) scheduler
behavior for SMP systems.  For multi-node systems it favors
runqueues on the current node when looking for another runqueue
to pull tasks from.  It also makes a balance decision at exec().
This patch is against 2.5.43.

On NUMA systems these two changes have shown performance gains 
in the 5 - 10% range depending on tests.  Some micro-benchmarks 
provided by Erich Focht which stress the memory subsystem show
a doubling in performance.

Please consider applying this patch.
-- 

Michael Hohnbaum                      503-578-5486
hohnbaum@us.ibm.com                   T/L 775-5486

[-- Attachment #2: sched43rev4.patch --]
[-- Type: text/plain, Size: 7479 bytes --]

--- clean-2.5.43/kernel/sched.c	Wed Oct 16 13:53:11 2002
+++ linux-2.5.43/kernel/sched.c	Wed Oct 16 13:35:12 2002
@@ -32,6 +32,9 @@
 #include <linux/delay.h>
 #include <linux/timer.h>
 #include <linux/rcupdate.h>
+#include <asm/topology.h>
+#include <linux/percpu.h>
+
 
 /*
  * Convert user-nice values [ -20 ... 0 ... 19 ]
@@ -639,15 +642,35 @@ static inline unsigned int double_lock_b
  */
 static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
 {
-	int nr_running, load, max_load, i;
-	runqueue_t *busiest, *rq_src;
+	int nr_running, load, max_load_on_node, max_load_off_node, i;
+	runqueue_t *busiest, *busiest_on_node, *busiest_off_node, *rq_src;
 
 	/*
-	 * We search all runqueues to find the most busy one.
+	 * This routine is designed to work on NUMA systems.  For
+	 * non-NUMA systems, everything ends up being on the same
+	 * node and most of the NUMA specific logic is optimized
+	 * away by the compiler.
+	 * 
+	 * We must look at each runqueue to update prev_nr_running.
+	 * As we do so, we find the busiest runqueue on the same
+	 * node, and the busiest runqueue off the node.  After
+	 * determining the busiest from each we first see if the
+	 * busiest runqueue on node meets the imbalance criteria.
+	 * If not, then we check the off queue runqueue.  Note that
+	 * we require a higher level of imbalance for choosing an
+	 * off node runqueue.
+	 *
+	 * For off-node, we only move at most one process, so imbalance
+	 * is always set to one for off-node runqueues.
+	 * 
 	 * We do this lockless to reduce cache-bouncing overhead,
 	 * we re-check the 'best' source CPU later on again, with
 	 * the lock held.
 	 *
+	 * Note that this routine is called with the current runqueue
+	 * locked, and if a runqueue is found (return != NULL), then
+	 * that runqueue is returned locked, also.
+	 *
 	 * We fend off statistical fluctuations in runqueue lengths by
 	 * saving the runqueue length during the previous load-balancing
 	 * operation and using the smaller one the current and saved lengths.
@@ -669,8 +692,12 @@ static inline runqueue_t *find_busiest_q
 	else
 		nr_running = this_rq->prev_nr_running[this_cpu];
 
+	busiest_on_node = NULL;
+	busiest_off_node = NULL;
 	busiest = NULL;
-	max_load = 1;
+	max_load_on_node = 1;
+	max_load_off_node = 3;
+	
 	for (i = 0; i < NR_CPUS; i++) {
 		if (!cpu_online(i))
 			continue;
@@ -682,33 +709,44 @@ static inline runqueue_t *find_busiest_q
 			load = this_rq->prev_nr_running[i];
 		this_rq->prev_nr_running[i] = rq_src->nr_running;
 
-		if ((load > max_load) && (rq_src != this_rq)) {
-			busiest = rq_src;
-			max_load = load;
+		if (__cpu_to_node(i) == __cpu_to_node(this_cpu)) {
+			if ((load > max_load_on_node) && (rq_src != this_rq)) {
+				busiest_on_node = rq_src;
+				max_load_on_node = load;
+			}
+		} else {
+			if (load > max_load_off_node)  {
+				busiest_off_node = rq_src;
+				max_load_off_node = load;
+			}
 		}
 	}
-
-	if (likely(!busiest))
-		goto out;
-
-	*imbalance = (max_load - nr_running) / 2;
-
-	/* It needs an at least ~25% imbalance to trigger balancing. */
-	if (!idle && (*imbalance < (max_load + 3)/4)) {
-		busiest = NULL;
-		goto out;
+	if (busiest_on_node) {
+		/* on node balancing happens if there is > 125% difference */
+		*imbalance = (max_load_on_node - nr_running) / 2;
+		if (idle || (*imbalance >=  (max_load_on_node + 3)/4)) {
+			busiest = busiest_on_node;
+		}
 	}
-
-	nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
-	/*
-	 * Make sure nothing changed since we checked the
-	 * runqueue length.
-	 */
-	if (busiest->nr_running <= nr_running + 1) {
-		spin_unlock(&busiest->lock);
-		busiest = NULL;
+	if (busiest_off_node && !busiest) {
+		/* off node balancing requires 400% difference */
+		/*if ((nr_running*4 >= max_load_off_node) && !idle) */
+		if (nr_running*4 >= max_load_off_node) 
+			return NULL;
+		busiest = busiest_off_node; 
+		*imbalance = 1;
+	} 
+	if (busiest) {
+		nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
+		/*
+		 * Make sure nothing changed since we checked the
+		 * runqueue length.
+		 */
+		if (busiest->nr_running <= nr_running + 1) {
+			spin_unlock(&busiest->lock);
+			busiest = NULL;
+		}
 	}
-out:
 	return busiest;
 }
 
@@ -2098,6 +2136,81 @@ __init int migration_init(void)
 
 #endif
 
+#if CONFIG_NUMA
+/*
+ * If dest_cpu is allowed for this process, migrate the task to it.
+ * This is accomplished by forcing the cpu_allowed mask to only
+ * allow dest_cpu, which will force the cpu onto dest_cpu.  Then
+ * the cpu_allowed mask is restored.
+ */
+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+	unsigned long old_mask;
+
+	old_mask = p->cpus_allowed;
+	if (!(old_mask & (1UL << dest_cpu)))
+		return;
+	/* force the process onto the specified CPU */
+	set_cpus_allowed(p, 1UL << dest_cpu);
+
+	/* restore the cpus allowed mask */
+	set_cpus_allowed(p, old_mask);
+}
+
+/*
+ * keep track of the last cpu that we exec'd from - use of this
+ * can be "fuzzy" as multiple procs can grab this at more or less
+ * the same time and set it similarly.  Those situations will 
+ * balance out on a heavily loaded system (where they are more
+ * likely to occur) quite rapidly
+ */
+static DEFINE_PER_CPU(int, last_exec_cpu) = 0;
+/*
+ * Find the least loaded CPU.  Slightly favor the current CPU by
+ * setting its runqueue length as the minimum to start.  If 
+ * current is lightly loaded, just stick with it.
+ */
+static int sched_best_cpu(struct task_struct *p)
+{
+	int i, minload, best_cpu, cur_cpu, node;
+	best_cpu = task_cpu(p);
+	if (cpu_rq(best_cpu)->nr_running <= 2)
+		return best_cpu;
+
+	node = __cpu_to_node(__get_cpu_var(last_exec_cpu));
+	if (++node >= numnodes)
+		node = 0;
+	
+	cur_cpu = __node_to_first_cpu(node);
+	minload = cpu_rq(best_cpu)->nr_running;
+
+	for (i = 0; i < NR_CPUS; i++) {
+		if (!cpu_online(cur_cpu))
+			continue;
+
+		if (minload > cpu_rq(cur_cpu)->nr_running) {
+			minload = cpu_rq(cur_cpu)->nr_running;
+			best_cpu = cur_cpu;
+		}
+		if (++cur_cpu >= NR_CPUS)
+			cur_cpu = 0;
+	}
+	__get_cpu_var(last_exec_cpu) = best_cpu;
+	return best_cpu;
+}
+
+void sched_balance_exec(void)
+{
+	int new_cpu;
+
+	if (numnodes > 1) {
+		new_cpu = sched_best_cpu(current);
+		if (new_cpu != smp_processor_id())
+			sched_migrate_task(current, new_cpu);
+	}
+}
+#endif /* CONFIG_NUMA */
+
 #if CONFIG_SMP || CONFIG_PREEMPT
 /*
  * The 'big kernel lock'
--- clean-2.5.43/fs/exec.c	Wed Oct 16 13:52:33 2002
+++ linux-2.5.43/fs/exec.c	Wed Oct 16 13:34:36 2002
@@ -1001,6 +1001,8 @@ int do_execve(char * filename, char ** a
 	int retval;
 	int i;
 
+	sched_balance_exec();
+
 	file = open_exec(filename);
 
 	retval = PTR_ERR(file);
--- clean-2.5.43/include/linux/sched.h	Wed Oct 16 13:53:06 2002
+++ linux-2.5.43/include/linux/sched.h	Wed Oct 16 13:34:37 2002
@@ -166,6 +166,11 @@ extern void update_one_process(struct ta
 			       unsigned long system, int cpu);
 extern void scheduler_tick(int user_tick, int system);
 extern unsigned long cache_decay_ticks;
+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+#else
+#define sched_balance_exec() {}
+#endif
 
 
 #define	MAX_SCHEDULE_TIMEOUT	LONG_MAX

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [PATCH] NUMA Scheduler - rev 4
  2002-10-16 22:31 [PATCH] NUMA Scheduler - rev 4 Michael Hohnbaum
@ 2002-10-17  1:21 ` Andrew Theurer
  0 siblings, 0 replies; 2+ messages in thread
From: Andrew Theurer @ 2002-10-17  1:21 UTC (permalink / raw)
  To: Michael Hohnbaum, torvalds, mingo, linux-kernel
  Cc: Erich Focht, Martin J. Bligh

> Linus, Ingo,
>
> Attached is a small patch which provides NUMA awareness to the
> O(1) scheduler.  This patch retains the identical O(1) scheduler
> behavior for SMP systems.  For multi-node systems it favors
> runqueues on the current node when looking for another runqueue
> to pull tasks from.  It also makes a balance decision at exec().
> This patch is against 2.5.43.
>
> On NUMA systems these two changes have shown performance gains
> in the 5 - 10% range depending on tests.  Some micro-benchmarks
> provided by Erich Focht which stress the memory subsystem show
> a doubling in performance.
>
> Please consider applying this patch.

FYI, to make sure there was no degrade for non numa systems, I just benched
a kernel compile on an IBM 8500R, with/without your patch.  Compile times
were 38.880 (vanilla) and 38.818 (numa sched).

Andrew Theurer


^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2002-10-17  1:09 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2002-10-16 22:31 [PATCH] NUMA Scheduler - rev 4 Michael Hohnbaum
2002-10-17  1:21 ` Andrew Theurer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox