[RFC] scheduler issue & patch

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] scheduler issue & patch
@ 2006-06-12 15:30 Gerd Hoffmann
  2006-06-12 17:28 ` Siddha, Suresh B
  0 siblings, 1 reply; 3+ messages in thread
From: Gerd Hoffmann @ 2006-06-12 15:30 UTC (permalink / raw)
  To: linux kernel mailing list

[-- Attachment #1: Type: text/plain, Size: 1837 bytes --]

  Hi,

I'm looking into a scheduler issue with a NUMA box and scheduling
domains.  The machine is a dual-core opteron with with two nodes, i.e.
four cpus.  cpu0+1 build node0, cpu2+3 build node1.

Now I have an application (benchmark) with two threads which performs
best when the two threads are running on different nodes (probably
because the cpus on each node share the L2 cache).  The scheduler tends
to keep threads on the local node though, wihch probably makes sense on
most cases because local memory is faster.

Ok, we have tools to give hints to the scheduler (taskset, numactl).
The problem is it doesn't work well.  I can ask the scheduler to use
cpu1 (node0) and cpu3 (node1) only (via "taskset 0x0a").  But the
scheduler very often schedules both threads on the same cpu :-(

I think the reason is that the scheduler always checks the complete cpu
groups when calculation the group load, without looking at
task->cpus_allowed.  So we have the effect that the scheduler walks down
the scheduler domain tree, looks at the group for node0, looks at both
cpu0 and cpu1, finds node0 being not overloaded due to cpu0 being idle
and decides to keep the thread on the local node.  Next it walks down
the tree and finds it isn't allowed to use the idle cpu0.  So both
threads get scheduled to cpu1.  Oops.

The patch attached takes the sledgehammer approach to fix it:  In case
we have a non-default cpumask in task->cpus_allowed the scheduler
ignores all the fancy scheduling domains and simply spreads the load
equally over the cpus allowed by task->cpus_allowed.  Not exactly
elegant, but works.  Not each time, but very often.

Comments?  Ideas how to solve this better?  I've also tried to play with
the group load calculation, but it didn't work well.  I'm kida lost in
all those scheduler tuning knobs ...

cheers,

  Gerd

[-- Attachment #2: sched.diff --]
[-- Type: text/x-patch, Size: 1024 bytes --]

--- kernel/sched.c.bug176738	2006-06-08 15:43:09.000000000 +0200
+++ kernel/sched.c	2006-06-12 14:56:51.000000000 +0200
@@ -1020,6 +1020,28 @@
 	return idlest;
 }

+static int
+find_idlest_cpu_nodomain(struct task_struct *p, int this_cpu)
+{
+	cpumask_t tmp;
+	unsigned long load, min_load = ULONG_MAX;
+	int idlest = -1;
+	int i;
+
+	/* Traverse only the allowed CPUs */
+	cpus_and(tmp, cpu_online_map, p->cpus_allowed);
+
+	for_each_cpu_mask(i, tmp) {
+		load = source_load(i, 0);
+
+		if (load < min_load || (load == min_load && i == this_cpu)) {
+			min_load = load;
+			idlest = i;
+		}
+	}
+	return idlest;
+}
+
 /*
  * sched_balance_self: balance the current task (running on cpu) in domains
  * that have the 'flag' flag set. In practice, this is SD_BALANCE_FORK and
@@ -1036,6 +1058,9 @@
 	struct task_struct *t = current;
 	struct sched_domain *tmp, *sd = NULL;

+	if (!cpus_full(t->cpus_allowed))
+		return find_idlest_cpu_nodomain(t, cpu);
+
 	for_each_domain(cpu, tmp)
 		if (tmp->flags & flag)
 			sd = tmp;

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] scheduler issue & patch
  2006-06-12 15:30 [RFC] scheduler issue & patch Gerd Hoffmann
@ 2006-06-12 17:28 ` Siddha, Suresh B
  2006-06-12 17:52   ` Gerd Hoffmann
  0 siblings, 1 reply; 3+ messages in thread
From: Siddha, Suresh B @ 2006-06-12 17:28 UTC (permalink / raw)
  To: Gerd Hoffmann; +Cc: linux kernel mailing list

On Mon, Jun 12, 2006 at 05:30:42PM +0200, Gerd Hoffmann wrote:
>   Hi,
> 
> I'm looking into a scheduler issue with a NUMA box and scheduling
> domains.  The machine is a dual-core opteron with with two nodes, i.e.
> four cpus.  cpu0+1 build node0, cpu2+3 build node1.
> 
> Now I have an application (benchmark) with two threads which performs
> best when the two threads are running on different nodes (probably
> because the cpus on each node share the L2 cache).  The scheduler tends
> to keep threads on the local node though, wihch probably makes sense on
> most cases because local memory is faster.
> 
> Ok, we have tools to give hints to the scheduler (taskset, numactl).
> The problem is it doesn't work well.  I can ask the scheduler to use
> cpu1 (node0) and cpu3 (node1) only (via "taskset 0x0a").  But the
> scheduler very often schedules both threads on the same cpu :-(
> 
> I think the reason is that the scheduler always checks the complete cpu
> groups when calculation the group load, without looking at
> task->cpus_allowed.  So we have the effect that the scheduler walks down
> the scheduler domain tree, looks at the group for node0, looks at both
> cpu0 and cpu1, finds node0 being not overloaded due to cpu0 being idle
> and decides to keep the thread on the local node.  Next it walks down
> the tree and finds it isn't allowed to use the idle cpu0.  So both
> threads get scheduled to cpu1.  Oops.

I don't think it is the problem with sched_balance_self(). sched_balance_self()
probably is doing the right thing based on the load that is present at the
time of fork/exec. Once the node-1 becomes idle, we expect the two threads
on node-0 cpu-1 to get distributed between the two nodes.

Perhaps the real issue is how cpu_power is calculated for node domain
on these systems. Because of the shared resources between the cpus in a node,
cpu_power for a group in node domain should be < 2 * SCHED_LOAD_SCALE..

Once this is the case, find_busiest_group() should detect the imbalance and
move one of the threads from cpu-1(node-0) to cpu-3(node-1)

> The patch attached takes the sledgehammer approach to fix it:  In case
> we have a non-default cpumask in task->cpus_allowed the scheduler
> ignores all the fancy scheduling domains and simply spreads the load
> equally over the cpus allowed by task->cpus_allowed.  Not exactly
> elegant, but works.  Not each time, but very often.
> 
> Comments?  Ideas how to solve this better?  I've also tried to play with
> the group load calculation, but it didn't work well.  I'm kida lost in
> all those scheduler tuning knobs ...

In my opinion, this patch is not the correct fix for the issue.

thanks,
suresh

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] scheduler issue & patch
  2006-06-12 17:28 ` Siddha, Suresh B
@ 2006-06-12 17:52   ` Gerd Hoffmann
  0 siblings, 0 replies; 3+ messages in thread
From: Gerd Hoffmann @ 2006-06-12 17:52 UTC (permalink / raw)
  To: Siddha, Suresh B; +Cc: linux kernel mailing list

Siddha, Suresh B wrote:
> I don't think it is the problem with sched_balance_self(). sched_balance_self()
> probably is doing the right thing based on the load that is present at the
> time of fork/exec. Once the node-1 becomes idle, we expect the two threads
> on node-0 cpu-1 to get distributed between the two nodes.

That happens indeed.  Problem with that is that the thread which gets
migrated from cpu1 (node0) to cpu3 (node1) ends up with memory for the
working set being allocated from node0 memory because it ran on node0
for a short time.  Which is a noticable performance hit on a NUMA system.

I think the scheduler should try harder to spread the threads across
cpus in a way that they can stay on the initial cpu instead of migrating
them later on.

> In my opinion, this patch is not the correct fix for the issue.

Sure, it's sort-od band-aid fix, thats why I'm trying to find something
better.

cheers,

  Gerd

-- 
Gerd Hoffmann <kraxel@suse.de>
http://www.suse.de/~kraxel/julika-dora.jpeg

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2006-06-12 17:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-12 15:30 [RFC] scheduler issue & patch Gerd Hoffmann
2006-06-12 17:28 ` Siddha, Suresh B
2006-06-12 17:52   ` Gerd Hoffmann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox