Hi,

I'm looking into a scheduler issue with a NUMA box and scheduling
domains.  The machine is a dual-core opteron with with two nodes, i.e.
four cpus.  cpu0+1 build node0, cpu2+3 build node1.

Now I have an application (benchmark) with two threads which performs
best when the two threads are running on different nodes (probably
because the cpus on each node share the L2 cache).  The scheduler tends
to keep threads on the local node though, wihch probably makes sense on
most cases because local memory is faster.

Ok, we have tools to give hints to the scheduler (taskset, numactl).
The problem is it doesn't work well.  I can ask the scheduler to use
cpu1 (node0) and cpu3 (node1) only (via "taskset 0x0a").  But the
scheduler very often schedules both threads on the same cpu :-(

I think the reason is that the scheduler always checks the complete cpu
groups when calculation the group load, without looking at
task->cpus_allowed.  So we have the effect that the scheduler walks down
the scheduler domain tree, looks at the group for node0, looks at both
cpu0 and cpu1, finds node0 being not overloaded due to cpu0 being idle
and decides to keep the thread on the local node.  Next it walks down
the tree and finds it isn't allowed to use the idle cpu0.  So both
threads get scheduled to cpu1.  Oops.

The patch attached takes the sledgehammer approach to fix it:  In case
we have a non-default cpumask in task->cpus_allowed the scheduler
ignores all the fancy scheduling domains and simply spreads the load
equally over the cpus allowed by task->cpus_allowed.  Not exactly
elegant, but works.  Not each time, but very often.

Comments?  Ideas how to solve this better?  I've also tried to play with
the group load calculation, but it didn't work well.  I'm kida lost in
all those scheduler tuning knobs ...

cheers,

  Gerd