SD_SHARE_CPUPOWER breaks scheduler fairness

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* SD_SHARE_CPUPOWER breaks scheduler fairness
@ 2005-05-31 17:46 Steve Rotolo
  2005-06-01  2:49 ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-05-31 17:46 UTC (permalink / raw)
  To: linux-kernel; +Cc: bugsy

The SD_SHARE_CPUPOWER flag in SMT scheduling domains (hyperthread
systems) can starve out sched_other tasks and even hang the system.  A
long-running (or run-away) sched_fifo task causes sched_other tasks to
get stuck on the sibling cpu's runqueue without any chance to run.  The
sibling cpu simply stays idle with tasks on it's runqueue for as long as
the sched_fifo task runs on the other sibling cpu.  The culprit is
dependent_sleeper() in sched.c.

I guess the SD_SHARE_CPUPOWER is supposed to cause the scheduler to
prohibit non-real-time tasks from running on a cpu while a real-time
task is running on the sibling cpu.  The problem is that sched_other
tasks are not migrated to a different runqueue and essentially get stuck
on a dead runqueue until either the sched_fifo task yields or the
load-balancer moves him.  Unfortunately, the load-balancer will never
migrate the task if the runqueue length is not sufficiently out of
balance.  Even more unfortunate, the load-balancer will actually move
tasks *to* the dead runqueue if it is less busy.  And still worse, since
SD_WAKE_IDLE is also set in the scheduling domain, the dead cpu will
actually attract waking tasks to it because it is idle!  The cpu becomes
a sort-of black-hole sucking in innocent tasks so they can no longer
run.

The worst-case scenario is when there are N spinning sched_fifo tasks on
an N-way hyperthreaded system.  This hangs the system since nothing can
run on the virtual cpus.  If you turn off the SD_SHARE_CPUPOWER flag,
the system stays fully functional until you have N*2 spinners hogging
all the virtual cpus.

I get the same behavior from 2.6.9 to 2.6.12-rc5.  So is this a bug or a
feature?

-- 
Steve Rotolo
Concurrent Computer Corporation

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-05-31 17:46 SD_SHARE_CPUPOWER breaks scheduler fairness Steve Rotolo
@ 2005-06-01  2:49 ` Con Kolivas
  2005-06-01 14:29   ` Steve Rotolo
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2005-06-01  2:49 UTC (permalink / raw)
  To: Steve Rotolo; +Cc: linux-kernel, bugsy

On Wed, 1 Jun 2005 03:46 am, Steve Rotolo wrote:
> The SD_SHARE_CPUPOWER flag in SMT scheduling domains (hyperthread
> systems) can starve out sched_other tasks and even hang the system.  A
> long-running (or run-away) sched_fifo task causes sched_other tasks to
> get stuck on the sibling cpu's runqueue without any chance to run.  The
> sibling cpu simply stays idle with tasks on it's runqueue for as long as
> the sched_fifo task runs on the other sibling cpu.  The culprit is
> dependent_sleeper() in sched.c.
>
> I guess the SD_SHARE_CPUPOWER is supposed to cause the scheduler to
> prohibit non-real-time tasks from running on a cpu while a real-time
> task is running on the sibling cpu.  The problem is that sched_other
> tasks are not migrated to a different runqueue and essentially get stuck
> on a dead runqueue until either the sched_fifo task yields or the
> load-balancer moves him.  Unfortunately, the load-balancer will never
> migrate the task if the runqueue length is not sufficiently out of
> balance.  Even more unfortunate, the load-balancer will actually move
> tasks *to* the dead runqueue if it is less busy.  And still worse, since
> SD_WAKE_IDLE is also set in the scheduling domain, the dead cpu will
> actually attract waking tasks to it because it is idle!  The cpu becomes
> a sort-of black-hole sucking in innocent tasks so they can no longer
> run.
>
> The worst-case scenario is when there are N spinning sched_fifo tasks on
> an N-way hyperthreaded system.  This hangs the system since nothing can
> run on the virtual cpus.  If you turn off the SD_SHARE_CPUPOWER flag,
> the system stays fully functional until you have N*2 spinners hogging
> all the virtual cpus.
>
> I get the same behavior from 2.6.9 to 2.6.12-rc5.  So is this a bug or a
> feature?

Sort of yes and yes. The idea that the sibling gets put to sleep if a real 
time task is running is a workaround for the fact that you do share cpu power 
(as you've correctly understood) and a real time task will slow down if a 
SCHED_NORMAL task is running on its sibling which it should not.  The 
limitation is that, yes, for all intents you only have N hyperthreaded cpus 
for spinning N rt tasks before nothing else runs, but you can actually run 
N*2 rt tasks in this setting which you would not be able to if hyperthreading 
was disabled.

For some time I've been thinking about changing the balance between the 
siblings slightly to allow SCHED_NORMAL tasks to run a small proportion of 
time when rt tasks are running on the sibling. The tricky part is that 
SCHED_FIFO tasks have no timeslice so we can't proportion cpu out according 
to the difference in size of the timeslices, which is currently how we 
proportion out cpu across siblings with SCHED_NORMAL, and this maintains cpu 
distribution very similarly to how 'nice' does on the same cpu.

Cheers,
Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01  2:49 ` Con Kolivas
@ 2005-06-01 14:29   ` Steve Rotolo
  2005-06-01 14:47     ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-06-01 14:29 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, bugsy

On Tue, 2005-05-31 at 22:49, Con Kolivas wrote:
> Sort of yes and yes. The idea that the sibling gets put to sleep if a real 
> time task is running is a workaround for the fact that you do share cpu power 
> (as you've correctly understood) and a real time task will slow down if a 
> SCHED_NORMAL task is running on its sibling which it should not.  The 
> limitation is that, yes, for all intents you only have N hyperthreaded cpus 
> for spinning N rt tasks before nothing else runs, but you can actually run 
> N*2 rt tasks in this setting which you would not be able to if hyperthreading 
> was disabled.
> 
> For some time I've been thinking about changing the balance between the 
> siblings slightly to allow SCHED_NORMAL tasks to run a small proportion of 
> time when rt tasks are running on the sibling. The tricky part is that 
> SCHED_FIFO tasks have no timeslice so we can't proportion cpu out according 
> to the difference in size of the timeslices, which is currently how we 
> proportion out cpu across siblings with SCHED_NORMAL, and this maintains cpu 
> distribution very similarly to how 'nice' does on the same cpu.

Thanks for responding, Con.  But I want to make sure that an important
point doesn't escape your attention.  It appears that tasks get trapped
on the stalled sibling, even when they could run on some other cpu.  The
load-balancer does not understand that the sibling is temporarily out of
service so it actually balances tasks to it.  And since it's idle, it
may attract tasks to it more than other cpus (thanks to SD_WAKE_IDLE). 
I think this is a serious bug.

-- 
Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 14:29   ` Steve Rotolo
@ 2005-06-01 14:47     ` Con Kolivas
  2005-06-01 18:41       ` Steve Rotolo
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2005-06-01 14:47 UTC (permalink / raw)
  To: steve.rotolo; +Cc: linux-kernel, bugsy

[-- Attachment #1: Type: text/plain, Size: 2827 bytes --]

On Thu, 2 Jun 2005 00:29, Steve Rotolo wrote:
> On Tue, 2005-05-31 at 22:49, Con Kolivas wrote:
> > Sort of yes and yes. The idea that the sibling gets put to sleep if a
> > real time task is running is a workaround for the fact that you do share
> > cpu power (as you've correctly understood) and a real time task will slow
> > down if a SCHED_NORMAL task is running on its sibling which it should
> > not.  The limitation is that, yes, for all intents you only have N
> > hyperthreaded cpus for spinning N rt tasks before nothing else runs, but
> > you can actually run N*2 rt tasks in this setting which you would not be
> > able to if hyperthreading was disabled.
> >
> > For some time I've been thinking about changing the balance between the
> > siblings slightly to allow SCHED_NORMAL tasks to run a small proportion
> > of time when rt tasks are running on the sibling. The tricky part is that
> > SCHED_FIFO tasks have no timeslice so we can't proportion cpu out
> > according to the difference in size of the timeslices, which is currently
> > how we proportion out cpu across siblings with SCHED_NORMAL, and this
> > maintains cpu distribution very similarly to how 'nice' does on the same
> > cpu.
>
> Thanks for responding, Con.  But I want to make sure that an important
> point doesn't escape your attention.  It appears that tasks get trapped
> on the stalled sibling, even when they could run on some other cpu.  The
> load-balancer does not understand that the sibling is temporarily out of
> service so it actually balances tasks to it.  And since it's idle, it
> may attract tasks to it more than other cpus (thanks to SD_WAKE_IDLE).
> I think this is a serious bug.

I didn't miss the point, but I guess I should have made that clear too.

The number of tasks seen running on that sibling is still the same even if the 
queue is forced to be idle (witness by top thinking the load is 1 on that 
sibling even if it also shows quite a lot of idle time). It should therefore 
not attract any more tasks to itself. 

The task that is there will be trapped based on the fact that there is only 
one task _only_ if the other sibling is indefinitely running real time tasks, 
and _if_ there are other physical cpus we can use we should try to schedule 
the trapped task away. If we have N physical cpus (and N*2 logical), and we 
are running N real time threads I don't think we should expect to run 
SCHED_NORMAL tasks as well. If we have <N real time tasks (where N > 1) then 
we should still be able to run SCHED_NORMAL tasks, I agree. I'm a little 
reluctant to tackle this at this stage with the number of SMP balancing 
things already queued for -mm, but making a sibling appear more heavily laden 
when "pegged" (nr_running + 1) should suffice.

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 14:47     ` Con Kolivas
@ 2005-06-01 18:41       ` Steve Rotolo
  2005-06-01 21:37         ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-06-01 18:41 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, bugsy

On Wed, 2005-06-01 at 10:47, Con Kolivas wrote:

> I didn't miss the point, but I guess I should have made that clear too.
> 
> The number of tasks seen running on that sibling is still the same even if the 
> queue is forced to be idle (witness by top thinking the load is 1 on that 
> sibling even if it also shows quite a lot of idle time). It should therefore 
> not attract any more tasks to itself. 
> The task that is there will be trapped based on the fact that there is only 
> one task _only_ if the other sibling is indefinitely running real time tasks, 
> and _if_ there are other physical cpus we can use we should try to schedule 
> the trapped task away. If we have N physical cpus (and N*2 logical), and we 
> are running N real time threads I don't think we should expect to run 
> SCHED_NORMAL tasks as well. If we have <N real time tasks (where N > 1) then 
> we should still be able to run SCHED_NORMAL tasks, I agree. I'm a little 
> reluctant to tackle this at this stage with the number of SMP balancing 
> things already queued for -mm, but making a sibling appear more heavily laden 
> when "pegged" (nr_running + 1) should suffice.
> 

Consider what happens if:
- you have 2 physical cpus, 4 logical cpus
- you have 40 running SCHED_NORMAL tasks on a well balanced system --
roughly 10 on each runqueue
- start up a spinning SCHED_FIFO task on cpu 0

Assuming that cpu 1 is the sibling of 0, cpu 1 now has 10 SCHED_NORMAL
tasks that are totally screwed -- they will never, ever, run anywhere,
period.

Now consider what happens if I start up 40 more SCHED_NORMAL tasks.  The
load-balancer will kindly place 10 of them on cpu 1's runqueue so they
too can be screwed for all eternity.  Nice.

One more thing: I *think* wake_idle() tends to wake tasks to idle cpus
regardless of the idle cpu's runqueue length.  This is why I say the
idle cpu becomes a magnet for even more tasks, until the balancer
straightens things out again.

I guess the bottom-line is: given N logical cpus, 1/N of all
SCHED_NORMAL tasks may get stuck on a sibling cpu with no chance to
run.  All it takes is one spinning SCHED_FIFO task.  Sounds like a bug.

-- 
Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 18:41       ` Steve Rotolo
@ 2005-06-01 21:37         ` Con Kolivas
  2005-06-01 21:54           ` Con Kolivas
                             ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Con Kolivas @ 2005-06-01 21:37 UTC (permalink / raw)
  To: steve.rotolo; +Cc: linux-kernel, bugsy

[-- Attachment #1: Type: text/plain, Size: 497 bytes --]

On Thu, 2 Jun 2005 04:41, Steve Rotolo wrote:
> I guess the bottom-line is: given N logical cpus, 1/N of all
> SCHED_NORMAL tasks may get stuck on a sibling cpu with no chance to
> run.  All it takes is one spinning SCHED_FIFO task.  Sounds like a bug.

You're right, and excuse me for missing it. We have to let SCHED_NORMAL tasks 
run for some period with rt tasks. There shouldn't be any combination of 
mutually exclusive tasks for siblings.

I'll work on something.

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 21:37         ` Con Kolivas
@ 2005-06-01 21:54           ` Con Kolivas
  2005-06-01 22:01           ` Steve Rotolo
  2005-06-01 23:16           ` Joe Korty
  2 siblings, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2005-06-01 21:54 UTC (permalink / raw)
  To: steve.rotolo; +Cc: linux-kernel, bugsy

[-- Attachment #1: Type: text/plain, Size: 414 bytes --]

On Thu, 2 Jun 2005 07:37, Con Kolivas wrote:
> On Thu, 2 Jun 2005 04:41, Steve Rotolo wrote:
> > I guess the bottom-line is: given N logical cpus, 1/N of all
> > SCHED_NORMAL tasks may get stuck on a sibling cpu with no chance to
> > run.  All it takes is one spinning SCHED_FIFO task.  Sounds like a bug.
>
> You're right, and excuse me for missing it. 

Oh and thanks for picking it up!

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 21:37         ` Con Kolivas
  2005-06-01 21:54           ` Con Kolivas
@ 2005-06-01 22:01           ` Steve Rotolo
  2005-06-02  3:01             ` Con Kolivas
  2005-06-01 23:16           ` Joe Korty
  2 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-06-01 22:01 UTC (permalink / raw)
  To: Con Kolivas; +Cc: linux-kernel, bugsy

On Wed, 2005-06-01 at 17:37, Con Kolivas wrote:
> I'll work on something.

Great!  I'd be happy to test a patch for you.  Thanks!!!

-- 
Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 21:37         ` Con Kolivas
  2005-06-01 21:54           ` Con Kolivas
  2005-06-01 22:01           ` Steve Rotolo
@ 2005-06-01 23:16           ` Joe Korty
  2005-06-01 23:25             ` Con Kolivas
  2 siblings, 1 reply; 15+ messages in thread
From: Joe Korty @ 2005-06-01 23:16 UTC (permalink / raw)
  To: Con Kolivas; +Cc: steve.rotolo, linux-kernel, bugsy


> On Thu, 2 Jun 2005 04:41, Steve Rotolo wrote:
> > I guess the bottom-line is: given N logical cpus, 1/N of all
> > SCHED_NORMAL tasks may get stuck on a sibling cpu with no chance to
> > run.  All it takes is one spinning SCHED_FIFO task.  Sounds like a bug.
> 
> You're right, and excuse me for missing it. We have to let SCHED_NORMAL tasks 
> run for some period with rt tasks. There shouldn't be any combination of 
> mutually exclusive tasks for siblings.
> 
> I'll work on something.

Wild thought: how about doing this for the sibling ...

	rp->nr_running += SOME_BIG_NUMBER

when a SCHED_FIFO task starts running on some cpu, and
undo the above when the cpu is released.   This fools
the load balancer into _gradually_ moving tasks off the
sibling, when the cpu is hogged by some SCHED_FIFO task,
but should have little effect if a SCHED_FIFO task takes
little cpu time.

Regards,
Joe
--
"Money can buy bandwidth, but latency is forever" -- John Mashey



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 23:16           ` Joe Korty
@ 2005-06-01 23:25             ` Con Kolivas
  2005-06-02 13:30               ` Steve Rotolo
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2005-06-01 23:25 UTC (permalink / raw)
  To: joe.korty; +Cc: steve.rotolo, linux-kernel, bugsy

On Thu, 2 Jun 2005 09:16 am, Joe Korty wrote:
> > On Thu, 2 Jun 2005 04:41, Steve Rotolo wrote:
> > > I guess the bottom-line is: given N logical cpus, 1/N of all
> > > SCHED_NORMAL tasks may get stuck on a sibling cpu with no chance to
> > > run.  All it takes is one spinning SCHED_FIFO task.  Sounds like a bug.
> >
> > You're right, and excuse me for missing it. We have to let SCHED_NORMAL
> > tasks run for some period with rt tasks. There shouldn't be any
> > combination of mutually exclusive tasks for siblings.
> >
> > I'll work on something.
>
> Wild thought: how about doing this for the sibling ...
>
> 	rp->nr_running += SOME_BIG_NUMBER
>
> when a SCHED_FIFO task starts running on some cpu, and
> undo the above when the cpu is released.   This fools
> the load balancer into _gradually_ moving tasks off the
> sibling, when the cpu is hogged by some SCHED_FIFO task,
> but should have little effect if a SCHED_FIFO task takes
> little cpu time.

A good thought, and one I had considered. SOME_BIG_NUMBER needs to be 
meaninful for this to work. Ideally what we do is add the effective load from 
the sibling cpu to the pegged cpu. However that's not as useful as it sounds 
because we need to ensure both sibling runqueues are locked every time we 
check the load value of one runqueue, and the last thing I want is to 
introduce yet more locking. Also the value will vary wildly depending on 
whether the task is pegged or not, and this changes in mainline many times in 
less than .1s which means it would throw load balancing way off as the value 
will effectively become meaningless.

I already have a plan for this without really touching the load balancing.

Cheers,
Con

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 22:01           ` Steve Rotolo
@ 2005-06-02  3:01             ` Con Kolivas
  0 siblings, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2005-06-02  3:01 UTC (permalink / raw)
  To: steve.rotolo; +Cc: linux-kernel, bugsy, mingo

[-- Attachment #1: Type: text/plain, Size: 640 bytes --]

On Thu, 2 Jun 2005 08:01 am, Steve Rotolo wrote:
> On Wed, 2005-06-01 at 17:37, Con Kolivas wrote:
> > I'll work on something.
>
> Great!  I'd be happy to test a patch for you.  Thanks!!!

Ok this patch is only compile tested I'm sorry to say (without a real HT 
testing environment at the moment) but it uses jiffies and DEF_TIMESLICE to 
allow SCHED_NORMAL tasks to run per_cpu_gain % of DEF_TIMESLICE at a time. 
Thus it is not tied to timeslices at all for real time tasks which is 
appropriate. Can you try it please? It should apply to any recent kernel.

Cheers,
Con

P.S. cc'ed Ingo

Signed-off-by: Con Kolivas <kernel@kolivas.org>

[-- Attachment #2: sched-run_normal_with_rt_on_sibling.diff --]
[-- Type: text/x-diff, Size: 3122 bytes --]

Index: linux-2.6.12-rc5-mm2/kernel/sched.c
===================================================================
--- linux-2.6.12-rc5-mm2.orig/kernel/sched.c	2005-06-02 10:13:26.000000000 +1000
+++ linux-2.6.12-rc5-mm2/kernel/sched.c	2005-06-02 12:54:39.000000000 +1000
@@ -2656,6 +2656,13 @@ out:
 }
 
 #ifdef CONFIG_SCHED_SMT
+static inline void wakeup_busy_runqueue(runqueue_t *rq)
+{
+	/* If an SMT runqueue is sleeping due to priority reasons wake it up */
+	if (rq->curr == rq->idle && rq->nr_running)
+		resched_task(rq->idle);
+}
+
 static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2689,12 +2696,7 @@ static inline void wake_sleeping_depende
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
 
-		/*
-		 * If an SMT sibling task is sleeping due to priority
-		 * reasons wake it up now.
-		 */
-		if (smt_rq->curr == smt_rq->idle && smt_rq->nr_running)
-			resched_task(smt_rq->idle);
+		wakeup_busy_runqueue(smt_rq);
 	}
 
 	for_each_cpu_mask(i, sibling_map)
@@ -2748,6 +2750,10 @@ static inline int dependent_sleeper(int 
 		runqueue_t *smt_rq = cpu_rq(i);
 		task_t *smt_curr = smt_rq->curr;
 
+		/* Kernel threads do not participate in dependent sleeping */ 
+		if (!p->mm || !smt_curr->mm || rt_task(p))
+			goto check_smt_task;
+
 		/*
 		 * If a user task with lower static priority than the
 		 * running task on the SMT sibling is trying to schedule,
@@ -2756,21 +2762,44 @@ static inline int dependent_sleeper(int 
 		 * task from using an unfair proportion of the
 		 * physical cpu's resources. -ck
 		 */
-		if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(p) || rt_task(smt_curr)) &&
-			p->mm && smt_curr->mm && !rt_task(p))
-				ret = 1;
+		if (rt_task(smt_curr)) {
+			/*
+			 * With real time tasks we run non-rt tasks only 
+			 * per_cpu_gain% of the time.
+			 */
+			if ((jiffies % DEF_TIMESLICE) >
+				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
+					ret = 1;
+		} else
+			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
+				100) > task_timeslice(p)))
+					ret = 1;
+
+check_smt_task:
+		if ((!smt_curr->mm && smt_curr != smt_rq->idle) ||
+			rt_task(smt_curr))
+				continue;
+		if (!p->mm) {
+			wakeup_busy_runqueue(smt_rq);
+			continue;
+		}
 
 		/*
 		 * Reschedule a lower priority task on the SMT sibling,
 		 * or wake it up if it has been put to sleep for priority
-		 * reasons.
+		 * reasons to see if it should run now.
 		 */
-		if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(smt_curr) || rt_task(p)) &&
-			smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
-			(smt_curr == smt_rq->idle && smt_rq->nr_running))
-				resched_task(smt_curr);
+		if (rt_task(p)) {
+			if ((jiffies % DEF_TIMESLICE) >
+				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
+					resched_task(smt_curr);
+		} else {
+			if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
+				task_timeslice(smt_curr))
+					resched_task(smt_curr);
+			else
+				wakeup_busy_runqueue(smt_rq);
+		}
 	}
 out_unlock:
 	for_each_cpu_mask(i, sibling_map)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-01 23:25             ` Con Kolivas
@ 2005-06-02 13:30               ` Steve Rotolo
  2005-06-02 13:34                 ` Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-06-02 13:30 UTC (permalink / raw)
  To: Con Kolivas; +Cc: joe.korty, linux-kernel, bugsy

> > Wild thought: how about doing this for the sibling ...
> >
> > 	rp->nr_running += SOME_BIG_NUMBER
> >
> > when a SCHED_FIFO task starts running on some cpu, and
> > undo the above when the cpu is released.   This fools
> > the load balancer into _gradually_ moving tasks off the
> > sibling, when the cpu is hogged by some SCHED_FIFO task,
> > but should have little effect if a SCHED_FIFO task takes
> > little cpu time.
> 
> A good thought, and one I had considered. SOME_BIG_NUMBER needs to be 
> meaninful for this to work. Ideally what we do is add the effective load from 
> the sibling cpu to the pegged cpu. However that's not as useful as it sounds 
> because we need to ensure both sibling runqueues are locked every time we 
> check the load value of one runqueue, and the last thing I want is to 
> introduce yet more locking. Also the value will vary wildly depending on 
> whether the task is pegged or not, and this changes in mainline many times in 
> less than .1s which means it would throw load balancing way off as the value 
> will effectively become meaningless.
> 

Just a few more thoughts on this....

I can't help but wonder if a similar problem exists even without HT. 
What if the load-balancer decides to keep a sched_normal task on a cpu
that is being dominated by a sched_fifo task.  The sched_normal task
should really be "balanced" to a different cpu but because nr_running is
the only balancing criteria that may not happen.  Runqueue business
ought to be weighted by the amount of time that sched_fifo tasks on that
runqueue have recently used.  So, load = rq->nr_running +
rq->recent_fifo_run_time.  I think this would make load-balancing more
correct.

Now back to HT sched_domains...  It seems to me that when
SD_SHARE_CPUPOWER is on, recent_fifo_run_time should apply to the whole
domain instead of a single runqueue, so that a cpu's load =
rq->nr_running + sd->recent_fifo_run_time.  But I don't know if this
suffers from the same runqueue locking problem that you pointed out.

-- 
Steve

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-02 13:30               ` Steve Rotolo
@ 2005-06-02 13:34                 ` Con Kolivas
  2005-06-02 15:48                   ` Steve Rotolo
  0 siblings, 1 reply; 15+ messages in thread
From: Con Kolivas @ 2005-06-02 13:34 UTC (permalink / raw)
  To: Steve Rotolo; +Cc: joe.korty, linux-kernel, bugsy

[-- Attachment #1: Type: text/plain, Size: 2359 bytes --]

On Thu, 2 Jun 2005 23:30, Steve Rotolo wrote:
> > > Wild thought: how about doing this for the sibling ...
> > >
> > > 	rp->nr_running += SOME_BIG_NUMBER
> > >
> > > when a SCHED_FIFO task starts running on some cpu, and
> > > undo the above when the cpu is released.   This fools
> > > the load balancer into _gradually_ moving tasks off the
> > > sibling, when the cpu is hogged by some SCHED_FIFO task,
> > > but should have little effect if a SCHED_FIFO task takes
> > > little cpu time.
> >
> > A good thought, and one I had considered. SOME_BIG_NUMBER needs to be
> > meaninful for this to work. Ideally what we do is add the effective load
> > from the sibling cpu to the pegged cpu. However that's not as useful as
> > it sounds because we need to ensure both sibling runqueues are locked
> > every time we check the load value of one runqueue, and the last thing I
> > want is to introduce yet more locking. Also the value will vary wildly
> > depending on whether the task is pegged or not, and this changes in
> > mainline many times in less than .1s which means it would throw load
> > balancing way off as the value will effectively become meaningless.
>
> Just a few more thoughts on this....
>
> I can't help but wonder if a similar problem exists even without HT.
> What if the load-balancer decides to keep a sched_normal task on a cpu
> that is being dominated by a sched_fifo task.  The sched_normal task
> should really be "balanced" to a different cpu but because nr_running is
> the only balancing criteria that may not happen.  Runqueue business
> ought to be weighted by the amount of time that sched_fifo tasks on that
> runqueue have recently used.  So, load = rq->nr_running +
> rq->recent_fifo_run_time.  I think this would make load-balancing more
> correct.

Funny you should mention this. Check the latest -mm code and you'll see Andrew 
has merged my smp nice code which takes into account "nice" values and alters 
balancing according to nice values and heavily biases them when real time 
tasks are running. So you are correct, and it is a problem common to any 
per-cpu runqueue designed scheduler (which interestingly there is evidence 
that windows went to in about 2003 because it exhibited this very problem). 
However my code should make this behave better now.

Cheers,
Con

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: SD_SHARE_CPUPOWER breaks scheduler fairness
  2005-06-02 13:34                 ` Con Kolivas
@ 2005-06-02 15:48                   ` Steve Rotolo
  2005-06-03  0:43                     ` [PATCH] SCHED: run SCHED_NORMAL tasks with real time tasks on SMT siblings Con Kolivas
  0 siblings, 1 reply; 15+ messages in thread
From: Steve Rotolo @ 2005-06-02 15:48 UTC (permalink / raw)
  To: Con Kolivas; +Cc: joe.korty, linux-kernel, bugsy

On Thu, 2005-06-02 at 09:34, Con Kolivas wrote:

> Funny you should mention this. Check the latest -mm code and you'll see Andrew 
> has merged my smp nice code which takes into account "nice" values and alters 
> balancing according to nice values and heavily biases them when real time 
> tasks are running. So you are correct, and it is a problem common to any 
> per-cpu runqueue designed scheduler (which interestingly there is evidence 
> that windows went to in about 2003 because it exhibited this very problem). 
> However my code should make this behave better now.
> 

Glad to hear this is in --mm!  And BTW, your patch works great with my
HT test case.  Thanks -- good job.

-- 
Steve


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH] SCHED: run SCHED_NORMAL tasks with real time tasks on SMT siblings
  2005-06-02 15:48                   ` Steve Rotolo
@ 2005-06-03  0:43                     ` Con Kolivas
  0 siblings, 0 replies; 15+ messages in thread
From: Con Kolivas @ 2005-06-03  0:43 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Steve Rotolo, joe.korty, linux-kernel, bugsy, Ingo Molnar, ck,
	Peter Williams


[-- Attachment #1.1: Type: text/plain, Size: 379 bytes --]

On Fri, 3 Jun 2005 01:48, Steve Rotolo wrote:
> And BTW, your patch works great with my 
> HT test case.  Thanks -- good job.

Thanks. Cleaned up the patch comments a little. 

Andrew can you queue this up in -mm please? This patch does not depend on any 
other patches in -mm and should have only a short test cycle before being 
pushed into mainline.

Con
----



[-- Attachment #1.2: sched-run_normal_with_rt_on_sibling.diff --]
[-- Type: text/x-diff, Size: 4141 bytes --]

The hyperthread aware nice handling currently puts to sleep any non real time
task when a real time task is running on its sibling cpu. This can lead to
prolonged starvation by having the non real time task pegged to the cpu with
load balancing not pulling that task away.

Currently we force lower priority hyperthread tasks to run a percentage of
time difference based on timeslice differences which is meaningless when
comparing real time tasks to SCHED_NORMAL tasks. We can allow non real time 
tasks to run with real time tasks on the sibling up to per_cpu_gain% if we use
jiffies as a counter.

Cleanups and micro-optimisations to the relevant code section should make it
more understandable as well.

Signed-off-by: Con Kolivas <kernel@kolivas.org>


Index: linux-2.6.12-rc5-mm2/kernel/sched.c
===================================================================
--- linux-2.6.12-rc5-mm2.orig/kernel/sched.c	2005-06-03 10:10:37.000000000 +1000
+++ linux-2.6.12-rc5-mm2/kernel/sched.c	2005-06-03 10:25:19.000000000 +1000
@@ -2656,6 +2656,13 @@ out:
 }
 
 #ifdef CONFIG_SCHED_SMT
+static inline void wakeup_busy_runqueue(runqueue_t *rq)
+{
+	/* If an SMT runqueue is sleeping due to priority reasons wake it up */
+	if (rq->curr == rq->idle && rq->nr_running)
+		resched_task(rq->idle);
+}
+
 static inline void wake_sleeping_dependent(int this_cpu, runqueue_t *this_rq)
 {
 	struct sched_domain *tmp, *sd = NULL;
@@ -2689,12 +2696,7 @@ static inline void wake_sleeping_depende
 	for_each_cpu_mask(i, sibling_map) {
 		runqueue_t *smt_rq = cpu_rq(i);
 
-		/*
-		 * If an SMT sibling task is sleeping due to priority
-		 * reasons wake it up now.
-		 */
-		if (smt_rq->curr == smt_rq->idle && smt_rq->nr_running)
-			resched_task(smt_rq->idle);
+		wakeup_busy_runqueue(smt_rq);
 	}
 
 	for_each_cpu_mask(i, sibling_map)
@@ -2748,6 +2750,10 @@ static inline int dependent_sleeper(int 
 		runqueue_t *smt_rq = cpu_rq(i);
 		task_t *smt_curr = smt_rq->curr;
 
+		/* Kernel threads do not participate in dependent sleeping */
+		if (!p->mm || !smt_curr->mm || rt_task(p))
+			goto check_smt_task;
+
 		/*
 		 * If a user task with lower static priority than the
 		 * running task on the SMT sibling is trying to schedule,
@@ -2756,21 +2762,44 @@ static inline int dependent_sleeper(int 
 		 * task from using an unfair proportion of the
 		 * physical cpu's resources. -ck
 		 */
-		if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(p) || rt_task(smt_curr)) &&
-			p->mm && smt_curr->mm && !rt_task(p))
-				ret = 1;
+		if (rt_task(smt_curr)) {
+			/*
+			 * With real time tasks we run non-rt tasks only
+			 * per_cpu_gain% of the time.
+			 */
+			if ((jiffies % DEF_TIMESLICE) >
+				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
+					ret = 1;
+		} else
+			if (((smt_curr->time_slice * (100 - sd->per_cpu_gain) /
+				100) > task_timeslice(p)))
+					ret = 1;
+
+check_smt_task:
+		if ((!smt_curr->mm && smt_curr != smt_rq->idle) ||
+			rt_task(smt_curr))
+				continue;
+		if (!p->mm) {
+			wakeup_busy_runqueue(smt_rq);
+			continue;
+		}
 
 		/*
-		 * Reschedule a lower priority task on the SMT sibling,
-		 * or wake it up if it has been put to sleep for priority
-		 * reasons.
+		 * Reschedule a lower priority task on the SMT sibling for
+		 * it to be put to sleep, or wake it up if it has been put to
+		 * sleep for priority reasons to see if it should run now.
 		 */
-		if ((((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
-			task_timeslice(smt_curr) || rt_task(p)) &&
-			smt_curr->mm && p->mm && !rt_task(smt_curr)) ||
-			(smt_curr == smt_rq->idle && smt_rq->nr_running))
-				resched_task(smt_curr);
+		if (rt_task(p)) {
+			if ((jiffies % DEF_TIMESLICE) >
+				(sd->per_cpu_gain * DEF_TIMESLICE / 100))
+					resched_task(smt_curr);
+		} else {
+			if ((p->time_slice * (100 - sd->per_cpu_gain) / 100) >
+				task_timeslice(smt_curr))
+					resched_task(smt_curr);
+			else
+				wakeup_busy_runqueue(smt_rq);
+		}
 	}
 out_unlock:
 	for_each_cpu_mask(i, sibling_map)

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2005-06-03  0:49 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-05-31 17:46 SD_SHARE_CPUPOWER breaks scheduler fairness Steve Rotolo
2005-06-01  2:49 ` Con Kolivas
2005-06-01 14:29   ` Steve Rotolo
2005-06-01 14:47     ` Con Kolivas
2005-06-01 18:41       ` Steve Rotolo
2005-06-01 21:37         ` Con Kolivas
2005-06-01 21:54           ` Con Kolivas
2005-06-01 22:01           ` Steve Rotolo
2005-06-02  3:01             ` Con Kolivas
2005-06-01 23:16           ` Joe Korty
2005-06-01 23:25             ` Con Kolivas
2005-06-02 13:30               ` Steve Rotolo
2005-06-02 13:34                 ` Con Kolivas
2005-06-02 15:48                   ` Steve Rotolo
2005-06-03  0:43                     ` [PATCH] SCHED: run SCHED_NORMAL tasks with real time tasks on SMT siblings Con Kolivas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox