commit e9e9250b: sync wakeup bustage when waker is an RT task

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* commit e9e9250b: sync wakeup bustage when waker is an RT task
@ 2010-05-15 11:57 Mike Galbraith
  2010-05-15 12:04 ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2010-05-15 11:57 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

Hi Peter,

This commit excluded RT tasks from rq->load, was that intentional?  The
comment in struct rq states that load reflects *all* tasks, but since
this commit, that's no longer true.

Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that
wake_affine() is failing for sync wakeups when it should not.  It's
doing so because the waker in this case is an RT kernel thread
(sirq-net-rx) - we subtract the sync waker's weight, when it was never
added in the first place, resulting in this_load going gaga.  End result
is quite high latency numbers due to tasks jabbering cross-cache.

If the exclusion was intentional, I suppose I can do a waker class check
in wake_affine() to fix it.

	-Mike

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-15 11:57 commit e9e9250b: sync wakeup bustage when waker is an RT task Mike Galbraith
@ 2010-05-15 12:04 ` Peter Zijlstra
  2010-05-15 17:07   ` Mike Galbraith
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-15 12:04 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML

On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote:
> Hi Peter,
> 
> This commit excluded RT tasks from rq->load, was that intentional?  The
> comment in struct rq states that load reflects *all* tasks, but since
> this commit, that's no longer true.

Right, because a static load value does not accurately reflect a RT task
which can run as long as it pretty well pleases. So instead we measure
the time spend running !fair tasks and scale down the cpu_power
proportionally.

> Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that
> wake_affine() is failing for sync wakeups when it should not.  It's
> doing so because the waker in this case is an RT kernel thread
> (sirq-net-rx) - we subtract the sync waker's weight, when it was never
> added in the first place, resulting in this_load going gaga.  End result
> is quite high latency numbers due to tasks jabbering cross-cache.
> 
> If the exclusion was intentional, I suppose I can do a waker class check
> in wake_affine() to fix it.

So basically make all RT wakeups sync?


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-15 12:04 ` Peter Zijlstra
@ 2010-05-15 17:07   ` Mike Galbraith
  2010-05-15 17:32     ` Mike Galbraith
  2010-05-16  7:21     ` Mike Galbraith
  0 siblings, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-05-15 17:07 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote:
> On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote:
> > Hi Peter,
> > 
> > This commit excluded RT tasks from rq->load, was that intentional?  The
> > comment in struct rq states that load reflects *all* tasks, but since
> > this commit, that's no longer true.
> 
> Right, because a static load value does not accurately reflect a RT task
> which can run as long as it pretty well pleases. So instead we measure
> the time spend running !fair tasks and scale down the cpu_power
> proportionally.
> 
> > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that
> > wake_affine() is failing for sync wakeups when it should not.  It's
> > doing so because the waker in this case is an RT kernel thread
> > (sirq-net-rx) - we subtract the sync waker's weight, when it was never
> > added in the first place, resulting in this_load going gaga.  End result
> > is quite high latency numbers due to tasks jabbering cross-cache.
> > 
> > If the exclusion was intentional, I suppose I can do a waker class check
> > in wake_affine() to fix it.
> 
> So basically make all RT wakeups sync?

I was going to just skip subtracting waker's weight ala

        /*
         * If sync wakeup then subtract the (maximum possible)
         * effect of the currently running task from the load
         * of the current CPU:
         */
	if (sync && !task_has_rt_policy(curr))
		...

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-15 17:07   ` Mike Galbraith
@ 2010-05-15 17:32     ` Mike Galbraith
  2010-05-16  7:21     ` Mike Galbraith
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-05-15 17:32 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML

On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote:
> On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote:

> > So basically make all RT wakeups sync?
> 
> I was going to just skip subtracting waker's weight ala
> 
>         /*
>          * If sync wakeup then subtract the (maximum possible)
>          * effect of the currently running task from the load
>          * of the current CPU:
>          */
> 	if (sync && !task_has_rt_policy(curr))
> 		...

(not.  hohum, something to piddle with over java manana)


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-15 17:07   ` Mike Galbraith
  2010-05-15 17:32     ` Mike Galbraith
@ 2010-05-16  7:21     ` Mike Galbraith
  2010-05-17  4:38       ` Mike Galbraith
  2010-05-31 11:56       ` Peter Zijlstra
  1 sibling, 2 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-05-16  7:21 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote:
> On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote:
> > On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote:
> > > Hi Peter,
> > > 
> > > This commit excluded RT tasks from rq->load, was that intentional?  The
> > > comment in struct rq states that load reflects *all* tasks, but since
> > > this commit, that's no longer true.
> > 
> > Right, because a static load value does not accurately reflect a RT task
> > which can run as long as it pretty well pleases. So instead we measure
> > the time spend running !fair tasks and scale down the cpu_power
> > proportionally.
> > 
> > > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that
> > > wake_affine() is failing for sync wakeups when it should not.  It's
> > > doing so because the waker in this case is an RT kernel thread
> > > (sirq-net-rx) - we subtract the sync waker's weight, when it was never
> > > added in the first place, resulting in this_load going gaga.  End result
> > > is quite high latency numbers due to tasks jabbering cross-cache.
> > > 
> > > If the exclusion was intentional, I suppose I can do a waker class check
> > > in wake_affine() to fix it.
> > 
> > So basically make all RT wakeups sync?
> 
> I was going to just skip subtracting waker's weight ala
> 
>         /*
>          * If sync wakeup then subtract the (maximum possible)
>          * effect of the currently running task from the load
>          * of the current CPU:
>          */
> 	if (sync && !task_has_rt_policy(curr))

One-liner doesn't work.  We have one task on the cfs_rq, the one who is
the waker in !PREEMPT_RT, which is a fail case for wake_affine() if you
don't do the weight subtraction.  I did the below instead.

sched: RT waker sync wakeup bugfix

An RT waker's weight is not on the runqueue, but we try to subrtact it anyway
in the sync wakeup case,  sending this_load negative.  This leads to affine
wakeup failure in cases where it should succeed.  This was found while testing
an PREEMPT_RT kernel with lmbench's lat_udp.  In a PREEMPT_RT kernel, softirq
threads act as a ~proxy for the !RT buddy.  Approximate !PREEMPT_RT sync wakeup
behavior by looking at the buddy instead, and subtracting the maximum task weight
that will not send this_load negative.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ingo Molnar <mingo@elte.hu> 
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
LKML-Reference: <new-submission>

 kernel/sched_fair.c |    9 +++++++++
 1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 5240469..cc40849 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1280,6 +1280,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		tg = task_group(current);
 		weight = current->se.load.weight;
 
+		/*
+		 * An RT waker's weight is not on the runqueue.  Subtract the
+		 * maximum task weight that will not send this_load negative.
+		 */
+		if (task_has_rt_policy(current)) {
+			weight = max_t(unsigned long, NICE_0_LOAD, p->se.load.weight);
+			weight = min(weight, this_load);
+		}
+
 		this_load += effective_load(tg, this_cpu, -weight, -weight);
 		load += effective_load(tg, prev_cpu, 0, -weight);
 	}



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-16  7:21     ` Mike Galbraith
@ 2010-05-17  4:38       ` Mike Galbraith
  2010-05-17  8:49         ` Peter Zijlstra
  2010-05-31 11:56       ` Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2010-05-17  4:38 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote: 
> On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote:
> > On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote:
> > > On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote:
> > > > Hi Peter,
> > > > 
> > > > This commit excluded RT tasks from rq->load, was that intentional?  The
> > > > comment in struct rq states that load reflects *all* tasks, but since
> > > > this commit, that's no longer true.
> > > 
> > > Right, because a static load value does not accurately reflect a RT task
> > > which can run as long as it pretty well pleases. So instead we measure
> > > the time spend running !fair tasks and scale down the cpu_power
> > > proportionally.
> > > 
> > > > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that
> > > > wake_affine() is failing for sync wakeups when it should not.  It's
> > > > doing so because the waker in this case is an RT kernel thread
> > > > (sirq-net-rx) - we subtract the sync waker's weight, when it was never
> > > > added in the first place, resulting in this_load going gaga.  End result
> > > > is quite high latency numbers due to tasks jabbering cross-cache.
> > > > 
> > > > If the exclusion was intentional, I suppose I can do a waker class check
> > > > in wake_affine() to fix it.
> > > 
> > > So basically make all RT wakeups sync?
> > 
> > I was going to just skip subtracting waker's weight ala
> > 
> >         /*
> >          * If sync wakeup then subtract the (maximum possible)
> >          * effect of the currently running task from the load
> >          * of the current CPU:
> >          */
> > 	if (sync && !task_has_rt_policy(curr))
> 
> One-liner doesn't work.  We have one task on the cfs_rq, the one who is
> the waker in !PREEMPT_RT, which is a fail case for wake_affine() if you
> don't do the weight subtraction.  I did the below instead.

(Which is kinda fugly, only redeeming factor is it works;).

What would be the harm/consequence of restoring RT tasks to rq->load so
the wake_affine()::sync logic just worked as before without hackery?
The weight is a more or less random number, but looking around, with
them excluded, avg_load_per_task is lowered when RT tasks enter the
system, and rq->load[] misses their weight.  (Dunno what effect it has
on tg shares).

	-Mike

> sched: RT waker sync wakeup bugfix
> 
> An RT waker's weight is not on the runqueue, but we try to subrtact it anyway
> in the sync wakeup case,  sending this_load negative.  This leads to affine
> wakeup failure in cases where it should succeed.  This was found while testing
> an PREEMPT_RT kernel with lmbench's lat_udp.  In a PREEMPT_RT kernel, softirq
> threads act as a ~proxy for the !RT buddy.  Approximate !PREEMPT_RT sync wakeup
> behavior by looking at the buddy instead, and subtracting the maximum task weight
> that will not send this_load negative.
> 
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> Cc: Ingo Molnar <mingo@elte.hu> 
> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> LKML-Reference: <new-submission>
> 
>  kernel/sched_fair.c |    9 +++++++++
>  1 files changed, 9 insertions(+), 0 deletions(-)
> 
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index 5240469..cc40849 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -1280,6 +1280,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		tg = task_group(current);
>  		weight = current->se.load.weight;
>  
> +		/*
> +		 * An RT waker's weight is not on the runqueue.  Subtract the
> +		 * maximum task weight that will not send this_load negative.
> +		 */
> +		if (task_has_rt_policy(current)) {
> +			weight = max_t(unsigned long, NICE_0_LOAD, p->se.load.weight);
> +			weight = min(weight, this_load);
> +		}
> +
>  		this_load += effective_load(tg, this_cpu, -weight, -weight);
>  		load += effective_load(tg, prev_cpu, 0, -weight);
>  	}
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-17  4:38       ` Mike Galbraith
@ 2010-05-17  8:49         ` Peter Zijlstra
  2010-05-17  8:52           ` Peter Zijlstra
  2010-05-17  9:04           ` Mike Galbraith
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-17  8:49 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote:
> What would be the harm/consequence of restoring RT tasks to rq->load so
> the wake_affine()::sync logic just worked as before without hackery?

Well, you'd have to constantly adjust the task weight of RT tasks to
reflect their actual consumption. Not really feasible.

So the proportional stuff works like:

 slice_i = w_i / (\Sum_j w_j) * dt

Giving a RT task a sensible weight we'd have to reverse that:

 w_i = slice_i/dt * (\Sum_j w_j)

which is something that depends on the rq->load, so every time you
change the rq->load you'd have to recompute the weight of all the RT
tasks, which again changes the rq->load (got a head-ache already? :-)

> The weight is a more or less random number, but looking around, with
> them excluded, avg_load_per_task is lowered when RT tasks enter the
> system, and rq->load[] misses their weight.  (Dunno what effect it has
> on tg shares). 

Well, those things are more or less a 'good' thing, it makes it purely
about sched_fair.

So the thing to do I think is to teach wake_affine about cpu_power,
because that is what includes the RT tasks.

The proper comparison of rq weights (like the regular load balancer
already does) is:

  A->load / A->cpu_power ~ B->load / B->cpu_power

The lower the cpu_power of a particular cpu, the less processing
capacity it has, the smaller its share of the total weight should be to
provide equal work for each task.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-17  8:49         ` Peter Zijlstra
@ 2010-05-17  8:52           ` Peter Zijlstra
  2010-05-17  9:04           ` Mike Galbraith
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-17  8:52 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-17 at 10:49 +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote:
> > What would be the harm/consequence of restoring RT tasks to rq->load so
> > the wake_affine()::sync logic just worked as before without hackery?
> 
> Well, you'd have to constantly adjust the task weight of RT tasks to
> reflect their actual consumption. Not really feasible.
> 
> So the proportional stuff works like:
> 
>  slice_i = w_i / (\Sum_j w_j) * dt
> 
> Giving a RT task a sensible weight we'd have to reverse that:
> 
>  w_i = slice_i/dt * (\Sum_j w_j)

Another point to note is that this requires we track per-RT-task usage
averages, whereas the cpu_power approach simply lumps everything !fair
(one of the things still on the TODO list is account for IRQ overhead)
into a single large bucket and doesn't care.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-17  8:49         ` Peter Zijlstra
  2010-05-17  8:52           ` Peter Zijlstra
@ 2010-05-17  9:04           ` Mike Galbraith
  1 sibling, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-05-17  9:04 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-17 at 10:49 +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote:
> > What would be the harm/consequence of restoring RT tasks to rq->load so
> > the wake_affine()::sync logic just worked as before without hackery?
> 
> Well, you'd have to constantly adjust the task weight of RT tasks to
> reflect their actual consumption. Not really feasible.

Egad, forget that.

> So the proportional stuff works like:
> 
>  slice_i = w_i / (\Sum_j w_j) * dt
> 
> Giving a RT task a sensible weight we'd have to reverse that:
> 
>  w_i = slice_i/dt * (\Sum_j w_j)
> 
> which is something that depends on the rq->load, so every time you
> change the rq->load you'd have to recompute the weight of all the RT
> tasks, which again changes the rq->load (got a head-ache already? :-)

Yup.

> > The weight is a more or less random number, but looking around, with
> > them excluded, avg_load_per_task is lowered when RT tasks enter the
> > system, and rq->load[] misses their weight.  (Dunno what effect it has
> > on tg shares). 
> 
> Well, those things are more or less a 'good' thing, it makes it purely
> about sched_fair.

(Yeah, I was pondering up/down sides)

> So the thing to do I think is to teach wake_affine about cpu_power,
> because that is what includes the RT tasks.
> 
> The proper comparison of rq weights (like the regular load balancer
> already does) is:
> 
>   A->load / A->cpu_power ~ B->load / B->cpu_power
> 
> The lower the cpu_power of a particular cpu, the less processing
> capacity it has, the smaller its share of the total weight should be to
> provide equal work for each task.

Hm, sounds kinda heavy/complicated for fast-path.  I think I like little
hack better than trying to teach it about cpu_power :)

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-16  7:21     ` Mike Galbraith
  2010-05-17  4:38       ` Mike Galbraith
@ 2010-05-31 11:56       ` Peter Zijlstra
  2010-05-31 13:56         ` Mike Galbraith
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-31 11:56 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote:
> sched: RT waker sync wakeup bugfix
> 
> An RT waker's weight is not on the runqueue, but we try to subrtact it anyway
> in the sync wakeup case,  sending this_load negative.  This leads to affine
> wakeup failure in cases where it should succeed.  This was found while testing
> an PREEMPT_RT kernel with lmbench's lat_udp.  In a PREEMPT_RT kernel, softirq
> threads act as a ~proxy for the !RT buddy.  Approximate !PREEMPT_RT sync wakeup
> behavior by looking at the buddy instead, and subtracting the maximum task weight
> that will not send this_load negative. 


Does the below work?

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |    4 ++--
 kernel/sched_fair.c |   36 ++++++++++++++++++++++++++++++------
 2 files changed, 32 insertions(+), 8 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -1877,8 +1877,8 @@ static void dec_nr_running(struct rq *rq
 static void set_load_weight(struct task_struct *p)
 {
 	if (task_has_rt_policy(p)) {
-		p->se.load.weight = prio_to_weight[0] * 2;
-		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+		p->se.load.weight = 0;
+		p->se.load.inv_weight = WMULT_CONST;
 		return;
 	}
 
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1220,12 +1220,26 @@ static inline unsigned long effective_lo
 
 #endif
 
+static unsigned long cpu_power(int cpu)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+
+	sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd);
+	if (!sd)
+		return 1024;
+	sg = sd->groups;
+	if (!sg)
+		return 1024;
+
+	return sg->cpu_power;
+}
+
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	unsigned long this_load, load;
 	int idx, this_cpu, prev_cpu;
 	unsigned long tl_per_task;
-	unsigned int imbalance;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -1252,8 +1266,6 @@ static int wake_affine(struct sched_doma
 	tg = task_group(p);
 	weight = p->se.load.weight;
 
-	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
-
 	/*
 	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
 	 * due to the sync cause above having dropped this_load to 0, we'll
@@ -1263,9 +1275,21 @@ static int wake_affine(struct sched_doma
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	balanced = !this_load ||
-		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
-		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
+	if (this_load) {
+		unsigned long this_eff_load, prev_eff_load;
+
+		this_eff_load = 100;
+		this_eff_load *= cpu_power(prev_cpu);
+		this_eff_load *= this_load +
+			effective_load(tg, this_cpu, weight, weight);
+
+		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+		prev_eff_load *= cpu_power(this_cpu);
+		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+
+		balanced = this_eff_load <= prev_eff_load;
+	} else
+		balanced = true;
 
 	/*
 	 * If the currently running task will sleep within


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-31 11:56       ` Peter Zijlstra
@ 2010-05-31 13:56         ` Mike Galbraith
  2010-05-31 14:28           ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2010-05-31 13:56 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-31 at 13:56 +0200, Peter Zijlstra wrote:
> On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote:
> > sched: RT waker sync wakeup bugfix
> > 
> > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway
> > in the sync wakeup case,  sending this_load negative.  This leads to affine
> > wakeup failure in cases where it should succeed.  This was found while testing
> > an PREEMPT_RT kernel with lmbench's lat_udp.  In a PREEMPT_RT kernel, softirq
> > threads act as a ~proxy for the !RT buddy.  Approximate !PREEMPT_RT sync wakeup
> > behavior by looking at the buddy instead, and subtracting the maximum task weight
> > that will not send this_load negative. 
> 
> 
> Does the below work?

Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid
latencies are gone.  (hm, skinnier than expected too)

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-31 13:56         ` Mike Galbraith
@ 2010-05-31 14:28           ` Peter Zijlstra
  2010-05-31 18:03             ` Mike Galbraith
  2010-06-01  9:12             ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2010-05-31 14:28 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-31 at 15:56 +0200, Mike Galbraith wrote:
> On Mon, 2010-05-31 at 13:56 +0200, Peter Zijlstra wrote:
> > On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote:
> > > sched: RT waker sync wakeup bugfix
> > > 
> > > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway
> > > in the sync wakeup case,  sending this_load negative.  This leads to affine
> > > wakeup failure in cases where it should succeed.  This was found while testing
> > > an PREEMPT_RT kernel with lmbench's lat_udp.  In a PREEMPT_RT kernel, softirq
> > > threads act as a ~proxy for the !RT buddy.  Approximate !PREEMPT_RT sync wakeup
> > > behavior by looking at the buddy instead, and subtracting the maximum task weight
> > > that will not send this_load negative. 
> > 
> > 
> > Does the below work?
> 
> Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid
> latencies are gone.  (hm, skinnier than expected too)


Here's one that should be even skinnier ;-)

I knew I already had an accessor: power_of(), also, I duplicated the
cpu_power variable inside struct rq so we don't have to chase three
pointers to get it.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/sched.c      |   24 ++++++------------------
 kernel/sched_fair.c |   22 ++++++++++++++++------
 2 files changed, 22 insertions(+), 24 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -545,6 +545,8 @@ struct rq {
 	struct root_domain *rd;
 	struct sched_domain *sd;
 
+	unsigned long cpu_power;
+
 	unsigned char idle_at_tick;
 	/* For active balancing */
 	int post_schedule;
@@ -1521,24 +1523,9 @@ static unsigned long target_load(int cpu
 	return max(rq->cpu_load[type-1], total);
 }
 
-static struct sched_group *group_of(int cpu)
-{
-	struct sched_domain *sd = rcu_dereference_sched(cpu_rq(cpu)->sd);
-
-	if (!sd)
-		return NULL;
-
-	return sd->groups;
-}
-
 static unsigned long power_of(int cpu)
 {
-	struct sched_group *group = group_of(cpu);
-
-	if (!group)
-		return SCHED_LOAD_SCALE;
-
-	return group->cpu_power;
+	return cpu_rq(cpu)->cpu_power;
 }
 
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
@@ -1877,8 +1864,8 @@ static void dec_nr_running(struct rq *rq
 static void set_load_weight(struct task_struct *p)
 {
 	if (task_has_rt_policy(p)) {
-		p->se.load.weight = prio_to_weight[0] * 2;
-		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+		p->se.load.weight = 0;
+		p->se.load.inv_weight = WMULT_CONST;
 		return;
 	}
 
@@ -7715,6 +7702,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
+		rq->cpu_power = SCHED_LOAD_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -1225,7 +1225,6 @@ static int wake_affine(struct sched_doma
 	unsigned long this_load, load;
 	int idx, this_cpu, prev_cpu;
 	unsigned long tl_per_task;
-	unsigned int imbalance;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -1252,8 +1251,6 @@ static int wake_affine(struct sched_doma
 	tg = task_group(p);
 	weight = p->se.load.weight;
 
-	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
-
 	/*
 	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
 	 * due to the sync cause above having dropped this_load to 0, we'll
@@ -1263,9 +1260,21 @@ static int wake_affine(struct sched_doma
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	balanced = !this_load ||
-		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
-		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
+	if (this_load) {
+		unsigned long this_eff_load, prev_eff_load;
+
+		this_eff_load = 100;
+		this_eff_load *= power_of(prev_cpu);
+		this_eff_load *= this_load +
+			effective_load(tg, this_cpu, weight, weight);
+
+		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+		prev_eff_load *= power_of(this_cpu);
+		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+
+		balanced = this_eff_load <= prev_eff_load;
+	} else
+		balanced = true;
 
 	/*
 	 * If the currently running task will sleep within
@@ -2298,6 +2307,7 @@ static void update_cpu_power(struct sche
 	if (!power)
 		power = 1;
 
+	cpu_rq(cpu)->cpu_power = power;
 	sdg->cpu_power = power;
 }
 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-31 14:28           ` Peter Zijlstra
@ 2010-05-31 18:03             ` Mike Galbraith
  2010-06-01  6:40               ` Mike Galbraith
  2010-06-01  9:12             ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 15+ messages in thread
From: Mike Galbraith @ 2010-05-31 18:03 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-31 at 16:28 +0200, Peter Zijlstra wrote:
> On Mon, 2010-05-31 at 15:56 +0200, Mike Galbraith wrote:
>  
> > > Does the below work?
> > 
> > Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid
> > latencies are gone.  (hm, skinnier than expected too)
> 
> 
> Here's one that should be even skinnier ;-)

Cool, box is into skeletal.

> I knew I already had an accessor: power_of(), also, I duplicated the
> cpu_power variable inside struct rq so we don't have to chase three
> pointers to get it.

Looks good, but I'll retest anyway. 

	-Mike



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task
  2010-05-31 18:03             ` Mike Galbraith
@ 2010-06-01  6:40               ` Mike Galbraith
  0 siblings, 0 replies; 15+ messages in thread
From: Mike Galbraith @ 2010-06-01  6:40 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner

On Mon, 2010-05-31 at 20:03 +0200, Mike Galbraith wrote:

> Looks good, but I'll retest anyway.

I wedged skinnier version into x86-tip-rt33, works fine too.

(todo: non-rt config testing)

	-Mike


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks
  2010-05-31 14:28           ` Peter Zijlstra
  2010-05-31 18:03             ` Mike Galbraith
@ 2010-06-01  9:12             ` tip-bot for Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: tip-bot for Peter Zijlstra @ 2010-06-01  9:12 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, tglx, mingo

Commit-ID:  e51fd5e22e12b39f49b1bb60b37b300b17378a43
Gitweb:     http://git.kernel.org/tip/e51fd5e22e12b39f49b1bb60b37b300b17378a43
Author:     Peter Zijlstra <a.p.zijlstra@chello.nl>
AuthorDate: Mon, 31 May 2010 12:37:30 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Tue, 1 Jun 2010 09:27:16 +0200

sched: Fix wake_affine() vs RT tasks

Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT
tasks), wake_affine() goes funny on RT tasks due to them still having a
!0 weight and wake_affine() still subtracts that from the rq weight.

Since nobody should be using se->weight for RT tasks, set the value to
zero. Also, since we now use ->cpu_power to normalize rq weights to
account for RT cpu usage, add that factor into the imbalance computation.

Reported-by: Mike Galbraith <efault@gmx.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1275316109.27810.22969.camel@twins>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c      |   24 ++++++------------------
 kernel/sched_fair.c |   22 ++++++++++++++++------
 2 files changed, 22 insertions(+), 24 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index d484081..f8b8996 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -544,6 +544,8 @@ struct rq {
 	struct root_domain *rd;
 	struct sched_domain *sd;
 
+	unsigned long cpu_power;
+
 	unsigned char idle_at_tick;
 	/* For active balancing */
 	int post_schedule;
@@ -1499,24 +1501,9 @@ static unsigned long target_load(int cpu, int type)
 	return max(rq->cpu_load[type-1], total);
 }
 
-static struct sched_group *group_of(int cpu)
-{
-	struct sched_domain *sd = rcu_dereference_sched(cpu_rq(cpu)->sd);
-
-	if (!sd)
-		return NULL;
-
-	return sd->groups;
-}
-
 static unsigned long power_of(int cpu)
 {
-	struct sched_group *group = group_of(cpu);
-
-	if (!group)
-		return SCHED_LOAD_SCALE;
-
-	return group->cpu_power;
+	return cpu_rq(cpu)->cpu_power;
 }
 
 static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd);
@@ -1854,8 +1841,8 @@ static void dec_nr_running(struct rq *rq)
 static void set_load_weight(struct task_struct *p)
 {
 	if (task_has_rt_policy(p)) {
-		p->se.load.weight = prio_to_weight[0] * 2;
-		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
+		p->se.load.weight = 0;
+		p->se.load.inv_weight = WMULT_CONST;
 		return;
 	}
 
@@ -7605,6 +7592,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
+		rq->cpu_power = SCHED_LOAD_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 217e4a9..eed35ed 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1225,7 +1225,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	unsigned long this_load, load;
 	int idx, this_cpu, prev_cpu;
 	unsigned long tl_per_task;
-	unsigned int imbalance;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -1252,8 +1251,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	tg = task_group(p);
 	weight = p->se.load.weight;
 
-	imbalance = 100 + (sd->imbalance_pct - 100) / 2;
-
 	/*
 	 * In low-load situations, where prev_cpu is idle and this_cpu is idle
 	 * due to the sync cause above having dropped this_load to 0, we'll
@@ -1263,9 +1260,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	balanced = !this_load ||
-		100*(this_load + effective_load(tg, this_cpu, weight, weight)) <=
-		imbalance*(load + effective_load(tg, prev_cpu, 0, weight));
+	if (this_load) {
+		unsigned long this_eff_load, prev_eff_load;
+
+		this_eff_load = 100;
+		this_eff_load *= power_of(prev_cpu);
+		this_eff_load *= this_load +
+			effective_load(tg, this_cpu, weight, weight);
+
+		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+		prev_eff_load *= power_of(this_cpu);
+		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+
+		balanced = this_eff_load <= prev_eff_load;
+	} else
+		balanced = true;
 
 	/*
 	 * If the currently running task will sleep within
@@ -2298,6 +2307,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 	if (!power)
 		power = 1;
 
+	cpu_rq(cpu)->cpu_power = power;
 	sdg->cpu_power = power;
 }
 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-06-01  9:14 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-05-15 11:57 commit e9e9250b: sync wakeup bustage when waker is an RT task Mike Galbraith
2010-05-15 12:04 ` Peter Zijlstra
2010-05-15 17:07   ` Mike Galbraith
2010-05-15 17:32     ` Mike Galbraith
2010-05-16  7:21     ` Mike Galbraith
2010-05-17  4:38       ` Mike Galbraith
2010-05-17  8:49         ` Peter Zijlstra
2010-05-17  8:52           ` Peter Zijlstra
2010-05-17  9:04           ` Mike Galbraith
2010-05-31 11:56       ` Peter Zijlstra
2010-05-31 13:56         ` Mike Galbraith
2010-05-31 14:28           ` Peter Zijlstra
2010-05-31 18:03             ` Mike Galbraith
2010-06-01  6:40               ` Mike Galbraith
2010-06-01  9:12             ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox