* commit e9e9250b: sync wakeup bustage when waker is an RT task @ 2010-05-15 11:57 Mike Galbraith 2010-05-15 12:04 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-05-15 11:57 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML Hi Peter, This commit excluded RT tasks from rq->load, was that intentional? The comment in struct rq states that load reflects *all* tasks, but since this commit, that's no longer true. Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that wake_affine() is failing for sync wakeups when it should not. It's doing so because the waker in this case is an RT kernel thread (sirq-net-rx) - we subtract the sync waker's weight, when it was never added in the first place, resulting in this_load going gaga. End result is quite high latency numbers due to tasks jabbering cross-cache. If the exclusion was intentional, I suppose I can do a waker class check in wake_affine() to fix it. -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-15 11:57 commit e9e9250b: sync wakeup bustage when waker is an RT task Mike Galbraith @ 2010-05-15 12:04 ` Peter Zijlstra 2010-05-15 17:07 ` Mike Galbraith 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2010-05-15 12:04 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, LKML On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote: > Hi Peter, > > This commit excluded RT tasks from rq->load, was that intentional? The > comment in struct rq states that load reflects *all* tasks, but since > this commit, that's no longer true. Right, because a static load value does not accurately reflect a RT task which can run as long as it pretty well pleases. So instead we measure the time spend running !fair tasks and scale down the cpu_power proportionally. > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that > wake_affine() is failing for sync wakeups when it should not. It's > doing so because the waker in this case is an RT kernel thread > (sirq-net-rx) - we subtract the sync waker's weight, when it was never > added in the first place, resulting in this_load going gaga. End result > is quite high latency numbers due to tasks jabbering cross-cache. > > If the exclusion was intentional, I suppose I can do a waker class check > in wake_affine() to fix it. So basically make all RT wakeups sync? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-15 12:04 ` Peter Zijlstra @ 2010-05-15 17:07 ` Mike Galbraith 2010-05-15 17:32 ` Mike Galbraith 2010-05-16 7:21 ` Mike Galbraith 0 siblings, 2 replies; 15+ messages in thread From: Mike Galbraith @ 2010-05-15 17:07 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote: > On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote: > > Hi Peter, > > > > This commit excluded RT tasks from rq->load, was that intentional? The > > comment in struct rq states that load reflects *all* tasks, but since > > this commit, that's no longer true. > > Right, because a static load value does not accurately reflect a RT task > which can run as long as it pretty well pleases. So instead we measure > the time spend running !fair tasks and scale down the cpu_power > proportionally. > > > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that > > wake_affine() is failing for sync wakeups when it should not. It's > > doing so because the waker in this case is an RT kernel thread > > (sirq-net-rx) - we subtract the sync waker's weight, when it was never > > added in the first place, resulting in this_load going gaga. End result > > is quite high latency numbers due to tasks jabbering cross-cache. > > > > If the exclusion was intentional, I suppose I can do a waker class check > > in wake_affine() to fix it. > > So basically make all RT wakeups sync? I was going to just skip subtracting waker's weight ala /* * If sync wakeup then subtract the (maximum possible) * effect of the currently running task from the load * of the current CPU: */ if (sync && !task_has_rt_policy(curr)) ... -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-15 17:07 ` Mike Galbraith @ 2010-05-15 17:32 ` Mike Galbraith 2010-05-16 7:21 ` Mike Galbraith 1 sibling, 0 replies; 15+ messages in thread From: Mike Galbraith @ 2010-05-15 17:32 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote: > On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote: > > So basically make all RT wakeups sync? > > I was going to just skip subtracting waker's weight ala > > /* > * If sync wakeup then subtract the (maximum possible) > * effect of the currently running task from the load > * of the current CPU: > */ > if (sync && !task_has_rt_policy(curr)) > ... (not. hohum, something to piddle with over java manana) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-15 17:07 ` Mike Galbraith 2010-05-15 17:32 ` Mike Galbraith @ 2010-05-16 7:21 ` Mike Galbraith 2010-05-17 4:38 ` Mike Galbraith 2010-05-31 11:56 ` Peter Zijlstra 1 sibling, 2 replies; 15+ messages in thread From: Mike Galbraith @ 2010-05-16 7:21 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote: > On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote: > > On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote: > > > Hi Peter, > > > > > > This commit excluded RT tasks from rq->load, was that intentional? The > > > comment in struct rq states that load reflects *all* tasks, but since > > > this commit, that's no longer true. > > > > Right, because a static load value does not accurately reflect a RT task > > which can run as long as it pretty well pleases. So instead we measure > > the time spend running !fair tasks and scale down the cpu_power > > proportionally. > > > > > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that > > > wake_affine() is failing for sync wakeups when it should not. It's > > > doing so because the waker in this case is an RT kernel thread > > > (sirq-net-rx) - we subtract the sync waker's weight, when it was never > > > added in the first place, resulting in this_load going gaga. End result > > > is quite high latency numbers due to tasks jabbering cross-cache. > > > > > > If the exclusion was intentional, I suppose I can do a waker class check > > > in wake_affine() to fix it. > > > > So basically make all RT wakeups sync? > > I was going to just skip subtracting waker's weight ala > > /* > * If sync wakeup then subtract the (maximum possible) > * effect of the currently running task from the load > * of the current CPU: > */ > if (sync && !task_has_rt_policy(curr)) One-liner doesn't work. We have one task on the cfs_rq, the one who is the waker in !PREEMPT_RT, which is a fail case for wake_affine() if you don't do the weight subtraction. I did the below instead. sched: RT waker sync wakeup bugfix An RT waker's weight is not on the runqueue, but we try to subrtact it anyway in the sync wakeup case, sending this_load negative. This leads to affine wakeup failure in cases where it should succeed. This was found while testing an PREEMPT_RT kernel with lmbench's lat_udp. In a PREEMPT_RT kernel, softirq threads act as a ~proxy for the !RT buddy. Approximate !PREEMPT_RT sync wakeup behavior by looking at the buddy instead, and subtracting the maximum task weight that will not send this_load negative. Signed-off-by: Mike Galbraith <efault@gmx.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> LKML-Reference: <new-submission> kernel/sched_fair.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 5240469..cc40849 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1280,6 +1280,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) tg = task_group(current); weight = current->se.load.weight; + /* + * An RT waker's weight is not on the runqueue. Subtract the + * maximum task weight that will not send this_load negative. + */ + if (task_has_rt_policy(current)) { + weight = max_t(unsigned long, NICE_0_LOAD, p->se.load.weight); + weight = min(weight, this_load); + } + this_load += effective_load(tg, this_cpu, -weight, -weight); load += effective_load(tg, prev_cpu, 0, -weight); } ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-16 7:21 ` Mike Galbraith @ 2010-05-17 4:38 ` Mike Galbraith 2010-05-17 8:49 ` Peter Zijlstra 2010-05-31 11:56 ` Peter Zijlstra 1 sibling, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-05-17 4:38 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote: > On Sat, 2010-05-15 at 19:07 +0200, Mike Galbraith wrote: > > On Sat, 2010-05-15 at 14:04 +0200, Peter Zijlstra wrote: > > > On Sat, 2010-05-15 at 13:57 +0200, Mike Galbraith wrote: > > > > Hi Peter, > > > > > > > > This commit excluded RT tasks from rq->load, was that intentional? The > > > > comment in struct rq states that load reflects *all* tasks, but since > > > > this commit, that's no longer true. > > > > > > Right, because a static load value does not accurately reflect a RT task > > > which can run as long as it pretty well pleases. So instead we measure > > > the time spend running !fair tasks and scale down the cpu_power > > > proportionally. > > > > > > > Looking at lmbench lat_udp in a PREEMPT_RT kernel, I noticed that > > > > wake_affine() is failing for sync wakeups when it should not. It's > > > > doing so because the waker in this case is an RT kernel thread > > > > (sirq-net-rx) - we subtract the sync waker's weight, when it was never > > > > added in the first place, resulting in this_load going gaga. End result > > > > is quite high latency numbers due to tasks jabbering cross-cache. > > > > > > > > If the exclusion was intentional, I suppose I can do a waker class check > > > > in wake_affine() to fix it. > > > > > > So basically make all RT wakeups sync? > > > > I was going to just skip subtracting waker's weight ala > > > > /* > > * If sync wakeup then subtract the (maximum possible) > > * effect of the currently running task from the load > > * of the current CPU: > > */ > > if (sync && !task_has_rt_policy(curr)) > > One-liner doesn't work. We have one task on the cfs_rq, the one who is > the waker in !PREEMPT_RT, which is a fail case for wake_affine() if you > don't do the weight subtraction. I did the below instead. (Which is kinda fugly, only redeeming factor is it works;). What would be the harm/consequence of restoring RT tasks to rq->load so the wake_affine()::sync logic just worked as before without hackery? The weight is a more or less random number, but looking around, with them excluded, avg_load_per_task is lowered when RT tasks enter the system, and rq->load[] misses their weight. (Dunno what effect it has on tg shares). -Mike > sched: RT waker sync wakeup bugfix > > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway > in the sync wakeup case, sending this_load negative. This leads to affine > wakeup failure in cases where it should succeed. This was found while testing > an PREEMPT_RT kernel with lmbench's lat_udp. In a PREEMPT_RT kernel, softirq > threads act as a ~proxy for the !RT buddy. Approximate !PREEMPT_RT sync wakeup > behavior by looking at the buddy instead, and subtracting the maximum task weight > that will not send this_load negative. > > Signed-off-by: Mike Galbraith <efault@gmx.de> > Cc: Ingo Molnar <mingo@elte.hu> > Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> > Cc: Thomas Gleixner <tglx@linutronix.de> > LKML-Reference: <new-submission> > > kernel/sched_fair.c | 9 +++++++++ > 1 files changed, 9 insertions(+), 0 deletions(-) > > diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c > index 5240469..cc40849 100644 > --- a/kernel/sched_fair.c > +++ b/kernel/sched_fair.c > @@ -1280,6 +1280,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) > tg = task_group(current); > weight = current->se.load.weight; > > + /* > + * An RT waker's weight is not on the runqueue. Subtract the > + * maximum task weight that will not send this_load negative. > + */ > + if (task_has_rt_policy(current)) { > + weight = max_t(unsigned long, NICE_0_LOAD, p->se.load.weight); > + weight = min(weight, this_load); > + } > + > this_load += effective_load(tg, this_cpu, -weight, -weight); > load += effective_load(tg, prev_cpu, 0, -weight); > } > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-17 4:38 ` Mike Galbraith @ 2010-05-17 8:49 ` Peter Zijlstra 2010-05-17 8:52 ` Peter Zijlstra 2010-05-17 9:04 ` Mike Galbraith 0 siblings, 2 replies; 15+ messages in thread From: Peter Zijlstra @ 2010-05-17 8:49 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote: > What would be the harm/consequence of restoring RT tasks to rq->load so > the wake_affine()::sync logic just worked as before without hackery? Well, you'd have to constantly adjust the task weight of RT tasks to reflect their actual consumption. Not really feasible. So the proportional stuff works like: slice_i = w_i / (\Sum_j w_j) * dt Giving a RT task a sensible weight we'd have to reverse that: w_i = slice_i/dt * (\Sum_j w_j) which is something that depends on the rq->load, so every time you change the rq->load you'd have to recompute the weight of all the RT tasks, which again changes the rq->load (got a head-ache already? :-) > The weight is a more or less random number, but looking around, with > them excluded, avg_load_per_task is lowered when RT tasks enter the > system, and rq->load[] misses their weight. (Dunno what effect it has > on tg shares). Well, those things are more or less a 'good' thing, it makes it purely about sched_fair. So the thing to do I think is to teach wake_affine about cpu_power, because that is what includes the RT tasks. The proper comparison of rq weights (like the regular load balancer already does) is: A->load / A->cpu_power ~ B->load / B->cpu_power The lower the cpu_power of a particular cpu, the less processing capacity it has, the smaller its share of the total weight should be to provide equal work for each task. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-17 8:49 ` Peter Zijlstra @ 2010-05-17 8:52 ` Peter Zijlstra 2010-05-17 9:04 ` Mike Galbraith 1 sibling, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2010-05-17 8:52 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-17 at 10:49 +0200, Peter Zijlstra wrote: > On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote: > > What would be the harm/consequence of restoring RT tasks to rq->load so > > the wake_affine()::sync logic just worked as before without hackery? > > Well, you'd have to constantly adjust the task weight of RT tasks to > reflect their actual consumption. Not really feasible. > > So the proportional stuff works like: > > slice_i = w_i / (\Sum_j w_j) * dt > > Giving a RT task a sensible weight we'd have to reverse that: > > w_i = slice_i/dt * (\Sum_j w_j) Another point to note is that this requires we track per-RT-task usage averages, whereas the cpu_power approach simply lumps everything !fair (one of the things still on the TODO list is account for IRQ overhead) into a single large bucket and doesn't care. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-17 8:49 ` Peter Zijlstra 2010-05-17 8:52 ` Peter Zijlstra @ 2010-05-17 9:04 ` Mike Galbraith 1 sibling, 0 replies; 15+ messages in thread From: Mike Galbraith @ 2010-05-17 9:04 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-17 at 10:49 +0200, Peter Zijlstra wrote: > On Mon, 2010-05-17 at 06:38 +0200, Mike Galbraith wrote: > > What would be the harm/consequence of restoring RT tasks to rq->load so > > the wake_affine()::sync logic just worked as before without hackery? > > Well, you'd have to constantly adjust the task weight of RT tasks to > reflect their actual consumption. Not really feasible. Egad, forget that. > So the proportional stuff works like: > > slice_i = w_i / (\Sum_j w_j) * dt > > Giving a RT task a sensible weight we'd have to reverse that: > > w_i = slice_i/dt * (\Sum_j w_j) > > which is something that depends on the rq->load, so every time you > change the rq->load you'd have to recompute the weight of all the RT > tasks, which again changes the rq->load (got a head-ache already? :-) Yup. > > The weight is a more or less random number, but looking around, with > > them excluded, avg_load_per_task is lowered when RT tasks enter the > > system, and rq->load[] misses their weight. (Dunno what effect it has > > on tg shares). > > Well, those things are more or less a 'good' thing, it makes it purely > about sched_fair. (Yeah, I was pondering up/down sides) > So the thing to do I think is to teach wake_affine about cpu_power, > because that is what includes the RT tasks. > > The proper comparison of rq weights (like the regular load balancer > already does) is: > > A->load / A->cpu_power ~ B->load / B->cpu_power > > The lower the cpu_power of a particular cpu, the less processing > capacity it has, the smaller its share of the total weight should be to > provide equal work for each task. Hm, sounds kinda heavy/complicated for fast-path. I think I like little hack better than trying to teach it about cpu_power :) -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-16 7:21 ` Mike Galbraith 2010-05-17 4:38 ` Mike Galbraith @ 2010-05-31 11:56 ` Peter Zijlstra 2010-05-31 13:56 ` Mike Galbraith 1 sibling, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2010-05-31 11:56 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote: > sched: RT waker sync wakeup bugfix > > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway > in the sync wakeup case, sending this_load negative. This leads to affine > wakeup failure in cases where it should succeed. This was found while testing > an PREEMPT_RT kernel with lmbench's lat_udp. In a PREEMPT_RT kernel, softirq > threads act as a ~proxy for the !RT buddy. Approximate !PREEMPT_RT sync wakeup > behavior by looking at the buddy instead, and subtracting the maximum task weight > that will not send this_load negative. Does the below work? Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 4 ++-- kernel/sched_fair.c | 36 ++++++++++++++++++++++++++++++------ 2 files changed, 32 insertions(+), 8 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -1877,8 +1877,8 @@ static void dec_nr_running(struct rq *rq static void set_load_weight(struct task_struct *p) { if (task_has_rt_policy(p)) { - p->se.load.weight = prio_to_weight[0] * 2; - p->se.load.inv_weight = prio_to_wmult[0] >> 1; + p->se.load.weight = 0; + p->se.load.inv_weight = WMULT_CONST; return; } Index: linux-2.6/kernel/sched_fair.c =================================================================== --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1220,12 +1220,26 @@ static inline unsigned long effective_lo #endif +static unsigned long cpu_power(int cpu) +{ + struct sched_domain *sd; + struct sched_group *sg; + + sd = rcu_dereference_check_sched_domain(cpu_rq(cpu)->sd); + if (!sd) + return 1024; + sg = sd->groups; + if (!sg) + return 1024; + + return sg->cpu_power; +} + static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) { unsigned long this_load, load; int idx, this_cpu, prev_cpu; unsigned long tl_per_task; - unsigned int imbalance; struct task_group *tg; unsigned long weight; int balanced; @@ -1252,8 +1266,6 @@ static int wake_affine(struct sched_doma tg = task_group(p); weight = p->se.load.weight; - imbalance = 100 + (sd->imbalance_pct - 100) / 2; - /* * In low-load situations, where prev_cpu is idle and this_cpu is idle * due to the sync cause above having dropped this_load to 0, we'll @@ -1263,9 +1275,21 @@ static int wake_affine(struct sched_doma * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - balanced = !this_load || - 100*(this_load + effective_load(tg, this_cpu, weight, weight)) <= - imbalance*(load + effective_load(tg, prev_cpu, 0, weight)); + if (this_load) { + unsigned long this_eff_load, prev_eff_load; + + this_eff_load = 100; + this_eff_load *= cpu_power(prev_cpu); + this_eff_load *= this_load + + effective_load(tg, this_cpu, weight, weight); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= cpu_power(this_cpu); + prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + + balanced = this_eff_load <= prev_eff_load; + } else + balanced = true; /* * If the currently running task will sleep within ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-31 11:56 ` Peter Zijlstra @ 2010-05-31 13:56 ` Mike Galbraith 2010-05-31 14:28 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-05-31 13:56 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-31 at 13:56 +0200, Peter Zijlstra wrote: > On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote: > > sched: RT waker sync wakeup bugfix > > > > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway > > in the sync wakeup case, sending this_load negative. This leads to affine > > wakeup failure in cases where it should succeed. This was found while testing > > an PREEMPT_RT kernel with lmbench's lat_udp. In a PREEMPT_RT kernel, softirq > > threads act as a ~proxy for the !RT buddy. Approximate !PREEMPT_RT sync wakeup > > behavior by looking at the buddy instead, and subtracting the maximum task weight > > that will not send this_load negative. > > > Does the below work? Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid latencies are gone. (hm, skinnier than expected too) -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-31 13:56 ` Mike Galbraith @ 2010-05-31 14:28 ` Peter Zijlstra 2010-05-31 18:03 ` Mike Galbraith 2010-06-01 9:12 ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra 0 siblings, 2 replies; 15+ messages in thread From: Peter Zijlstra @ 2010-05-31 14:28 UTC (permalink / raw) To: Mike Galbraith; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-31 at 15:56 +0200, Mike Galbraith wrote: > On Mon, 2010-05-31 at 13:56 +0200, Peter Zijlstra wrote: > > On Sun, 2010-05-16 at 09:21 +0200, Mike Galbraith wrote: > > > sched: RT waker sync wakeup bugfix > > > > > > An RT waker's weight is not on the runqueue, but we try to subrtact it anyway > > > in the sync wakeup case, sending this_load negative. This leads to affine > > > wakeup failure in cases where it should succeed. This was found while testing > > > an PREEMPT_RT kernel with lmbench's lat_udp. In a PREEMPT_RT kernel, softirq > > > threads act as a ~proxy for the !RT buddy. Approximate !PREEMPT_RT sync wakeup > > > behavior by looking at the buddy instead, and subtracting the maximum task weight > > > that will not send this_load negative. > > > > > > Does the below work? > > Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid > latencies are gone. (hm, skinnier than expected too) Here's one that should be even skinnier ;-) I knew I already had an accessor: power_of(), also, I duplicated the cpu_power variable inside struct rq so we don't have to chase three pointers to get it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/sched.c | 24 ++++++------------------ kernel/sched_fair.c | 22 ++++++++++++++++------ 2 files changed, 22 insertions(+), 24 deletions(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -545,6 +545,8 @@ struct rq { struct root_domain *rd; struct sched_domain *sd; + unsigned long cpu_power; + unsigned char idle_at_tick; /* For active balancing */ int post_schedule; @@ -1521,24 +1523,9 @@ static unsigned long target_load(int cpu return max(rq->cpu_load[type-1], total); } -static struct sched_group *group_of(int cpu) -{ - struct sched_domain *sd = rcu_dereference_sched(cpu_rq(cpu)->sd); - - if (!sd) - return NULL; - - return sd->groups; -} - static unsigned long power_of(int cpu) { - struct sched_group *group = group_of(cpu); - - if (!group) - return SCHED_LOAD_SCALE; - - return group->cpu_power; + return cpu_rq(cpu)->cpu_power; } static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd); @@ -1877,8 +1864,8 @@ static void dec_nr_running(struct rq *rq static void set_load_weight(struct task_struct *p) { if (task_has_rt_policy(p)) { - p->se.load.weight = prio_to_weight[0] * 2; - p->se.load.inv_weight = prio_to_wmult[0] >> 1; + p->se.load.weight = 0; + p->se.load.inv_weight = WMULT_CONST; return; } @@ -7715,6 +7702,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; + rq->cpu_power = SCHED_LOAD_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; Index: linux-2.6/kernel/sched_fair.c =================================================================== --- linux-2.6.orig/kernel/sched_fair.c +++ linux-2.6/kernel/sched_fair.c @@ -1225,7 +1225,6 @@ static int wake_affine(struct sched_doma unsigned long this_load, load; int idx, this_cpu, prev_cpu; unsigned long tl_per_task; - unsigned int imbalance; struct task_group *tg; unsigned long weight; int balanced; @@ -1252,8 +1251,6 @@ static int wake_affine(struct sched_doma tg = task_group(p); weight = p->se.load.weight; - imbalance = 100 + (sd->imbalance_pct - 100) / 2; - /* * In low-load situations, where prev_cpu is idle and this_cpu is idle * due to the sync cause above having dropped this_load to 0, we'll @@ -1263,9 +1260,21 @@ static int wake_affine(struct sched_doma * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - balanced = !this_load || - 100*(this_load + effective_load(tg, this_cpu, weight, weight)) <= - imbalance*(load + effective_load(tg, prev_cpu, 0, weight)); + if (this_load) { + unsigned long this_eff_load, prev_eff_load; + + this_eff_load = 100; + this_eff_load *= power_of(prev_cpu); + this_eff_load *= this_load + + effective_load(tg, this_cpu, weight, weight); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= power_of(this_cpu); + prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + + balanced = this_eff_load <= prev_eff_load; + } else + balanced = true; /* * If the currently running task will sleep within @@ -2298,6 +2307,7 @@ static void update_cpu_power(struct sche if (!power) power = 1; + cpu_rq(cpu)->cpu_power = power; sdg->cpu_power = power; } ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-31 14:28 ` Peter Zijlstra @ 2010-05-31 18:03 ` Mike Galbraith 2010-06-01 6:40 ` Mike Galbraith 2010-06-01 9:12 ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra 1 sibling, 1 reply; 15+ messages in thread From: Mike Galbraith @ 2010-05-31 18:03 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-31 at 16:28 +0200, Peter Zijlstra wrote: > On Mon, 2010-05-31 at 15:56 +0200, Mike Galbraith wrote: > > > > Does the below work? > > > > Had to make a dinky adjustment for x86-tip-rt33, but yeah, the horrid > > latencies are gone. (hm, skinnier than expected too) > > > Here's one that should be even skinnier ;-) Cool, box is into skeletal. > I knew I already had an accessor: power_of(), also, I duplicated the > cpu_power variable inside struct rq so we don't have to chase three > pointers to get it. Looks good, but I'll retest anyway. -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: commit e9e9250b: sync wakeup bustage when waker is an RT task 2010-05-31 18:03 ` Mike Galbraith @ 2010-06-01 6:40 ` Mike Galbraith 0 siblings, 0 replies; 15+ messages in thread From: Mike Galbraith @ 2010-06-01 6:40 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, LKML, Thomas Gleixner On Mon, 2010-05-31 at 20:03 +0200, Mike Galbraith wrote: > Looks good, but I'll retest anyway. I wedged skinnier version into x86-tip-rt33, works fine too. (todo: non-rt config testing) -Mike ^ permalink raw reply [flat|nested] 15+ messages in thread
* [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks 2010-05-31 14:28 ` Peter Zijlstra 2010-05-31 18:03 ` Mike Galbraith @ 2010-06-01 9:12 ` tip-bot for Peter Zijlstra 1 sibling, 0 replies; 15+ messages in thread From: tip-bot for Peter Zijlstra @ 2010-06-01 9:12 UTC (permalink / raw) To: linux-tip-commits Cc: linux-kernel, hpa, mingo, a.p.zijlstra, efault, tglx, mingo Commit-ID: e51fd5e22e12b39f49b1bb60b37b300b17378a43 Gitweb: http://git.kernel.org/tip/e51fd5e22e12b39f49b1bb60b37b300b17378a43 Author: Peter Zijlstra <a.p.zijlstra@chello.nl> AuthorDate: Mon, 31 May 2010 12:37:30 +0200 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 1 Jun 2010 09:27:16 +0200 sched: Fix wake_affine() vs RT tasks Mike reports that since e9e9250b (sched: Scale down cpu_power due to RT tasks), wake_affine() goes funny on RT tasks due to them still having a !0 weight and wake_affine() still subtracts that from the rq weight. Since nobody should be using se->weight for RT tasks, set the value to zero. Also, since we now use ->cpu_power to normalize rq weights to account for RT cpu usage, add that factor into the imbalance computation. Reported-by: Mike Galbraith <efault@gmx.de> Tested-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1275316109.27810.22969.camel@twins> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched.c | 24 ++++++------------------ kernel/sched_fair.c | 22 ++++++++++++++++------ 2 files changed, 22 insertions(+), 24 deletions(-) diff --git a/kernel/sched.c b/kernel/sched.c index d484081..f8b8996 100644 --- a/kernel/sched.c +++ b/kernel/sched.c @@ -544,6 +544,8 @@ struct rq { struct root_domain *rd; struct sched_domain *sd; + unsigned long cpu_power; + unsigned char idle_at_tick; /* For active balancing */ int post_schedule; @@ -1499,24 +1501,9 @@ static unsigned long target_load(int cpu, int type) return max(rq->cpu_load[type-1], total); } -static struct sched_group *group_of(int cpu) -{ - struct sched_domain *sd = rcu_dereference_sched(cpu_rq(cpu)->sd); - - if (!sd) - return NULL; - - return sd->groups; -} - static unsigned long power_of(int cpu) { - struct sched_group *group = group_of(cpu); - - if (!group) - return SCHED_LOAD_SCALE; - - return group->cpu_power; + return cpu_rq(cpu)->cpu_power; } static int task_hot(struct task_struct *p, u64 now, struct sched_domain *sd); @@ -1854,8 +1841,8 @@ static void dec_nr_running(struct rq *rq) static void set_load_weight(struct task_struct *p) { if (task_has_rt_policy(p)) { - p->se.load.weight = prio_to_weight[0] * 2; - p->se.load.inv_weight = prio_to_wmult[0] >> 1; + p->se.load.weight = 0; + p->se.load.inv_weight = WMULT_CONST; return; } @@ -7605,6 +7592,7 @@ void __init sched_init(void) #ifdef CONFIG_SMP rq->sd = NULL; rq->rd = NULL; + rq->cpu_power = SCHED_LOAD_SCALE; rq->post_schedule = 0; rq->active_balance = 0; rq->next_balance = jiffies; diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c index 217e4a9..eed35ed 100644 --- a/kernel/sched_fair.c +++ b/kernel/sched_fair.c @@ -1225,7 +1225,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) unsigned long this_load, load; int idx, this_cpu, prev_cpu; unsigned long tl_per_task; - unsigned int imbalance; struct task_group *tg; unsigned long weight; int balanced; @@ -1252,8 +1251,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) tg = task_group(p); weight = p->se.load.weight; - imbalance = 100 + (sd->imbalance_pct - 100) / 2; - /* * In low-load situations, where prev_cpu is idle and this_cpu is idle * due to the sync cause above having dropped this_load to 0, we'll @@ -1263,9 +1260,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync) * Otherwise check if either cpus are near enough in load to allow this * task to be woken on this_cpu. */ - balanced = !this_load || - 100*(this_load + effective_load(tg, this_cpu, weight, weight)) <= - imbalance*(load + effective_load(tg, prev_cpu, 0, weight)); + if (this_load) { + unsigned long this_eff_load, prev_eff_load; + + this_eff_load = 100; + this_eff_load *= power_of(prev_cpu); + this_eff_load *= this_load + + effective_load(tg, this_cpu, weight, weight); + + prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2; + prev_eff_load *= power_of(this_cpu); + prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight); + + balanced = this_eff_load <= prev_eff_load; + } else + balanced = true; /* * If the currently running task will sleep within @@ -2298,6 +2307,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu) if (!power) power = 1; + cpu_rq(cpu)->cpu_power = power; sdg->cpu_power = power; } ^ permalink raw reply related [flat|nested] 15+ messages in thread
end of thread, other threads:[~2010-06-01 9:14 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-05-15 11:57 commit e9e9250b: sync wakeup bustage when waker is an RT task Mike Galbraith 2010-05-15 12:04 ` Peter Zijlstra 2010-05-15 17:07 ` Mike Galbraith 2010-05-15 17:32 ` Mike Galbraith 2010-05-16 7:21 ` Mike Galbraith 2010-05-17 4:38 ` Mike Galbraith 2010-05-17 8:49 ` Peter Zijlstra 2010-05-17 8:52 ` Peter Zijlstra 2010-05-17 9:04 ` Mike Galbraith 2010-05-31 11:56 ` Peter Zijlstra 2010-05-31 13:56 ` Mike Galbraith 2010-05-31 14:28 ` Peter Zijlstra 2010-05-31 18:03 ` Mike Galbraith 2010-06-01 6:40 ` Mike Galbraith 2010-06-01 9:12 ` [tip:sched/urgent] sched: Fix wake_affine() vs RT tasks tip-bot for Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox