* sched: incorrect argument used in task_hot()
@ 2006-11-14 23:00 Chen, Kenneth W
2006-11-17 16:56 ` [rfc patch] " Mike Galbraith
0 siblings, 1 reply; 10+ messages in thread
From: Chen, Kenneth W @ 2006-11-14 23:00 UTC (permalink / raw)
To: nickpiggin, Ingo Molnar, Siddha, Suresh B; +Cc: linux-kernel
The argument used for task_hot in can_migrate_task() looks wrong:
int can_migrate_task()
{ ...
if (task_hot(p, rq->timestamp_last_tick, sd))
return 0;
return 1;
}
It is not using current time to estimate whether a task is cache-hot
or not on a remote CPU, instead it is using timestamp that the remote
cpu took last timer interrupt. With timer interrupt staggered, this
under estimate how long a task has been off CPU by a wide margin
compare to its actual value. The ramification is that tasks should
be considered as cache cold is now being evaluated as cache hot.
We've seen that the effect is especially annoying at HT sched domain
where l-b is not able to freely migrate tasks between sibling CPUs
and leave idle time on the system.
One route to defend that misbehave is to override sd->cache_hot_time
at boot time. Intuitively, sys admin will set that parameter to zero.
But wait, that is not correct. It should be set to -1 jiffy because
(rq->timestamp_last_tick - p->last_run) can be negative. On top of
that one has to convert jiffy to ns when set the parameter. All very
unintuitive and undesirable.
I was very tempted to do:
now - this_rq->timestamp_last_tick + rq->timestamp_last_tick;
But it is equally flawed that timestamp_last_tick is not synchronized
between this_rq and target_rq. The adjustment is simply inaccurate and
not suitable for load balance decision at HT (or even core) domain.
There are a number of other usages of above adjustment, I think they
are all inaccurate. Though, most of them are for interactiveness and
can withstand the inaccuracy because it makes decision based on much
larger scale.
So back to the first observation on not enough l-b at HT domain because
of inaccurate time calculation, what would be the best solution to fix
this?
- Ken
^ permalink raw reply [flat|nested] 10+ messages in thread* [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-14 23:00 sched: incorrect argument used in task_hot() Chen, Kenneth W @ 2006-11-17 16:56 ` Mike Galbraith 2006-11-17 19:20 ` Chen, Kenneth W 2006-11-17 19:20 ` Ingo Molnar 0 siblings, 2 replies; 10+ messages in thread From: Mike Galbraith @ 2006-11-17 16:56 UTC (permalink / raw) To: Chen, Kenneth W; +Cc: nickpiggin, Ingo Molnar, Siddha, Suresh B, linux-kernel On Tue, 2006-11-14 at 15:00 -0800, Chen, Kenneth W wrote: > The argument used for task_hot in can_migrate_task() looks wrong: > > int can_migrate_task() > { ... > if (task_hot(p, rq->timestamp_last_tick, sd)) > return 0; > return 1; > } > > It is not using current time to estimate whether a task is cache-hot > or not on a remote CPU, instead it is using timestamp that the remote > cpu took last timer interrupt. With timer interrupt staggered, this > under estimate how long a task has been off CPU by a wide margin > compare to its actual value. The ramification is that tasks should > be considered as cache cold is now being evaluated as cache hot. > > We've seen that the effect is especially annoying at HT sched domain > where l-b is not able to freely migrate tasks between sibling CPUs > and leave idle time on the system. > > One route to defend that misbehave is to override sd->cache_hot_time > at boot time. Intuitively, sys admin will set that parameter to zero. > But wait, that is not correct. It should be set to -1 jiffy because > (rq->timestamp_last_tick - p->last_run) can be negative. On top of > that one has to convert jiffy to ns when set the parameter. All very > unintuitive and undesirable. > > I was very tempted to do: > now - this_rq->timestamp_last_tick + rq->timestamp_last_tick; > > But it is equally flawed that timestamp_last_tick is not synchronized > between this_rq and target_rq. The adjustment is simply inaccurate and > not suitable for load balance decision at HT (or even core) domain. > > There are a number of other usages of above adjustment, I think they > are all inaccurate. Though, most of them are for interactiveness and > can withstand the inaccuracy because it makes decision based on much > larger scale. > > So back to the first observation on not enough l-b at HT domain because > of inaccurate time calculation, what would be the best solution to fix > this? One way to improve granularity, and eliminate the possibility of p->last_run being > rq->timestamp_tast_tick, and thereby short circuiting the evaluation of cache_hot_time, is to cache the last return of sched_clock() at both tick and sched times, and use that value as our reference instead of the absolute time of the tick. It won't totally eliminate skew, but it moves the reference point closer to the current time on the remote cpu. Looking for a good place to do this, I chose update_cpu_clock(). While looking to see if I could hijack rq->timestamp_last_tick to avoid adding to the runqueue, I noticed that while we update p->timestamp and add it's recent runtime to p->sched_time at every call to schedule(), if we _don't_ task switch, and then a tick comes in, that task will have the full time from the last tick until now added to it's sched_time despite having already been charged for part of that, because rq->timestamp_last_tick is older than p->timestamp. To avoid that, I touch p->timestamp at tick time as well, and use only p->timestamp to see what needs to be added. rq->timestamp_last_tick hijacked. I also noticed that we were updating p->last_ran at every schedule() call, whereas we only consider moving tasks which aren't currently on cpu, so I moved that to the "we absolute will switch" spot. Comments? --- linux-2.6.19-rc5-mm2/kernel/sched.c.org 2006-11-15 10:14:07.000000000 +0100 +++ linux-2.6.19-rc5-mm2/kernel/sched.c 2006-11-17 15:42:46.000000000 +0100 @@ -226,7 +226,8 @@ struct rq { unsigned long nr_uninterruptible; unsigned long expired_timestamp; - unsigned long long timestamp_last_tick; + /* Cached timestamp set by update_cpu_clock() */ + unsigned long long most_recent_timestamp; struct task_struct *curr, *idle; struct mm_struct *prev_mm; struct prio_array *active, *expired, arrays[2]; @@ -947,8 +948,8 @@ static void activate_task(struct task_st if (!local) { /* Compensate for drifting sched_clock */ struct rq *this_rq = this_rq(); - now = (now - this_rq->timestamp_last_tick) - + rq->timestamp_last_tick; + now = (now - this_rq->most_recent_timestamp) + + rq->most_recent_timestamp; } #endif @@ -1694,8 +1695,8 @@ void fastcall wake_up_new_task(struct ta * Not the local CPU - must adjust timestamp. This should * get optimised away in the !CONFIG_SMP case. */ - p->timestamp = (p->timestamp - this_rq->timestamp_last_tick) - + rq->timestamp_last_tick; + p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) + + rq->most_recent_timestamp; __activate_task(p, rq); if (TASK_PREEMPTS_CURR(p, rq)) resched_task(rq->curr); @@ -2067,8 +2068,8 @@ static void pull_task(struct rq *src_rq, set_task_cpu(p, this_cpu); inc_nr_running(p, this_rq); enqueue_task(p, this_array); - p->timestamp = (p->timestamp - src_rq->timestamp_last_tick) - + this_rq->timestamp_last_tick; + p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) + + this_rq->most_recent_timestamp; /* * Note that idle threads have a prio of MAX_PRIO, for this test * to be always true for them. @@ -2107,7 +2108,7 @@ int can_migrate_task(struct task_struct if (sd->nr_balance_failed > sd->cache_nice_tries) return 1; - if (task_hot(p, rq->timestamp_last_tick, sd)) + if (task_hot(p, rq->most_recent_timestamp, sd)) return 0; return 1; } @@ -2206,7 +2207,7 @@ skip_queue: } #ifdef CONFIG_SCHEDSTATS - if (task_hot(tmp, busiest->timestamp_last_tick, sd)) + if (task_hot(tmp, busiest->most_recent_timestamp, sd)) schedstat_inc(sd, lb_hot_gained[idle]); #endif @@ -2943,7 +2944,8 @@ EXPORT_PER_CPU_SYMBOL(kstat); static inline void update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now) { - p->sched_time += now - max(p->timestamp, rq->timestamp_last_tick); + p->sched_time += now - p->timestamp; + p->timestamp = rq->most_recent_timestamp = now; } /* @@ -2956,8 +2958,7 @@ unsigned long long current_sched_time(co unsigned long flags; local_irq_save(flags); - ns = max(p->timestamp, task_rq(p)->timestamp_last_tick); - ns = p->sched_time + sched_clock() - ns; + ns = p->sched_time + sched_clock() - p->timestamp; local_irq_restore(flags); return ns; @@ -3073,8 +3074,6 @@ void scheduler_tick(void) update_cpu_clock(p, rq, now); - rq->timestamp_last_tick = now; - if (p == rq->idle) { if (wake_priority_sleeper(rq)) goto out; @@ -3466,11 +3465,10 @@ switch_tasks: prev->sleep_avg -= run_time; if ((long)prev->sleep_avg <= 0) prev->sleep_avg = 0; - prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); if (likely(prev != next)) { - next->timestamp = now; + next->timestamp = prev->last_ran = now; rq->nr_switches++; rq->curr = next; ++*switch_count; @@ -5000,8 +4998,8 @@ static int __migrate_task(struct task_st * afterwards, and pretending it was a local activate. * This way is cleaner and logically correct. */ - p->timestamp = p->timestamp - rq_src->timestamp_last_tick - + rq_dest->timestamp_last_tick; + p->timestamp = p->timestamp - rq_src->most_recent_timestamp + + rq_dest->most_recent_timestamp; deactivate_task(p, rq_src); __activate_task(p, rq_dest); if (TASK_PREEMPTS_CURR(p, rq_dest)) ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 16:56 ` [rfc patch] " Mike Galbraith @ 2006-11-17 19:20 ` Chen, Kenneth W 2006-11-17 19:20 ` Ingo Molnar 1 sibling, 0 replies; 10+ messages in thread From: Chen, Kenneth W @ 2006-11-17 19:20 UTC (permalink / raw) To: 'Mike Galbraith' Cc: nickpiggin, Ingo Molnar, Siddha, Suresh B, linux-kernel Mike Galbraith wrote on Friday, November 17, 2006 8:57 AM > On Tue, 2006-11-14 at 15:00 -0800, Chen, Kenneth W wrote: > > The argument used for task_hot in can_migrate_task() looks wrong: > > > > int can_migrate_task() > > { ... > > if (task_hot(p, rq->timestamp_last_tick, sd)) > > return 0; > > return 1; > > } > > > > [....] > > > > So back to the first observation on not enough l-b at HT domain because > > of inaccurate time calculation, what would be the best solution to fix > > this? > > One way to improve granularity, and eliminate the possibility of > p->last_run being > rq->timestamp_tast_tick, and thereby short > circuiting the evaluation of cache_hot_time, is to cache the last return > of sched_clock() at both tick and sched times, and use that value as our > reference instead of the absolute time of the tick. It won't totally > eliminate skew, but it moves the reference point closer to the current > time on the remote cpu. > > Comments? I like it and think it might do it. Just for the record, we are tinkering with the following patch. The thinking is that logical CPUs in HT and multi-core domain are usually on the same CPU package, and it is likely that the tsc are synchronized, so we can take current time stamp and use it directly to compare with p->last_ran. I'm planning on running a couple of experiments with both patches. --- ./kernel/sched.c.orig 2006-11-07 18:24:20.000000000 -0800 +++ ./kernel/sched.c 2006-11-15 16:01:39.000000000 -0800 @@ -2068,6 +2068,8 @@ int can_migrate_task(struct task_struct struct sched_domain *sd, enum idle_type idle, int *all_pinned) { + unsigned long long now; + /* * We do not migrate tasks that are: * 1) running (obviously), or @@ -2090,7 +2092,12 @@ int can_migrate_task(struct task_struct if (sd->nr_balance_failed > sd->cache_nice_tries) return 1; - if (task_hot(p, rq->timestamp_last_tick, sd)) + if (sd->flags & (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)) + now = sched_clock(); + else + now = rq->timestamp_last_tick; + + if (task_hot(p, now, sd)) return 0; return 1; } ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 16:56 ` [rfc patch] " Mike Galbraith 2006-11-17 19:20 ` Chen, Kenneth W @ 2006-11-17 19:20 ` Ingo Molnar 2006-11-17 19:41 ` Chen, Kenneth W 2006-11-17 21:30 ` Mike Galbraith 1 sibling, 2 replies; 10+ messages in thread From: Ingo Molnar @ 2006-11-17 19:20 UTC (permalink / raw) To: Mike Galbraith Cc: Chen, Kenneth W, nickpiggin, Siddha, Suresh B, linux-kernel, Andrew Morton * Mike Galbraith <efault@gmx.de> wrote: > One way to improve granularity, and eliminate the possibility of > p->last_run being > rq->timestamp_tast_tick, and thereby short > circuiting the evaluation of cache_hot_time, is to cache the last > return of sched_clock() at both tick and sched times, and use that > value as our reference instead of the absolute time of the tick. It > won't totally eliminate skew, but it moves the reference point closer > to the current time on the remote cpu. > > Looking for a good place to do this, I chose update_cpu_clock(). looks good to me - thus we will update the timestamp not only in the timer tick, but also upon every context-switch (when we acquire sched_clock() value anyway). Lets try this in -mm? Acked-by: Ingo Molnar <mingo@elte.hu> Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 19:20 ` Ingo Molnar @ 2006-11-17 19:41 ` Chen, Kenneth W 2006-11-17 21:30 ` Mike Galbraith 1 sibling, 0 replies; 10+ messages in thread From: Chen, Kenneth W @ 2006-11-17 19:41 UTC (permalink / raw) To: 'Ingo Molnar', Mike Galbraith Cc: nickpiggin, Siddha, Suresh B, linux-kernel, Andrew Morton Ingo Molnar wrote on Friday, November 17, 2006 11:21 AM > * Mike Galbraith <efault@gmx.de> wrote: > > > One way to improve granularity, and eliminate the possibility of > > p->last_run being > rq->timestamp_tast_tick, and thereby short > > circuiting the evaluation of cache_hot_time, is to cache the last > > return of sched_clock() at both tick and sched times, and use that > > value as our reference instead of the absolute time of the tick. It > > won't totally eliminate skew, but it moves the reference point closer > > to the current time on the remote cpu. > > > > Looking for a good place to do this, I chose update_cpu_clock(). > > looks good to me - thus we will update the timestamp not only in the > timer tick, but also upon every context-switch (when we acquire > sched_clock() value anyway). Lets try this in -mm? Certainly gets my vote. For my particular workload environment, there are enough schedule activity on the remote CPU and in theory it should make time calculation a lot better than what it is now. I will run a couple of experiment to verify. Acked-by: Ken Chen <kenneth.w.chen@intel.com> ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 19:20 ` Ingo Molnar 2006-11-17 19:41 ` Chen, Kenneth W @ 2006-11-17 21:30 ` Mike Galbraith 2006-11-17 21:39 ` Andrew Morton 1 sibling, 1 reply; 10+ messages in thread From: Mike Galbraith @ 2006-11-17 21:30 UTC (permalink / raw) To: Ingo Molnar Cc: Chen, Kenneth W, nickpiggin, Siddha, Suresh B, linux-kernel, Andrew Morton On Fri, 2006-11-17 at 20:20 +0100, Ingo Molnar wrote: > * Mike Galbraith <efault@gmx.de> wrote: > > > One way to improve granularity, and eliminate the possibility of > > p->last_run being > rq->timestamp_tast_tick, and thereby short > > circuiting the evaluation of cache_hot_time, is to cache the last > > return of sched_clock() at both tick and sched times, and use that > > value as our reference instead of the absolute time of the tick. It > > won't totally eliminate skew, but it moves the reference point closer > > to the current time on the remote cpu. > > > > Looking for a good place to do this, I chose update_cpu_clock(). > > looks good to me - thus we will update the timestamp not only in the > timer tick, but also upon every context-switch (when we acquire > sched_clock() value anyway). Lets try this in -mm? > > Acked-by: Ingo Molnar <mingo@elte.hu> Then it needs a blame line. Signed-off-by: Mike Galbraith <efault@gmx.de> -Mike ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 21:30 ` Mike Galbraith @ 2006-11-17 21:39 ` Andrew Morton 2006-11-17 22:18 ` Mike Galbraith 0 siblings, 1 reply; 10+ messages in thread From: Andrew Morton @ 2006-11-17 21:39 UTC (permalink / raw) To: Mike Galbraith Cc: Ingo Molnar, Chen, Kenneth W, nickpiggin, Siddha, Suresh B, linux-kernel On Fri, 17 Nov 2006 22:30:34 +0100 Mike Galbraith <efault@gmx.de> wrote: > On Fri, 2006-11-17 at 20:20 +0100, Ingo Molnar wrote: > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > One way to improve granularity, and eliminate the possibility of > > > p->last_run being > rq->timestamp_tast_tick, and thereby short > > > circuiting the evaluation of cache_hot_time, is to cache the last > > > return of sched_clock() at both tick and sched times, and use that > > > value as our reference instead of the absolute time of the tick. It > > > won't totally eliminate skew, but it moves the reference point closer > > > to the current time on the remote cpu. > > > > > > Looking for a good place to do this, I chose update_cpu_clock(). > > > > looks good to me - thus we will update the timestamp not only in the > > timer tick, but also upon every context-switch (when we acquire > > sched_clock() value anyway). Lets try this in -mm? > > > > Acked-by: Ingo Molnar <mingo@elte.hu> > > Then it needs a blame line. > > Signed-off-by: Mike Galbraith <efault@gmx.de> > And a changelog, then we're all set! Oh. And a patch, too. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 21:39 ` Andrew Morton @ 2006-11-17 22:18 ` Mike Galbraith 2006-11-18 0:25 ` Chen, Kenneth W 0 siblings, 1 reply; 10+ messages in thread From: Mike Galbraith @ 2006-11-17 22:18 UTC (permalink / raw) To: Andrew Morton Cc: Ingo Molnar, Chen, Kenneth W, nickpiggin, Siddha, Suresh B, linux-kernel On Fri, 2006-11-17 at 13:39 -0800, Andrew Morton wrote: > On Fri, 17 Nov 2006 22:30:34 +0100 > Mike Galbraith <efault@gmx.de> wrote: > > > On Fri, 2006-11-17 at 20:20 +0100, Ingo Molnar wrote: > > > * Mike Galbraith <efault@gmx.de> wrote: > > > > > > > One way to improve granularity, and eliminate the possibility of > > > > p->last_run being > rq->timestamp_tast_tick, and thereby short > > > > circuiting the evaluation of cache_hot_time, is to cache the last > > > > return of sched_clock() at both tick and sched times, and use that > > > > value as our reference instead of the absolute time of the tick. It > > > > won't totally eliminate skew, but it moves the reference point closer > > > > to the current time on the remote cpu. > > > > > > > > Looking for a good place to do this, I chose update_cpu_clock(). > > > > > > looks good to me - thus we will update the timestamp not only in the > > > timer tick, but also upon every context-switch (when we acquire > > > sched_clock() value anyway). Lets try this in -mm? > > > > > > Acked-by: Ingo Molnar <mingo@elte.hu> > > > > Then it needs a blame line. > > > > Signed-off-by: Mike Galbraith <efault@gmx.de> > > > > And a changelog, then we're all set! > > Oh. And a patch, too. Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation reference timestamp at both tick and sched times to prevent said reference, formerly rq->timestamp_last_tick, from being behind task->last_ran at evaluation time, and to move said reference closer to current time on the remote processor, intent being to improve cache hot evaluation and timestamp adjustment accuracy for task migration. Fix minor sched_time double accounting error which occurs when a task passing through schedule() does not schedule off, and takes the next timer tick. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ken Chen <kenneth.w.chen@intel.com> --- linux-2.6.19-rc5-mm2/kernel/sched.c.org 2006-11-15 10:14:07.000000000 +0100 +++ linux-2.6.19-rc5-mm2/kernel/sched.c 2006-11-17 15:42:46.000000000 +0100 @@ -226,7 +226,8 @@ struct rq { unsigned long nr_uninterruptible; unsigned long expired_timestamp; - unsigned long long timestamp_last_tick; + /* Cached timestamp set by update_cpu_clock() */ + unsigned long long most_recent_timestamp; struct task_struct *curr, *idle; struct mm_struct *prev_mm; struct prio_array *active, *expired, arrays[2]; @@ -947,8 +948,8 @@ static void activate_task(struct task_st if (!local) { /* Compensate for drifting sched_clock */ struct rq *this_rq = this_rq(); - now = (now - this_rq->timestamp_last_tick) - + rq->timestamp_last_tick; + now = (now - this_rq->most_recent_timestamp) + + rq->most_recent_timestamp; } #endif @@ -1694,8 +1695,8 @@ void fastcall wake_up_new_task(struct ta * Not the local CPU - must adjust timestamp. This should * get optimised away in the !CONFIG_SMP case. */ - p->timestamp = (p->timestamp - this_rq->timestamp_last_tick) - + rq->timestamp_last_tick; + p->timestamp = (p->timestamp - this_rq->most_recent_timestamp) + + rq->most_recent_timestamp; __activate_task(p, rq); if (TASK_PREEMPTS_CURR(p, rq)) resched_task(rq->curr); @@ -2067,8 +2068,8 @@ static void pull_task(struct rq *src_rq, set_task_cpu(p, this_cpu); inc_nr_running(p, this_rq); enqueue_task(p, this_array); - p->timestamp = (p->timestamp - src_rq->timestamp_last_tick) - + this_rq->timestamp_last_tick; + p->timestamp = (p->timestamp - src_rq->most_recent_timestamp) + + this_rq->most_recent_timestamp; /* * Note that idle threads have a prio of MAX_PRIO, for this test * to be always true for them. @@ -2107,7 +2108,7 @@ int can_migrate_task(struct task_struct if (sd->nr_balance_failed > sd->cache_nice_tries) return 1; - if (task_hot(p, rq->timestamp_last_tick, sd)) + if (task_hot(p, rq->most_recent_timestamp, sd)) return 0; return 1; } @@ -2206,7 +2207,7 @@ skip_queue: } #ifdef CONFIG_SCHEDSTATS - if (task_hot(tmp, busiest->timestamp_last_tick, sd)) + if (task_hot(tmp, busiest->most_recent_timestamp, sd)) schedstat_inc(sd, lb_hot_gained[idle]); #endif @@ -2943,7 +2944,8 @@ EXPORT_PER_CPU_SYMBOL(kstat); static inline void update_cpu_clock(struct task_struct *p, struct rq *rq, unsigned long long now) { - p->sched_time += now - max(p->timestamp, rq->timestamp_last_tick); + p->sched_time += now - p->timestamp; + p->timestamp = rq->most_recent_timestamp = now; } /* @@ -2956,8 +2958,7 @@ unsigned long long current_sched_time(co unsigned long flags; local_irq_save(flags); - ns = max(p->timestamp, task_rq(p)->timestamp_last_tick); - ns = p->sched_time + sched_clock() - ns; + ns = p->sched_time + sched_clock() - p->timestamp; local_irq_restore(flags); return ns; @@ -3073,8 +3074,6 @@ void scheduler_tick(void) update_cpu_clock(p, rq, now); - rq->timestamp_last_tick = now; - if (p == rq->idle) { if (wake_priority_sleeper(rq)) goto out; @@ -3466,11 +3465,10 @@ switch_tasks: prev->sleep_avg -= run_time; if ((long)prev->sleep_avg <= 0) prev->sleep_avg = 0; - prev->timestamp = prev->last_ran = now; sched_info_switch(prev, next); if (likely(prev != next)) { - next->timestamp = now; + next->timestamp = prev->last_ran = now; rq->nr_switches++; rq->curr = next; ++*switch_count; @@ -5000,8 +4998,8 @@ static int __migrate_task(struct task_st * afterwards, and pretending it was a local activate. * This way is cleaner and logically correct. */ - p->timestamp = p->timestamp - rq_src->timestamp_last_tick - + rq_dest->timestamp_last_tick; + p->timestamp = p->timestamp - rq_src->most_recent_timestamp + + rq_dest->most_recent_timestamp; deactivate_task(p, rq_src); __activate_task(p, rq_dest); if (TASK_PREEMPTS_CURR(p, rq_dest)) ^ permalink raw reply [flat|nested] 10+ messages in thread
* RE: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-17 22:18 ` Mike Galbraith @ 2006-11-18 0:25 ` Chen, Kenneth W 2006-11-18 7:28 ` Ingo Molnar 0 siblings, 1 reply; 10+ messages in thread From: Chen, Kenneth W @ 2006-11-18 0:25 UTC (permalink / raw) To: 'Mike Galbraith', Andrew Morton Cc: Ingo Molnar, nickpiggin, Siddha, Suresh B, linux-kernel Mike Galbraith wrote on Friday, November 17, 2006 2:19 PM > > And a changelog, then we're all set! > > > > Oh. And a patch, too. > > Co-opt rq->timestamp_last_tick to maintain a cache_hot_time evaluation > reference timestamp at both tick and sched times to prevent said > reference, formerly rq->timestamp_last_tick, from being behind > task->last_ran at evaluation time, and to move said reference closer to > current time on the remote processor, intent being to improve cache hot > evaluation and timestamp adjustment accuracy for task migration. > > Fix minor sched_time double accounting error which occurs when a task > passing through schedule() does not schedule off, and takes the next > timer tick. > > [....] > > @@ -2206,7 +2207,7 @@ skip_queue: > } > > #ifdef CONFIG_SCHEDSTATS > - if (task_hot(tmp, busiest->timestamp_last_tick, sd)) > + if (task_hot(tmp, busiest->most_recent_timestamp, sd)) > schedstat_inc(sd, lb_hot_gained[idle]); > #endif While we at it, let's clean up this hunk. task_hot is evaluated twice in the more common case of nr_balance_failed <= cache_nice_tries. We should only test/increment relevant stats for forced migration. Patch on top of yours. Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> --- ./kernel/sched.c.orig 2006-11-17 13:37:46.000000000 -0800 +++ ./kernel/sched.c 2006-11-17 14:20:34.000000000 -0800 @@ -2088,8 +2088,13 @@ int can_migrate_task(struct task_struct * 2) too many balance attempts have failed. */ - if (sd->nr_balance_failed > sd->cache_nice_tries) + if (sd->nr_balance_failed > sd->cache_nice_tries) { + #ifdef CONFIG_SCHEDSTATS + if (task_hot(p, rq->most_recent_timestamp, sd)) + schedstat_inc(sd, lb_hot_gained[idle]); + #endif return 1; + } if (task_hot(p, rq->most_recent_timestamp, sd)) return 0; @@ -2189,11 +2194,6 @@ skip_queue: goto skip_bitmap; } -#ifdef CONFIG_SCHEDSTATS - if (task_hot(tmp, busiest->most_recent_timestamp, sd)) - schedstat_inc(sd, lb_hot_gained[idle]); -#endif - pull_task(busiest, array, tmp, this_rq, dst_array, this_cpu); pulled++; rem_load_move -= tmp->load_weight; ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [rfc patch] Re: sched: incorrect argument used in task_hot() 2006-11-18 0:25 ` Chen, Kenneth W @ 2006-11-18 7:28 ` Ingo Molnar 0 siblings, 0 replies; 10+ messages in thread From: Ingo Molnar @ 2006-11-18 7:28 UTC (permalink / raw) To: Chen, Kenneth W Cc: 'Mike Galbraith', Andrew Morton, nickpiggin, Siddha, Suresh B, linux-kernel * Chen, Kenneth W <kenneth.w.chen@intel.com> wrote: > - if (sd->nr_balance_failed > sd->cache_nice_tries) > + if (sd->nr_balance_failed > sd->cache_nice_tries) { > + #ifdef CONFIG_SCHEDSTATS > + if (task_hot(p, rq->most_recent_timestamp, sd)) > + schedstat_inc(sd, lb_hot_gained[idle]); > + #endif > return 1; > + } minor nit: preprocessor directives should be aligned to the first column. Ingo ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2006-11-18 7:29 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-14 23:00 sched: incorrect argument used in task_hot() Chen, Kenneth W 2006-11-17 16:56 ` [rfc patch] " Mike Galbraith 2006-11-17 19:20 ` Chen, Kenneth W 2006-11-17 19:20 ` Ingo Molnar 2006-11-17 19:41 ` Chen, Kenneth W 2006-11-17 21:30 ` Mike Galbraith 2006-11-17 21:39 ` Andrew Morton 2006-11-17 22:18 ` Mike Galbraith 2006-11-18 0:25 ` Chen, Kenneth W 2006-11-18 7:28 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox