* [question] sched: idle_avg and migration latency
@ 2013-12-10 11:30 Daniel Lezcano
2013-12-10 15:11 ` Mike Galbraith
2013-12-10 15:20 ` Alex Shi
0 siblings, 2 replies; 7+ messages in thread
From: Daniel Lezcano @ 2013-12-10 11:30 UTC (permalink / raw)
To: Linux Kernel Mailing List; +Cc: Alex Shi
Hi All,
I am trying to understand how is computed the idle_avg and how it is
used regarding the migration latency.
1. What is the sysctl_sched_migration_cost value ? It is initialized to
500000UL. Is it an arbitrarily chosen value ? Could it change depending
on the hardware performances ?
2. The idle_balance function checks:
if (this_rq->avg_idle < sysctl_sched_migration_cost)
return 0;
IIUC, it is not worth to migrate a task to this cpu as we expect to run
another task before we can pull a task to the current cpu, right ?
Then if there is no task to balance we will enter idle, thus we
initialize the idle_stamp to the current clock.
When another task is woken up with the ttwu_do_wakeup, the duration of
the idle time is computed in there:
if (rq->idle_stamp) {
u64 delta = rq_clock(rq) - rq->idle_stamp;
u64 max = 2*sysctl_sched_migration_cost;
if (delta > max)
rq->avg_idle = max;
else
update_avg(&rq->avg_idle, delta);
rq->idle_stamp = 0;
}
Why is the 'delta' leveraged by 'max' ?
3. And finally the function update_avg does:
s64 diff = sample - *avg;
*avg += diff >> 3;
Why is diff >> 3 used instead of the number of values ?
Thanks in advance for any answers
-- Daniel
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [question] sched: idle_avg and migration latency 2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano @ 2013-12-10 15:11 ` Mike Galbraith 2013-12-10 18:31 ` Daniel Lezcano 2013-12-10 15:20 ` Alex Shi 1 sibling, 1 reply; 7+ messages in thread From: Mike Galbraith @ 2013-12-10 15:11 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, Alex Shi On Tue, 2013-12-10 at 12:30 +0100, Daniel Lezcano wrote: > Hi All, > > I am trying to understand how is computed the idle_avg and how it is > used regarding the migration latency. > > 1. What is the sysctl_sched_migration_cost value ? It is initialized to > 500000UL. Is it an arbitrarily chosen value ? Could it change depending > on the hardware performances ? Yeah, it's a magic number. We used to use boot time measurements. > 2. The idle_balance function checks: > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return 0; > > IIUC, it is not worth to migrate a task to this cpu as we expect to run > another task before we can pull a task to the current cpu, right ? No, that's all about not beating living hell outta ourselves on every micro-idle. As with all load balancing, it's usually too much balancing that creates a problem. You need it, but it's really expensive, so less is more. > Then if there is no task to balance we will enter idle, thus we > initialize the idle_stamp to the current clock. > > When another task is woken up with the ttwu_do_wakeup, the duration of > the idle time is computed in there: > > if (rq->idle_stamp) { > u64 delta = rq_clock(rq) - rq->idle_stamp; > u64 max = 2*sysctl_sched_migration_cost; > > if (delta > max) > rq->avg_idle = max; > else > update_avg(&rq->avg_idle, delta); > rq->idle_stamp = 0; > } > > Why is the 'delta' leveraged by 'max' ? That has changed a little recently. I originally slammed avg_idle itself straight to max to ensure that a bursty load would idle balance, and not use stale data. If you start cross core switching at high frequency, you'll still shut idle balancing quickly. > 3. And finally the function update_avg does: > > s64 diff = sample - *avg; > *avg += diff >> 3; > > Why is diff >> 3 used instead of the number of values ? Ingo's quick like bunny smooth average. -Mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [question] sched: idle_avg and migration latency 2013-12-10 15:11 ` Mike Galbraith @ 2013-12-10 18:31 ` Daniel Lezcano 2013-12-11 1:25 ` Alex Shi 2013-12-11 6:44 ` Mike Galbraith 0 siblings, 2 replies; 7+ messages in thread From: Daniel Lezcano @ 2013-12-10 18:31 UTC (permalink / raw) To: Mike Galbraith; +Cc: Linux Kernel Mailing List, Alex Shi, Ingo Molnar On 12/10/2013 04:11 PM, Mike Galbraith wrote: > On Tue, 2013-12-10 at 12:30 +0100, Daniel Lezcano wrote: >> Hi All, >> >> I am trying to understand how is computed the idle_avg and how it is >> used regarding the migration latency. >> >> 1. What is the sysctl_sched_migration_cost value ? It is initialized to >> 500000UL. Is it an arbitrarily chosen value ? Could it change depending >> on the hardware performances ? > > Yeah, it's a magic number. We used to use boot time measurements. > >> 2. The idle_balance function checks: >> >> if (this_rq->avg_idle < sysctl_sched_migration_cost) >> return 0; >> >> IIUC, it is not worth to migrate a task to this cpu as we expect to run >> another task before we can pull a task to the current cpu, right ? > > No, that's all about not beating living hell outta ourselves on every > micro-idle. As with all load balancing, it's usually too much balancing > that creates a problem. You need it, but it's really expensive, so less > is more. > >> Then if there is no task to balance we will enter idle, thus we >> initialize the idle_stamp to the current clock. >> >> When another task is woken up with the ttwu_do_wakeup, the duration of >> the idle time is computed in there: >> >> if (rq->idle_stamp) { >> u64 delta = rq_clock(rq) - rq->idle_stamp; >> u64 max = 2*sysctl_sched_migration_cost; >> >> if (delta > max) >> rq->avg_idle = max; >> else >> update_avg(&rq->avg_idle, delta); >> rq->idle_stamp = 0; >> } >> >> Why is the 'delta' leveraged by 'max' ? > > That has changed a little recently. I originally slammed avg_idle > itself straight to max to ensure that a bursty load would idle balance, > and not use stale data. If you start cross core switching at high > frequency, you'll still shut idle balancing quickly. Ok, thanks for the explanation. I think I am a bit puzzled with the 'idle_avg' name. I am guessing the semantic of this variable is "how long this cpu has been idle". The idle duration, with the no_hz, could be long, several seconds if the work queues have been migrated and if the timer affinity is set to another cpu. So if we fall in this case and there is a burst of activity + micro-idle and idle_avg is not leverage to max, it will stay high during an amount of time, thus pulling tasks at each micro idle period, right ? >> 3. And finally the function update_avg does: >> >> s64 diff = sample - *avg; >> *avg += diff >> 3; >> >> Why is diff >> 3 used instead of the number of values ? > > Ingo's quick like bunny smooth average. Yeah, average computation on-the-fly. But why 'divide by 8' ? (Cc'ed Ingo). Thanks for taking the time to answer. -- Daniel -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [question] sched: idle_avg and migration latency 2013-12-10 18:31 ` Daniel Lezcano @ 2013-12-11 1:25 ` Alex Shi 2013-12-11 6:44 ` Mike Galbraith 1 sibling, 0 replies; 7+ messages in thread From: Alex Shi @ 2013-12-11 1:25 UTC (permalink / raw) To: Daniel Lezcano, Mike Galbraith; +Cc: Linux Kernel Mailing List, Ingo Molnar On 12/11/2013 02:31 AM, Daniel Lezcano wrote: >> >> That has changed a little recently. I originally slammed avg_idle >> itself straight to max to ensure that a bursty load would idle balance, >> and not use stale data. If you start cross core switching at high >> frequency, you'll still shut idle balancing quickly. > > Ok, thanks for the explanation. > > I think I am a bit puzzled with the 'idle_avg' name. I am guessing the > semantic of this variable is "how long this cpu has been idle". > > The idle duration, with the no_hz, could be long, several seconds if the > work queues have been migrated and if the timer affinity is set to > another cpu. So if we fall in this case and there is a burst of activity > + micro-idle and idle_avg is not leverage to max, it will stay high > during an amount of time, thus pulling tasks at each micro idle period, > right ? yes, I think so. -- Thanks Alex ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [question] sched: idle_avg and migration latency 2013-12-10 18:31 ` Daniel Lezcano 2013-12-11 1:25 ` Alex Shi @ 2013-12-11 6:44 ` Mike Galbraith 1 sibling, 0 replies; 7+ messages in thread From: Mike Galbraith @ 2013-12-11 6:44 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, Alex Shi, Ingo Molnar On Tue, 2013-12-10 at 19:31 +0100, Daniel Lezcano wrote: > I think I am a bit puzzled with the 'idle_avg' name. I am guessing the > semantic of this variable is "how long this cpu has been idle". Average distance between idles. > The idle duration, with the no_hz, could be long, several seconds if the > work queues have been migrated and if the timer affinity is set to > another cpu. So if we fall in this case and there is a burst of activity > + micro-idle and idle_avg is not leverage to max, it will stay high > during an amount of time, thus pulling tasks at each micro idle period, > right ? Yeah, it cares about shutting the thing down when idle distance is too small to be affordable, but cranking is back up quickly as to not damage generic bursty load utilization too much. It tries to be dirt simply and cheap, not perfect. For nohz_full loads, you'll likely want to kill most if not all wake and idle balancing, or at least put some serious roadblocks up.. but then you'll have isolated and pinned everything anyway if you deeply care about perturbation. All load balancing totally sucks in that regard, as do those darn workqueues you mentioned. -Mike ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [question] sched: idle_avg and migration latency 2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano 2013-12-10 15:11 ` Mike Galbraith @ 2013-12-10 15:20 ` Alex Shi 2013-12-10 19:07 ` Daniel Lezcano 1 sibling, 1 reply; 7+ messages in thread From: Alex Shi @ 2013-12-10 15:20 UTC (permalink / raw) To: Daniel Lezcano, Linux Kernel Mailing List, Mike Galbraith CC to MikeG, he written this part. :) I try to explain sth I know. I am sorry if my understanding incorrect. On 12/10/2013 07:30 PM, Daniel Lezcano wrote: > > Hi All, > > I am trying to understand how is computed the idle_avg and how it is > used regarding the migration latency. > > 1. What is the sysctl_sched_migration_cost value ? It is initialized to > 500000UL. Is it an arbitrarily chosen value ? Could it change depending > on the hardware performances ? current sysctl_sched_mirgration_cost is 0.5ms, used to limit overscheduling. Guess it is a kind of arbitrary. But it can be rewrite at /proc/sys/kernel/sched_migration_cost_ns. So if you find some new suitable value in particular scenario. guess PeterZ like to modify it. :) > > > 2. The idle_balance function checks: > > if (this_rq->avg_idle < sysctl_sched_migration_cost) > return 0; > > IIUC, it is not worth to migrate a task to this cpu as we expect to run > another task before we can pull a task to the current cpu, right ? No, that used to prevent every idle_balance cause a task migration if idle balance happens too much and too quick, -- frequency more than task migration limitation. > > Then if there is no task to balance we will enter idle, thus we > initialize the idle_stamp to the current clock. If we pulled task, we will restart frequency calculation by set idle_stamp = 0; or if new task adding this rq, allow more idle_balance. > > When another task is woken up with the ttwu_do_wakeup, the duration of > the idle time is computed in there: > > if (rq->idle_stamp) { > u64 delta = rq_clock(rq) - rq->idle_stamp; > u64 max = 2*sysctl_sched_migration_cost; > > if (delta > max) > rq->avg_idle = max; > else > update_avg(&rq->avg_idle, delta); > rq->idle_stamp = 0; > } > > Why is the 'delta' leveraged by 'max' ? > > > 3. And finally the function update_avg does: > > s64 diff = sample - *avg; > *avg += diff >> 3; > > Why is diff >> 3 used instead of the number of values ? It is a kind of decay. but has no idea of why this value '3'. Guess MikeG has some reason. > > Thanks in advance for any answers > > -- Daniel > -- Thanks Alex ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [question] sched: idle_avg and migration latency 2013-12-10 15:20 ` Alex Shi @ 2013-12-10 19:07 ` Daniel Lezcano 0 siblings, 0 replies; 7+ messages in thread From: Daniel Lezcano @ 2013-12-10 19:07 UTC (permalink / raw) To: Alex Shi, Linux Kernel Mailing List, Mike Galbraith On 12/10/2013 04:20 PM, Alex Shi wrote: > CC to MikeG, he written this part. :) > I try to explain sth I know. I am sorry if my understanding incorrect. > > On 12/10/2013 07:30 PM, Daniel Lezcano wrote: >> >> Hi All, >> >> I am trying to understand how is computed the idle_avg and how it is >> used regarding the migration latency. >> >> 1. What is the sysctl_sched_migration_cost value ? It is initialized to >> 500000UL. Is it an arbitrarily chosen value ? Could it change depending >> on the hardware performances ? > > current sysctl_sched_mirgration_cost is 0.5ms, used to limit > overscheduling. Guess it is a kind of arbitrary. But it can be rewrite > at /proc/sys/kernel/sched_migration_cost_ns. > So if you find some new suitable value in particular scenario. guess > PeterZ like to modify it. :) > >> >> >> 2. The idle_balance function checks: >> >> if (this_rq->avg_idle < sysctl_sched_migration_cost) >> return 0; >> >> IIUC, it is not worth to migrate a task to this cpu as we expect to run >> another task before we can pull a task to the current cpu, right ? > > No, that used to prevent every idle_balance cause a task migration if > idle balance happens too much and too quick, -- frequency more than task > migration limitation. >> >> Then if there is no task to balance we will enter idle, thus we >> initialize the idle_stamp to the current clock. > > If we pulled task, we will restart frequency calculation by set > idle_stamp = 0; > or if new task adding this rq, allow more idle_balance. Thanks Alex for the explanation. >> When another task is woken up with the ttwu_do_wakeup, the duration of >> the idle time is computed in there: >> >> if (rq->idle_stamp) { >> u64 delta = rq_clock(rq) - rq->idle_stamp; >> u64 max = 2*sysctl_sched_migration_cost; >> >> if (delta > max) >> rq->avg_idle = max; >> else >> update_avg(&rq->avg_idle, delta); >> rq->idle_stamp = 0; >> } >> >> Why is the 'delta' leveraged by 'max' ? >> >> >> 3. And finally the function update_avg does: >> >> s64 diff = sample - *avg; >> *avg += diff >> 3; >> >> Why is diff >> 3 used instead of the number of values ? > > It is a kind of decay. but has no idea of why this value '3'. Guess > MikeG has some reason. >> >> Thanks in advance for any answers >> >> -- Daniel >> > > -- <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook | <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-blog/> Blog ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-12-11 6:44 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano 2013-12-10 15:11 ` Mike Galbraith 2013-12-10 18:31 ` Daniel Lezcano 2013-12-11 1:25 ` Alex Shi 2013-12-11 6:44 ` Mike Galbraith 2013-12-10 15:20 ` Alex Shi 2013-12-10 19:07 ` Daniel Lezcano
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox