[question] sched: idle_avg and migration latency

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [question] sched: idle_avg and migration latency
@ 2013-12-10 11:30 Daniel Lezcano
  2013-12-10 15:11 ` Mike Galbraith
  2013-12-10 15:20 ` Alex Shi
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Lezcano @ 2013-12-10 11:30 UTC (permalink / raw)
  To: Linux Kernel Mailing List; +Cc: Alex Shi

Hi All,

I am trying to understand how is computed the idle_avg and how it is 
used regarding the migration latency.

1. What is the sysctl_sched_migration_cost value ? It is initialized to 
500000UL. Is it an arbitrarily chosen value ? Could it change depending 
on the hardware performances ?

2. The idle_balance function checks:

         if (this_rq->avg_idle < sysctl_sched_migration_cost)
                 return 0;

IIUC, it is not worth to migrate a task to this cpu as we expect to run 
another task before we can pull a task to the current cpu, right ?

Then if there is no task to balance we will enter idle, thus we 
initialize the idle_stamp to the current clock.

When another task is woken up with the ttwu_do_wakeup, the duration of 
the idle time is computed in there:

	if (rq->idle_stamp) {
		u64 delta = rq_clock(rq) - rq->idle_stamp;
		u64 max = 2*sysctl_sched_migration_cost;

		if (delta > max)
			rq->avg_idle = max;
		else
			update_avg(&rq->avg_idle, delta);
		rq->idle_stamp = 0;
	}

Why is the 'delta' leveraged by 'max' ?

3. And finally the function update_avg does:

	s64 diff = sample - *avg;
	*avg += diff >> 3;

Why is diff >> 3 used instead of the number of values ?

Thanks in advance for any answers

   -- Daniel

-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano
@ 2013-12-10 15:11 ` Mike Galbraith
  2013-12-10 18:31   ` Daniel Lezcano
  2013-12-10 15:20 ` Alex Shi
  1 sibling, 1 reply; 7+ messages in thread
From: Mike Galbraith @ 2013-12-10 15:11 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, Alex Shi

On Tue, 2013-12-10 at 12:30 +0100, Daniel Lezcano wrote: 
> Hi All,
> 
> I am trying to understand how is computed the idle_avg and how it is 
> used regarding the migration latency.
> 
> 1. What is the sysctl_sched_migration_cost value ? It is initialized to 
> 500000UL. Is it an arbitrarily chosen value ? Could it change depending 
> on the hardware performances ?

Yeah, it's a magic number.  We used to use boot time measurements.

> 2. The idle_balance function checks:
> 
>          if (this_rq->avg_idle < sysctl_sched_migration_cost)
>                  return 0;
> 
> IIUC, it is not worth to migrate a task to this cpu as we expect to run 
> another task before we can pull a task to the current cpu, right ?

No, that's all about not beating living hell outta ourselves on every
micro-idle.  As with all load balancing, it's usually too much balancing
that creates a problem.  You need it, but it's really expensive, so less
is more.

> Then if there is no task to balance we will enter idle, thus we 
> initialize the idle_stamp to the current clock.
> 
> When another task is woken up with the ttwu_do_wakeup, the duration of 
> the idle time is computed in there:
> 
> 	if (rq->idle_stamp) {
> 		u64 delta = rq_clock(rq) - rq->idle_stamp;
> 		u64 max = 2*sysctl_sched_migration_cost;
> 
> 		if (delta > max)
> 			rq->avg_idle = max;
> 		else
> 			update_avg(&rq->avg_idle, delta);
> 		rq->idle_stamp = 0;
> 	}
> 
> Why is the 'delta' leveraged by 'max' ?

That has changed a little recently.  I originally slammed avg_idle
itself straight to max to ensure that a bursty load would idle balance,
and not use stale data.  If you start cross core switching at high
frequency, you'll still shut idle balancing quickly.

> 3. And finally the function update_avg does:
> 
> 	s64 diff = sample - *avg;
> 	*avg += diff >> 3;
> 
> Why is diff >> 3 used instead of the number of values ?

Ingo's quick like bunny smooth average.

-Mike


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano
  2013-12-10 15:11 ` Mike Galbraith
@ 2013-12-10 15:20 ` Alex Shi
  2013-12-10 19:07   ` Daniel Lezcano
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Shi @ 2013-12-10 15:20 UTC (permalink / raw)
  To: Daniel Lezcano, Linux Kernel Mailing List, Mike Galbraith

CC to MikeG, he written this part. :)
I try to explain sth I know. I am sorry if my understanding incorrect.

On 12/10/2013 07:30 PM, Daniel Lezcano wrote:
> 
> Hi All,
> 
> I am trying to understand how is computed the idle_avg and how it is
> used regarding the migration latency.
> 
> 1. What is the sysctl_sched_migration_cost value ? It is initialized to
> 500000UL. Is it an arbitrarily chosen value ? Could it change depending
> on the hardware performances ?

current sysctl_sched_mirgration_cost is 0.5ms, used to limit
overscheduling. Guess it is a kind of arbitrary. But it can be rewrite
at /proc/sys/kernel/sched_migration_cost_ns.
So if you find some new suitable value in particular scenario. guess
PeterZ like to modify it. :)

> 
> 
> 2. The idle_balance function checks:
> 
>         if (this_rq->avg_idle < sysctl_sched_migration_cost)
>                 return 0;
> 
> IIUC, it is not worth to migrate a task to this cpu as we expect to run
> another task before we can pull a task to the current cpu, right ?

No, that used to prevent every idle_balance cause a task migration if
idle balance happens too much and too quick, -- frequency more than task
migration limitation.
> 
> Then if there is no task to balance we will enter idle, thus we
> initialize the idle_stamp to the current clock.

If we pulled task, we will restart frequency calculation by set
idle_stamp = 0;
or if new task adding this rq, allow more idle_balance.
> 
> When another task is woken up with the ttwu_do_wakeup, the duration of
> the idle time is computed in there:
> 
>     if (rq->idle_stamp) {
>         u64 delta = rq_clock(rq) - rq->idle_stamp;
>         u64 max = 2*sysctl_sched_migration_cost;
> 
>         if (delta > max)
>             rq->avg_idle = max;
>         else
>             update_avg(&rq->avg_idle, delta);
>         rq->idle_stamp = 0;
>     }
> 
> Why is the 'delta' leveraged by 'max' ?
> 
> 
> 3. And finally the function update_avg does:
> 
>     s64 diff = sample - *avg;
>     *avg += diff >> 3;
> 
> Why is diff >> 3 used instead of the number of values ?

It is a kind of decay. but has no idea of why this value '3'. Guess
MikeG has some reason.
> 
> Thanks in advance for any answers
> 
>   -- Daniel
> 


-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 15:11 ` Mike Galbraith
@ 2013-12-10 18:31   ` Daniel Lezcano
  2013-12-11  1:25     ` Alex Shi
  2013-12-11  6:44     ` Mike Galbraith
  0 siblings, 2 replies; 7+ messages in thread
From: Daniel Lezcano @ 2013-12-10 18:31 UTC (permalink / raw)
  To: Mike Galbraith; +Cc: Linux Kernel Mailing List, Alex Shi, Ingo Molnar

On 12/10/2013 04:11 PM, Mike Galbraith wrote:
> On Tue, 2013-12-10 at 12:30 +0100, Daniel Lezcano wrote:
>> Hi All,
>>
>> I am trying to understand how is computed the idle_avg and how it is
>> used regarding the migration latency.
>>
>> 1. What is the sysctl_sched_migration_cost value ? It is initialized to
>> 500000UL. Is it an arbitrarily chosen value ? Could it change depending
>> on the hardware performances ?
>
> Yeah, it's a magic number.  We used to use boot time measurements.
>
>> 2. The idle_balance function checks:
>>
>>           if (this_rq->avg_idle < sysctl_sched_migration_cost)
>>                   return 0;
>>
>> IIUC, it is not worth to migrate a task to this cpu as we expect to run
>> another task before we can pull a task to the current cpu, right ?
>
> No, that's all about not beating living hell outta ourselves on every
> micro-idle.  As with all load balancing, it's usually too much balancing
> that creates a problem.  You need it, but it's really expensive, so less
> is more.
>
>> Then if there is no task to balance we will enter idle, thus we
>> initialize the idle_stamp to the current clock.
>>
>> When another task is woken up with the ttwu_do_wakeup, the duration of
>> the idle time is computed in there:
>>
>> 	if (rq->idle_stamp) {
>> 		u64 delta = rq_clock(rq) - rq->idle_stamp;
>> 		u64 max = 2*sysctl_sched_migration_cost;
>>
>> 		if (delta > max)
>> 			rq->avg_idle = max;
>> 		else
>> 			update_avg(&rq->avg_idle, delta);
>> 		rq->idle_stamp = 0;
>> 	}
>>
>> Why is the 'delta' leveraged by 'max' ?
>
> That has changed a little recently.  I originally slammed avg_idle
> itself straight to max to ensure that a bursty load would idle balance,
> and not use stale data.  If you start cross core switching at high
> frequency, you'll still shut idle balancing quickly.

Ok, thanks for the explanation.

I think I am a bit puzzled with the 'idle_avg' name. I am guessing the 
semantic of this variable is "how long this cpu has been idle".

The idle duration, with the no_hz, could be long, several seconds if the 
work queues have been migrated and if the timer affinity is set to 
another cpu. So if we fall in this case and there is a burst of activity 
+ micro-idle and idle_avg is not leverage to max, it will stay high 
during an amount of time, thus pulling tasks at each micro idle period, 
right ?

>> 3. And finally the function update_avg does:
>>
>> 	s64 diff = sample - *avg;
>> 	*avg += diff >> 3;
>>
>> Why is diff >> 3 used instead of the number of values ?
>
> Ingo's quick like bunny smooth average.

Yeah, average computation on-the-fly. But why 'divide by 8' ? (Cc'ed Ingo).

Thanks for taking the time to answer.

   -- Daniel

-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 15:20 ` Alex Shi
@ 2013-12-10 19:07   ` Daniel Lezcano
  0 siblings, 0 replies; 7+ messages in thread
From: Daniel Lezcano @ 2013-12-10 19:07 UTC (permalink / raw)
  To: Alex Shi, Linux Kernel Mailing List, Mike Galbraith

On 12/10/2013 04:20 PM, Alex Shi wrote:
> CC to MikeG, he written this part. :)
> I try to explain sth I know. I am sorry if my understanding incorrect.
>
> On 12/10/2013 07:30 PM, Daniel Lezcano wrote:
>>
>> Hi All,
>>
>> I am trying to understand how is computed the idle_avg and how it is
>> used regarding the migration latency.
>>
>> 1. What is the sysctl_sched_migration_cost value ? It is initialized to
>> 500000UL. Is it an arbitrarily chosen value ? Could it change depending
>> on the hardware performances ?
>
> current sysctl_sched_mirgration_cost is 0.5ms, used to limit
> overscheduling. Guess it is a kind of arbitrary. But it can be rewrite
> at /proc/sys/kernel/sched_migration_cost_ns.
> So if you find some new suitable value in particular scenario. guess
> PeterZ like to modify it. :)
>
>>
>>
>> 2. The idle_balance function checks:
>>
>>          if (this_rq->avg_idle < sysctl_sched_migration_cost)
>>                  return 0;
>>
>> IIUC, it is not worth to migrate a task to this cpu as we expect to run
>> another task before we can pull a task to the current cpu, right ?
>
> No, that used to prevent every idle_balance cause a task migration if
> idle balance happens too much and too quick, -- frequency more than task
> migration limitation.
>>
>> Then if there is no task to balance we will enter idle, thus we
>> initialize the idle_stamp to the current clock.
>
> If we pulled task, we will restart frequency calculation by set
> idle_stamp = 0;
> or if new task adding this rq, allow more idle_balance.

Thanks Alex for the explanation.

>> When another task is woken up with the ttwu_do_wakeup, the duration of
>> the idle time is computed in there:
>>
>>      if (rq->idle_stamp) {
>>          u64 delta = rq_clock(rq) - rq->idle_stamp;
>>          u64 max = 2*sysctl_sched_migration_cost;
>>
>>          if (delta > max)
>>              rq->avg_idle = max;
>>          else
>>              update_avg(&rq->avg_idle, delta);
>>          rq->idle_stamp = 0;
>>      }
>>
>> Why is the 'delta' leveraged by 'max' ?
>>
>>
>> 3. And finally the function update_avg does:
>>
>>      s64 diff = sample - *avg;
>>      *avg += diff >> 3;
>>
>> Why is diff >> 3 used instead of the number of values ?
>
> It is a kind of decay. but has no idea of why this value '3'. Guess
> MikeG has some reason.
>>
>> Thanks in advance for any answers
>>
>>    -- Daniel
>>
>
>


-- 
  <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 18:31   ` Daniel Lezcano
@ 2013-12-11  1:25     ` Alex Shi
  2013-12-11  6:44     ` Mike Galbraith
  1 sibling, 0 replies; 7+ messages in thread
From: Alex Shi @ 2013-12-11  1:25 UTC (permalink / raw)
  To: Daniel Lezcano, Mike Galbraith; +Cc: Linux Kernel Mailing List, Ingo Molnar

On 12/11/2013 02:31 AM, Daniel Lezcano wrote:
>>
>> That has changed a little recently.  I originally slammed avg_idle
>> itself straight to max to ensure that a bursty load would idle balance,
>> and not use stale data.  If you start cross core switching at high
>> frequency, you'll still shut idle balancing quickly.
> 
> Ok, thanks for the explanation.
> 
> I think I am a bit puzzled with the 'idle_avg' name. I am guessing the
> semantic of this variable is "how long this cpu has been idle".
> 
> The idle duration, with the no_hz, could be long, several seconds if the
> work queues have been migrated and if the timer affinity is set to
> another cpu. So if we fall in this case and there is a burst of activity
> + micro-idle and idle_avg is not leverage to max, it will stay high
> during an amount of time, thus pulling tasks at each micro idle period,
> right ?

yes, I think so.

-- 
Thanks
    Alex

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [question] sched: idle_avg and migration latency
  2013-12-10 18:31   ` Daniel Lezcano
  2013-12-11  1:25     ` Alex Shi
@ 2013-12-11  6:44     ` Mike Galbraith
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Galbraith @ 2013-12-11  6:44 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: Linux Kernel Mailing List, Alex Shi, Ingo Molnar

On Tue, 2013-12-10 at 19:31 +0100, Daniel Lezcano wrote:

> I think I am a bit puzzled with the 'idle_avg' name. I am guessing the 
> semantic of this variable is "how long this cpu has been idle".

Average distance between idles.

> The idle duration, with the no_hz, could be long, several seconds if the 
> work queues have been migrated and if the timer affinity is set to 
> another cpu. So if we fall in this case and there is a burst of activity 
> + micro-idle and idle_avg is not leverage to max, it will stay high 
> during an amount of time, thus pulling tasks at each micro idle period, 
> right ?

Yeah, it cares about shutting the thing down when idle distance is too
small to be affordable, but cranking is back up quickly as to not damage
generic bursty load utilization too much.  It tries to be dirt simply
and cheap, not perfect.

For nohz_full loads, you'll likely want to kill most if not all wake and
idle balancing, or at least put some serious roadblocks up.. but then
you'll have isolated and pinned everything anyway if you deeply care
about perturbation.  All load balancing totally sucks in that regard, as
do those darn workqueues you mentioned.

-Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-12-11  6:44 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-10 11:30 [question] sched: idle_avg and migration latency Daniel Lezcano
2013-12-10 15:11 ` Mike Galbraith
2013-12-10 18:31   ` Daniel Lezcano
2013-12-11  1:25     ` Alex Shi
2013-12-11  6:44     ` Mike Galbraith
2013-12-10 15:20 ` Alex Shi
2013-12-10 19:07   ` Daniel Lezcano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox