[PATCH 0/2] Per-task load tracking errors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Per-task load tracking errors
@ 2013-12-09 12:59 Chris Redpath
  2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath
  2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath
  0 siblings, 2 replies; 15+ messages in thread
From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw)
  To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann
  Cc: linux-kernel, Chris Redpath

Hi Paul, Peter etc.

I've found a couple of bugs in the load tracking code and here
is my attempt to fix them. I have some test code available which
can trigger the issues if anyone is interested.

The first one is straightforward. We can leave a number in 
se.avg.decay_count after a short sleep. If that task is later
migrated while runnable, then the left-over decay looks like
unaccounted sleep time so the load is decayed.

The second one is similar. Here we are losing sleep time for a
task if it is migrated while sleeping and the CPU it previously
ran on has entered nohz mode.

I don't really like this fix much, but the root of the problem
is that load tracking more-or-less expects the runqueue's
decay_counter to be up to date, and when nohz is in use it is
not. The fix demonstrates the issue anyway, I haven't seen
other occasions where nohz CPUs distort the tracked load.

Chris Redpath (2):
  sched: reset blocked load decay_count during synchronization
  sched: update runqueue clock before migrations away

 kernel/sched/fair.c |   38 +++++++++++++++++++++++++++++++++-----
 1 file changed, 33 insertions(+), 5 deletions(-)

-- 
1.7.9.5

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/2] sched: reset blocked load decay_count during synchronization
  2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath
@ 2013-12-09 12:59 ` Chris Redpath
  2013-12-09 17:59   ` bsegall
  2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath
  1 sibling, 1 reply; 15+ messages in thread
From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw)
  To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann
  Cc: linux-kernel, Chris Redpath

If an entity happens to sleep for less than one tick duration
the tracked load associated with that entity can be decayed by an
unexpectedly large amount if it is later migrated to a different
CPU. This can interfere with correct scheduling when entity load
is used for decision making.

The reason for this is that when an entity is dequeued and enqueued
quickly, such that se.avg.decay_count and cfs_rq.decay_counter
do not differ when that entity is enqueued again,
__synchronize_entity_decay skips the calculation step and also skips
clearing the decay_count. At a later time that entity may be
migrated and its load will be decayed incorrectly.

All users of this function expect decay_count to be zero'ed after
use.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
---
 kernel/sched/fair.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8b652e..b7e5945 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2142,12 +2142,10 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 	u64 decays = atomic64_read(&cfs_rq->decay_counter);

 	decays -= se->avg.decay_count;
-	if (!decays)
-		return 0;
-
-	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	if (decays)
+		se->avg.load_avg_contrib =
+			decay_load(se->avg.load_avg_contrib, decays);
 	se->avg.decay_count = 0;
-
 	return decays;
 }

-- 
1.7.9.5

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath
  2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath
@ 2013-12-09 12:59 ` Chris Redpath
  2013-12-09 18:13   ` bsegall
  2013-12-10 11:48   ` Peter Zijlstra
  1 sibling, 2 replies; 15+ messages in thread
From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw)
  To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann
  Cc: linux-kernel, Chris Redpath

If we migrate a sleeping task away from a CPU which has the
tick stopped, then both the clock_task and decay_counter will
be out of date for that CPU and we will not decay load correctly
regardless of how often we update the blocked load.

This is only an issue for tasks which are not on a runqueue
(because otherwise that CPU would be awake) and simultaneously
the CPU the task previously ran on has had the tick stopped.

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
---
 kernel/sched/fair.c |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7e5945..0af1dc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4324,6 +4324,7 @@ unlock:
 	return new_cpu;
 }
 
+static int nohz_test_cpu(int cpu);
 /*
  * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
  * cfs_rq_of(p) references at time of call are still valid and identify the
@@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 	 * be negative here since on-rq tasks have decay-count == 0.
 	 */
 	if (se->avg.decay_count) {
+		/*
+		 * If we migrate a sleeping task away from a CPU
+		 * which has the tick stopped, then both the clock_task
+		 * and decay_counter will be out of date for that CPU
+		 * and we will not decay load correctly.
+		 */
+		if (!se->on_rq && nohz_test_cpu(task_cpu(p))) {
+			struct rq *rq = cpu_rq(task_cpu(p));
+			unsigned long flags;
+			/*
+			 * Current CPU cannot be holding rq->lock in this
+			 * circumstance, but another might be. We must hold
+			 * rq->lock before we go poking around in its clocks
+			 */
+			raw_spin_lock_irqsave(&rq->lock, flags);
+			update_rq_clock(rq);
+			update_cfs_rq_blocked_load(cfs_rq, 0);
+			raw_spin_unlock_irqrestore(&rq->lock, flags);
+		}
 		se->avg.decay_count = -__synchronize_entity_decay(se);
 		atomic_long_add(se->avg.load_avg_contrib,
 						&cfs_rq->removed_load);
@@ -6507,6 +6527,11 @@ static struct {
 	unsigned long next_balance;     /* in jiffy units */
 } nohz ____cacheline_aligned;
 
+static int nohz_test_cpu(int cpu)
+{
+	return cpumask_test_cpu(cpu, nohz.idle_cpus_mask);
+}
+
 static inline int find_new_ilb(int call_cpu)
 {
 	int ilb = cpumask_first(nohz.idle_cpus_mask);
@@ -6619,6 +6644,11 @@ static int sched_ilb_notifier(struct notifier_block *nfb,
 		return NOTIFY_DONE;
 	}
 }
+#else
+static int nohz_test_cpu(int cpu)
+{
+	return 0;
+}
 #endif
 
 static DEFINE_SPINLOCK(balancing);
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] sched: reset blocked load decay_count during synchronization
  2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath
@ 2013-12-09 17:59   ` bsegall
  0 siblings, 0 replies; 15+ messages in thread
From: bsegall @ 2013-12-09 17:59 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann,
	linux-kernel

Chris Redpath <chris.redpath@arm.com> writes:

> If an entity happens to sleep for less than one tick duration
> the tracked load associated with that entity can be decayed by an
> unexpectedly large amount if it is later migrated to a different
> CPU. This can interfere with correct scheduling when entity load
> is used for decision making.
>
> The reason for this is that when an entity is dequeued and enqueued
> quickly, such that se.avg.decay_count and cfs_rq.decay_counter
> do not differ when that entity is enqueued again,
> __synchronize_entity_decay skips the calculation step and also skips
> clearing the decay_count. At a later time that entity may be
> migrated and its load will be decayed incorrectly.
>
> All users of this function expect decay_count to be zero'ed after
> use.
>
> Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Reviewed-by: Ben Segall <bsegall@google.com>

> ---
>  kernel/sched/fair.c |    8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8b652e..b7e5945 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2142,12 +2142,10 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
>  	u64 decays = atomic64_read(&cfs_rq->decay_counter);
>  
>  	decays -= se->avg.decay_count;
> -	if (!decays)
> -		return 0;
> -
> -	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
> +	if (decays)
> +		se->avg.load_avg_contrib =
> +			decay_load(se->avg.load_avg_contrib, decays);
>  	se->avg.decay_count = 0;
> -
>  	return decays;
>  }

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath
@ 2013-12-09 18:13   ` bsegall
  2013-12-10 11:48   ` Peter Zijlstra
  1 sibling, 0 replies; 15+ messages in thread
From: bsegall @ 2013-12-09 18:13 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann,
	linux-kernel

Chris Redpath <chris.redpath@arm.com> writes:

> If we migrate a sleeping task away from a CPU which has the
> tick stopped, then both the clock_task and decay_counter will
> be out of date for that CPU and we will not decay load correctly
> regardless of how often we update the blocked load.
>
> This is only an issue for tasks which are not on a runqueue
> (because otherwise that CPU would be awake) and simultaneously
> the CPU the task previously ran on has had the tick stopped.
>
> Signed-off-by: Chris Redpath <chris.redpath@arm.com>

This looks like it is basically correct, but it seems unfortunate to
take any rq lock for these ttwus. I don't know enough about the nohz
machinery to know if that's at all avoidable.


> ---
>  kernel/sched/fair.c |   30 ++++++++++++++++++++++++++++++
>  1 file changed, 30 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7e5945..0af1dc2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4324,6 +4324,7 @@ unlock:
>  	return new_cpu;
>  }
>  
> +static int nohz_test_cpu(int cpu);
>  /*
>   * Called immediately before a task is migrated to a new cpu; task_cpu(p) and
>   * cfs_rq_of(p) references at time of call are still valid and identify the
> @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>  	 * be negative here since on-rq tasks have decay-count == 0.
>  	 */
>  	if (se->avg.decay_count) {
> +		/*
> +		 * If we migrate a sleeping task away from a CPU
> +		 * which has the tick stopped, then both the clock_task
> +		 * and decay_counter will be out of date for that CPU
> +		 * and we will not decay load correctly.
> +		 */
> +		if (!se->on_rq && nohz_test_cpu(task_cpu(p))) {
p->on_rq - se->on_rq must be false to call set_task_cpu at all. That
said, barring bugs like the one you fixed in patch 1 I think decay_count
!= 0 should also imply !p->on_rq.

> +			struct rq *rq = cpu_rq(task_cpu(p));
> +			unsigned long flags;
> +			/*
> +			 * Current CPU cannot be holding rq->lock in this
> +			 * circumstance, but another might be. We must hold
> +			 * rq->lock before we go poking around in its clocks
> +			 */
> +			raw_spin_lock_irqsave(&rq->lock, flags);
> +			update_rq_clock(rq);
> +			update_cfs_rq_blocked_load(cfs_rq, 0);
> +			raw_spin_unlock_irqrestore(&rq->lock, flags);
> +		}
>  		se->avg.decay_count = -__synchronize_entity_decay(se);
>  		atomic_long_add(se->avg.load_avg_contrib,
>  						&cfs_rq->removed_load);
> @@ -6507,6 +6527,11 @@ static struct {
>  	unsigned long next_balance;     /* in jiffy units */
>  } nohz ____cacheline_aligned;
>  
> +static int nohz_test_cpu(int cpu)
> +{
> +	return cpumask_test_cpu(cpu, nohz.idle_cpus_mask);
> +}
> +
>  static inline int find_new_ilb(int call_cpu)
>  {
>  	int ilb = cpumask_first(nohz.idle_cpus_mask);
> @@ -6619,6 +6644,11 @@ static int sched_ilb_notifier(struct notifier_block *nfb,
>  		return NOTIFY_DONE;
>  	}
>  }
> +#else
> +static int nohz_test_cpu(int cpu)
> +{
> +	return 0;
> +}
>  #endif
>  
>  static DEFINE_SPINLOCK(balancing);

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath
  2013-12-09 18:13   ` bsegall
@ 2013-12-10 11:48   ` Peter Zijlstra
  2013-12-10 13:24     ` Chris Redpath
  1 sibling, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2013-12-10 11:48 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt, mingo, alex.shi, morten.rasmussen, dietmar.eggemann,
	linux-kernel

On Mon, Dec 09, 2013 at 12:59:10PM +0000, Chris Redpath wrote:
> If we migrate a sleeping task away from a CPU which has the
> tick stopped, then both the clock_task and decay_counter will
> be out of date for that CPU and we will not decay load correctly
> regardless of how often we update the blocked load.
> 
> This is only an issue for tasks which are not on a runqueue
> (because otherwise that CPU would be awake) and simultaneously
> the CPU the task previously ran on has had the tick stopped.

OK, so the idiot in a hurry (me) isn't quite getting the issue.

Normally we update the blocked averages from the tick; clearly when no
tick, no update. So far so good.

Now, we also update blocked load from idle balance -- which would
include the CPUs without tick through nohz_idle_balance() -- however
this only appears to be done for CONFIG_FAIR_GROUP_SCHED.

Are you running without cgroup muck? If so should we make this
unconditional?

If you have cgroup muck enabled; what's the problem? Don't we run
nohz_idle_balance() frequently enough to be effective for updating the
blocked load?

You also seem to have overlooked NO_HZ_FULL, that can stop a tick even
when there's a running task and makes the situation even more fun.

> @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>  	 * be negative here since on-rq tasks have decay-count == 0.
>  	 */
>  	if (se->avg.decay_count) {
> +		/*
> +		 * If we migrate a sleeping task away from a CPU
> +		 * which has the tick stopped, then both the clock_task
> +		 * and decay_counter will be out of date for that CPU
> +		 * and we will not decay load correctly.
> +		 */
> +		if (!se->on_rq && nohz_test_cpu(task_cpu(p))) {
> +			struct rq *rq = cpu_rq(task_cpu(p));
> +			unsigned long flags;
> +			/*
> +			 * Current CPU cannot be holding rq->lock in this
> +			 * circumstance, but another might be. We must hold
> +			 * rq->lock before we go poking around in its clocks
> +			 */
> +			raw_spin_lock_irqsave(&rq->lock, flags);
> +			update_rq_clock(rq);
> +			update_cfs_rq_blocked_load(cfs_rq, 0);
> +			raw_spin_unlock_irqrestore(&rq->lock, flags);
> +		}
>  		se->avg.decay_count = -__synchronize_entity_decay(se);
>  		atomic_long_add(se->avg.load_avg_contrib,
>  						&cfs_rq->removed_load);

Right, as Ben already said; taking a rq->lock there is unfortunate at
best.

So normally we 'throttle' the expense of decaying the blocked load to
ticks. But the above does it on every (suitable) task migration which
might be far more often.

So ideally we'd get it all sorted through the nohz_idle_balance() path;
what exactly are the problems with that?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-10 11:48   ` Peter Zijlstra
@ 2013-12-10 13:24     ` Chris Redpath
  2013-12-10 15:14       ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Redpath @ 2013-12-10 13:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall

On 10/12/13 11:48, Peter Zijlstra wrote:
> On Mon, Dec 09, 2013 at 12:59:10PM +0000, Chris Redpath wrote:
>> If we migrate a sleeping task away from a CPU which has the
>> tick stopped, then both the clock_task and decay_counter will
>> be out of date for that CPU and we will not decay load correctly
>> regardless of how often we update the blocked load.
>>
>> This is only an issue for tasks which are not on a runqueue
>> (because otherwise that CPU would be awake) and simultaneously
>> the CPU the task previously ran on has had the tick stopped.
>
> OK, so the idiot in a hurry (me) isn't quite getting the issue.
>

Sorry Peter, I will expand a little. We are using runnable_avg_sum to 
drive task placement in a much more aggressive way than usual in our 
big.LITTLE MP scheduler patches. I spend a lot of time looking at 
individual task load signals.

What happens is that if you have a task which sleeps for a while and 
wakes on a different CPU and the previous CPU hasn't had a tick for a 
while, then that sleep time is lost. If the sleep time is short relative 
to the runtime the impact is small but we have tasks on Android where 
the runtime is small relative to the sleep. These tasks look much bigger 
than they really are. Everything works, but we can use more energy than 
we need to to meet the compute requirements. I guess load balancing 
could also be mislead for a while.

I fully appreciate that it isn't so visible the way that load averages 
are used in the mainline scheduler, but its definitely there.

> Normally we update the blocked averages from the tick; clearly when no
> tick, no update. So far so good.
>
> Now, we also update blocked load from idle balance -- which would
> include the CPUs without tick through nohz_idle_balance() -- however
> this only appears to be done for CONFIG_FAIR_GROUP_SCHED.
>
> Are you running without cgroup muck? If so should we make this
> unconditional?
>
> If you have cgroup muck enabled; what's the problem? Don't we run
> nohz_idle_balance() frequently enough to be effective for updating the
> blocked load?
>

I have to check this out a bit more, but yes I do have the muck enabled 
:/ I will investigate how often we idle balance on this platform however 
to be correct wouldn't we need an idle balance event to happen during 
sleep every time an entity was dequeued? And then we'd still lose the 
sleep time if we happen to wake between idle balance events?

> You also seem to have overlooked NO_HZ_FULL, that can stop a tick even
> when there's a running task and makes the situation even more fun.

You are right, I haven't considered NO_HZ_FULL at all but I guess its 
also going to have an impact. I'll look closer at it.

>
>> @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
>>   	 * be negative here since on-rq tasks have decay-count == 0.
>>   	 */
>>   	if (se->avg.decay_count) {
>> +		/*
>> +		 * If we migrate a sleeping task away from a CPU
>> +		 * which has the tick stopped, then both the clock_task
>> +		 * and decay_counter will be out of date for that CPU
>> +		 * and we will not decay load correctly.
>> +		 */
>> +		if (!se->on_rq && nohz_test_cpu(task_cpu(p))) {
>> +			struct rq *rq = cpu_rq(task_cpu(p));
>> +			unsigned long flags;
>> +			/*
>> +			 * Current CPU cannot be holding rq->lock in this
>> +			 * circumstance, but another might be. We must hold
>> +			 * rq->lock before we go poking around in its clocks
>> +			 */
>> +			raw_spin_lock_irqsave(&rq->lock, flags);
>> +			update_rq_clock(rq);
>> +			update_cfs_rq_blocked_load(cfs_rq, 0);
>> +			raw_spin_unlock_irqrestore(&rq->lock, flags);
>> +		}
>>   		se->avg.decay_count = -__synchronize_entity_decay(se);
>>   		atomic_long_add(se->avg.load_avg_contrib,
>>   						&cfs_rq->removed_load);
>
> Right, as Ben already said; taking a rq->lock there is unfortunate at
> best.
>

You're both being very polite. I hate it too :) The only reason for 
taking the lock is to be sure the decay_counter is updated, normal sleep 
accounting doesn't need all this and is always correct.

This stuff only exists in the first place so that the blocked load 
doesn't get out of whack when we are moving entities. Maybe I can 
separate entity load decay due to sleep from blocked load accounting. It 
will leave blocked load subject to rq clock updates but that might be 
OK. It's only a half-formed idea at the moment, I will ponder a bit longer.

> So normally we 'throttle' the expense of decaying the blocked load to
> ticks. But the above does it on every (suitable) task migration which
> might be far more often.
>
> So ideally we'd get it all sorted through the nohz_idle_balance() path;
> what exactly are the problems with that?
>

What I want is for se.avg.runnable_avg_sum to correctly incorporate the 
sleep time. I care much less about runqueue blocked averages being 
instantaneously correct so long as they don't get out of sync when 
entities move around. If it's possible to do it in the idle balance path 
that is great - if you have any suggestions I'd be glad to have a look.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-10 13:24     ` Chris Redpath
@ 2013-12-10 15:14       ` Peter Zijlstra
  2013-12-10 15:55         ` Chris Redpath
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Zijlstra @ 2013-12-10 15:14 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall

On Tue, Dec 10, 2013 at 01:24:21PM +0000, Chris Redpath wrote:
> What happens is that if you have a task which sleeps for a while and wakes
> on a different CPU and the previous CPU hasn't had a tick for a while, then
> that sleep time is lost.

/me more confused now.

Where does update_cfs_rq_blocked_load() account sleep time? Its only
concerned with the blocked decay.

Or is it the decay_count <= 0 case in enqueue_entity_load_avg() that's
screwing you over?

That's guestimating the last_runnable_update based on decay_count, and
per the previous the decay count can get slightly out of sync.

If so, we should look to cure the issue enqueue_entity_load_avg() is
trying to work around, namely the fact that we cannot assume synced
clocks, nor can we read remote clocks atomically.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-10 15:14       ` Peter Zijlstra
@ 2013-12-10 15:55         ` Chris Redpath
  2013-12-12 18:24           ` Peter Zijlstra
  0 siblings, 1 reply; 15+ messages in thread
From: Chris Redpath @ 2013-12-10 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall@google.com

On 10/12/13 15:14, Peter Zijlstra wrote:
> On Tue, Dec 10, 2013 at 01:24:21PM +0000, Chris Redpath wrote:
>> What happens is that if you have a task which sleeps for a while and wakes
>> on a different CPU and the previous CPU hasn't had a tick for a while, then
>> that sleep time is lost.
>
> /me more confused now.
>
> Where does update_cfs_rq_blocked_load() account sleep time? Its only
> concerned with the blocked decay.
>
> Or is it the decay_count <= 0 case in enqueue_entity_load_avg() that's
> screwing you over?
>
Sorry, I don't appear to have explained myself well.

We don't get a negative decay_count when we synchronize the entity in
this situation because the decay_counter of the runqueue has not been
updated since the last update. __synchronize_entity_decay will return 0,
and hence the entity load average does not decay and neither does
blocked load.

It doesn't seem to matter much that the blocked load doesn't decay
because we are about to remove the load from the now-unblocked entity
anyway, but it impacts the entity load average because that sleep time
has now disappeared.

The call to update_rq_clock() is needed to get rq->clock_task updated,
and then I use update_cfs_rq_blocked_load() to get the decay_counter
updated.

It has the side-effect of updating the blocked_load_avg to stay
consistent but the effect I am looking for is the decay_counter update.

> That's guestimating the last_runnable_update based on decay_count, and
> per the previous the decay count can get slightly out of sync.

The guesstimation works fine, the issue is only that we can't tell at
this point how much time that entity was asleep when the CPU it ran on
has no tick and since it is too expensive to synchronize the clocks,
there isn't (currently) a way to find out without updating the rq we
came from.

I can't see anything handy lying around in struct rq, but is there a
copy of the jiffies held anywhere per-cpu presuming it would stop being
updated when the tick stops? If not, I could store one somewhere as part
of turning the tick off and then we could use the difference between
that and the current jiffies count to estimate the amount of time in
limbo. That would almost certainly be accurate enough for me - a few ms
won't hurt but when we lose seconds it does.

>
> If so, we should look to cure the issue enqueue_entity_load_avg() is
> trying to work around, namely the fact that we cannot assume synced
> clocks, nor can we read remote clocks atomically.
>

That would be nice, but IMHO the estimation error is small and
self-consistent so the optimization appears good.

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium.  Thank you.

ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2557590
ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No:  2548782

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-10 15:55         ` Chris Redpath
@ 2013-12-12 18:24           ` Peter Zijlstra
  2013-12-13  8:48             ` Vincent Guittot
  2013-12-17 14:09             ` Chris Redpath
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2013-12-12 18:24 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall@google.com, Vincent Guittot, Frederic Weisbecker

On Tue, Dec 10, 2013 at 03:55:43PM +0000, Chris Redpath wrote:
> >That's guestimating the last_runnable_update based on decay_count, and
> >per the previous the decay count can get slightly out of sync.
> 
> The guesstimation works fine, the issue is only that we can't tell at
> this point how much time that entity was asleep when the CPU it ran on
> has no tick and since it is too expensive to synchronize the clocks,
> there isn't (currently) a way to find out without updating the rq we
> came from.
> 
> I can't see anything handy lying around in struct rq, but is there a
> copy of the jiffies held anywhere per-cpu presuming it would stop being
> updated when the tick stops? If not, I could store one somewhere as part
> of turning the tick off and then we could use the difference between
> that and the current jiffies count to estimate the amount of time in
> limbo. That would almost certainly be accurate enough for me - a few ms
> won't hurt but when we lose seconds it does.

Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick
be useful?

I suppose we could easily lift that to NO_HZ_COMMON.

Which raises another point; I dislike this idle pre/post business. Why
can't the post_schedule_idle() calls be done from pick_next_task_idle()
and pre_schedule_idle() done from put_prev_task_idle() ?

That would avoid the post_schedule() rq->lock dance.

Vince?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-12 18:24           ` Peter Zijlstra
@ 2013-12-13  8:48             ` Vincent Guittot
  2013-12-17 14:09             ` Chris Redpath
  1 sibling, 0 replies; 15+ messages in thread
From: Vincent Guittot @ 2013-12-13  8:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Redpath, pjt@google.com, mingo@redhat.com,
	alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann,
	linux-kernel@vger.kernel.org, bsegall@google.com,
	Frederic Weisbecker

On 12 December 2013 19:24, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Dec 10, 2013 at 03:55:43PM +0000, Chris Redpath wrote:
>> >That's guestimating the last_runnable_update based on decay_count, and
>> >per the previous the decay count can get slightly out of sync.
>>
>> The guesstimation works fine, the issue is only that we can't tell at
>> this point how much time that entity was asleep when the CPU it ran on
>> has no tick and since it is too expensive to synchronize the clocks,
>> there isn't (currently) a way to find out without updating the rq we
>> came from.
>>
>> I can't see anything handy lying around in struct rq, but is there a
>> copy of the jiffies held anywhere per-cpu presuming it would stop being
>> updated when the tick stops? If not, I could store one somewhere as part
>> of turning the tick off and then we could use the difference between
>> that and the current jiffies count to estimate the amount of time in
>> limbo. That would almost certainly be accurate enough for me - a few ms
>> won't hurt but when we lose seconds it does.
>
> Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick
> be useful?
>
> I suppose we could easily lift that to NO_HZ_COMMON.
>
> Which raises another point; I dislike this idle pre/post business. Why
> can't the post_schedule_idle() calls be done from pick_next_task_idle()
> and pre_schedule_idle() done from put_prev_task_idle() ?

The update of runnable_avg for a cpu that becomes idle, must be done
before the call to idle_balance which will use it; So idle_enter_fair
is called in pre_schedule_idle.
idle_exit_fair has been put in the post_schedule function in order to
be symmetric and coherent but nothing prevent from moving it in
pick_next_task_idle.

Vincent

>
> That would avoid the post_schedule() rq->lock dance.
>
> Vince?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-12 18:24           ` Peter Zijlstra
  2013-12-13  8:48             ` Vincent Guittot
@ 2013-12-17 14:09             ` Chris Redpath
  2013-12-17 15:51               ` Peter Zijlstra
  2013-12-17 18:03               ` bsegall
  1 sibling, 2 replies; 15+ messages in thread
From: Chris Redpath @ 2013-12-17 14:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall@google.com, Vincent Guittot, Frederic Weisbecker

On 12/12/13 18:24, Peter Zijlstra wrote:
> Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick
> be useful?
>
> I suppose we could easily lift that to NO_HZ_COMMON.
>

Many thanks for the tip Peter, I have tried this out and it does provide 
enough information to be able to correct the problem. The new version 
doesn't update the rq, just carries the extra unaccounted time 
(estimated from the jiffies) over to be processed during enqueue.

However before I send a new patch set I have a question about the 
existing behavior. Ben, you may already know the answer to this?

During a wake migration we call __synchronize_entity_decay in 
migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also 
record the amount of periods we decayed for as a negative number in 
avg.decay_count.

We then enqueue the task on its target runqueue, and again we decay the 
load by the number of periods it has been off-rq.

if (unlikely(se->avg.decay_count <= 0)) {
	se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
	if (se->avg.decay_count) {
		se->avg.last_runnable_update -= (-se->avg.decay_count)
							<< 20;
 >>>		update_entity_load_avg(se, 0);

Am I misunderstanding how this is supposed to work or have we been 
always double-accounting sleep time for wake migrations?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-17 14:09             ` Chris Redpath
@ 2013-12-17 15:51               ` Peter Zijlstra
  2013-12-17 18:03               ` bsegall
  1 sibling, 0 replies; 15+ messages in thread
From: Peter Zijlstra @ 2013-12-17 15:51 UTC (permalink / raw)
  To: Chris Redpath
  Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org,
	Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org,
	bsegall@google.com, Vincent Guittot, Frederic Weisbecker

On Tue, Dec 17, 2013 at 02:09:13PM +0000, Chris Redpath wrote:
> On 12/12/13 18:24, Peter Zijlstra wrote:
> >Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick
> >be useful?
> >
> >I suppose we could easily lift that to NO_HZ_COMMON.
> >
> 
> Many thanks for the tip Peter, I have tried this out and it does provide
> enough information to be able to correct the problem. The new version
> doesn't update the rq, just carries the extra unaccounted time (estimated
> from the jiffies) over to be processed during enqueue.
> 
> However before I send a new patch set I have a question about the existing
> behavior. Ben, you may already know the answer to this?
> 
> During a wake migration we call __synchronize_entity_decay in
> migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also record
> the amount of periods we decayed for as a negative number in
> avg.decay_count.
> 
> We then enqueue the task on its target runqueue, and again we decay the load
> by the number of periods it has been off-rq.
> 
> if (unlikely(se->avg.decay_count <= 0)) {
> 	se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> 	if (se->avg.decay_count) {
> 		se->avg.last_runnable_update -= (-se->avg.decay_count)
> 							<< 20;
> >>>		update_entity_load_avg(se, 0);
> 
> Am I misunderstanding how this is supposed to work or have we been always
> double-accounting sleep time for wake migrations?

The way I understood this is that on migrate we sync from dequeue to
migrate point, and subtract the 'current (at point of migrate)' blocked
load from the src rq, and add it to the dst rq.

Then on enqueue we sync again, but now from migrate to enqueue.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-17 14:09             ` Chris Redpath
  2013-12-17 15:51               ` Peter Zijlstra
@ 2013-12-17 18:03               ` bsegall
  2013-12-18 10:13                 ` Chris Redpath
  1 sibling, 1 reply; 15+ messages in thread
From: bsegall @ 2013-12-17 18:03 UTC (permalink / raw)
  To: Chris Redpath
  Cc: Peter Zijlstra, pjt@google.com, mingo@redhat.com,
	alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann,
	linux-kernel@vger.kernel.org, Vincent Guittot,
	Frederic Weisbecker

Chris Redpath <chris.redpath@arm.com> writes:

> On 12/12/13 18:24, Peter Zijlstra wrote:
>> Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick
>> be useful?
>>
>> I suppose we could easily lift that to NO_HZ_COMMON.
>>
>
> Many thanks for the tip Peter, I have tried this out and it does provide enough
> information to be able to correct the problem. The new version doesn't update
> the rq, just carries the extra unaccounted time (estimated from the jiffies)
> over to be processed during enqueue.
>
> However before I send a new patch set I have a question about the existing
> behavior. Ben, you may already know the answer to this?
>
> During a wake migration we call __synchronize_entity_decay in
> migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also record the
> amount of periods we decayed for as a negative number in avg.decay_count.
>
> We then enqueue the task on its target runqueue, and again we decay the load by
> the number of periods it has been off-rq.
>
> if (unlikely(se->avg.decay_count <= 0)) {
> 	se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq));
> 	if (se->avg.decay_count) {
> 		se->avg.last_runnable_update -= (-se->avg.decay_count)
> 							<< 20;
>>>>		update_entity_load_avg(se, 0);
>
> Am I misunderstanding how this is supposed to work or have we been always
> double-accounting sleep time for wake migrations?

__synchronize_entity_decay will decay load_avg_contrib in order to
figure out how much to remove from old_cfs_rq->blocked_load.
update_entity_load_avg will update the underlying runnable_avg_sum/period that
is used to update load_avg_contrib.

(Normally we update runnable_avg_sum, which updates load_avg_contrib via
__update_entity_load_avg_contrib. Here we go in the reverse direction
because we don't hold the right rq locks at the right times.)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] sched: update runqueue clock before migrations away
  2013-12-17 18:03               ` bsegall
@ 2013-12-18 10:13                 ` Chris Redpath
  0 siblings, 0 replies; 15+ messages in thread
From: Chris Redpath @ 2013-12-18 10:13 UTC (permalink / raw)
  To: bsegall@google.com
  Cc: Peter Zijlstra, pjt@google.com, mingo@redhat.com,
	alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann,
	linux-kernel@vger.kernel.org, Vincent Guittot,
	Frederic Weisbecker, Chris Redpath

On 17/12/13 18:03, bsegall@google.com wrote:
> __synchronize_entity_decay will decay load_avg_contrib in order to
> figure out how much to remove from old_cfs_rq->blocked_load.
> update_entity_load_avg will update the underlying runnable_avg_sum/period that
> is used to update load_avg_contrib.
>
> (Normally we update runnable_avg_sum, which updates load_avg_contrib via
> __update_entity_load_avg_contrib. Here we go in the reverse direction
> because we don't hold the right rq locks at the right times.)
>

Thanks Ben, got it now. The only question remaining for me to figure out 
is if I need to include the missed tick time in the contrib decay or not 
- I definitely need to include it in the negative decay count we send 
through a migration. I'll go and check the places we use the removed 
load and update blocked load again.


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-12-18 10:13 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath
2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath
2013-12-09 17:59   ` bsegall
2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath
2013-12-09 18:13   ` bsegall
2013-12-10 11:48   ` Peter Zijlstra
2013-12-10 13:24     ` Chris Redpath
2013-12-10 15:14       ` Peter Zijlstra
2013-12-10 15:55         ` Chris Redpath
2013-12-12 18:24           ` Peter Zijlstra
2013-12-13  8:48             ` Vincent Guittot
2013-12-17 14:09             ` Chris Redpath
2013-12-17 15:51               ` Peter Zijlstra
2013-12-17 18:03               ` bsegall
2013-12-18 10:13                 ` Chris Redpath

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox