* [PATCH 0/2] Per-task load tracking errors @ 2013-12-09 12:59 Chris Redpath 2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath 2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath 0 siblings, 2 replies; 15+ messages in thread From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw) To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann Cc: linux-kernel, Chris Redpath Hi Paul, Peter etc. I've found a couple of bugs in the load tracking code and here is my attempt to fix them. I have some test code available which can trigger the issues if anyone is interested. The first one is straightforward. We can leave a number in se.avg.decay_count after a short sleep. If that task is later migrated while runnable, then the left-over decay looks like unaccounted sleep time so the load is decayed. The second one is similar. Here we are losing sleep time for a task if it is migrated while sleeping and the CPU it previously ran on has entered nohz mode. I don't really like this fix much, but the root of the problem is that load tracking more-or-less expects the runqueue's decay_counter to be up to date, and when nohz is in use it is not. The fix demonstrates the issue anyway, I haven't seen other occasions where nohz CPUs distort the tracked load. Chris Redpath (2): sched: reset blocked load decay_count during synchronization sched: update runqueue clock before migrations away kernel/sched/fair.c | 38 +++++++++++++++++++++++++++++++++----- 1 file changed, 33 insertions(+), 5 deletions(-) -- 1.7.9.5 ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 1/2] sched: reset blocked load decay_count during synchronization 2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath @ 2013-12-09 12:59 ` Chris Redpath 2013-12-09 17:59 ` bsegall 2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath 1 sibling, 1 reply; 15+ messages in thread From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw) To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann Cc: linux-kernel, Chris Redpath If an entity happens to sleep for less than one tick duration the tracked load associated with that entity can be decayed by an unexpectedly large amount if it is later migrated to a different CPU. This can interfere with correct scheduling when entity load is used for decision making. The reason for this is that when an entity is dequeued and enqueued quickly, such that se.avg.decay_count and cfs_rq.decay_counter do not differ when that entity is enqueued again, __synchronize_entity_decay skips the calculation step and also skips clearing the decay_count. At a later time that entity may be migrated and its load will be decayed incorrectly. All users of this function expect decay_count to be zero'ed after use. Signed-off-by: Chris Redpath <chris.redpath@arm.com> --- kernel/sched/fair.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index e8b652e..b7e5945 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -2142,12 +2142,10 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) u64 decays = atomic64_read(&cfs_rq->decay_counter); decays -= se->avg.decay_count; - if (!decays) - return 0; - - se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); + if (decays) + se->avg.load_avg_contrib = + decay_load(se->avg.load_avg_contrib, decays); se->avg.decay_count = 0; - return decays; } -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] sched: reset blocked load decay_count during synchronization 2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath @ 2013-12-09 17:59 ` bsegall 0 siblings, 0 replies; 15+ messages in thread From: bsegall @ 2013-12-09 17:59 UTC (permalink / raw) To: Chris Redpath Cc: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann, linux-kernel Chris Redpath <chris.redpath@arm.com> writes: > If an entity happens to sleep for less than one tick duration > the tracked load associated with that entity can be decayed by an > unexpectedly large amount if it is later migrated to a different > CPU. This can interfere with correct scheduling when entity load > is used for decision making. > > The reason for this is that when an entity is dequeued and enqueued > quickly, such that se.avg.decay_count and cfs_rq.decay_counter > do not differ when that entity is enqueued again, > __synchronize_entity_decay skips the calculation step and also skips > clearing the decay_count. At a later time that entity may be > migrated and its load will be decayed incorrectly. > > All users of this function expect decay_count to be zero'ed after > use. > > Signed-off-by: Chris Redpath <chris.redpath@arm.com> Reviewed-by: Ben Segall <bsegall@google.com> > --- > kernel/sched/fair.c | 8 +++----- > 1 file changed, 3 insertions(+), 5 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index e8b652e..b7e5945 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2142,12 +2142,10 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se) > u64 decays = atomic64_read(&cfs_rq->decay_counter); > > decays -= se->avg.decay_count; > - if (!decays) > - return 0; > - > - se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays); > + if (decays) > + se->avg.load_avg_contrib = > + decay_load(se->avg.load_avg_contrib, decays); > se->avg.decay_count = 0; > - > return decays; > } ^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath 2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath @ 2013-12-09 12:59 ` Chris Redpath 2013-12-09 18:13 ` bsegall 2013-12-10 11:48 ` Peter Zijlstra 1 sibling, 2 replies; 15+ messages in thread From: Chris Redpath @ 2013-12-09 12:59 UTC (permalink / raw) To: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann Cc: linux-kernel, Chris Redpath If we migrate a sleeping task away from a CPU which has the tick stopped, then both the clock_task and decay_counter will be out of date for that CPU and we will not decay load correctly regardless of how often we update the blocked load. This is only an issue for tasks which are not on a runqueue (because otherwise that CPU would be awake) and simultaneously the CPU the task previously ran on has had the tick stopped. Signed-off-by: Chris Redpath <chris.redpath@arm.com> --- kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++ 1 file changed, 30 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index b7e5945..0af1dc2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4324,6 +4324,7 @@ unlock: return new_cpu; } +static int nohz_test_cpu(int cpu); /* * Called immediately before a task is migrated to a new cpu; task_cpu(p) and * cfs_rq_of(p) references at time of call are still valid and identify the @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) * be negative here since on-rq tasks have decay-count == 0. */ if (se->avg.decay_count) { + /* + * If we migrate a sleeping task away from a CPU + * which has the tick stopped, then both the clock_task + * and decay_counter will be out of date for that CPU + * and we will not decay load correctly. + */ + if (!se->on_rq && nohz_test_cpu(task_cpu(p))) { + struct rq *rq = cpu_rq(task_cpu(p)); + unsigned long flags; + /* + * Current CPU cannot be holding rq->lock in this + * circumstance, but another might be. We must hold + * rq->lock before we go poking around in its clocks + */ + raw_spin_lock_irqsave(&rq->lock, flags); + update_rq_clock(rq); + update_cfs_rq_blocked_load(cfs_rq, 0); + raw_spin_unlock_irqrestore(&rq->lock, flags); + } se->avg.decay_count = -__synchronize_entity_decay(se); atomic_long_add(se->avg.load_avg_contrib, &cfs_rq->removed_load); @@ -6507,6 +6527,11 @@ static struct { unsigned long next_balance; /* in jiffy units */ } nohz ____cacheline_aligned; +static int nohz_test_cpu(int cpu) +{ + return cpumask_test_cpu(cpu, nohz.idle_cpus_mask); +} + static inline int find_new_ilb(int call_cpu) { int ilb = cpumask_first(nohz.idle_cpus_mask); @@ -6619,6 +6644,11 @@ static int sched_ilb_notifier(struct notifier_block *nfb, return NOTIFY_DONE; } } +#else +static int nohz_test_cpu(int cpu) +{ + return 0; +} #endif static DEFINE_SPINLOCK(balancing); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath @ 2013-12-09 18:13 ` bsegall 2013-12-10 11:48 ` Peter Zijlstra 1 sibling, 0 replies; 15+ messages in thread From: bsegall @ 2013-12-09 18:13 UTC (permalink / raw) To: Chris Redpath Cc: pjt, mingo, peterz, alex.shi, morten.rasmussen, dietmar.eggemann, linux-kernel Chris Redpath <chris.redpath@arm.com> writes: > If we migrate a sleeping task away from a CPU which has the > tick stopped, then both the clock_task and decay_counter will > be out of date for that CPU and we will not decay load correctly > regardless of how often we update the blocked load. > > This is only an issue for tasks which are not on a runqueue > (because otherwise that CPU would be awake) and simultaneously > the CPU the task previously ran on has had the tick stopped. > > Signed-off-by: Chris Redpath <chris.redpath@arm.com> This looks like it is basically correct, but it seems unfortunate to take any rq lock for these ttwus. I don't know enough about the nohz machinery to know if that's at all avoidable. > --- > kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++++ > 1 file changed, 30 insertions(+) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b7e5945..0af1dc2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -4324,6 +4324,7 @@ unlock: > return new_cpu; > } > > +static int nohz_test_cpu(int cpu); > /* > * Called immediately before a task is migrated to a new cpu; task_cpu(p) and > * cfs_rq_of(p) references at time of call are still valid and identify the > @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) > * be negative here since on-rq tasks have decay-count == 0. > */ > if (se->avg.decay_count) { > + /* > + * If we migrate a sleeping task away from a CPU > + * which has the tick stopped, then both the clock_task > + * and decay_counter will be out of date for that CPU > + * and we will not decay load correctly. > + */ > + if (!se->on_rq && nohz_test_cpu(task_cpu(p))) { p->on_rq - se->on_rq must be false to call set_task_cpu at all. That said, barring bugs like the one you fixed in patch 1 I think decay_count != 0 should also imply !p->on_rq. > + struct rq *rq = cpu_rq(task_cpu(p)); > + unsigned long flags; > + /* > + * Current CPU cannot be holding rq->lock in this > + * circumstance, but another might be. We must hold > + * rq->lock before we go poking around in its clocks > + */ > + raw_spin_lock_irqsave(&rq->lock, flags); > + update_rq_clock(rq); > + update_cfs_rq_blocked_load(cfs_rq, 0); > + raw_spin_unlock_irqrestore(&rq->lock, flags); > + } > se->avg.decay_count = -__synchronize_entity_decay(se); > atomic_long_add(se->avg.load_avg_contrib, > &cfs_rq->removed_load); > @@ -6507,6 +6527,11 @@ static struct { > unsigned long next_balance; /* in jiffy units */ > } nohz ____cacheline_aligned; > > +static int nohz_test_cpu(int cpu) > +{ > + return cpumask_test_cpu(cpu, nohz.idle_cpus_mask); > +} > + > static inline int find_new_ilb(int call_cpu) > { > int ilb = cpumask_first(nohz.idle_cpus_mask); > @@ -6619,6 +6644,11 @@ static int sched_ilb_notifier(struct notifier_block *nfb, > return NOTIFY_DONE; > } > } > +#else > +static int nohz_test_cpu(int cpu) > +{ > + return 0; > +} > #endif > > static DEFINE_SPINLOCK(balancing); ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath 2013-12-09 18:13 ` bsegall @ 2013-12-10 11:48 ` Peter Zijlstra 2013-12-10 13:24 ` Chris Redpath 1 sibling, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2013-12-10 11:48 UTC (permalink / raw) To: Chris Redpath Cc: pjt, mingo, alex.shi, morten.rasmussen, dietmar.eggemann, linux-kernel On Mon, Dec 09, 2013 at 12:59:10PM +0000, Chris Redpath wrote: > If we migrate a sleeping task away from a CPU which has the > tick stopped, then both the clock_task and decay_counter will > be out of date for that CPU and we will not decay load correctly > regardless of how often we update the blocked load. > > This is only an issue for tasks which are not on a runqueue > (because otherwise that CPU would be awake) and simultaneously > the CPU the task previously ran on has had the tick stopped. OK, so the idiot in a hurry (me) isn't quite getting the issue. Normally we update the blocked averages from the tick; clearly when no tick, no update. So far so good. Now, we also update blocked load from idle balance -- which would include the CPUs without tick through nohz_idle_balance() -- however this only appears to be done for CONFIG_FAIR_GROUP_SCHED. Are you running without cgroup muck? If so should we make this unconditional? If you have cgroup muck enabled; what's the problem? Don't we run nohz_idle_balance() frequently enough to be effective for updating the blocked load? You also seem to have overlooked NO_HZ_FULL, that can stop a tick even when there's a running task and makes the situation even more fun. > @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) > * be negative here since on-rq tasks have decay-count == 0. > */ > if (se->avg.decay_count) { > + /* > + * If we migrate a sleeping task away from a CPU > + * which has the tick stopped, then both the clock_task > + * and decay_counter will be out of date for that CPU > + * and we will not decay load correctly. > + */ > + if (!se->on_rq && nohz_test_cpu(task_cpu(p))) { > + struct rq *rq = cpu_rq(task_cpu(p)); > + unsigned long flags; > + /* > + * Current CPU cannot be holding rq->lock in this > + * circumstance, but another might be. We must hold > + * rq->lock before we go poking around in its clocks > + */ > + raw_spin_lock_irqsave(&rq->lock, flags); > + update_rq_clock(rq); > + update_cfs_rq_blocked_load(cfs_rq, 0); > + raw_spin_unlock_irqrestore(&rq->lock, flags); > + } > se->avg.decay_count = -__synchronize_entity_decay(se); > atomic_long_add(se->avg.load_avg_contrib, > &cfs_rq->removed_load); Right, as Ben already said; taking a rq->lock there is unfortunate at best. So normally we 'throttle' the expense of decaying the blocked load to ticks. But the above does it on every (suitable) task migration which might be far more often. So ideally we'd get it all sorted through the nohz_idle_balance() path; what exactly are the problems with that? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-10 11:48 ` Peter Zijlstra @ 2013-12-10 13:24 ` Chris Redpath 2013-12-10 15:14 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Chris Redpath @ 2013-12-10 13:24 UTC (permalink / raw) To: Peter Zijlstra Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall On 10/12/13 11:48, Peter Zijlstra wrote: > On Mon, Dec 09, 2013 at 12:59:10PM +0000, Chris Redpath wrote: >> If we migrate a sleeping task away from a CPU which has the >> tick stopped, then both the clock_task and decay_counter will >> be out of date for that CPU and we will not decay load correctly >> regardless of how often we update the blocked load. >> >> This is only an issue for tasks which are not on a runqueue >> (because otherwise that CPU would be awake) and simultaneously >> the CPU the task previously ran on has had the tick stopped. > > OK, so the idiot in a hurry (me) isn't quite getting the issue. > Sorry Peter, I will expand a little. We are using runnable_avg_sum to drive task placement in a much more aggressive way than usual in our big.LITTLE MP scheduler patches. I spend a lot of time looking at individual task load signals. What happens is that if you have a task which sleeps for a while and wakes on a different CPU and the previous CPU hasn't had a tick for a while, then that sleep time is lost. If the sleep time is short relative to the runtime the impact is small but we have tasks on Android where the runtime is small relative to the sleep. These tasks look much bigger than they really are. Everything works, but we can use more energy than we need to to meet the compute requirements. I guess load balancing could also be mislead for a while. I fully appreciate that it isn't so visible the way that load averages are used in the mainline scheduler, but its definitely there. > Normally we update the blocked averages from the tick; clearly when no > tick, no update. So far so good. > > Now, we also update blocked load from idle balance -- which would > include the CPUs without tick through nohz_idle_balance() -- however > this only appears to be done for CONFIG_FAIR_GROUP_SCHED. > > Are you running without cgroup muck? If so should we make this > unconditional? > > If you have cgroup muck enabled; what's the problem? Don't we run > nohz_idle_balance() frequently enough to be effective for updating the > blocked load? > I have to check this out a bit more, but yes I do have the muck enabled :/ I will investigate how often we idle balance on this platform however to be correct wouldn't we need an idle balance event to happen during sleep every time an entity was dequeued? And then we'd still lose the sleep time if we happen to wake between idle balance events? > You also seem to have overlooked NO_HZ_FULL, that can stop a tick even > when there's a running task and makes the situation even more fun. You are right, I haven't considered NO_HZ_FULL at all but I guess its also going to have an impact. I'll look closer at it. > >> @@ -4343,6 +4344,25 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu) >> * be negative here since on-rq tasks have decay-count == 0. >> */ >> if (se->avg.decay_count) { >> + /* >> + * If we migrate a sleeping task away from a CPU >> + * which has the tick stopped, then both the clock_task >> + * and decay_counter will be out of date for that CPU >> + * and we will not decay load correctly. >> + */ >> + if (!se->on_rq && nohz_test_cpu(task_cpu(p))) { >> + struct rq *rq = cpu_rq(task_cpu(p)); >> + unsigned long flags; >> + /* >> + * Current CPU cannot be holding rq->lock in this >> + * circumstance, but another might be. We must hold >> + * rq->lock before we go poking around in its clocks >> + */ >> + raw_spin_lock_irqsave(&rq->lock, flags); >> + update_rq_clock(rq); >> + update_cfs_rq_blocked_load(cfs_rq, 0); >> + raw_spin_unlock_irqrestore(&rq->lock, flags); >> + } >> se->avg.decay_count = -__synchronize_entity_decay(se); >> atomic_long_add(se->avg.load_avg_contrib, >> &cfs_rq->removed_load); > > Right, as Ben already said; taking a rq->lock there is unfortunate at > best. > You're both being very polite. I hate it too :) The only reason for taking the lock is to be sure the decay_counter is updated, normal sleep accounting doesn't need all this and is always correct. This stuff only exists in the first place so that the blocked load doesn't get out of whack when we are moving entities. Maybe I can separate entity load decay due to sleep from blocked load accounting. It will leave blocked load subject to rq clock updates but that might be OK. It's only a half-formed idea at the moment, I will ponder a bit longer. > So normally we 'throttle' the expense of decaying the blocked load to > ticks. But the above does it on every (suitable) task migration which > might be far more often. > > So ideally we'd get it all sorted through the nohz_idle_balance() path; > what exactly are the problems with that? > What I want is for se.avg.runnable_avg_sum to correctly incorporate the sleep time. I care much less about runqueue blocked averages being instantaneously correct so long as they don't get out of sync when entities move around. If it's possible to do it in the idle balance path that is great - if you have any suggestions I'd be glad to have a look. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-10 13:24 ` Chris Redpath @ 2013-12-10 15:14 ` Peter Zijlstra 2013-12-10 15:55 ` Chris Redpath 0 siblings, 1 reply; 15+ messages in thread From: Peter Zijlstra @ 2013-12-10 15:14 UTC (permalink / raw) To: Chris Redpath Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall On Tue, Dec 10, 2013 at 01:24:21PM +0000, Chris Redpath wrote: > What happens is that if you have a task which sleeps for a while and wakes > on a different CPU and the previous CPU hasn't had a tick for a while, then > that sleep time is lost. /me more confused now. Where does update_cfs_rq_blocked_load() account sleep time? Its only concerned with the blocked decay. Or is it the decay_count <= 0 case in enqueue_entity_load_avg() that's screwing you over? That's guestimating the last_runnable_update based on decay_count, and per the previous the decay count can get slightly out of sync. If so, we should look to cure the issue enqueue_entity_load_avg() is trying to work around, namely the fact that we cannot assume synced clocks, nor can we read remote clocks atomically. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-10 15:14 ` Peter Zijlstra @ 2013-12-10 15:55 ` Chris Redpath 2013-12-12 18:24 ` Peter Zijlstra 0 siblings, 1 reply; 15+ messages in thread From: Chris Redpath @ 2013-12-10 15:55 UTC (permalink / raw) To: Peter Zijlstra Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall@google.com On 10/12/13 15:14, Peter Zijlstra wrote: > On Tue, Dec 10, 2013 at 01:24:21PM +0000, Chris Redpath wrote: >> What happens is that if you have a task which sleeps for a while and wakes >> on a different CPU and the previous CPU hasn't had a tick for a while, then >> that sleep time is lost. > > /me more confused now. > > Where does update_cfs_rq_blocked_load() account sleep time? Its only > concerned with the blocked decay. > > Or is it the decay_count <= 0 case in enqueue_entity_load_avg() that's > screwing you over? > Sorry, I don't appear to have explained myself well. We don't get a negative decay_count when we synchronize the entity in this situation because the decay_counter of the runqueue has not been updated since the last update. __synchronize_entity_decay will return 0, and hence the entity load average does not decay and neither does blocked load. It doesn't seem to matter much that the blocked load doesn't decay because we are about to remove the load from the now-unblocked entity anyway, but it impacts the entity load average because that sleep time has now disappeared. The call to update_rq_clock() is needed to get rq->clock_task updated, and then I use update_cfs_rq_blocked_load() to get the decay_counter updated. It has the side-effect of updating the blocked_load_avg to stay consistent but the effect I am looking for is the decay_counter update. > That's guestimating the last_runnable_update based on decay_count, and > per the previous the decay count can get slightly out of sync. The guesstimation works fine, the issue is only that we can't tell at this point how much time that entity was asleep when the CPU it ran on has no tick and since it is too expensive to synchronize the clocks, there isn't (currently) a way to find out without updating the rq we came from. I can't see anything handy lying around in struct rq, but is there a copy of the jiffies held anywhere per-cpu presuming it would stop being updated when the tick stops? If not, I could store one somewhere as part of turning the tick off and then we could use the difference between that and the current jiffies count to estimate the amount of time in limbo. That would almost certainly be accurate enough for me - a few ms won't hurt but when we lose seconds it does. > > If so, we should look to cure the issue enqueue_entity_load_avg() is > trying to work around, namely the fact that we cannot assume synced > clocks, nor can we read remote clocks atomically. > That would be nice, but IMHO the estimation error is small and self-consistent so the optimization appears good. -- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you. ARM Limited, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2557590 ARM Holdings plc, Registered office 110 Fulbourn Road, Cambridge CB1 9NJ, Registered in England & Wales, Company No: 2548782 ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-10 15:55 ` Chris Redpath @ 2013-12-12 18:24 ` Peter Zijlstra 2013-12-13 8:48 ` Vincent Guittot 2013-12-17 14:09 ` Chris Redpath 0 siblings, 2 replies; 15+ messages in thread From: Peter Zijlstra @ 2013-12-12 18:24 UTC (permalink / raw) To: Chris Redpath Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall@google.com, Vincent Guittot, Frederic Weisbecker On Tue, Dec 10, 2013 at 03:55:43PM +0000, Chris Redpath wrote: > >That's guestimating the last_runnable_update based on decay_count, and > >per the previous the decay count can get slightly out of sync. > > The guesstimation works fine, the issue is only that we can't tell at > this point how much time that entity was asleep when the CPU it ran on > has no tick and since it is too expensive to synchronize the clocks, > there isn't (currently) a way to find out without updating the rq we > came from. > > I can't see anything handy lying around in struct rq, but is there a > copy of the jiffies held anywhere per-cpu presuming it would stop being > updated when the tick stops? If not, I could store one somewhere as part > of turning the tick off and then we could use the difference between > that and the current jiffies count to estimate the amount of time in > limbo. That would almost certainly be accurate enough for me - a few ms > won't hurt but when we lose seconds it does. Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick be useful? I suppose we could easily lift that to NO_HZ_COMMON. Which raises another point; I dislike this idle pre/post business. Why can't the post_schedule_idle() calls be done from pick_next_task_idle() and pre_schedule_idle() done from put_prev_task_idle() ? That would avoid the post_schedule() rq->lock dance. Vince? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-12 18:24 ` Peter Zijlstra @ 2013-12-13 8:48 ` Vincent Guittot 2013-12-17 14:09 ` Chris Redpath 1 sibling, 0 replies; 15+ messages in thread From: Vincent Guittot @ 2013-12-13 8:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Chris Redpath, pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall@google.com, Frederic Weisbecker On 12 December 2013 19:24, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Dec 10, 2013 at 03:55:43PM +0000, Chris Redpath wrote: >> >That's guestimating the last_runnable_update based on decay_count, and >> >per the previous the decay count can get slightly out of sync. >> >> The guesstimation works fine, the issue is only that we can't tell at >> this point how much time that entity was asleep when the CPU it ran on >> has no tick and since it is too expensive to synchronize the clocks, >> there isn't (currently) a way to find out without updating the rq we >> came from. >> >> I can't see anything handy lying around in struct rq, but is there a >> copy of the jiffies held anywhere per-cpu presuming it would stop being >> updated when the tick stops? If not, I could store one somewhere as part >> of turning the tick off and then we could use the difference between >> that and the current jiffies count to estimate the amount of time in >> limbo. That would almost certainly be accurate enough for me - a few ms >> won't hurt but when we lose seconds it does. > > Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick > be useful? > > I suppose we could easily lift that to NO_HZ_COMMON. > > Which raises another point; I dislike this idle pre/post business. Why > can't the post_schedule_idle() calls be done from pick_next_task_idle() > and pre_schedule_idle() done from put_prev_task_idle() ? The update of runnable_avg for a cpu that becomes idle, must be done before the call to idle_balance which will use it; So idle_enter_fair is called in pre_schedule_idle. idle_exit_fair has been put in the post_schedule function in order to be symmetric and coherent but nothing prevent from moving it in pick_next_task_idle. Vincent > > That would avoid the post_schedule() rq->lock dance. > > Vince? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-12 18:24 ` Peter Zijlstra 2013-12-13 8:48 ` Vincent Guittot @ 2013-12-17 14:09 ` Chris Redpath 2013-12-17 15:51 ` Peter Zijlstra 2013-12-17 18:03 ` bsegall 1 sibling, 2 replies; 15+ messages in thread From: Chris Redpath @ 2013-12-17 14:09 UTC (permalink / raw) To: Peter Zijlstra Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall@google.com, Vincent Guittot, Frederic Weisbecker On 12/12/13 18:24, Peter Zijlstra wrote: > Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick > be useful? > > I suppose we could easily lift that to NO_HZ_COMMON. > Many thanks for the tip Peter, I have tried this out and it does provide enough information to be able to correct the problem. The new version doesn't update the rq, just carries the extra unaccounted time (estimated from the jiffies) over to be processed during enqueue. However before I send a new patch set I have a question about the existing behavior. Ben, you may already know the answer to this? During a wake migration we call __synchronize_entity_decay in migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also record the amount of periods we decayed for as a negative number in avg.decay_count. We then enqueue the task on its target runqueue, and again we decay the load by the number of periods it has been off-rq. if (unlikely(se->avg.decay_count <= 0)) { se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq)); if (se->avg.decay_count) { se->avg.last_runnable_update -= (-se->avg.decay_count) << 20; >>> update_entity_load_avg(se, 0); Am I misunderstanding how this is supposed to work or have we been always double-accounting sleep time for wake migrations? ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-17 14:09 ` Chris Redpath @ 2013-12-17 15:51 ` Peter Zijlstra 2013-12-17 18:03 ` bsegall 1 sibling, 0 replies; 15+ messages in thread From: Peter Zijlstra @ 2013-12-17 15:51 UTC (permalink / raw) To: Chris Redpath Cc: pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, bsegall@google.com, Vincent Guittot, Frederic Weisbecker On Tue, Dec 17, 2013 at 02:09:13PM +0000, Chris Redpath wrote: > On 12/12/13 18:24, Peter Zijlstra wrote: > >Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick > >be useful? > > > >I suppose we could easily lift that to NO_HZ_COMMON. > > > > Many thanks for the tip Peter, I have tried this out and it does provide > enough information to be able to correct the problem. The new version > doesn't update the rq, just carries the extra unaccounted time (estimated > from the jiffies) over to be processed during enqueue. > > However before I send a new patch set I have a question about the existing > behavior. Ben, you may already know the answer to this? > > During a wake migration we call __synchronize_entity_decay in > migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also record > the amount of periods we decayed for as a negative number in > avg.decay_count. > > We then enqueue the task on its target runqueue, and again we decay the load > by the number of periods it has been off-rq. > > if (unlikely(se->avg.decay_count <= 0)) { > se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq)); > if (se->avg.decay_count) { > se->avg.last_runnable_update -= (-se->avg.decay_count) > << 20; > >>> update_entity_load_avg(se, 0); > > Am I misunderstanding how this is supposed to work or have we been always > double-accounting sleep time for wake migrations? The way I understood this is that on migrate we sync from dequeue to migrate point, and subtract the 'current (at point of migrate)' blocked load from the src rq, and add it to the dst rq. Then on enqueue we sync again, but now from migrate to enqueue. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-17 14:09 ` Chris Redpath 2013-12-17 15:51 ` Peter Zijlstra @ 2013-12-17 18:03 ` bsegall 2013-12-18 10:13 ` Chris Redpath 1 sibling, 1 reply; 15+ messages in thread From: bsegall @ 2013-12-17 18:03 UTC (permalink / raw) To: Chris Redpath Cc: Peter Zijlstra, pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, Vincent Guittot, Frederic Weisbecker Chris Redpath <chris.redpath@arm.com> writes: > On 12/12/13 18:24, Peter Zijlstra wrote: >> Would pre_schedule_idle() -> rq_last_tick_reset() -> rq->last_sched_tick >> be useful? >> >> I suppose we could easily lift that to NO_HZ_COMMON. >> > > Many thanks for the tip Peter, I have tried this out and it does provide enough > information to be able to correct the problem. The new version doesn't update > the rq, just carries the extra unaccounted time (estimated from the jiffies) > over to be processed during enqueue. > > However before I send a new patch set I have a question about the existing > behavior. Ben, you may already know the answer to this? > > During a wake migration we call __synchronize_entity_decay in > migrate_task_rq_fair, which will decay avg.runnable_avg_sum. We also record the > amount of periods we decayed for as a negative number in avg.decay_count. > > We then enqueue the task on its target runqueue, and again we decay the load by > the number of periods it has been off-rq. > > if (unlikely(se->avg.decay_count <= 0)) { > se->avg.last_runnable_update = rq_clock_task(rq_of(cfs_rq)); > if (se->avg.decay_count) { > se->avg.last_runnable_update -= (-se->avg.decay_count) > << 20; >>>> update_entity_load_avg(se, 0); > > Am I misunderstanding how this is supposed to work or have we been always > double-accounting sleep time for wake migrations? __synchronize_entity_decay will decay load_avg_contrib in order to figure out how much to remove from old_cfs_rq->blocked_load. update_entity_load_avg will update the underlying runnable_avg_sum/period that is used to update load_avg_contrib. (Normally we update runnable_avg_sum, which updates load_avg_contrib via __update_entity_load_avg_contrib. Here we go in the reverse direction because we don't hold the right rq locks at the right times.) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] sched: update runqueue clock before migrations away 2013-12-17 18:03 ` bsegall @ 2013-12-18 10:13 ` Chris Redpath 0 siblings, 0 replies; 15+ messages in thread From: Chris Redpath @ 2013-12-18 10:13 UTC (permalink / raw) To: bsegall@google.com Cc: Peter Zijlstra, pjt@google.com, mingo@redhat.com, alex.shi@linaro.org, Morten Rasmussen, Dietmar Eggemann, linux-kernel@vger.kernel.org, Vincent Guittot, Frederic Weisbecker, Chris Redpath On 17/12/13 18:03, bsegall@google.com wrote: > __synchronize_entity_decay will decay load_avg_contrib in order to > figure out how much to remove from old_cfs_rq->blocked_load. > update_entity_load_avg will update the underlying runnable_avg_sum/period that > is used to update load_avg_contrib. > > (Normally we update runnable_avg_sum, which updates load_avg_contrib via > __update_entity_load_avg_contrib. Here we go in the reverse direction > because we don't hold the right rq locks at the right times.) > Thanks Ben, got it now. The only question remaining for me to figure out is if I need to include the missed tick time in the contrib decay or not - I definitely need to include it in the negative decay count we send through a migration. I'll go and check the places we use the removed load and update blocked load again. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-12-18 10:13 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-09 12:59 [PATCH 0/2] Per-task load tracking errors Chris Redpath 2013-12-09 12:59 ` [PATCH 1/2] sched: reset blocked load decay_count during synchronization Chris Redpath 2013-12-09 17:59 ` bsegall 2013-12-09 12:59 ` [PATCH 2/2] sched: update runqueue clock before migrations away Chris Redpath 2013-12-09 18:13 ` bsegall 2013-12-10 11:48 ` Peter Zijlstra 2013-12-10 13:24 ` Chris Redpath 2013-12-10 15:14 ` Peter Zijlstra 2013-12-10 15:55 ` Chris Redpath 2013-12-12 18:24 ` Peter Zijlstra 2013-12-13 8:48 ` Vincent Guittot 2013-12-17 14:09 ` Chris Redpath 2013-12-17 15:51 ` Peter Zijlstra 2013-12-17 18:03 ` bsegall 2013-12-18 10:13 ` Chris Redpath
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox