* [PATCH] sched: fix group_entity's share update
@ 2016-12-01 16:38 Vincent Guittot
2016-12-15 16:52 ` Vincent Guittot
2016-12-15 21:42 ` Peter Zijlstra
0 siblings, 2 replies; 5+ messages in thread
From: Vincent Guittot @ 2016-12-01 16:38 UTC (permalink / raw)
To: peterz, mingo, linux-kernel; +Cc: pjt, Vincent Guittot, stable
The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.
Let take the example of a task TA that is dequeued of its task group TG1.
TA was the only task in TG1 which becomes idle.
We have the sequence:
- dequeue_entity TA->se
- update_load_avg(TA->se)
- dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
- account_entity_dequeue(TG1->cfs_rq, TA->se)
TG1->cfs_rq->load.weight = 0
- update_cfs_shares(TG1->cfs_rq)
TG1->se->load.weight is updated with the new share of
cfs_rq. TG1->se->load.weight = 0.
- dequeue_entity TG1->se
- update_load_avg(TG1->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with its new weight (0 in our case)
instead of its real weight. The last time slot is accounted as an idle one
whereas it was a running one.
If the running time of TA is short enough that no tick happens when it
runs, all running time of TG1->se will be accounted as idle time.
Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.
update_cfs_shares() now takes the sched_entity as parameter instead of the
cfs_rq and the weight of the group_entity is updated only once its load_avg
has been synced with current time.
Cc: <stable@vger.kernel.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
I have seen the problem on tip/sched/core, v4.8 and v4.7. Previous versions
might also have the problem but I haven't not been able to test them yet.
kernel/sched/fair.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18d9e75..19092fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
-static void update_cfs_shares(struct cfs_rq *cfs_rq)
+static void update_cfs_shares(struct sched_entity *se)
{
struct task_group *tg;
- struct sched_entity *se;
+ struct cfs_rq *cfs_rq = group_cfs_rq(se);
long shares;
+ if (entity_is_task(se))
+ return;
+
tg = cfs_rq->tg;
- se = tg->se[cpu_of(rq_of(cfs_rq))];
- if (!se || throttled_hierarchy(cfs_rq))
+
+ if (throttled_hierarchy(cfs_rq))
return;
#ifndef CONFIG_SMP
if (likely(se->load.weight == tg->shares))
@@ -2707,8 +2710,10 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
reweight_entity(cfs_rq_of(se), se, shares);
}
+
+
#else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+static inline void update_cfs_shares(struct sched_entity *se)
{
}
#endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
se->vruntime += cfs_rq->min_vruntime;
update_load_avg(se, UPDATE_TG);
+ update_cfs_shares(se);
enqueue_entity_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);
- update_cfs_shares(cfs_rq);
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
@@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
- update_cfs_shares(cfs_rq);
+ update_cfs_shares(se);
/*
* Now advance min_vruntime if @se was the entity holding it back,
@@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
* Ensure that runnable average is periodically updated.
*/
update_load_avg(curr, UPDATE_TG);
- update_cfs_shares(cfs_rq);
+ update_cfs_shares(curr);
#ifdef CONFIG_SCHED_HRTICK
/*
@@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
break;
update_load_avg(se, UPDATE_TG);
- update_cfs_shares(cfs_rq);
+ update_cfs_shares(se);
}
if (!se)
@@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
break;
update_load_avg(se, UPDATE_TG);
- update_cfs_shares(cfs_rq);
+ update_cfs_shares(se);
}
if (!se)
@@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
/* Possible calls to update_curr() need rq clock */
update_rq_clock(rq);
for_each_sched_entity(se)
- update_cfs_shares(group_cfs_rq(se));
+ update_cfs_shares(se);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
--
2.7.4
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH] sched: fix group_entity's share update 2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot @ 2016-12-15 16:52 ` Vincent Guittot 2016-12-15 21:42 ` Peter Zijlstra 1 sibling, 0 replies; 5+ messages in thread From: Vincent Guittot @ 2016-12-15 16:52 UTC (permalink / raw) To: Peter Zijlstra, Ingo Molnar, linux-kernel Cc: Paul Turner, Vincent Guittot, stable Gentle ping ... Vincent On 1 December 2016 at 17:38, Vincent Guittot <vincent.guittot@linaro.org> wrote: > The update of the share of a cfs_rq is done when its load_avg is updated > but before the group_entity's load_avg has been updated for the past time > slot. This generates wrong load_avg accounting which can be significant > when small tasks are involved in the scheduling. > > Let take the example of a task TA that is dequeued of its task group TG1. > TA was the only task in TG1 which becomes idle. > > We have the sequence: > > - dequeue_entity TA->se > - update_load_avg(TA->se) > - dequeue_entity_load_avg(TG1->cfs_rq, TA->se) > - account_entity_dequeue(TG1->cfs_rq, TA->se) > TG1->cfs_rq->load.weight = 0 > - update_cfs_shares(TG1->cfs_rq) > TG1->se->load.weight is updated with the new share of > cfs_rq. TG1->se->load.weight = 0. > - dequeue_entity TG1->se > - update_load_avg(TG1->se) but its weight is now null so the last time > slot (up to a tick) will be accounted with its new weight (0 in our case) > instead of its real weight. The last time slot is accounted as an idle one > whereas it was a running one. > > If the running time of TA is short enough that no tick happens when it > runs, all running time of TG1->se will be accounted as idle time. > > Instead, we should update the share of a cfs_rq (in fact the weight of its > group entity) only after having updated the load_avg of the group_entity. > > update_cfs_shares() now takes the sched_entity as parameter instead of the > cfs_rq and the weight of the group_entity is updated only once its load_avg > has been synced with current time. > > Cc: <stable@vger.kernel.org> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > > I have seen the problem on tip/sched/core, v4.8 and v4.7. Previous versions > might also have the problem but I haven't not been able to test them yet. > > kernel/sched/fair.c | 27 ++++++++++++++++----------- > 1 file changed, 16 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 18d9e75..19092fa 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, > > static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); > > -static void update_cfs_shares(struct cfs_rq *cfs_rq) > +static void update_cfs_shares(struct sched_entity *se) > { > struct task_group *tg; > - struct sched_entity *se; > + struct cfs_rq *cfs_rq = group_cfs_rq(se); > long shares; > > + if (entity_is_task(se)) > + return; > + > tg = cfs_rq->tg; > - se = tg->se[cpu_of(rq_of(cfs_rq))]; > - if (!se || throttled_hierarchy(cfs_rq)) > + > + if (throttled_hierarchy(cfs_rq)) > return; > #ifndef CONFIG_SMP > if (likely(se->load.weight == tg->shares)) > @@ -2707,8 +2710,10 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq) > > reweight_entity(cfs_rq_of(se), se, shares); > } > + > + > #else /* CONFIG_FAIR_GROUP_SCHED */ > -static inline void update_cfs_shares(struct cfs_rq *cfs_rq) > +static inline void update_cfs_shares(struct sched_entity *se) > { > } > #endif /* CONFIG_FAIR_GROUP_SCHED */ > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > se->vruntime += cfs_rq->min_vruntime; > > update_load_avg(se, UPDATE_TG); > + update_cfs_shares(se); > enqueue_entity_load_avg(cfs_rq, se); > account_entity_enqueue(cfs_rq, se); > - update_cfs_shares(cfs_rq); > > if (flags & ENQUEUE_WAKEUP) > place_entity(cfs_rq, se, 0); > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > /* return excess runtime on last dequeue */ > return_cfs_rq_runtime(cfs_rq); > > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > > /* > * Now advance min_vruntime if @se was the entity holding it back, > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) > * Ensure that runnable average is periodically updated. > */ > update_load_avg(curr, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(curr); > > #ifdef CONFIG_SCHED_HRTICK > /* > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > break; > > update_load_avg(se, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > } > > if (!se) > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > break; > > update_load_avg(se, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > } > > if (!se) > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) > /* Possible calls to update_curr() need rq clock */ > update_rq_clock(rq); > for_each_sched_entity(se) > - update_cfs_shares(group_cfs_rq(se)); > + update_cfs_shares(se); > raw_spin_unlock_irqrestore(&rq->lock, flags); > } > > -- > 2.7.4 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: fix group_entity's share update 2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot 2016-12-15 16:52 ` Vincent Guittot @ 2016-12-15 21:42 ` Peter Zijlstra 2016-12-16 8:55 ` Vincent Guittot 1 sibling, 1 reply; 5+ messages in thread From: Peter Zijlstra @ 2016-12-15 21:42 UTC (permalink / raw) To: Vincent Guittot; +Cc: mingo, linux-kernel, pjt, stable On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote: > The update of the share of a cfs_rq is done when its load_avg is updated > but before the group_entity's load_avg has been updated for the past time > slot. This generates wrong load_avg accounting which can be significant > when small tasks are involved in the scheduling. > > Let take the example of a task TA that is dequeued of its task group TG1. > TA was the only task in TG1 which becomes idle. > > We have the sequence: > > - dequeue_entity TA->se > - update_load_avg(TA->se) > - dequeue_entity_load_avg(TG1->cfs_rq, TA->se) > - account_entity_dequeue(TG1->cfs_rq, TA->se) > TG1->cfs_rq->load.weight = 0 > - update_cfs_shares(TG1->cfs_rq) > TG1->se->load.weight is updated with the new share of > cfs_rq. TG1->se->load.weight = 0. > - dequeue_entity TG1->se > - update_load_avg(TG1->se) but its weight is now null so the last time > slot (up to a tick) will be accounted with its new weight (0 in our case) > instead of its real weight. The last time slot is accounted as an idle one > whereas it was a running one. > > If the running time of TA is short enough that no tick happens when it > runs, all running time of TG1->se will be accounted as idle time. > > Instead, we should update the share of a cfs_rq (in fact the weight of its > group entity) only after having updated the load_avg of the group_entity. > > update_cfs_shares() now takes the sched_entity as parameter instead of the > cfs_rq and the weight of the group_entity is updated only once its load_avg > has been synced with current time. Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/ So the problem is that in our for_each_sched_entity(se) loop we end up changing the next se before we get there. root (cfs_rq) \ (se) A (cfs_rq) \ (se) a Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then updates A's se, which is the next se in our iteration and mucks with state before we get there. So you change update_cfs_shares() to go downward while we go upward, ensuring we only update things that we've finished with. Makes sense.. > kernel/sched/fair.c | 27 ++++++++++++++++----------- > 1 file changed, 16 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 18d9e75..19092fa 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, > > static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); > > -static void update_cfs_shares(struct cfs_rq *cfs_rq) > +static void update_cfs_shares(struct sched_entity *se) > { > struct task_group *tg; > - struct sched_entity *se; > + struct cfs_rq *cfs_rq = group_cfs_rq(se); > long shares; please keep them ordered by length. > > + if (entity_is_task(se)) can be: !cfs_rq, which is the same and we already done that load. > + return; > + > tg = cfs_rq->tg; This load isn't needed here yet, can be moved down a bit. > - se = tg->se[cpu_of(rq_of(cfs_rq))]; > - if (!se || throttled_hierarchy(cfs_rq)) > + > + if (throttled_hierarchy(cfs_rq)) > return; > #ifndef CONFIG_SMP > if (likely(se->load.weight == tg->shares)) > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > se->vruntime += cfs_rq->min_vruntime; > > update_load_avg(se, UPDATE_TG); > + update_cfs_shares(se); > enqueue_entity_load_avg(cfs_rq, se); > account_entity_enqueue(cfs_rq, se); > - update_cfs_shares(cfs_rq); > > if (flags & ENQUEUE_WAKEUP) > place_entity(cfs_rq, se, 0); So here we need to update_cfs_shares() _before_ enqueue_entity, because the update_cfs_shares() will affect this se's load, right? > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > /* return excess runtime on last dequeue */ > return_cfs_rq_runtime(cfs_rq); > > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > > /* > * Now advance min_vruntime if @se was the entity holding it back, But this one hurts my brain.. It must be done after dequeue_entity_load_avg() such that we subtract the load as was seen until now. Could we please add comments explaining this ordering, because I forever need to think about this (both enqueue and dequeue). > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) > * Ensure that runnable average is periodically updated. > */ > update_load_avg(curr, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(curr); > > #ifdef CONFIG_SCHED_HRTICK > /* > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > break; > > update_load_avg(se, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > } > > if (!se) > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > break; > > update_load_avg(se, UPDATE_TG); > - update_cfs_shares(cfs_rq); > + update_cfs_shares(se); > } > > if (!se) This has a distinct pattern to it though; should we think about something like: UPDATE_SHARES for update_load_avg() or does that confuse things? > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) > /* Possible calls to update_curr() need rq clock */ > update_rq_clock(rq); > for_each_sched_entity(se) > - update_cfs_shares(group_cfs_rq(se)); > + update_cfs_shares(se); Should we not also catch up with our load before we frob the shares? > raw_spin_unlock_irqrestore(&rq->lock, flags); > } ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: fix group_entity's share update 2016-12-15 21:42 ` Peter Zijlstra @ 2016-12-16 8:55 ` Vincent Guittot 2016-12-19 17:37 ` Vincent Guittot 0 siblings, 1 reply; 5+ messages in thread From: Vincent Guittot @ 2016-12-16 8:55 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Paul Turner, stable On 15 December 2016 at 22:42, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote: > > The update of the share of a cfs_rq is done when its load_avg is updated > > but before the group_entity's load_avg has been updated for the past time > > slot. This generates wrong load_avg accounting which can be significant > > when small tasks are involved in the scheduling. > > > > Let take the example of a task TA that is dequeued of its task group TG1. > > TA was the only task in TG1 which becomes idle. > > > > We have the sequence: > > > > - dequeue_entity TA->se > > - update_load_avg(TA->se) > > - dequeue_entity_load_avg(TG1->cfs_rq, TA->se) > > - account_entity_dequeue(TG1->cfs_rq, TA->se) > > TG1->cfs_rq->load.weight = 0 > > - update_cfs_shares(TG1->cfs_rq) > > TG1->se->load.weight is updated with the new share of > > cfs_rq. TG1->se->load.weight = 0. > > - dequeue_entity TG1->se > > - update_load_avg(TG1->se) but its weight is now null so the last time > > slot (up to a tick) will be accounted with its new weight (0 in our case) > > instead of its real weight. The last time slot is accounted as an idle one > > whereas it was a running one. > > > > If the running time of TA is short enough that no tick happens when it > > runs, all running time of TG1->se will be accounted as idle time. > > > > Instead, we should update the share of a cfs_rq (in fact the weight of its > > group entity) only after having updated the load_avg of the group_entity. > > > > update_cfs_shares() now takes the sched_entity as parameter instead of the > > cfs_rq and the weight of the group_entity is updated only once its load_avg > > has been synced with current time. > > Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/ > > So the problem is that in our for_each_sched_entity(se) loop we end up > changing the next se before we get there. > > > root > (cfs_rq) > \ > (se) > A > (cfs_rq) > \ > (se) > a > > > Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then > updates A's se, which is the next se in our iteration and mucks with > state before we get there. > > So you change update_cfs_shares() to go downward while we go upward, > ensuring we only update things that we've finished with. yes > > Makes sense.. > > > kernel/sched/fair.c | 27 ++++++++++++++++----------- > > 1 file changed, 16 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 18d9e75..19092fa 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, > > > > static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); > > > > -static void update_cfs_shares(struct cfs_rq *cfs_rq) > > +static void update_cfs_shares(struct sched_entity *se) > > { > > struct task_group *tg; > > - struct sched_entity *se; > > + struct cfs_rq *cfs_rq = group_cfs_rq(se); > > long shares; > > please keep them ordered by length. Ok > > > > > + if (entity_is_task(se)) > > can be: !cfs_rq, which is the same and we already done that load. yes. My goal was to keep it more readable about the meaning of the test and I was expecting that the compiler would be smart enough to use the same one load for both cfs_rq = group_cfs_rq(se) and entity_is_task(se) I can change with !cfs_rq > > > + return; > > + > > tg = cfs_rq->tg; > > This load isn't needed here yet, can be moved down a bit. Indeed > > > - se = tg->se[cpu_of(rq_of(cfs_rq))]; > > - if (!se || throttled_hierarchy(cfs_rq)) > > + > > + if (throttled_hierarchy(cfs_rq)) > > return; > > #ifndef CONFIG_SMP > > if (likely(se->load.weight == tg->shares)) > > > > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > se->vruntime += cfs_rq->min_vruntime; > > > > update_load_avg(se, UPDATE_TG); > > + update_cfs_shares(se); > > enqueue_entity_load_avg(cfs_rq, se); > > account_entity_enqueue(cfs_rq, se); > > - update_cfs_shares(cfs_rq); > > > > if (flags & ENQUEUE_WAKEUP) > > place_entity(cfs_rq, se, 0); > > So here we need to update_cfs_shares() _before_ enqueue_entity, because > the update_cfs_shares() will affect this se's load, right? exactly > > > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > /* return excess runtime on last dequeue */ > > return_cfs_rq_runtime(cfs_rq); > > > > - update_cfs_shares(cfs_rq); > > + update_cfs_shares(se); > > > > /* > > * Now advance min_vruntime if @se was the entity holding it back, > > But this one hurts my brain.. > > It must be done after dequeue_entity_load_avg() such that we subtract > the load as was seen until now. update_cfs_shares(A's se) must be done after update_load_avg(A's se, UPDATE_TG); so the update od A's se ->load-avg will be updated with the previous load to update load_avg for the previous time slot. update_cfs_shares(A's se) could be done before or after dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync during the reweight of A's se. Nevertheless, i see one advantage of doing that after: reweight_entity will be faster because A's se->on_rq will have been cleared in the meantime > > Could we please add comments explaining this ordering, because I forever > need to think about this (both enqueue and dequeue). OK > > > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) > > * Ensure that runnable average is periodically updated. > > */ > > update_load_avg(curr, UPDATE_TG); > > - update_cfs_shares(cfs_rq); > > + update_cfs_shares(curr); > > > > #ifdef CONFIG_SCHED_HRTICK > > /* > > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > break; > > > > update_load_avg(se, UPDATE_TG); > > - update_cfs_shares(cfs_rq); > > + update_cfs_shares(se); > > } > > > > if (!se) > > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) > > break; > > > > update_load_avg(se, UPDATE_TG); > > - update_cfs_shares(cfs_rq); > > + update_cfs_shares(se); > > } > > > > if (!se) > > This has a distinct pattern to it though; should we think about > something like: UPDATE_SHARES for update_load_avg() or does that confuse > things? IMHO, keeping update_cfs_shares separated from update_load_avg make it clear about when we update the shares and enable some optimization like for dequeue_entity > > > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) > > /* Possible calls to update_curr() need rq clock */ > > update_rq_clock(rq); > > for_each_sched_entity(se) > > - update_cfs_shares(group_cfs_rq(se)); > > + update_cfs_shares(se); > > Should we not also catch up with our load before we frob the shares? yes you're right, an update_load_avg is missing > > > raw_spin_unlock_irqrestore(&rq->lock, flags); > > } ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] sched: fix group_entity's share update 2016-12-16 8:55 ` Vincent Guittot @ 2016-12-19 17:37 ` Vincent Guittot 0 siblings, 0 replies; 5+ messages in thread From: Vincent Guittot @ 2016-12-19 17:37 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Paul Turner, stable On 16 December 2016 at 09:55, Vincent Guittot <vincent.guittot@linaro.org> wrote: > On 15 December 2016 at 22:42, Peter Zijlstra <peterz@infradead.org> wrote: >> >> On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote: >> > The update of the share of a cfs_rq is done when its load_avg is updated >> > but before the group_entity's load_avg has been updated for the past time >> > slot. This generates wrong load_avg accounting which can be significant >> > when small tasks are involved in the scheduling. >> > >> > Let take the example of a task TA that is dequeued of its task group TG1. >> > TA was the only task in TG1 which becomes idle. >> > >> > We have the sequence: >> > >> > - dequeue_entity TA->se >> > - update_load_avg(TA->se) >> > - dequeue_entity_load_avg(TG1->cfs_rq, TA->se) >> > - account_entity_dequeue(TG1->cfs_rq, TA->se) >> > TG1->cfs_rq->load.weight = 0 >> > - update_cfs_shares(TG1->cfs_rq) >> > TG1->se->load.weight is updated with the new share of >> > cfs_rq. TG1->se->load.weight = 0. >> > - dequeue_entity TG1->se >> > - update_load_avg(TG1->se) but its weight is now null so the last time >> > slot (up to a tick) will be accounted with its new weight (0 in our case) >> > instead of its real weight. The last time slot is accounted as an idle one >> > whereas it was a running one. >> > >> > If the running time of TA is short enough that no tick happens when it >> > runs, all running time of TG1->se will be accounted as idle time. >> > >> > Instead, we should update the share of a cfs_rq (in fact the weight of its >> > group entity) only after having updated the load_avg of the group_entity. >> > >> > update_cfs_shares() now takes the sched_entity as parameter instead of the >> > cfs_rq and the weight of the group_entity is updated only once its load_avg >> > has been synced with current time. >> >> Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/ >> >> So the problem is that in our for_each_sched_entity(se) loop we end up >> changing the next se before we get there. >> >> >> root >> (cfs_rq) >> \ >> (se) >> A >> (cfs_rq) >> \ >> (se) >> a >> >> >> Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then >> updates A's se, which is the next se in our iteration and mucks with >> state before we get there. >> >> So you change update_cfs_shares() to go downward while we go upward, >> ensuring we only update things that we've finished with. > > yes > >> >> Makes sense.. >> >> > kernel/sched/fair.c | 27 ++++++++++++++++----------- >> > 1 file changed, 16 insertions(+), 11 deletions(-) >> > >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> > index 18d9e75..19092fa 100644 >> > --- a/kernel/sched/fair.c >> > +++ b/kernel/sched/fair.c >> > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, >> > >> > static inline int throttled_hierarchy(struct cfs_rq *cfs_rq); >> > >> > -static void update_cfs_shares(struct cfs_rq *cfs_rq) >> > +static void update_cfs_shares(struct sched_entity *se) >> > { >> > struct task_group *tg; >> > - struct sched_entity *se; >> > + struct cfs_rq *cfs_rq = group_cfs_rq(se); >> > long shares; >> >> please keep them ordered by length. > > Ok > >> >> > >> > + if (entity_is_task(se)) >> >> can be: !cfs_rq, which is the same and we already done that load. > > yes. My goal was to keep it more readable about the meaning of the > test and I was expecting that the compiler would be smart enough to > use the same one load for both cfs_rq = group_cfs_rq(se) and > entity_is_task(se) > > I can change with !cfs_rq > >> >> > + return; >> > + >> > tg = cfs_rq->tg; >> >> This load isn't needed here yet, can be moved down a bit. > > Indeed > >> >> > - se = tg->se[cpu_of(rq_of(cfs_rq))]; >> > - if (!se || throttled_hierarchy(cfs_rq)) >> > + >> > + if (throttled_hierarchy(cfs_rq)) >> > return; >> > #ifndef CONFIG_SMP >> > if (likely(se->load.weight == tg->shares)) >> >> >> > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> > se->vruntime += cfs_rq->min_vruntime; >> > >> > update_load_avg(se, UPDATE_TG); >> > + update_cfs_shares(se); >> > enqueue_entity_load_avg(cfs_rq, se); >> > account_entity_enqueue(cfs_rq, se); >> > - update_cfs_shares(cfs_rq); >> > >> > if (flags & ENQUEUE_WAKEUP) >> > place_entity(cfs_rq, se, 0); >> >> So here we need to update_cfs_shares() _before_ enqueue_entity, because >> the update_cfs_shares() will affect this se's load, right? > > exactly In fact, the only constraint is that update_cfs_shares() must be done before account_entity_enqueue(). But there no constraint with enqueue_entity_load_avg() so it's probably better to put manipulation of load together and manipulation of weight together: update_load_avg(se, UPDATE_TG); enqueue_entity_load_avg(cfs_rq, se); update_cfs_shares(se); account_entity_enqueue(cfs_rq, se); > >> >> > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) >> > /* return excess runtime on last dequeue */ >> > return_cfs_rq_runtime(cfs_rq); >> > >> > - update_cfs_shares(cfs_rq); >> > + update_cfs_shares(se); >> > >> > /* >> > * Now advance min_vruntime if @se was the entity holding it back, >> >> But this one hurts my brain.. >> >> It must be done after dequeue_entity_load_avg() such that we subtract >> the load as was seen until now. > > update_cfs_shares(A's se) must be done after update_load_avg(A's se, > UPDATE_TG); so the update od A's se ->load-avg will be updated with > the previous load to update load_avg for the previous time slot. > > update_cfs_shares(A's se) could be done before or after > dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync > during the reweight of A's se. Nevertheless, i see one advantage of > doing that after: reweight_entity will be faster because A's se->on_rq > will have been cleared in the meantime > >> >> Could we please add comments explaining this ordering, because I forever >> need to think about this (both enqueue and dequeue). > > OK > >> >> > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued) >> > * Ensure that runnable average is periodically updated. >> > */ >> > update_load_avg(curr, UPDATE_TG); >> > - update_cfs_shares(cfs_rq); >> > + update_cfs_shares(curr); >> > >> > #ifdef CONFIG_SCHED_HRTICK >> > /* >> > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> > break; >> > >> > update_load_avg(se, UPDATE_TG); >> > - update_cfs_shares(cfs_rq); >> > + update_cfs_shares(se); >> > } >> > >> > if (!se) >> > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags) >> > break; >> > >> > update_load_avg(se, UPDATE_TG); >> > - update_cfs_shares(cfs_rq); >> > + update_cfs_shares(se); >> > } >> > >> > if (!se) >> >> This has a distinct pattern to it though; should we think about >> something like: UPDATE_SHARES for update_load_avg() or does that confuse >> things? > > IMHO, keeping update_cfs_shares separated from update_load_avg make it > clear about when we update the shares and enable some optimization > like for dequeue_entity > >> >> > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares) >> > /* Possible calls to update_curr() need rq clock */ >> > update_rq_clock(rq); >> > for_each_sched_entity(se) >> > - update_cfs_shares(group_cfs_rq(se)); >> > + update_cfs_shares(se); >> >> Should we not also catch up with our load before we frob the shares? > > yes you're right, an update_load_avg is missing > >> >> > raw_spin_unlock_irqrestore(&rq->lock, flags); >> > } ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-12-19 17:37 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot 2016-12-15 16:52 ` Vincent Guittot 2016-12-15 21:42 ` Peter Zijlstra 2016-12-16 8:55 ` Vincent Guittot 2016-12-19 17:37 ` Vincent Guittot
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).