[PATCH] sched: fix group_entity's share update

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched: fix group_entity's share update
@ 2016-12-01 16:38 Vincent Guittot
  2016-12-15 16:52 ` Vincent Guittot
  2016-12-15 21:42 ` Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Vincent Guittot @ 2016-12-01 16:38 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel; +Cc: pjt, Vincent Guittot, stable

The update of the share of a cfs_rq is done when its load_avg is updated
but before the group_entity's load_avg has been updated for the past time
slot. This generates wrong load_avg accounting which can be significant
when small tasks are involved in the scheduling.

Let take the example of a task TA that is dequeued of its task group TG1.
TA was the only task in TG1 which becomes idle.

We have the sequence:

- dequeue_entity TA->se
    - update_load_avg(TA->se)
    - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
    - account_entity_dequeue(TG1->cfs_rq, TA->se)
          TG1->cfs_rq->load.weight = 0
    - update_cfs_shares(TG1->cfs_rq)
	        TG1->se->load.weight is updated with the new share of
		cfs_rq. TG1->se->load.weight = 0.
- dequeue_entity TG1->se
    - update_load_avg(TG1->se) but its weight is now null so the last time
slot (up to a tick) will be accounted with its new weight (0 in our case)
instead of its real weight. The last time slot is accounted as an idle one
whereas it was a running one.

If the running time of TA is short enough that no tick happens when it
runs, all running time of TG1->se will be accounted as idle time.

Instead, we should update the share of a cfs_rq (in fact the weight of its
group entity) only after having updated the load_avg of the group_entity.

update_cfs_shares() now takes the sched_entity as parameter instead of the
cfs_rq and the weight of the group_entity is updated only once its load_avg
has been synced with current time.

Cc: <stable@vger.kernel.org>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

I have seen the problem on tip/sched/core, v4.8 and v4.7. Previous versions
might also have the problem but I haven't not been able to test them yet.

 kernel/sched/fair.c | 27 ++++++++++++++++-----------
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18d9e75..19092fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
 
 static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
 
-static void update_cfs_shares(struct cfs_rq *cfs_rq)
+static void update_cfs_shares(struct sched_entity *se)
 {
 	struct task_group *tg;
-	struct sched_entity *se;
+	struct cfs_rq *cfs_rq = group_cfs_rq(se);
 	long shares;
 
+	if (entity_is_task(se))
+		return;
+
 	tg = cfs_rq->tg;
-	se = tg->se[cpu_of(rq_of(cfs_rq))];
-	if (!se || throttled_hierarchy(cfs_rq))
+
+	if (throttled_hierarchy(cfs_rq))
 		return;
 #ifndef CONFIG_SMP
 	if (likely(se->load.weight == tg->shares))
@@ -2707,8 +2710,10 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
 
 	reweight_entity(cfs_rq_of(se), se, shares);
 }
+
+
 #else /* CONFIG_FAIR_GROUP_SCHED */
-static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
+static inline void update_cfs_shares(struct sched_entity *se)
 {
 }
 #endif /* CONFIG_FAIR_GROUP_SCHED */
@@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		se->vruntime += cfs_rq->min_vruntime;
 
 	update_load_avg(se, UPDATE_TG);
+	update_cfs_shares(se);
 	enqueue_entity_load_avg(cfs_rq, se);
 	account_entity_enqueue(cfs_rq, se);
-	update_cfs_shares(cfs_rq);
 
 	if (flags & ENQUEUE_WAKEUP)
 		place_entity(cfs_rq, se, 0);
@@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	/* return excess runtime on last dequeue */
 	return_cfs_rq_runtime(cfs_rq);
 
-	update_cfs_shares(cfs_rq);
+	update_cfs_shares(se);
 
 	/*
 	 * Now advance min_vruntime if @se was the entity holding it back,
@@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
 	 * Ensure that runnable average is periodically updated.
 	 */
 	update_load_avg(curr, UPDATE_TG);
-	update_cfs_shares(cfs_rq);
+	update_cfs_shares(curr);
 
 #ifdef CONFIG_SCHED_HRTICK
 	/*
@@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			break;
 
 		update_load_avg(se, UPDATE_TG);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(se);
 	}
 
 	if (!se)
@@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 			break;
 
 		update_load_avg(se, UPDATE_TG);
-		update_cfs_shares(cfs_rq);
+		update_cfs_shares(se);
 	}
 
 	if (!se)
@@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
 		/* Possible calls to update_curr() need rq clock */
 		update_rq_clock(rq);
 		for_each_sched_entity(se)
-			update_cfs_shares(group_cfs_rq(se));
+			update_cfs_shares(se);
 		raw_spin_unlock_irqrestore(&rq->lock, flags);
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix group_entity's share update
  2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot
@ 2016-12-15 16:52 ` Vincent Guittot
  2016-12-15 21:42 ` Peter Zijlstra
  1 sibling, 0 replies; 5+ messages in thread
From: Vincent Guittot @ 2016-12-15 16:52 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, linux-kernel
  Cc: Paul Turner, Vincent Guittot, stable

Gentle ping ...

Vincent

On 1 December 2016 at 17:38, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> The update of the share of a cfs_rq is done when its load_avg is updated
> but before the group_entity's load_avg has been updated for the past time
> slot. This generates wrong load_avg accounting which can be significant
> when small tasks are involved in the scheduling.
>
> Let take the example of a task TA that is dequeued of its task group TG1.
> TA was the only task in TG1 which becomes idle.
>
> We have the sequence:
>
> - dequeue_entity TA->se
>     - update_load_avg(TA->se)
>     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
>     - account_entity_dequeue(TG1->cfs_rq, TA->se)
>           TG1->cfs_rq->load.weight = 0
>     - update_cfs_shares(TG1->cfs_rq)
>                 TG1->se->load.weight is updated with the new share of
>                 cfs_rq. TG1->se->load.weight = 0.
> - dequeue_entity TG1->se
>     - update_load_avg(TG1->se) but its weight is now null so the last time
> slot (up to a tick) will be accounted with its new weight (0 in our case)
> instead of its real weight. The last time slot is accounted as an idle one
> whereas it was a running one.
>
> If the running time of TA is short enough that no tick happens when it
> runs, all running time of TG1->se will be accounted as idle time.
>
> Instead, we should update the share of a cfs_rq (in fact the weight of its
> group entity) only after having updated the load_avg of the group_entity.
>
> update_cfs_shares() now takes the sched_entity as parameter instead of the
> cfs_rq and the weight of the group_entity is updated only once its load_avg
> has been synced with current time.
>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>
> I have seen the problem on tip/sched/core, v4.8 and v4.7. Previous versions
> might also have the problem but I haven't not been able to test them yet.
>
>  kernel/sched/fair.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18d9e75..19092fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>
> -static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +static void update_cfs_shares(struct sched_entity *se)
>  {
>         struct task_group *tg;
> -       struct sched_entity *se;
> +       struct cfs_rq *cfs_rq = group_cfs_rq(se);
>         long shares;
>
> +       if (entity_is_task(se))
> +               return;
> +
>         tg = cfs_rq->tg;
> -       se = tg->se[cpu_of(rq_of(cfs_rq))];
> -       if (!se || throttled_hierarchy(cfs_rq))
> +
> +       if (throttled_hierarchy(cfs_rq))
>                 return;
>  #ifndef CONFIG_SMP
>         if (likely(se->load.weight == tg->shares))
> @@ -2707,8 +2710,10 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)
>
>         reweight_entity(cfs_rq_of(se), se, shares);
>  }
> +
> +
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> -static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
> +static inline void update_cfs_shares(struct sched_entity *se)
>  {
>  }
>  #endif /* CONFIG_FAIR_GROUP_SCHED */
> @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>                 se->vruntime += cfs_rq->min_vruntime;
>
>         update_load_avg(se, UPDATE_TG);
> +       update_cfs_shares(se);
>         enqueue_entity_load_avg(cfs_rq, se);
>         account_entity_enqueue(cfs_rq, se);
> -       update_cfs_shares(cfs_rq);
>
>         if (flags & ENQUEUE_WAKEUP)
>                 place_entity(cfs_rq, se, 0);
> @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>         /* return excess runtime on last dequeue */
>         return_cfs_rq_runtime(cfs_rq);
>
> -       update_cfs_shares(cfs_rq);
> +       update_cfs_shares(se);
>
>         /*
>          * Now advance min_vruntime if @se was the entity holding it back,
> @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>          * Ensure that runnable average is periodically updated.
>          */
>         update_load_avg(curr, UPDATE_TG);
> -       update_cfs_shares(cfs_rq);
> +       update_cfs_shares(curr);
>
>  #ifdef CONFIG_SCHED_HRTICK
>         /*
> @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                         break;
>
>                 update_load_avg(se, UPDATE_TG);
> -               update_cfs_shares(cfs_rq);
> +               update_cfs_shares(se);
>         }
>
>         if (!se)
> @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                         break;
>
>                 update_load_avg(se, UPDATE_TG);
> -               update_cfs_shares(cfs_rq);
> +               update_cfs_shares(se);
>         }
>
>         if (!se)
> @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>                 /* Possible calls to update_curr() need rq clock */
>                 update_rq_clock(rq);
>                 for_each_sched_entity(se)
> -                       update_cfs_shares(group_cfs_rq(se));
> +                       update_cfs_shares(se);
>                 raw_spin_unlock_irqrestore(&rq->lock, flags);
>         }
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix group_entity's share update
  2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot
  2016-12-15 16:52 ` Vincent Guittot
@ 2016-12-15 21:42 ` Peter Zijlstra
  2016-12-16  8:55   ` Vincent Guittot
  1 sibling, 1 reply; 5+ messages in thread
From: Peter Zijlstra @ 2016-12-15 21:42 UTC (permalink / raw)
  To: Vincent Guittot; +Cc: mingo, linux-kernel, pjt, stable

On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
> The update of the share of a cfs_rq is done when its load_avg is updated
> but before the group_entity's load_avg has been updated for the past time
> slot. This generates wrong load_avg accounting which can be significant
> when small tasks are involved in the scheduling.
> 
> Let take the example of a task TA that is dequeued of its task group TG1.
> TA was the only task in TG1 which becomes idle.
> 
> We have the sequence:
> 
> - dequeue_entity TA->se
>     - update_load_avg(TA->se)
>     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
>     - account_entity_dequeue(TG1->cfs_rq, TA->se)
>           TG1->cfs_rq->load.weight = 0
>     - update_cfs_shares(TG1->cfs_rq)
> 	        TG1->se->load.weight is updated with the new share of
> 		cfs_rq. TG1->se->load.weight = 0.
> - dequeue_entity TG1->se
>     - update_load_avg(TG1->se) but its weight is now null so the last time
> slot (up to a tick) will be accounted with its new weight (0 in our case)
> instead of its real weight. The last time slot is accounted as an idle one
> whereas it was a running one.
> 
> If the running time of TA is short enough that no tick happens when it
> runs, all running time of TG1->se will be accounted as idle time.
> 
> Instead, we should update the share of a cfs_rq (in fact the weight of its
> group entity) only after having updated the load_avg of the group_entity.
> 
> update_cfs_shares() now takes the sched_entity as parameter instead of the
> cfs_rq and the weight of the group_entity is updated only once its load_avg
> has been synced with current time.

Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/

So the problem is that in our for_each_sched_entity(se) loop we end up
changing the next se before we get there.


		root
	      (cfs_rq)
		  \
		  (se)
		    A
		 (cfs_rq)
		      \
		      (se)
		       a


Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
updates A's se, which is the next se in our iteration and mucks with
state before we get there.

So you change update_cfs_shares() to go downward while we go upward,
ensuring we only update things that we've finished with.

Makes sense..

>  kernel/sched/fair.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18d9e75..19092fa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>  
>  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>  
> -static void update_cfs_shares(struct cfs_rq *cfs_rq)
> +static void update_cfs_shares(struct sched_entity *se)
>  {
>  	struct task_group *tg;
> -	struct sched_entity *se;
> +	struct cfs_rq *cfs_rq = group_cfs_rq(se);
>  	long shares;

please keep them ordered by length.

>  
> +	if (entity_is_task(se))

can be: !cfs_rq, which is the same and we already done that load.

> +		return;
> +
>  	tg = cfs_rq->tg;

This load isn't needed here yet, can be moved down a bit.

> -	se = tg->se[cpu_of(rq_of(cfs_rq))];
> -	if (!se || throttled_hierarchy(cfs_rq))
> +
> +	if (throttled_hierarchy(cfs_rq))
>  		return;
>  #ifndef CONFIG_SMP
>  	if (likely(se->load.weight == tg->shares))


> @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		se->vruntime += cfs_rq->min_vruntime;
>  
>  	update_load_avg(se, UPDATE_TG);
> +	update_cfs_shares(se);
>  	enqueue_entity_load_avg(cfs_rq, se);
>  	account_entity_enqueue(cfs_rq, se);
> -	update_cfs_shares(cfs_rq);
>  
>  	if (flags & ENQUEUE_WAKEUP)
>  		place_entity(cfs_rq, se, 0);

So here we need to update_cfs_shares() _before_ enqueue_entity, because
the update_cfs_shares() will affect this se's load, right?

> @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  	/* return excess runtime on last dequeue */
>  	return_cfs_rq_runtime(cfs_rq);
>  
> -	update_cfs_shares(cfs_rq);
> +	update_cfs_shares(se);
>  
>  	/*
>  	 * Now advance min_vruntime if @se was the entity holding it back,

But this one hurts my brain..

It must be done after dequeue_entity_load_avg() such that we subtract
the load as was seen until now.

Could we please add comments explaining this ordering, because I forever
need to think about this (both enqueue and dequeue).

> @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>  	 * Ensure that runnable average is periodically updated.
>  	 */
>  	update_load_avg(curr, UPDATE_TG);
> -	update_cfs_shares(cfs_rq);
> +	update_cfs_shares(curr);
>  
>  #ifdef CONFIG_SCHED_HRTICK
>  	/*
> @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  			break;
>  
>  		update_load_avg(se, UPDATE_TG);
> -		update_cfs_shares(cfs_rq);
> +		update_cfs_shares(se);
>  	}
>  
>  	if (!se)
> @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  			break;
>  
>  		update_load_avg(se, UPDATE_TG);
> -		update_cfs_shares(cfs_rq);
> +		update_cfs_shares(se);
>  	}
>  
>  	if (!se)

This has a distinct pattern to it though; should we think about
something like: UPDATE_SHARES for update_load_avg() or does that confuse
things?

> @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>  		/* Possible calls to update_curr() need rq clock */
>  		update_rq_clock(rq);
>  		for_each_sched_entity(se)
> -			update_cfs_shares(group_cfs_rq(se));
> +			update_cfs_shares(se);

Should we not also catch up with our load before we frob the shares?

>  		raw_spin_unlock_irqrestore(&rq->lock, flags);
>  	}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix group_entity's share update
  2016-12-15 21:42 ` Peter Zijlstra
@ 2016-12-16  8:55   ` Vincent Guittot
  2016-12-19 17:37     ` Vincent Guittot
  0 siblings, 1 reply; 5+ messages in thread
From: Vincent Guittot @ 2016-12-16  8:55 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Paul Turner, stable

On 15 December 2016 at 22:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
> > The update of the share of a cfs_rq is done when its load_avg is updated
> > but before the group_entity's load_avg has been updated for the past time
> > slot. This generates wrong load_avg accounting which can be significant
> > when small tasks are involved in the scheduling.
> >
> > Let take the example of a task TA that is dequeued of its task group TG1.
> > TA was the only task in TG1 which becomes idle.
> >
> > We have the sequence:
> >
> > - dequeue_entity TA->se
> >     - update_load_avg(TA->se)
> >     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
> >     - account_entity_dequeue(TG1->cfs_rq, TA->se)
> >           TG1->cfs_rq->load.weight = 0
> >     - update_cfs_shares(TG1->cfs_rq)
> >               TG1->se->load.weight is updated with the new share of
> >               cfs_rq. TG1->se->load.weight = 0.
> > - dequeue_entity TG1->se
> >     - update_load_avg(TG1->se) but its weight is now null so the last time
> > slot (up to a tick) will be accounted with its new weight (0 in our case)
> > instead of its real weight. The last time slot is accounted as an idle one
> > whereas it was a running one.
> >
> > If the running time of TA is short enough that no tick happens when it
> > runs, all running time of TG1->se will be accounted as idle time.
> >
> > Instead, we should update the share of a cfs_rq (in fact the weight of its
> > group entity) only after having updated the load_avg of the group_entity.
> >
> > update_cfs_shares() now takes the sched_entity as parameter instead of the
> > cfs_rq and the weight of the group_entity is updated only once its load_avg
> > has been synced with current time.
>
> Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/
>
> So the problem is that in our for_each_sched_entity(se) loop we end up
> changing the next se before we get there.
>
>
>                 root
>               (cfs_rq)
>                   \
>                   (se)
>                     A
>                  (cfs_rq)
>                       \
>                       (se)
>                        a
>
>
> Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
> updates A's se, which is the next se in our iteration and mucks with
> state before we get there.
>
> So you change update_cfs_shares() to go downward while we go upward,
> ensuring we only update things that we've finished with.

yes

>
> Makes sense..
>
> >  kernel/sched/fair.c | 27 ++++++++++++++++-----------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 18d9e75..19092fa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >
> >  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> >
> > -static void update_cfs_shares(struct cfs_rq *cfs_rq)
> > +static void update_cfs_shares(struct sched_entity *se)
> >  {
> >       struct task_group *tg;
> > -     struct sched_entity *se;
> > +     struct cfs_rq *cfs_rq = group_cfs_rq(se);
> >       long shares;
>
> please keep them ordered by length.

Ok

>
> >
> > +     if (entity_is_task(se))
>
> can be: !cfs_rq, which is the same and we already done that load.

yes. My goal was to keep it more readable about the meaning of the
test and I was expecting that the compiler would be smart enough to
use the same one load for both cfs_rq = group_cfs_rq(se) and
entity_is_task(se)

I can change with !cfs_rq

>
> > +             return;
> > +
> >       tg = cfs_rq->tg;
>
> This load isn't needed here yet, can be moved down a bit.

Indeed

>
> > -     se = tg->se[cpu_of(rq_of(cfs_rq))];
> > -     if (!se || throttled_hierarchy(cfs_rq))
> > +
> > +     if (throttled_hierarchy(cfs_rq))
> >               return;
> >  #ifndef CONFIG_SMP
> >       if (likely(se->load.weight == tg->shares))
>
>
> > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               se->vruntime += cfs_rq->min_vruntime;
> >
> >       update_load_avg(se, UPDATE_TG);
> > +     update_cfs_shares(se);
> >       enqueue_entity_load_avg(cfs_rq, se);
> >       account_entity_enqueue(cfs_rq, se);
> > -     update_cfs_shares(cfs_rq);
> >
> >       if (flags & ENQUEUE_WAKEUP)
> >               place_entity(cfs_rq, se, 0);
>
> So here we need to update_cfs_shares() _before_ enqueue_entity, because
> the update_cfs_shares() will affect this se's load, right?

exactly

>
> > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >       /* return excess runtime on last dequeue */
> >       return_cfs_rq_runtime(cfs_rq);
> >
> > -     update_cfs_shares(cfs_rq);
> > +     update_cfs_shares(se);
> >
> >       /*
> >        * Now advance min_vruntime if @se was the entity holding it back,
>
> But this one hurts my brain..
>
> It must be done after dequeue_entity_load_avg() such that we subtract
> the load as was seen until now.

 update_cfs_shares(A's se) must be done after update_load_avg(A's se,
UPDATE_TG); so the update od A's se ->load-avg will be updated with
the previous load to update load_avg for the previous time slot.

update_cfs_shares(A's se) could be done before or after
dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync
during the reweight of A's se. Nevertheless, i see one advantage of
doing that after: reweight_entity will be faster because A's se->on_rq
will have been cleared in the meantime

>
> Could we please add comments explaining this ordering, because I forever
> need to think about this (both enqueue and dequeue).

OK

>
> > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> >        * Ensure that runnable average is periodically updated.
> >        */
> >       update_load_avg(curr, UPDATE_TG);
> > -     update_cfs_shares(cfs_rq);
> > +     update_cfs_shares(curr);
> >
> >  #ifdef CONFIG_SCHED_HRTICK
> >       /*
> > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >                       break;
> >
> >               update_load_avg(se, UPDATE_TG);
> > -             update_cfs_shares(cfs_rq);
> > +             update_cfs_shares(se);
> >       }
> >
> >       if (!se)
> > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >                       break;
> >
> >               update_load_avg(se, UPDATE_TG);
> > -             update_cfs_shares(cfs_rq);
> > +             update_cfs_shares(se);
> >       }
> >
> >       if (!se)
>
> This has a distinct pattern to it though; should we think about
> something like: UPDATE_SHARES for update_load_avg() or does that confuse
> things?

IMHO, keeping update_cfs_shares separated from update_load_avg make it
clear about when we update the shares and enable some optimization
like for  dequeue_entity

>
> > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> >               /* Possible calls to update_curr() need rq clock */
> >               update_rq_clock(rq);
> >               for_each_sched_entity(se)
> > -                     update_cfs_shares(group_cfs_rq(se));
> > +                     update_cfs_shares(se);
>
> Should we not also catch up with our load before we frob the shares?

yes you're right, an update_load_avg is missing

>
> >               raw_spin_unlock_irqrestore(&rq->lock, flags);
> >       }

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: fix group_entity's share update
  2016-12-16  8:55   ` Vincent Guittot
@ 2016-12-19 17:37     ` Vincent Guittot
  0 siblings, 0 replies; 5+ messages in thread
From: Vincent Guittot @ 2016-12-19 17:37 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Ingo Molnar, linux-kernel, Paul Turner, stable

On 16 December 2016 at 09:55, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> On 15 December 2016 at 22:42, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>> On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
>> > The update of the share of a cfs_rq is done when its load_avg is updated
>> > but before the group_entity's load_avg has been updated for the past time
>> > slot. This generates wrong load_avg accounting which can be significant
>> > when small tasks are involved in the scheduling.
>> >
>> > Let take the example of a task TA that is dequeued of its task group TG1.
>> > TA was the only task in TG1 which becomes idle.
>> >
>> > We have the sequence:
>> >
>> > - dequeue_entity TA->se
>> >     - update_load_avg(TA->se)
>> >     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
>> >     - account_entity_dequeue(TG1->cfs_rq, TA->se)
>> >           TG1->cfs_rq->load.weight = 0
>> >     - update_cfs_shares(TG1->cfs_rq)
>> >               TG1->se->load.weight is updated with the new share of
>> >               cfs_rq. TG1->se->load.weight = 0.
>> > - dequeue_entity TG1->se
>> >     - update_load_avg(TG1->se) but its weight is now null so the last time
>> > slot (up to a tick) will be accounted with its new weight (0 in our case)
>> > instead of its real weight. The last time slot is accounted as an idle one
>> > whereas it was a running one.
>> >
>> > If the running time of TA is short enough that no tick happens when it
>> > runs, all running time of TG1->se will be accounted as idle time.
>> >
>> > Instead, we should update the share of a cfs_rq (in fact the weight of its
>> > group entity) only after having updated the load_avg of the group_entity.
>> >
>> > update_cfs_shares() now takes the sched_entity as parameter instead of the
>> > cfs_rq and the weight of the group_entity is updated only once its load_avg
>> > has been synced with current time.
>>
>> Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/
>>
>> So the problem is that in our for_each_sched_entity(se) loop we end up
>> changing the next se before we get there.
>>
>>
>>                 root
>>               (cfs_rq)
>>                   \
>>                   (se)
>>                     A
>>                  (cfs_rq)
>>                       \
>>                       (se)
>>                        a
>>
>>
>> Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
>> updates A's se, which is the next se in our iteration and mucks with
>> state before we get there.
>>
>> So you change update_cfs_shares() to go downward while we go upward,
>> ensuring we only update things that we've finished with.
>
> yes
>
>>
>> Makes sense..
>>
>> >  kernel/sched/fair.c | 27 ++++++++++++++++-----------
>> >  1 file changed, 16 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 18d9e75..19092fa 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> >
>> >  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
>> >
>> > -static void update_cfs_shares(struct cfs_rq *cfs_rq)
>> > +static void update_cfs_shares(struct sched_entity *se)
>> >  {
>> >       struct task_group *tg;
>> > -     struct sched_entity *se;
>> > +     struct cfs_rq *cfs_rq = group_cfs_rq(se);
>> >       long shares;
>>
>> please keep them ordered by length.
>
> Ok
>
>>
>> >
>> > +     if (entity_is_task(se))
>>
>> can be: !cfs_rq, which is the same and we already done that load.
>
> yes. My goal was to keep it more readable about the meaning of the
> test and I was expecting that the compiler would be smart enough to
> use the same one load for both cfs_rq = group_cfs_rq(se) and
> entity_is_task(se)
>
> I can change with !cfs_rq
>
>>
>> > +             return;
>> > +
>> >       tg = cfs_rq->tg;
>>
>> This load isn't needed here yet, can be moved down a bit.
>
> Indeed
>
>>
>> > -     se = tg->se[cpu_of(rq_of(cfs_rq))];
>> > -     if (!se || throttled_hierarchy(cfs_rq))
>> > +
>> > +     if (throttled_hierarchy(cfs_rq))
>> >               return;
>> >  #ifndef CONFIG_SMP
>> >       if (likely(se->load.weight == tg->shares))
>>
>>
>> > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> >               se->vruntime += cfs_rq->min_vruntime;
>> >
>> >       update_load_avg(se, UPDATE_TG);
>> > +     update_cfs_shares(se);
>> >       enqueue_entity_load_avg(cfs_rq, se);
>> >       account_entity_enqueue(cfs_rq, se);
>> > -     update_cfs_shares(cfs_rq);
>> >
>> >       if (flags & ENQUEUE_WAKEUP)
>> >               place_entity(cfs_rq, se, 0);
>>
>> So here we need to update_cfs_shares() _before_ enqueue_entity, because
>> the update_cfs_shares() will affect this se's load, right?
>
> exactly

In fact, the only constraint is that update_cfs_shares() must be done
before account_entity_enqueue(). But there no constraint with
enqueue_entity_load_avg() so it's probably better to put manipulation
of load together and manipulation of weight together:

update_load_avg(se, UPDATE_TG);
enqueue_entity_load_avg(cfs_rq, se);
update_cfs_shares(se);
account_entity_enqueue(cfs_rq, se);

>
>>
>> > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>> >       /* return excess runtime on last dequeue */
>> >       return_cfs_rq_runtime(cfs_rq);
>> >
>> > -     update_cfs_shares(cfs_rq);
>> > +     update_cfs_shares(se);
>> >
>> >       /*
>> >        * Now advance min_vruntime if @se was the entity holding it back,
>>
>> But this one hurts my brain..
>>
>> It must be done after dequeue_entity_load_avg() such that we subtract
>> the load as was seen until now.
>
>  update_cfs_shares(A's se) must be done after update_load_avg(A's se,
> UPDATE_TG); so the update od A's se ->load-avg will be updated with
> the previous load to update load_avg for the previous time slot.
>
> update_cfs_shares(A's se) could be done before or after
> dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync
> during the reweight of A's se. Nevertheless, i see one advantage of
> doing that after: reweight_entity will be faster because A's se->on_rq
> will have been cleared in the meantime
>
>>
>> Could we please add comments explaining this ordering, because I forever
>> need to think about this (both enqueue and dequeue).
>
> OK
>
>>
>> > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
>> >        * Ensure that runnable average is periodically updated.
>> >        */
>> >       update_load_avg(curr, UPDATE_TG);
>> > -     update_cfs_shares(cfs_rq);
>> > +     update_cfs_shares(curr);
>> >
>> >  #ifdef CONFIG_SCHED_HRTICK
>> >       /*
>> > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> >                       break;
>> >
>> >               update_load_avg(se, UPDATE_TG);
>> > -             update_cfs_shares(cfs_rq);
>> > +             update_cfs_shares(se);
>> >       }
>> >
>> >       if (!se)
>> > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> >                       break;
>> >
>> >               update_load_avg(se, UPDATE_TG);
>> > -             update_cfs_shares(cfs_rq);
>> > +             update_cfs_shares(se);
>> >       }
>> >
>> >       if (!se)
>>
>> This has a distinct pattern to it though; should we think about
>> something like: UPDATE_SHARES for update_load_avg() or does that confuse
>> things?
>
> IMHO, keeping update_cfs_shares separated from update_load_avg make it
> clear about when we update the shares and enable some optimization
> like for  dequeue_entity
>
>>
>> > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
>> >               /* Possible calls to update_curr() need rq clock */
>> >               update_rq_clock(rq);
>> >               for_each_sched_entity(se)
>> > -                     update_cfs_shares(group_cfs_rq(se));
>> > +                     update_cfs_shares(se);
>>
>> Should we not also catch up with our load before we frob the shares?
>
> yes you're right, an update_load_avg is missing
>
>>
>> >               raw_spin_unlock_irqrestore(&rq->lock, flags);
>> >       }

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-12-19 17:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-01 16:38 [PATCH] sched: fix group_entity's share update Vincent Guittot
2016-12-15 16:52 ` Vincent Guittot
2016-12-15 21:42 ` Peter Zijlstra
2016-12-16  8:55   ` Vincent Guittot
2016-12-19 17:37     ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).