public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Increase max lag clamping
@ 2025-04-18 15:12 Vincent Guittot
  2025-04-18 15:51 ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Vincent Guittot @ 2025-04-18 15:12 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, linux-kernel
  Cc: Vincent Guittot

sched_entity lag is currently limited to the maximum between the tick and
twice the slice. This is too short compared to the maximum custom slice
that can be set and accumulated by other tasks.
Clamp the lag to the maximum slice that a task can set. A task A can
accumulate up to its slice of negative lag while running to parity and
the other runnable tasks can accumulate the same positive lag while
waiting to run. This positive lag could be lost during dequeue when
clamping it to twice task's slice if task A's slice is 100ms and others
use a smaller value like the default 2.8ms.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a0c4cd26ee07..1c2c70decb20 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -683,15 +683,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  * is possible -- by addition/removal/reweight to the tree -- to move V around
  * and end up with a larger lag than we started with.
  *
- * Limit this to either double the slice length with a minimum of TICK_NSEC
- * since that is the timing granularity.
- *
- * EEVDF gives the following limit for a steady state system:
+ * Limit this to the max allowed custom slice length which is higher than the
+ * timing granularity (the tick) and EEVDF gives the following limit for
+ * a steady state system:
  *
  *   -r_max < lag < max(r_max, q)
  *
  * XXX could add max_slice to the augmented data to track this.
  */
+#define SCHED_SLICE_MIN		(NSEC_PER_MSEC/10)  /* HZ=1000 * 10 */
+#define SCHED_SLICE_MAX		(NSEC_PER_MSEC*100) /* HZ=100  / 10 */
+
 static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
 	s64 vlag, limit;
@@ -699,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	WARN_ON_ONCE(!se->on_rq);
 
 	vlag = avg_vruntime(cfs_rq) - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	limit = calc_delta_fair(SCHED_SLICE_MAX, se);
 
 	se->vlag = clamp(vlag, -limit, limit);
 }
@@ -5189,8 +5191,8 @@ void __setparam_fair(struct task_struct *p, const struct sched_attr *attr)
 	if (attr->sched_runtime) {
 		se->custom_slice = 1;
 		se->slice = clamp_t(u64, attr->sched_runtime,
-				      NSEC_PER_MSEC/10,   /* HZ=1000 * 10 */
-				      NSEC_PER_MSEC*100); /* HZ=100  / 10 */
+				      SCHED_SLICE_MIN,
+				      SCHED_SLICE_MAX);
 	} else {
 		se->custom_slice = 0;
 		se->slice = sysctl_sched_base_slice;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Increase max lag clamping
  2025-04-18 15:12 [PATCH] sched/fair: Increase max lag clamping Vincent Guittot
@ 2025-04-18 15:51 ` Peter Zijlstra
  2025-04-22 10:16   ` Peter Zijlstra
  0 siblings, 1 reply; 6+ messages in thread
From: Peter Zijlstra @ 2025-04-18 15:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel

On Fri, Apr 18, 2025 at 05:12:25PM +0200, Vincent Guittot wrote:
> sched_entity lag is currently limited to the maximum between the tick and
> twice the slice. This is too short compared to the maximum custom slice
> that can be set and accumulated by other tasks.
> Clamp the lag to the maximum slice that a task can set. A task A can
> accumulate up to its slice of negative lag while running to parity and
> the other runnable tasks can accumulate the same positive lag while
> waiting to run. This positive lag could be lost during dequeue when
> clamping it to twice task's slice if task A's slice is 100ms and others
> use a smaller value like the default 2.8ms.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0c4cd26ee07..1c2c70decb20 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -683,15 +683,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>   * is possible -- by addition/removal/reweight to the tree -- to move V around
>   * and end up with a larger lag than we started with.
>   *
> - * Limit this to either double the slice length with a minimum of TICK_NSEC
> - * since that is the timing granularity.
> - *
> - * EEVDF gives the following limit for a steady state system:
> + * Limit this to the max allowed custom slice length which is higher than the
> + * timing granularity (the tick) and EEVDF gives the following limit for
> + * a steady state system:
>   *
>   *   -r_max < lag < max(r_max, q)
>   *
>   * XXX could add max_slice to the augmented data to track this.
>   */

Right, its that max_slice XXX there.

I think I've actually done that patch at some point, but I'm not sure
where I've placed it :-)

Anyway, I did poke at that other issue you mentioned at OSPM, where
PREEMPT_SHORT wasn't working quite right.

I've pushed out the patches to queue/sched/exp -- I meant to go post all
that, except I keep getting side-tracked and 0-day people are having
trouble with the hrtick bits in there :-/

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Increase max lag clamping
  2025-04-18 15:51 ` Peter Zijlstra
@ 2025-04-22 10:16   ` Peter Zijlstra
  2025-04-23 10:19     ` Vincent Guittot
                       ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Peter Zijlstra @ 2025-04-22 10:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, dhaval

On Fri, Apr 18, 2025 at 05:51:15PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 18, 2025 at 05:12:25PM +0200, Vincent Guittot wrote:
> > sched_entity lag is currently limited to the maximum between the tick and
> > twice the slice. This is too short compared to the maximum custom slice
> > that can be set and accumulated by other tasks.
> > Clamp the lag to the maximum slice that a task can set. A task A can
> > accumulate up to its slice of negative lag while running to parity and
> > the other runnable tasks can accumulate the same positive lag while
> > waiting to run. This positive lag could be lost during dequeue when
> > clamping it to twice task's slice if task A's slice is 100ms and others
> > use a smaller value like the default 2.8ms.
> > 
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 16 +++++++++-------
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a0c4cd26ee07..1c2c70decb20 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -683,15 +683,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> >   * is possible -- by addition/removal/reweight to the tree -- to move V around
> >   * and end up with a larger lag than we started with.
> >   *
> > - * Limit this to either double the slice length with a minimum of TICK_NSEC
> > - * since that is the timing granularity.
> > - *
> > - * EEVDF gives the following limit for a steady state system:
> > + * Limit this to the max allowed custom slice length which is higher than the
> > + * timing granularity (the tick) and EEVDF gives the following limit for
> > + * a steady state system:
> >   *
> >   *   -r_max < lag < max(r_max, q)
> >   *
> >   * XXX could add max_slice to the augmented data to track this.
> >   */
> 
> Right, its that max_slice XXX there.
> 
> I think I've actually done that patch at some point, but I'm not sure
> where I've placed it :-)

No matter, I've redone it by copy-paste from min_slice.

How's something like this then?

---
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f96ac1982893..9e90cd9023db 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -573,6 +573,7 @@ struct sched_entity {
 	u64				deadline;
 	u64				min_vruntime;
 	u64				min_slice;
+	u64				max_slice;
 
 	struct list_head		group_node;
 	unsigned char			on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 84916c865377..7c3c95f5cabd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -676,6 +676,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->min_vruntime + avg;
 }
 
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
+
 /*
  * lag_i = S - s_i = w_i * (V - v_i)
  *
@@ -689,17 +691,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  * EEVDF gives the following limit for a steady state system:
  *
  *   -r_max < lag < max(r_max, q)
- *
- * XXX could add max_slice to the augmented data to track this.
  */
 static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
 	s64 vlag, limit;
 
 	WARN_ON_ONCE(!se->on_rq);
 
 	vlag = avg_vruntime(cfs_rq) - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	limit = calc_delta_fair(max_slice, se);
 
 	se->vlag = clamp(vlag, -limit, limit);
 }
@@ -795,6 +796,21 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
 	return min_slice;
 }
 
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *root = __pick_root_entity(cfs_rq);
+	struct sched_entity *curr = cfs_rq->curr;
+	u64 max_slice = 0ULL;
+
+	if (curr && curr->on_rq)
+		max_slice = curr->slice;
+
+	if (root)
+		max_slice = min(max_slice, root->max_slice);
+
+	return max_slice;
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
 	return entity_before(__node_2_se(a), __node_2_se(b));
@@ -820,6 +836,15 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
 	}
 }
 
+static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (rse->max_slice < se->max_slice)
+			se->max_slice = rse->max_slice;
+	}
+}
+
 /*
  * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
@@ -827,6 +852,7 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
 	u64 old_min_vruntime = se->min_vruntime;
 	u64 old_min_slice = se->min_slice;
+	u64 old_max_slice = se->max_slice;
 	struct rb_node *node = &se->run_node;
 
 	se->min_vruntime = se->vruntime;
@@ -837,8 +863,13 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 	__min_slice_update(se, node->rb_right);
 	__min_slice_update(se, node->rb_left);
 
+	se->max_slice = se->slice;
+	__max_slice_update(se, node->rb_right);
+	__max_slice_update(se, node->rb_left);
+
 	return se->min_vruntime == old_min_vruntime &&
-	       se->min_slice == old_min_slice;
+	       se->min_slice == old_min_slice &&
+	       se->max_slice == old_max_slice;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
@@ -852,6 +883,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	avg_vruntime_add(cfs_rq, se);
 	se->min_vruntime = se->vruntime;
 	se->min_slice = se->slice;
+	se->max_slice = se->slice;
 	rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
 				__entity_less, &min_vruntime_cb);
 }

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Increase max lag clamping
  2025-04-22 10:16   ` Peter Zijlstra
@ 2025-04-23 10:19     ` Vincent Guittot
  2025-04-29  6:47     ` Vincent Guittot
  2026-02-23 10:25     ` [tip: sched/urgent] sched/fair: Fix lag clamp tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2025-04-23 10:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, dhaval

Hi Peter,

On Tue, 22 Apr 2025 at 12:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 18, 2025 at 05:51:15PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 18, 2025 at 05:12:25PM +0200, Vincent Guittot wrote:
> > > sched_entity lag is currently limited to the maximum between the tick and
> > > twice the slice. This is too short compared to the maximum custom slice
> > > that can be set and accumulated by other tasks.
> > > Clamp the lag to the maximum slice that a task can set. A task A can
> > > accumulate up to its slice of negative lag while running to parity and
> > > the other runnable tasks can accumulate the same positive lag while
> > > waiting to run. This positive lag could be lost during dequeue when
> > > clamping it to twice task's slice if task A's slice is 100ms and others
> > > use a smaller value like the default 2.8ms.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 16 +++++++++-------
> > >  1 file changed, 9 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a0c4cd26ee07..1c2c70decb20 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -683,15 +683,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > >   * is possible -- by addition/removal/reweight to the tree -- to move V around
> > >   * and end up with a larger lag than we started with.
> > >   *
> > > - * Limit this to either double the slice length with a minimum of TICK_NSEC
> > > - * since that is the timing granularity.
> > > - *
> > > - * EEVDF gives the following limit for a steady state system:
> > > + * Limit this to the max allowed custom slice length which is higher than the
> > > + * timing granularity (the tick) and EEVDF gives the following limit for
> > > + * a steady state system:
> > >   *
> > >   *   -r_max < lag < max(r_max, q)
> > >   *
> > >   * XXX could add max_slice to the augmented data to track this.
> > >   */
> >
> > Right, its that max_slice XXX there.
> >
> > I think I've actually done that patch at some point, but I'm not sure
> > where I've placed it :-)
>
> No matter, I've redone it by copy-paste from min_slice.
>
> How's something like this then?

Thanks, I will test it

>
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..9e90cd9023db 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -573,6 +573,7 @@ struct sched_entity {
>         u64                             deadline;
>         u64                             min_vruntime;
>         u64                             min_slice;
> +       u64                             max_slice;
>
>         struct list_head                group_node;
>         unsigned char                   on_rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84916c865377..7c3c95f5cabd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -676,6 +676,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>         return cfs_rq->min_vruntime + avg;
>  }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
> +
>  /*
>   * lag_i = S - s_i = w_i * (V - v_i)
>   *
> @@ -689,17 +691,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>   * EEVDF gives the following limit for a steady state system:
>   *
>   *   -r_max < lag < max(r_max, q)
> - *
> - * XXX could add max_slice to the augmented data to track this.
>   */
>  static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
>         s64 vlag, limit;
>
>         WARN_ON_ONCE(!se->on_rq);
>
>         vlag = avg_vruntime(cfs_rq) - se->vruntime;
> -       limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> +       limit = calc_delta_fair(max_slice, se);
>
>         se->vlag = clamp(vlag, -limit, limit);
>  }
> @@ -795,6 +796,21 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
>         return min_slice;
>  }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
> +{
> +       struct sched_entity *root = __pick_root_entity(cfs_rq);
> +       struct sched_entity *curr = cfs_rq->curr;
> +       u64 max_slice = 0ULL;
> +
> +       if (curr && curr->on_rq)
> +               max_slice = curr->slice;
> +
> +       if (root)
> +               max_slice = min(max_slice, root->max_slice);
> +
> +       return max_slice;
> +}
> +
>  static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
>  {
>         return entity_before(__node_2_se(a), __node_2_se(b));
> @@ -820,6 +836,15 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
>         }
>  }
>
> +static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
> +{
> +       if (node) {
> +               struct sched_entity *rse = __node_2_se(node);
> +               if (rse->max_slice < se->max_slice)
> +                       se->max_slice = rse->max_slice;
> +       }
> +}
> +
>  /*
>   * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
>   */
> @@ -827,6 +852,7 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
>  {
>         u64 old_min_vruntime = se->min_vruntime;
>         u64 old_min_slice = se->min_slice;
> +       u64 old_max_slice = se->max_slice;
>         struct rb_node *node = &se->run_node;
>
>         se->min_vruntime = se->vruntime;
> @@ -837,8 +863,13 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
>         __min_slice_update(se, node->rb_right);
>         __min_slice_update(se, node->rb_left);
>
> +       se->max_slice = se->slice;
> +       __max_slice_update(se, node->rb_right);
> +       __max_slice_update(se, node->rb_left);
> +
>         return se->min_vruntime == old_min_vruntime &&
> -              se->min_slice == old_min_slice;
> +              se->min_slice == old_min_slice &&
> +              se->max_slice == old_max_slice;
>  }
>
>  RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
> @@ -852,6 +883,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         avg_vruntime_add(cfs_rq, se);
>         se->min_vruntime = se->vruntime;
>         se->min_slice = se->slice;
> +       se->max_slice = se->slice;
>         rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
>                                 __entity_less, &min_vruntime_cb);
>  }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] sched/fair: Increase max lag clamping
  2025-04-22 10:16   ` Peter Zijlstra
  2025-04-23 10:19     ` Vincent Guittot
@ 2025-04-29  6:47     ` Vincent Guittot
  2026-02-23 10:25     ` [tip: sched/urgent] sched/fair: Fix lag clamp tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 6+ messages in thread
From: Vincent Guittot @ 2025-04-29  6:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, linux-kernel, dhaval

On Tue, 22 Apr 2025 at 12:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Apr 18, 2025 at 05:51:15PM +0200, Peter Zijlstra wrote:
> > On Fri, Apr 18, 2025 at 05:12:25PM +0200, Vincent Guittot wrote:
> > > sched_entity lag is currently limited to the maximum between the tick and
> > > twice the slice. This is too short compared to the maximum custom slice
> > > that can be set and accumulated by other tasks.
> > > Clamp the lag to the maximum slice that a task can set. A task A can
> > > accumulate up to its slice of negative lag while running to parity and
> > > the other runnable tasks can accumulate the same positive lag while
> > > waiting to run. This positive lag could be lost during dequeue when
> > > clamping it to twice task's slice if task A's slice is 100ms and others
> > > use a smaller value like the default 2.8ms.
> > >
> > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > > ---
> > >  kernel/sched/fair.c | 16 +++++++++-------
> > >  1 file changed, 9 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a0c4cd26ee07..1c2c70decb20 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -683,15 +683,17 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
> > >   * is possible -- by addition/removal/reweight to the tree -- to move V around
> > >   * and end up with a larger lag than we started with.
> > >   *
> > > - * Limit this to either double the slice length with a minimum of TICK_NSEC
> > > - * since that is the timing granularity.
> > > - *
> > > - * EEVDF gives the following limit for a steady state system:
> > > + * Limit this to the max allowed custom slice length which is higher than the
> > > + * timing granularity (the tick) and EEVDF gives the following limit for
> > > + * a steady state system:
> > >   *
> > >   *   -r_max < lag < max(r_max, q)
> > >   *
> > >   * XXX could add max_slice to the augmented data to track this.
> > >   */
> >
> > Right, its that max_slice XXX there.
> >
> > I think I've actually done that patch at some point, but I'm not sure
> > where I've placed it :-)
>
> No matter, I've redone it by copy-paste from min_slice.
>
> How's something like this then?

I tested the patch and the1st tests look ok with the changes below
which fix the copy-paste :-)

I will run few more tests

>
> ---
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f96ac1982893..9e90cd9023db 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -573,6 +573,7 @@ struct sched_entity {
>         u64                             deadline;
>         u64                             min_vruntime;
>         u64                             min_slice;
> +       u64                             max_slice;
>
>         struct list_head                group_node;
>         unsigned char                   on_rq;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 84916c865377..7c3c95f5cabd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -676,6 +676,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>         return cfs_rq->min_vruntime + avg;
>  }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
> +
>  /*
>   * lag_i = S - s_i = w_i * (V - v_i)
>   *
> @@ -689,17 +691,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
>   * EEVDF gives the following limit for a steady state system:
>   *
>   *   -r_max < lag < max(r_max, q)
> - *
> - * XXX could add max_slice to the augmented data to track this.
>   */
>  static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +       u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
>         s64 vlag, limit;
>
>         WARN_ON_ONCE(!se->on_rq);
>
>         vlag = avg_vruntime(cfs_rq) - se->vruntime;
> -       limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
> +       limit = calc_delta_fair(max_slice, se);
>
>         se->vlag = clamp(vlag, -limit, limit);
>  }
> @@ -795,6 +796,21 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
>         return min_slice;
>  }
>
> +static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
> +{
> +       struct sched_entity *root = __pick_root_entity(cfs_rq);
> +       struct sched_entity *curr = cfs_rq->curr;
> +       u64 max_slice = 0ULL;
> +
> +       if (curr && curr->on_rq)
> +               max_slice = curr->slice;
> +
> +       if (root)
> +               max_slice = min(max_slice, root->max_slice);

max(max_slice, root->max_slice);

> +
> +       return max_slice;
> +}
> +
>  static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
>  {
>         return entity_before(__node_2_se(a), __node_2_se(b));
> @@ -820,6 +836,15 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
>         }
>  }
>
> +static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
> +{
> +       if (node) {
> +               struct sched_entity *rse = __node_2_se(node);
> +               if (rse->max_slice < se->max_slice)

              if (rse->max_slice > se->max_slice)

> +                       se->max_slice = rse->max_slice;
> +       }
> +}
> +
>  /*
>   * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
>   */
> @@ -827,6 +852,7 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
>  {
>         u64 old_min_vruntime = se->min_vruntime;
>         u64 old_min_slice = se->min_slice;
> +       u64 old_max_slice = se->max_slice;
>         struct rb_node *node = &se->run_node;
>
>         se->min_vruntime = se->vruntime;
> @@ -837,8 +863,13 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
>         __min_slice_update(se, node->rb_right);
>         __min_slice_update(se, node->rb_left);
>
> +       se->max_slice = se->slice;
> +       __max_slice_update(se, node->rb_right);
> +       __max_slice_update(se, node->rb_left);
> +
>         return se->min_vruntime == old_min_vruntime &&
> -              se->min_slice == old_min_slice;
> +              se->min_slice == old_min_slice &&
> +              se->max_slice == old_max_slice;
>  }
>
>  RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,
> @@ -852,6 +883,7 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>         avg_vruntime_add(cfs_rq, se);
>         se->min_vruntime = se->vruntime;
>         se->min_slice = se->slice;
> +       se->max_slice = se->slice;
>         rb_add_augmented_cached(&se->run_node, &cfs_rq->tasks_timeline,
>                                 __entity_less, &min_vruntime_cb);
>  }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [tip: sched/urgent] sched/fair: Fix lag clamp
  2025-04-22 10:16   ` Peter Zijlstra
  2025-04-23 10:19     ` Vincent Guittot
  2025-04-29  6:47     ` Vincent Guittot
@ 2026-02-23 10:25     ` tip-bot2 for Peter Zijlstra
  2 siblings, 0 replies; 6+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2026-02-23 10:25 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Vincent Guittot, K Prateek Nayak,
	Shubhang Kaushik, x86, linux-kernel

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID:     6e3c0a4e1ad1e0455b7880fad02b3ee179f56c09
Gitweb:        https://git.kernel.org/tip/6e3c0a4e1ad1e0455b7880fad02b3ee179f56c09
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Tue, 22 Apr 2025 12:16:28 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 23 Feb 2026 11:19:18 +01:00

sched/fair: Fix lag clamp

Vincent reported that he was seeing undue lag clamping in a mixed
slice workload. Implement the max_slice tracking as per the todo
comment.

Fixes: 147f3efaa241 ("sched/fair: Implement an EEVDF-like scheduling policy")
Reported-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>
Link: https://patch.msgid.link/20250422101628.GA33555@noisy.programming.kicks-ass.net
---
 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 39 +++++++++++++++++++++++++++++++++++----
 2 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 074ad4e..a7b4a98 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -579,6 +579,7 @@ struct sched_entity {
 	u64				deadline;
 	u64				min_vruntime;
 	u64				min_slice;
+	u64				max_slice;
 
 	struct list_head		group_node;
 	unsigned char			on_rq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 93fa5b8..f4446cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -748,6 +748,8 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
 	return cfs_rq->zero_vruntime;
 }
 
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq);
+
 /*
  * lag_i = S - s_i = w_i * (V - v_i)
  *
@@ -761,17 +763,16 @@ u64 avg_vruntime(struct cfs_rq *cfs_rq)
  * EEVDF gives the following limit for a steady state system:
  *
  *   -r_max < lag < max(r_max, q)
- *
- * XXX could add max_slice to the augmented data to track this.
  */
 static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	u64 max_slice = cfs_rq_max_slice(cfs_rq) + TICK_NSEC;
 	s64 vlag, limit;
 
 	WARN_ON_ONCE(!se->on_rq);
 
 	vlag = avg_vruntime(cfs_rq) - se->vruntime;
-	limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
+	limit = calc_delta_fair(max_slice, se);
 
 	se->vlag = clamp(vlag, -limit, limit);
 }
@@ -829,6 +830,21 @@ static inline u64 cfs_rq_min_slice(struct cfs_rq *cfs_rq)
 	return min_slice;
 }
 
+static inline u64 cfs_rq_max_slice(struct cfs_rq *cfs_rq)
+{
+	struct sched_entity *root = __pick_root_entity(cfs_rq);
+	struct sched_entity *curr = cfs_rq->curr;
+	u64 max_slice = 0ULL;
+
+	if (curr && curr->on_rq)
+		max_slice = curr->slice;
+
+	if (root)
+		max_slice = max(max_slice, root->max_slice);
+
+	return max_slice;
+}
+
 static inline bool __entity_less(struct rb_node *a, const struct rb_node *b)
 {
 	return entity_before(__node_2_se(a), __node_2_se(b));
@@ -853,6 +869,15 @@ static inline void __min_slice_update(struct sched_entity *se, struct rb_node *n
 	}
 }
 
+static inline void __max_slice_update(struct sched_entity *se, struct rb_node *node)
+{
+	if (node) {
+		struct sched_entity *rse = __node_2_se(node);
+		if (rse->max_slice > se->max_slice)
+			se->max_slice = rse->max_slice;
+	}
+}
+
 /*
  * se->min_vruntime = min(se->vruntime, {left,right}->min_vruntime)
  */
@@ -860,6 +885,7 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 {
 	u64 old_min_vruntime = se->min_vruntime;
 	u64 old_min_slice = se->min_slice;
+	u64 old_max_slice = se->max_slice;
 	struct rb_node *node = &se->run_node;
 
 	se->min_vruntime = se->vruntime;
@@ -870,8 +896,13 @@ static inline bool min_vruntime_update(struct sched_entity *se, bool exit)
 	__min_slice_update(se, node->rb_right);
 	__min_slice_update(se, node->rb_left);
 
+	se->max_slice = se->slice;
+	__max_slice_update(se, node->rb_right);
+	__max_slice_update(se, node->rb_left);
+
 	return se->min_vruntime == old_min_vruntime &&
-	       se->min_slice == old_min_slice;
+	       se->min_slice == old_min_slice &&
+	       se->max_slice == old_max_slice;
 }
 
 RB_DECLARE_CALLBACKS(static, min_vruntime_cb, struct sched_entity,

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-23 10:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-18 15:12 [PATCH] sched/fair: Increase max lag clamping Vincent Guittot
2025-04-18 15:51 ` Peter Zijlstra
2025-04-22 10:16   ` Peter Zijlstra
2025-04-23 10:19     ` Vincent Guittot
2025-04-29  6:47     ` Vincent Guittot
2026-02-23 10:25     ` [tip: sched/urgent] sched/fair: Fix lag clamp tip-bot2 for Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox