[PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
@ 2026-03-31 16:23 Vincent Guittot
  2026-04-02 13:13 ` Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Vincent Guittot @ 2026-03-31 16:23 UTC (permalink / raw)
  To: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, kprateek.nayak, linux-kernel, shubhang
  Cc: Vincent Guittot

Delayed dequeue feature aims to reduce the negative lag of a dequeued task
while sleeping but it can happens that newly enqueued tasks will move
backward the avg vruntime and increase its negative lag.
When the delayed dequeued task wakes up, it has more neg lag compared to
being dequeued immediately or to other tasks that have been dequeued just
before theses new enqueues.

Ensure that the negative lag of a delayed dequeued task doesn't increase
during its delayed dequeued phase while waiting for its neg lag to
diseappear. Similarly, we remove any positive lag that the delayed
dequeued task could have gain during thsi period.

Short slice tasks are particularly impacted in overloaded system.

Test on snapdragon rb5:

hackbench -T -p -l 16000000 -g 2 1> /dev/null &
cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q

The scheduling latency of cyclictest is:

                       tip/sched/core  tip/sched/core    +this patch
cyclictest slice  (ms) (default)2.8             8               8
hackbench slice   (ms) (default)2.8            20              20
Total Samples          |   115632          119733          119806
Average           (us) |      364              64(-82%)        61(- 5%)
Median (P50)      (us) |       60              56(- 7%)        56(  0%)
90th Percentile   (us) |     1166              62(-95%)        62(  0%)
99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
Maximum           (us) |    17735           14273(-20%)     13525(- 5%)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Since v1:
- Embedded the check of lag evolution of delayed dequeue entities in
  update_entity_lag() to include all cases.

 kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
 1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 226509231e67..c1ffe86bf78d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
 	return clamp(vlag, -limit, limit);
 }
 
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/*
+ * Delayed dequeue aims to reduce the negative lag of a dequeued task.
+ * While updating the lag of an entity, check that negative lag didn't increase
+ * during the delayed dequeue period which would be unfair.
+ * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
+ * set.
+ *
+ * Return true if the lag has been adjusted.
+ */
+static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	s64 vlag;
+
 	WARN_ON_ONCE(!se->on_rq);
 
-	se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+	vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+
+	if (se->sched_delayed)
+		/* previous vlag < 0 otherwise se would not be delayed */
+		se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
+	else
+		se->vlag = vlag;
+
+	return (vlag != se->vlag);
 }
 
 /*
@@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
 	}
 }
 
-static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
-{
-	clear_delayed(se);
-	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
-		se->vlag = 0;
-}
-
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			update_load_avg(cfs_rq, se, 0);
+			update_entity_lag(cfs_rq, se);
 			set_delayed(se);
 			return false;
 		}
@@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	update_cfs_group(se);
 
 	if (flags & DEQUEUE_DELAYED)
-		finish_delayed_dequeue_entity(se);
+		clear_delayed(se);
 
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
@@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
 	WARN_ON_ONCE(!se->sched_delayed);
 	WARN_ON_ONCE(!se->on_rq);
 
-	if (sched_feat(DELAY_ZERO)) {
-		update_entity_lag(cfs_rq, se);
-		if (se->vlag > 0) {
-			cfs_rq->nr_queued--;
-			if (se != cfs_rq->curr)
-				__dequeue_entity(cfs_rq, se);
-			se->vlag = 0;
-			place_entity(cfs_rq, se, 0);
-			if (se != cfs_rq->curr)
-				__enqueue_entity(cfs_rq, se);
-			cfs_rq->nr_queued++;
-		}
+	if (update_entity_lag(cfs_rq, se)) {
+		cfs_rq->nr_queued--;
+		if (se != cfs_rq->curr)
+			__dequeue_entity(cfs_rq, se);
+		place_entity(cfs_rq, se, 0);
+		if (se != cfs_rq->curr)
+			__enqueue_entity(cfs_rq, se);
+		cfs_rq->nr_queued++;
 	}
 
 	update_load_avg(cfs_rq, se, 0);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-03-31 16:23 [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue Vincent Guittot
@ 2026-04-02 13:13 ` Peter Zijlstra
  2026-04-02 13:17   ` Vincent Guittot
  2026-04-02 13:42 ` Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Peter Zijlstra @ 2026-04-02 13:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, linux-kernel, shubhang

On Tue, Mar 31, 2026 at 06:23:52PM +0200, Vincent Guittot wrote:
> +/*
> + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> + * While updating the lag of an entity, check that negative lag didn't increase
> + * during the delayed dequeue period which would be unfair.
> + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> + * set.
> + *
> + * Return true if the lag has been adjusted.
> + */
> +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> +	s64 vlag;
> +
>  	WARN_ON_ONCE(!se->on_rq);
>  
> -	se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> +	vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> +
> +	if (se->sched_delayed)
> +		/* previous vlag < 0 otherwise se would not be delayed */
> +		se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> +	else
> +		se->vlag = vlag;
> +
> +	return (vlag != se->vlag);
>  }

Would you mind terribly if I write this like so?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -841,29 +841,32 @@ static s64 entity_lag(struct cfs_rq *cfs
 }
 
 /*
- * Delayed dequeue aims to reduce the negative lag of a dequeued task.
- * While updating the lag of an entity, check that negative lag didn't increase
+ * Delayed dequeue aims to reduce the negative lag of a dequeued task. While
+ * updating the lag of an entity, check that negative lag didn't increase
  * during the delayed dequeue period which would be unfair.
- * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
- * set.
+ * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO
+ * is set.
  *
  * Return true if the lag has been adjusted.
  */
-static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+static __always_inline
+bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	s64 vlag;
+	s64 vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+	bool ret;
 
 	WARN_ON_ONCE(!se->on_rq);
 
-	vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
-
-	if (se->sched_delayed)
+	if (se->sched_delayed) {
 		/* previous vlag < 0 otherwise se would not be delayed */
-		se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
-	else
-		se->vlag = vlag;
+		vlag = max(vlag, se->vlag);
+		if (sched_feat(DELAY_ZERO))
+			vlag = min(vlag, 0);
+	}
+	ret = (vlag == se->vlag);
+	se->vlag = vlag;
 
-	return (vlag != se->vlag);
+	return ret;
 }
 
 /*


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-04-02 13:13 ` Peter Zijlstra
@ 2026-04-02 13:17   ` Vincent Guittot
  0 siblings, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2026-04-02 13:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, linux-kernel, shubhang

On Thu, 2 Apr 2026 at 15:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Tue, Mar 31, 2026 at 06:23:52PM +0200, Vincent Guittot wrote:
> > +/*
> > + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> > + * While updating the lag of an entity, check that negative lag didn't increase
> > + * during the delayed dequeue period which would be unfair.
> > + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> > + * set.
> > + *
> > + * Return true if the lag has been adjusted.
> > + */
> > +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +     s64 vlag;
> > +
> >       WARN_ON_ONCE(!se->on_rq);
> >
> > -     se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +     vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +
> > +     if (se->sched_delayed)
> > +             /* previous vlag < 0 otherwise se would not be delayed */
> > +             se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> > +     else
> > +             se->vlag = vlag;
> > +
> > +     return (vlag != se->vlag);
> >  }
>
> Would you mind terribly if I write this like so?

np, that's looks good too for me

>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -841,29 +841,32 @@ static s64 entity_lag(struct cfs_rq *cfs
>  }
>
>  /*
> - * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> - * While updating the lag of an entity, check that negative lag didn't increase
> + * Delayed dequeue aims to reduce the negative lag of a dequeued task. While
> + * updating the lag of an entity, check that negative lag didn't increase
>   * during the delayed dequeue period which would be unfair.
> - * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> - * set.
> + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO
> + * is set.
>   *
>   * Return true if the lag has been adjusted.
>   */
> -static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +static __always_inline
> +bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       s64 vlag;
> +       s64 vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> +       bool ret;
>
>         WARN_ON_ONCE(!se->on_rq);
>
> -       vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> -
> -       if (se->sched_delayed)
> +       if (se->sched_delayed) {
>                 /* previous vlag < 0 otherwise se would not be delayed */
> -               se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> -       else
> -               se->vlag = vlag;
> +               vlag = max(vlag, se->vlag);
> +               if (sched_feat(DELAY_ZERO))
> +                       vlag = min(vlag, 0);
> +       }
> +       ret = (vlag == se->vlag);
> +       se->vlag = vlag;
>
> -       return (vlag != se->vlag);
> +       return ret;
>  }
>
>  /*
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-03-31 16:23 [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue Vincent Guittot
  2026-04-02 13:13 ` Peter Zijlstra
@ 2026-04-02 13:42 ` Peter Zijlstra
  2026-04-02 19:27 ` Shubhang Kaushik
  2026-04-03 12:30 ` [tip: sched/core] sched/fair: " tip-bot2 for Vincent Guittot
  3 siblings, 0 replies; 9+ messages in thread
From: Peter Zijlstra @ 2026-04-02 13:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, juri.lelli, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, linux-kernel, shubhang

On Tue, Mar 31, 2026 at 06:23:52PM +0200, Vincent Guittot wrote:
> Delayed dequeue feature aims to reduce the negative lag of a dequeued task
> while sleeping but it can happens that newly enqueued tasks will move
> backward the avg vruntime and increase its negative lag.
> When the delayed dequeued task wakes up, it has more neg lag compared to
> being dequeued immediately or to other tasks that have been dequeued just
> before theses new enqueues.
> 
> Ensure that the negative lag of a delayed dequeued task doesn't increase
> during its delayed dequeued phase while waiting for its neg lag to
> diseappear. Similarly, we remove any positive lag that the delayed
> dequeued task could have gain during thsi period.
> 
> Short slice tasks are particularly impacted in overloaded system.
> 
> Test on snapdragon rb5:
> 
> hackbench -T -p -l 16000000 -g 2 1> /dev/null &
> cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q
> 
> The scheduling latency of cyclictest is:
> 
>                        tip/sched/core  tip/sched/core    +this patch
> cyclictest slice  (ms) (default)2.8             8               8
> hackbench slice   (ms) (default)2.8            20              20
> Total Samples          |   115632          119733          119806
> Average           (us) |      364              64(-82%)        61(- 5%)
> Median (P50)      (us) |       60              56(- 7%)        56(  0%)
> 90th Percentile   (us) |     1166              62(-95%)        62(  0%)
> 99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
> 99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
> Maximum           (us) |    17735           14273(-20%)     13525(- 5%)
> 

Anyway, I can confirm this works quite well. The latency-slice numbers
are far more stable now.

Thanks for digging into that!

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-03-31 16:23 [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue Vincent Guittot
  2026-04-02 13:13 ` Peter Zijlstra
  2026-04-02 13:42 ` Peter Zijlstra
@ 2026-04-02 19:27 ` Shubhang Kaushik
  2026-04-03  8:37   ` Vincent Guittot
  2026-04-03  8:46   ` Vincent Guittot
  2026-04-03 12:30 ` [tip: sched/core] sched/fair: " tip-bot2 for Vincent Guittot
  3 siblings, 2 replies; 9+ messages in thread
From: Shubhang Kaushik @ 2026-04-02 19:27 UTC (permalink / raw)
  To: Vincent Guittot, mingo, peterz, juri.lelli, dietmar.eggemann,
	rostedt, bsegall, mgorman, vschneid, kprateek.nayak, linux-kernel

Hi Vincent,

I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8 
Neoverse-N1) 1P system using an idle tickless kernel on the latest 
tip/sched/core branch.

On Tue, 31 Mar 2026, Vincent Guittot wrote:

> Delayed dequeue feature aims to reduce the negative lag of a dequeued task
> while sleeping but it can happens that newly enqueued tasks will move
> backward the avg vruntime and increase its negative lag.
> When the delayed dequeued task wakes up, it has more neg lag compared to
> being dequeued immediately or to other tasks that have been dequeued just
> before theses new enqueues.
>
> Ensure that the negative lag of a delayed dequeued task doesn't increase
> during its delayed dequeued phase while waiting for its neg lag to
> diseappear. Similarly, we remove any positive lag that the delayed
> dequeued task could have gain during thsi period.
>
> Short slice tasks are particularly impacted in overloaded system.
>
> Test on snapdragon rb5:
>
> hackbench -T -p -l 16000000 -g 2 1> /dev/null &
> cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q
>
> The scheduling latency of cyclictest is:
>
>                       tip/sched/core  tip/sched/core    +this patch
> cyclictest slice  (ms) (default)2.8             8               8
> hackbench slice   (ms) (default)2.8            20              20
> Total Samples          |   115632          119733          119806
> Average           (us) |      364              64(-82%)        61(- 5%)
> Median (P50)      (us) |       60              56(- 7%)        56(  0%)
> 90th Percentile   (us) |     1166              62(-95%)        62(  0%)
> 99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
> 99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
> Maximum           (us) |    17735           14273(-20%)     13525(- 5%)
>
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>

I replicated this cyclictest environment scaled for 80 cores using a 
background hackbench load (-g 20). On Ampere Altra, I did not see the 
tail latency reduction that you observed on the 8-core Snapdragon. In 
fact, both average and max latencies increased slightly.

Metric       | Baseline | Patched  | Delta (%)
-------------|----------|----------|-----------
Max Latency  | 9141us   | 9426us   | +3.11%
Avg Latency  | 206us    | 217us    | +5.33%
Min Latency  | 14us     | 13us     | -7.14%

More concerning is the impact on throughput. At 8-16 threads, hackbench 
execution times increased by ~30%. I attempted to isolate this by 
disabling the DELAY_DEQUEUE sched_feature. But the regression persists 
even with NO_DELAY_DEQUEUE, pointing to overhead in the modified 
update_entity_lag() path itself.

Test Case    | Baseline | Patched  | Delta (%) | Patched(NO_DELAYDQ)
-------------|----------|----------|-----------|--------------------
4 Threads    | 13.77s   | 17.53s   | +27.3%    | 17.16s
8 Threads    | 24.39s   | 31.90s   | +30.8%    | 30.67s
16 Threads   | 47.92s   | 60.46s   | +26.2%    | 62.53s
32 Processes | 118.08s  | 103.16s  | -12.6%    | 101.87s

> Since v1:
> - Embedded the check of lag evolution of delayed dequeue entities in
>  update_entity_lag() to include all cases.
>

While the patch shows a ~12.6% improvement at high saturation (32 
processes), the throughput cost at mid-range scales appears to outweigh 
the fairness benefits on our high core system, as even the worst-case 
wake-up latencies did not improve.

> kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 226509231e67..c1ffe86bf78d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
> 	return clamp(vlag, -limit, limit);
> }
>
> -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +/*
> + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> + * While updating the lag of an entity, check that negative lag didn't increase
> + * during the delayed dequeue period which would be unfair.
> + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> + * set.
> + *
> + * Return true if the lag has been adjusted.
> + */
> +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> +	s64 vlag;
> +
> 	WARN_ON_ONCE(!se->on_rq);
>
> -	se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> +	vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> +
> +	if (se->sched_delayed)
> +		/* previous vlag < 0 otherwise se would not be delayed */
> +		se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> +	else
> +		se->vlag = vlag;
> +
> +	return (vlag != se->vlag);
> }
>
> /*
> @@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
> 	}
> }
>
> -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> -{
> -	clear_delayed(se);
> -	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> -		se->vlag = 0;
> -}
> -
> static bool
> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> {
> @@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 		if (sched_feat(DELAY_DEQUEUE) && delay &&
> 		    !entity_eligible(cfs_rq, se)) {
> 			update_load_avg(cfs_rq, se, 0);
> +			update_entity_lag(cfs_rq, se);

The regression persists even with NO_DELAY_DEQUEUE, likely because 
update_entity_lag() is now called unconditionally in dequeue_entity() 
thereby adding avg_vruntime() overhead and cacheline contention for every 
dequeue.

Do consider guarding the update_entity_lag() call in dequeue_entity() 
with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature 
is disabled.

> 			set_delayed(se);
> 			return false;
> 		}
> @@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> 	update_cfs_group(se);
>
> 	if (flags & DEQUEUE_DELAYED)
> -		finish_delayed_dequeue_entity(se);
> +		clear_delayed(se);
>
> 	if (cfs_rq->nr_queued == 0) {
> 		update_idle_cfs_rq_clock_pelt(cfs_rq);
> @@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
> 	WARN_ON_ONCE(!se->sched_delayed);
> 	WARN_ON_ONCE(!se->on_rq);
>
> -	if (sched_feat(DELAY_ZERO)) {
> -		update_entity_lag(cfs_rq, se);
> -		if (se->vlag > 0) {
> -			cfs_rq->nr_queued--;
> -			if (se != cfs_rq->curr)
> -				__dequeue_entity(cfs_rq, se);
> -			se->vlag = 0;
> -			place_entity(cfs_rq, se, 0);
> -			if (se != cfs_rq->curr)
> -				__enqueue_entity(cfs_rq, se);
> -			cfs_rq->nr_queued++;
> -		}
> +	if (update_entity_lag(cfs_rq, se)) {
> +		cfs_rq->nr_queued--;
> +		if (se != cfs_rq->curr)
> +			__dequeue_entity(cfs_rq, se);
> +		place_entity(cfs_rq, se, 0);
> +		if (se != cfs_rq->curr)
> +			__enqueue_entity(cfs_rq, se);
> +		cfs_rq->nr_queued++;

Triggering a full dequeue/enqueue cycle for every vlag adjustment appears 
to be a major bottleneck. Frequent RB-tree rebalancing here creates 
significant contention.

Could we preserve fairness while recovering throughput by only re-queuing 
when the lag sign changes or a significant eligibility threshold is 
crossed?

> 	}
>
> 	update_load_avg(cfs_rq, se, 0);
> -- 
> 2.43.0
>
>
Regards,
Shubhang Kaushik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-04-02 19:27 ` Shubhang Kaushik
@ 2026-04-03  8:37   ` Vincent Guittot
  2026-04-04  8:08     ` Shubhang Kaushik
  2026-04-03  8:46   ` Vincent Guittot
  1 sibling, 1 reply; 9+ messages in thread
From: Vincent Guittot @ 2026-04-03  8:37 UTC (permalink / raw)
  To: Shubhang Kaushik
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, kprateek.nayak, linux-kernel

On Thu, 2 Apr 2026 at 21:27, Shubhang Kaushik
<shubhang@os.amperecomputing.com> wrote:
>
> Hi Vincent,
>
> I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8
> Neoverse-N1) 1P system using an idle tickless kernel on the latest
> tip/sched/core branch.
>
> On Tue, 31 Mar 2026, Vincent Guittot wrote:
>
> > Delayed dequeue feature aims to reduce the negative lag of a dequeued task
> > while sleeping but it can happens that newly enqueued tasks will move
> > backward the avg vruntime and increase its negative lag.
> > When the delayed dequeued task wakes up, it has more neg lag compared to
> > being dequeued immediately or to other tasks that have been dequeued just
> > before theses new enqueues.
> >
> > Ensure that the negative lag of a delayed dequeued task doesn't increase
> > during its delayed dequeued phase while waiting for its neg lag to
> > diseappear. Similarly, we remove any positive lag that the delayed
> > dequeued task could have gain during thsi period.
> >
> > Short slice tasks are particularly impacted in overloaded system.
> >
> > Test on snapdragon rb5:
> >
> > hackbench -T -p -l 16000000 -g 2 1> /dev/null &
> > cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q
> >
> > The scheduling latency of cyclictest is:
> >
> >                       tip/sched/core  tip/sched/core    +this patch
> > cyclictest slice  (ms) (default)2.8             8               8
> > hackbench slice   (ms) (default)2.8            20              20
> > Total Samples          |   115632          119733          119806
> > Average           (us) |      364              64(-82%)        61(- 5%)
> > Median (P50)      (us) |       60              56(- 7%)        56(  0%)
> > 90th Percentile   (us) |     1166              62(-95%)        62(  0%)
> > 99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
> > 99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
> > Maximum           (us) |    17735           14273(-20%)     13525(- 5%)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
>
> I replicated this cyclictest environment scaled for 80 cores using a
> background hackbench load (-g 20). On Ampere Altra, I did not see the
> tail latency reduction that you observed on the 8-core Snapdragon. In
> fact, both average and max latencies increased slightly.
>
> Metric       | Baseline | Patched  | Delta (%)
> -------------|----------|----------|-----------
> Max Latency  | 9141us   | 9426us   | +3.11%
> Avg Latency  | 206us    | 217us    | +5.33%
> Min Latency  | 14us     | 13us     | -7.14%

Without setting a shorter custom slice for cyclictest, you will not
see any major differences. The difference ehappens in the p99 and
p99.9 with a shorter slice

>
> More concerning is the impact on throughput. At 8-16 threads, hackbench
> execution times increased by ~30%. I attempted to isolate this by

Hmm, I run some perf test and I haven't seen any difference for
hackbench with various number of group

> disabling the DELAY_DEQUEUE sched_feature. But the regression persists
> even with NO_DELAY_DEQUEUE, pointing to overhead in the modified
> update_entity_lag() path itself.
>
> Test Case    | Baseline | Patched  | Delta (%) | Patched(NO_DELAYDQ)

By baseline, do you mean tip/sched/core or v7.0-rcx ?

> -------------|----------|----------|-----------|--------------------
> 4 Threads    | 13.77s   | 17.53s   | +27.3%    | 17.16s
> 8 Threads    | 24.39s   | 31.90s   | +30.8%    | 30.67s
> 16 Threads   | 47.92s   | 60.46s   | +26.2%    | 62.53s
> 32 Processes | 118.08s  | 103.16s  | -12.6%    | 101.87s

That's surprising. I ran some perf tests with the patch and haven't
seen any differences

                                tip/sched/core  + patch
hackbench 1 process socket      0,581           0,580 (0,0 %)
                       stddev   2,7 %           2,5 %
hackbench 4 process socket      0,612           0,612 (0,0 %)
                       stddev   0,9 %           2,3 %
hackbench 8 process socket      0,662           0,659 (0,4 %)
                       stddev   1,0 %           1,8 %
hackbench 16 process socket     0,700           0,699 (0,3 %)
                       stddev   1,6 %           1,3 %
hackbench 1 process pipe        0,796           0,797 (-0,2 %)
                       stddev   1,5 %           1,9 %
hackbench 4 process pipe        0,699           0,694 (0,8 %)
                       stddev   3,7 %           2,5 %
hackbench 8 process pipe        0,631           0,636 (-0,9 %)
                       stddev   3,4 %           2,2 %
hackbench 16 process pipe       0,612           0,594 (2,9 %)
                       stddev   1,8 %           1,5 %
hackbench 1 thread socket       0,571           0,570 (0,1 %)
                       stddev   2,3 %           1,5 %
hackbench 4 thread socket       0,591           0,594 (-0,5 %)
                       stddev   1,2 %           0,7 %
hackbench 8 thread socket       0,621           0,628 (-1,2 %)
                       stddev   1,3 %           1,4 %
hackbench 16 thread socket      0,660           0,653 (1,0 %)
                       stddev   0,7 %           0,9 %
hackbench 1 thread pipe         0,860           0,864 (-0,6 %)
                       stddev   1,4 %           2,0 %
hackbench 4 thread pipe         0,828           0,821 (0,9 %)
                       stddev   3,5 %           4,7 %
hackbench 8 thread pipe         0,725           0,739 (-1,8 %)
                       stddev   2,3 %           8,6 %
hackbench 16 thread pipe        0,647           0,645 (0,4 %)
                       stddev   4,3 %           4,2 %

>
> > Since v1:
> > - Embedded the check of lag evolution of delayed dequeue entities in
> >  update_entity_lag() to include all cases.
> >
>
> While the patch shows a ~12.6% improvement at high saturation (32
> processes), the throughput cost at mid-range scales appears to outweigh
> the fairness benefits on our high core system, as even the worst-case
> wake-up latencies did not improve.
>
> > kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
> > 1 file changed, 31 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 226509231e67..c1ffe86bf78d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
> >       return clamp(vlag, -limit, limit);
> > }
> >
> > -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +/*
> > + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> > + * While updating the lag of an entity, check that negative lag didn't increase
> > + * during the delayed dequeue period which would be unfair.
> > + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> > + * set.
> > + *
> > + * Return true if the lag has been adjusted.
> > + */
> > +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > +     s64 vlag;
> > +
> >       WARN_ON_ONCE(!se->on_rq);
> >
> > -     se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +     vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +
> > +     if (se->sched_delayed)
> > +             /* previous vlag < 0 otherwise se would not be delayed */
> > +             se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> > +     else
> > +             se->vlag = vlag;
> > +
> > +     return (vlag != se->vlag);
> > }
> >
> > /*
> > @@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
> >       }
> > }
> >
> > -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> > -{
> > -     clear_delayed(se);
> > -     if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> > -             se->vlag = 0;
> > -}
> > -
> > static bool
> > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > @@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
> >                       update_load_avg(cfs_rq, se, 0);
> > +                     update_entity_lag(cfs_rq, se);
>
> The regression persists even with NO_DELAY_DEQUEUE, likely because
> update_entity_lag() is now called unconditionally in dequeue_entity()
> thereby adding avg_vruntime() overhead and cacheline contention for every
> dequeue.
>
> Do consider guarding the update_entity_lag() call in dequeue_entity()
> with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature
> is disabled.
>
> >                       set_delayed(se);
> >                       return false;
> >               }
> > @@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >       update_cfs_group(se);
> >
> >       if (flags & DEQUEUE_DELAYED)
> > -             finish_delayed_dequeue_entity(se);
> > +             clear_delayed(se);
> >
> >       if (cfs_rq->nr_queued == 0) {
> >               update_idle_cfs_rq_clock_pelt(cfs_rq);
> > @@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
> >       WARN_ON_ONCE(!se->sched_delayed);
> >       WARN_ON_ONCE(!se->on_rq);
> >
> > -     if (sched_feat(DELAY_ZERO)) {
> > -             update_entity_lag(cfs_rq, se);
> > -             if (se->vlag > 0) {
> > -                     cfs_rq->nr_queued--;
> > -                     if (se != cfs_rq->curr)
> > -                             __dequeue_entity(cfs_rq, se);
> > -                     se->vlag = 0;
> > -                     place_entity(cfs_rq, se, 0);
> > -                     if (se != cfs_rq->curr)
> > -                             __enqueue_entity(cfs_rq, se);
> > -                     cfs_rq->nr_queued++;
> > -             }
> > +     if (update_entity_lag(cfs_rq, se)) {
> > +             cfs_rq->nr_queued--;
> > +             if (se != cfs_rq->curr)
> > +                     __dequeue_entity(cfs_rq, se);
> > +             place_entity(cfs_rq, se, 0);
> > +             if (se != cfs_rq->curr)
> > +                     __enqueue_entity(cfs_rq, se);
> > +             cfs_rq->nr_queued++;
>
> Triggering a full dequeue/enqueue cycle for every vlag adjustment appears
> to be a major bottleneck. Frequent RB-tree rebalancing here creates
> significant contention.

This adjustment is not supposed to happen

>
> Could we preserve fairness while recovering throughput by only re-queuing
> when the lag sign changes or a significant eligibility threshold is
> crossed?

Could you monitor how often we have to adjust the lag in your case? As
mentioned above, this shouldn't happen often, in particular the
increase of neg lag case

>
> >       }
> >
> >       update_load_avg(cfs_rq, se, 0);
> > --
> > 2.43.0
> >
> >
> Regards,
> Shubhang Kaushik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-04-02 19:27 ` Shubhang Kaushik
  2026-04-03  8:37   ` Vincent Guittot
@ 2026-04-03  8:46   ` Vincent Guittot
  1 sibling, 0 replies; 9+ messages in thread
From: Vincent Guittot @ 2026-04-03  8:46 UTC (permalink / raw)
  To: Shubhang Kaushik
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, kprateek.nayak, linux-kernel

On Thu, 2 Apr 2026 at 21:27, Shubhang Kaushik
<shubhang@os.amperecomputing.com> wrote:
>
> Hi Vincent,
>
> I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8
> Neoverse-N1) 1P system using an idle tickless kernel on the latest
> tip/sched/core branch.
>
> On Tue, 31 Mar 2026, Vincent Guittot wrote:
>
> > Delayed dequeue feature aims to reduce the negative lag of a dequeued task
> > while sleeping but it can happens that newly enqueued tasks will move
> > backward the avg vruntime and increase its negative lag.
> > When the delayed dequeued task wakes up, it has more neg lag compared to
> > being dequeued immediately or to other tasks that have been dequeued just
> > before theses new enqueues.
> >
> > Ensure that the negative lag of a delayed dequeued task doesn't increase
> > during its delayed dequeued phase while waiting for its neg lag to
> > diseappear. Similarly, we remove any positive lag that the delayed
> > dequeued task could have gain during thsi period.
> >
> > Short slice tasks are particularly impacted in overloaded system.
> >
> > Test on snapdragon rb5:
> >
> > hackbench -T -p -l 16000000 -g 2 1> /dev/null &
> > cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q
> >
> > The scheduling latency of cyclictest is:
> >
> >                       tip/sched/core  tip/sched/core    +this patch
> > cyclictest slice  (ms) (default)2.8             8               8
> > hackbench slice   (ms) (default)2.8            20              20
> > Total Samples          |   115632          119733          119806
> > Average           (us) |      364              64(-82%)        61(- 5%)
> > Median (P50)      (us) |       60              56(- 7%)        56(  0%)
> > 90th Percentile   (us) |     1166              62(-95%)        62(  0%)
> > 99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
> > 99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
> > Maximum           (us) |    17735           14273(-20%)     13525(- 5%)
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
>
> I replicated this cyclictest environment scaled for 80 cores using a
> background hackbench load (-g 20). On Ampere Altra, I did not see the
> tail latency reduction that you observed on the 8-core Snapdragon. In
> fact, both average and max latencies increased slightly.
>
> Metric       | Baseline | Patched  | Delta (%)
> -------------|----------|----------|-----------
> Max Latency  | 9141us   | 9426us   | +3.11%
> Avg Latency  | 206us    | 217us    | +5.33%
> Min Latency  | 14us     | 13us     | -7.14%
>
> More concerning is the impact on throughput. At 8-16 threads, hackbench
> execution times increased by ~30%. I attempted to isolate this by
> disabling the DELAY_DEQUEUE sched_feature. But the regression persists
> even with NO_DELAY_DEQUEUE, pointing to overhead in the modified

I didn't immediately realize that you have a problem even with
NO_DELAY_DEQUEUE whereas the patch doesn't change anything for this
case. Could it be something else?

> update_entity_lag() path itself.
>
> Test Case    | Baseline | Patched  | Delta (%) | Patched(NO_DELAYDQ)
> -------------|----------|----------|-----------|--------------------
> 4 Threads    | 13.77s   | 17.53s   | +27.3%    | 17.16s
> 8 Threads    | 24.39s   | 31.90s   | +30.8%    | 30.67s
> 16 Threads   | 47.92s   | 60.46s   | +26.2%    | 62.53s
> 32 Processes | 118.08s  | 103.16s  | -12.6%    | 101.87s
>
> > Since v1:
> > - Embedded the check of lag evolution of delayed dequeue entities in
> >  update_entity_lag() to include all cases.
> >
>
> While the patch shows a ~12.6% improvement at high saturation (32
> processes), the throughput cost at mid-range scales appears to outweigh
> the fairness benefits on our high core system, as even the worst-case
> wake-up latencies did not improve.
>
> > kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
> > 1 file changed, 31 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 226509231e67..c1ffe86bf78d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
> >       return clamp(vlag, -limit, limit);
> > }
> >
> > -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > +/*
> > + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
> > + * While updating the lag of an entity, check that negative lag didn't increase
> > + * during the delayed dequeue period which would be unfair.
> > + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
> > + * set.
> > + *
> > + * Return true if the lag has been adjusted.
> > + */
> > +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
> > {
> > +     s64 vlag;
> > +
> >       WARN_ON_ONCE(!se->on_rq);
> >
> > -     se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +     vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
> > +
> > +     if (se->sched_delayed)
> > +             /* previous vlag < 0 otherwise se would not be delayed */
> > +             se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
> > +     else
> > +             se->vlag = vlag;
> > +
> > +     return (vlag != se->vlag);
> > }
> >
> > /*
> > @@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
> >       }
> > }
> >
> > -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
> > -{
> > -     clear_delayed(se);
> > -     if (sched_feat(DELAY_ZERO) && se->vlag > 0)
> > -             se->vlag = 0;
> > -}
> > -
> > static bool
> > dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > {
> > @@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               if (sched_feat(DELAY_DEQUEUE) && delay &&
> >                   !entity_eligible(cfs_rq, se)) {
> >                       update_load_avg(cfs_rq, se, 0);
> > +                     update_entity_lag(cfs_rq, se);
>
> The regression persists even with NO_DELAY_DEQUEUE, likely because
> update_entity_lag() is now called unconditionally in dequeue_entity()
> thereby adding avg_vruntime() overhead and cacheline contention for every
> dequeue.
>
> Do consider guarding the update_entity_lag() call in dequeue_entity()
> with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature
> is disabled.
>
> >                       set_delayed(se);
> >                       return false;
> >               }
> > @@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >       update_cfs_group(se);
> >
> >       if (flags & DEQUEUE_DELAYED)
> > -             finish_delayed_dequeue_entity(se);
> > +             clear_delayed(se);
> >
> >       if (cfs_rq->nr_queued == 0) {
> >               update_idle_cfs_rq_clock_pelt(cfs_rq);
> > @@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
> >       WARN_ON_ONCE(!se->sched_delayed);
> >       WARN_ON_ONCE(!se->on_rq);
> >
> > -     if (sched_feat(DELAY_ZERO)) {
> > -             update_entity_lag(cfs_rq, se);
> > -             if (se->vlag > 0) {
> > -                     cfs_rq->nr_queued--;
> > -                     if (se != cfs_rq->curr)
> > -                             __dequeue_entity(cfs_rq, se);
> > -                     se->vlag = 0;
> > -                     place_entity(cfs_rq, se, 0);
> > -                     if (se != cfs_rq->curr)
> > -                             __enqueue_entity(cfs_rq, se);
> > -                     cfs_rq->nr_queued++;
> > -             }
> > +     if (update_entity_lag(cfs_rq, se)) {
> > +             cfs_rq->nr_queued--;
> > +             if (se != cfs_rq->curr)
> > +                     __dequeue_entity(cfs_rq, se);
> > +             place_entity(cfs_rq, se, 0);
> > +             if (se != cfs_rq->curr)
> > +                     __enqueue_entity(cfs_rq, se);
> > +             cfs_rq->nr_queued++;
>
> Triggering a full dequeue/enqueue cycle for every vlag adjustment appears
> to be a major bottleneck. Frequent RB-tree rebalancing here creates
> significant contention.
>
> Could we preserve fairness while recovering throughput by only re-queuing
> when the lag sign changes or a significant eligibility threshold is
> crossed?
>
> >       }
> >
> >       update_load_avg(cfs_rq, se, 0);
> > --
> > 2.43.0
> >
> >
> Regards,
> Shubhang Kaushik

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip: sched/core] sched/fair: Prevent negative lag increase during delayed dequeue
  2026-03-31 16:23 [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue Vincent Guittot
                   ` (2 preceding siblings ...)
  2026-04-02 19:27 ` Shubhang Kaushik
@ 2026-04-03 12:30 ` tip-bot2 for Vincent Guittot
  3 siblings, 0 replies; 9+ messages in thread
From: tip-bot2 for Vincent Guittot @ 2026-04-03 12:30 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Guittot, Peter Zijlstra (Intel), x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     059258b0d424510202b6f2796279dbdbf0c6a83d
Gitweb:        https://git.kernel.org/tip/059258b0d424510202b6f2796279dbdbf0c6a83d
Author:        Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate:    Tue, 31 Mar 2026 18:23:52 +02:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 03 Apr 2026 14:23:41 +02:00

sched/fair: Prevent negative lag increase during delayed dequeue

Delayed dequeue feature aims to reduce the negative lag of a dequeued
task while sleeping but it can happens that newly enqueued tasks will
move backward the avg vruntime and increase its negative lag.
When the delayed dequeued task wakes up, it has more neg lag compared
to being dequeued immediately or to other tasks that have been
dequeued just before theses new enqueues.

Ensure that the negative lag of a delayed dequeued task doesn't
increase during its delayed dequeued phase while waiting for its neg
lag to diseappear. Similarly, we remove any positive lag that the
delayed dequeued task could have gain during thsi period.

Short slice tasks are particularly impacted in overloaded system.

Test on snapdragon rb5:

hackbench -T -p -l 16000000 -g 2 1> /dev/null &
cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q

The scheduling latency of cyclictest is:

                       tip/sched/core  tip/sched/core    +this patch
cyclictest slice  (ms) (default)2.8             8               8
hackbench slice   (ms) (default)2.8            20              20
Total Samples          |   115632          119733          119806
Average           (us) |      364              64(-82%)        61(- 5%)
Median (P50)      (us) |       60              56(- 7%)        56(  0%)
90th Percentile   (us) |     1166              62(-95%)        62(  0%)
99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
Maximum           (us) |    17735           14273(-20%)     13525(- 5%)

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://patch.msgid.link/20260331162352.551501-1-vincent.guittot@linaro.org
---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++------------------
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 41293d5..597ce5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -840,11 +840,33 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
 	return clamp(vlag, -limit, limit);
 }
 
-static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/*
+ * Delayed dequeue aims to reduce the negative lag of a dequeued task. While
+ * updating the lag of an entity, check that negative lag didn't increase
+ * during the delayed dequeue period which would be unfair.
+ * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO
+ * is set.
+ *
+ * Return true if the lag has been adjusted.
+ */
+static __always_inline
+bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
+	s64 vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+	bool ret;
+
 	WARN_ON_ONCE(!se->on_rq);
 
-	se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
+	if (se->sched_delayed) {
+		/* previous vlag < 0 otherwise se would not be delayed */
+		vlag = max(vlag, se->vlag);
+		if (sched_feat(DELAY_ZERO))
+			vlag = min(vlag, 0);
+	}
+	ret = (vlag == se->vlag);
+	se->vlag = vlag;
+
+	return ret;
 }
 
 /*
@@ -5564,13 +5586,6 @@ static void clear_delayed(struct sched_entity *se)
 	}
 }
 
-static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
-{
-	clear_delayed(se);
-	if (sched_feat(DELAY_ZERO) && se->vlag > 0)
-		se->vlag = 0;
-}
-
 static bool
 dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 {
@@ -5596,6 +5611,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		if (sched_feat(DELAY_DEQUEUE) && delay &&
 		    !entity_eligible(cfs_rq, se)) {
 			update_load_avg(cfs_rq, se, 0);
+			update_entity_lag(cfs_rq, se);
 			set_delayed(se);
 			return false;
 		}
@@ -5635,7 +5651,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	update_cfs_group(se);
 
 	if (flags & DEQUEUE_DELAYED)
-		finish_delayed_dequeue_entity(se);
+		clear_delayed(se);
 
 	if (cfs_rq->nr_queued == 0) {
 		update_idle_cfs_rq_clock_pelt(cfs_rq);
@@ -7084,18 +7100,14 @@ requeue_delayed_entity(struct sched_entity *se)
 	WARN_ON_ONCE(!se->sched_delayed);
 	WARN_ON_ONCE(!se->on_rq);
 
-	if (sched_feat(DELAY_ZERO)) {
-		update_entity_lag(cfs_rq, se);
-		if (se->vlag > 0) {
-			cfs_rq->nr_queued--;
-			if (se != cfs_rq->curr)
-				__dequeue_entity(cfs_rq, se);
-			se->vlag = 0;
-			place_entity(cfs_rq, se, 0);
-			if (se != cfs_rq->curr)
-				__enqueue_entity(cfs_rq, se);
-			cfs_rq->nr_queued++;
-		}
+	if (update_entity_lag(cfs_rq, se)) {
+		cfs_rq->nr_queued--;
+		if (se != cfs_rq->curr)
+			__dequeue_entity(cfs_rq, se);
+		place_entity(cfs_rq, se, 0);
+		if (se != cfs_rq->curr)
+			__enqueue_entity(cfs_rq, se);
+		cfs_rq->nr_queued++;
 	}
 
 	update_load_avg(cfs_rq, se, 0);

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue
  2026-04-03  8:37   ` Vincent Guittot
@ 2026-04-04  8:08     ` Shubhang Kaushik
  0 siblings, 0 replies; 9+ messages in thread
From: Shubhang Kaushik @ 2026-04-04  8:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, peterz, juri.lelli, dietmar.eggemann, rostedt, bsegall,
	mgorman, vschneid, kprateek.nayak, linux-kernel

Hi Vincent,

Thanks for the feedback. Previously, the baseline I referred to 
was the top of tip/sched/core. You were right about the slice tuning; 
the delta is much more apparent with a shorter preemption window. After 
setting base_slice_ns to 400us on the 80 core Ampere Altra, the results 
shifted significantly in favor of the patch.

The tail latency (P99.9) dropped from 4194us to 2205us (~47% reduction). 
While we see a slight increase in the P50 (from 62us to 88us), likely due 
to the additional instruction overhead in the update_entity_lag() 
hot-path, the overall distribution is much tighter under high contention.

The most notable impact is on system throughput. In a saturated hackbench 
run (32 groups/800 tasks), execution time dropped from 155.8s to 91.5s. 
This suggests that preventing the inflation of negative lag during delayed 
dequeue effectively mitigates runqueue logjams on high core count SMP. 
By ensuring short slice tasks aren't unfairly penalized upon wakeup, we're 
seeing much better fluidness across the 80 cores.

System: Ampere Altra (80 Cores, 1P)
Baseline: tip/sched/core @ commit 2d4cc371baa5
Merged Patch: tip/sched/core @ commit 059258b0d424
Scheduler Tuning: base_slice_ns = 400,000 (0.4ms)

Hackbench results:- 
Background load: 32 groups / 800 tasks
Test Case	Baseline(sec)	Merged(sec)	Throughput
1 Thread	12.62		7.72		+38.8%
4 Threads	26.85		16.36		+39.1%
8 Threads	47.53		33.59		+29.3%
16 Processes	77.67		48.10		+38.1%
32 Processes	155.84		91.46		+41.3%

CyclicTest results:-
Background load: 20 groups / 800 tasks.
Metric		Baseline	Merged		Latency
P50 (Median)	62 us		88 us		+41.9%
P99		1956 us		1319 us		-32.5%
P99.9 (Tail)	4194 us		2205 us		-47.4%

Regarding the lag adjustment frequency, it seems to be an 
exceptional event. I monitored the logic using a kprobe on 
requeue_delayed_entity during the 32 process saturation test. Out of 
millions of scheduling events, the lag adjustment was triggered only a few 
times.

The patch does provides an efficient guardrail that prevents EEVDF lag 
starvation at scale without imposing a frequent adjustment tax.

Feel free to include:-
Tested-by: Shubhang Kaushik <shubhang@os.amperecomputing.com>

Regards,
Shubhang Kaushik

On Fri, 3 Apr 2026, Vincent Guittot wrote:

> On Thu, 2 Apr 2026 at 21:27, Shubhang Kaushik
> <shubhang@os.amperecomputing.com> wrote:
>>
>> Hi Vincent,
>>
>> I have been testing your v2 patch on my 80 core Ampere Altra (ARMv8
>> Neoverse-N1) 1P system using an idle tickless kernel on the latest
>> tip/sched/core branch.
>>
>> On Tue, 31 Mar 2026, Vincent Guittot wrote:
>>
>>> Delayed dequeue feature aims to reduce the negative lag of a dequeued task
>>> while sleeping but it can happens that newly enqueued tasks will move
>>> backward the avg vruntime and increase its negative lag.
>>> When the delayed dequeued task wakes up, it has more neg lag compared to
>>> being dequeued immediately or to other tasks that have been dequeued just
>>> before theses new enqueues.
>>>
>>> Ensure that the negative lag of a delayed dequeued task doesn't increase
>>> during its delayed dequeued phase while waiting for its neg lag to
>>> diseappear. Similarly, we remove any positive lag that the delayed
>>> dequeued task could have gain during thsi period.
>>>
>>> Short slice tasks are particularly impacted in overloaded system.
>>>
>>> Test on snapdragon rb5:
>>>
>>> hackbench -T -p -l 16000000 -g 2 1> /dev/null &
>>> cyclictest -t 1 -i 2777 -D 333 --policy=fair --mlock  -h 20000 -q
>>>
>>> The scheduling latency of cyclictest is:
>>>
>>>                       tip/sched/core  tip/sched/core    +this patch
>>> cyclictest slice  (ms) (default)2.8             8               8
>>> hackbench slice   (ms) (default)2.8            20              20
>>> Total Samples          |   115632          119733          119806
>>> Average           (us) |      364              64(-82%)        61(- 5%)
>>> Median (P50)      (us) |       60              56(- 7%)        56(  0%)
>>> 90th Percentile   (us) |     1166              62(-95%)        62(  0%)
>>> 99th Percentile   (us) |     4192              73(-98%)        72(- 1%)
>>> 99.9th Percentile (us) |     8528            2707(-68%)      1300(-52%)
>>> Maximum           (us) |    17735           14273(-20%)     13525(- 5%)
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>
>>
>> I replicated this cyclictest environment scaled for 80 cores using a
>> background hackbench load (-g 20). On Ampere Altra, I did not see the
>> tail latency reduction that you observed on the 8-core Snapdragon. In
>> fact, both average and max latencies increased slightly.
>>
>> Metric       | Baseline | Patched  | Delta (%)
>> -------------|----------|----------|-----------
>> Max Latency  | 9141us   | 9426us   | +3.11%
>> Avg Latency  | 206us    | 217us    | +5.33%
>> Min Latency  | 14us     | 13us     | -7.14%
>
> Without setting a shorter custom slice for cyclictest, you will not
> see any major differences. The difference ehappens in the p99 and
> p99.9 with a shorter slice
>
>>
>> More concerning is the impact on throughput. At 8-16 threads, hackbench
>> execution times increased by ~30%. I attempted to isolate this by
>
> Hmm, I run some perf test and I haven't seen any difference for
> hackbench with various number of group
>
>> disabling the DELAY_DEQUEUE sched_feature. But the regression persists
>> even with NO_DELAY_DEQUEUE, pointing to overhead in the modified
>> update_entity_lag() path itself.
>>
>> Test Case    | Baseline | Patched  | Delta (%) | Patched(NO_DELAYDQ)
>
> By baseline, do you mean tip/sched/core or v7.0-rcx ?
>
>> -------------|----------|----------|-----------|--------------------
>> 4 Threads    | 13.77s   | 17.53s   | +27.3%    | 17.16s
>> 8 Threads    | 24.39s   | 31.90s   | +30.8%    | 30.67s
>> 16 Threads   | 47.92s   | 60.46s   | +26.2%    | 62.53s
>> 32 Processes | 118.08s  | 103.16s  | -12.6%    | 101.87s
>
> That's surprising. I ran some perf tests with the patch and haven't
> seen any differences
>
>                                tip/sched/core  + patch
> hackbench 1 process socket      0,581           0,580 (0,0 %)
>                       stddev   2,7 %           2,5 %
> hackbench 4 process socket      0,612           0,612 (0,0 %)
>                       stddev   0,9 %           2,3 %
> hackbench 8 process socket      0,662           0,659 (0,4 %)
>                       stddev   1,0 %           1,8 %
> hackbench 16 process socket     0,700           0,699 (0,3 %)
>                       stddev   1,6 %           1,3 %
> hackbench 1 process pipe        0,796           0,797 (-0,2 %)
>                       stddev   1,5 %           1,9 %
> hackbench 4 process pipe        0,699           0,694 (0,8 %)
>                       stddev   3,7 %           2,5 %
> hackbench 8 process pipe        0,631           0,636 (-0,9 %)
>                       stddev   3,4 %           2,2 %
> hackbench 16 process pipe       0,612           0,594 (2,9 %)
>                       stddev   1,8 %           1,5 %
> hackbench 1 thread socket       0,571           0,570 (0,1 %)
>                       stddev   2,3 %           1,5 %
> hackbench 4 thread socket       0,591           0,594 (-0,5 %)
>                       stddev   1,2 %           0,7 %
> hackbench 8 thread socket       0,621           0,628 (-1,2 %)
>                       stddev   1,3 %           1,4 %
> hackbench 16 thread socket      0,660           0,653 (1,0 %)
>                       stddev   0,7 %           0,9 %
> hackbench 1 thread pipe         0,860           0,864 (-0,6 %)
>                       stddev   1,4 %           2,0 %
> hackbench 4 thread pipe         0,828           0,821 (0,9 %)
>                       stddev   3,5 %           4,7 %
> hackbench 8 thread pipe         0,725           0,739 (-1,8 %)
>                       stddev   2,3 %           8,6 %
> hackbench 16 thread pipe        0,647           0,645 (0,4 %)
>                       stddev   4,3 %           4,2 %
>
>>
>>> Since v1:
>>> - Embedded the check of lag evolution of delayed dequeue entities in
>>>  update_entity_lag() to include all cases.
>>>
>>
>> While the patch shows a ~12.6% improvement at high saturation (32
>> processes), the throughput cost at mid-range scales appears to outweigh
>> the fairness benefits on our high core system, as even the worst-case
>> wake-up latencies did not improve.
>>
>>> kernel/sched/fair.c | 53 ++++++++++++++++++++++++++-------------------
>>> 1 file changed, 31 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 226509231e67..c1ffe86bf78d 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -840,11 +840,30 @@ static s64 entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se, u64 avrunt
>>>       return clamp(vlag, -limit, limit);
>>> }
>>>
>>> -static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +/*
>>> + * Delayed dequeue aims to reduce the negative lag of a dequeued task.
>>> + * While updating the lag of an entity, check that negative lag didn't increase
>>> + * during the delayed dequeue period which would be unfair.
>>> + * Similarly, check that the entity didn't gain positive lag when DELAY_ZERO is
>>> + * set.
>>> + *
>>> + * Return true if the lag has been adjusted.
>>> + */
>>> +static bool update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> {
>>> +     s64 vlag;
>>> +
>>>       WARN_ON_ONCE(!se->on_rq);
>>>
>>> -     se->vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
>>> +     vlag = entity_lag(cfs_rq, se, avg_vruntime(cfs_rq));
>>> +
>>> +     if (se->sched_delayed)
>>> +             /* previous vlag < 0 otherwise se would not be delayed */
>>> +             se->vlag = clamp(vlag, se->vlag, sched_feat(DELAY_ZERO) ? 0 : S64_MAX);
>>> +     else
>>> +             se->vlag = vlag;
>>> +
>>> +     return (vlag != se->vlag);
>>> }
>>>
>>> /*
>>> @@ -5563,13 +5582,6 @@ static void clear_delayed(struct sched_entity *se)
>>>       }
>>> }
>>>
>>> -static inline void finish_delayed_dequeue_entity(struct sched_entity *se)
>>> -{
>>> -     clear_delayed(se);
>>> -     if (sched_feat(DELAY_ZERO) && se->vlag > 0)
>>> -             se->vlag = 0;
>>> -}
>>> -
>>> static bool
>>> dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>> {
>>> @@ -5595,6 +5607,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>>               if (sched_feat(DELAY_DEQUEUE) && delay &&
>>>                   !entity_eligible(cfs_rq, se)) {
>>>                       update_load_avg(cfs_rq, se, 0);
>>> +                     update_entity_lag(cfs_rq, se);
>>
>> The regression persists even with NO_DELAY_DEQUEUE, likely because
>> update_entity_lag() is now called unconditionally in dequeue_entity()
>> thereby adding avg_vruntime() overhead and cacheline contention for every
>> dequeue.
>>
>> Do consider guarding the update_entity_lag() call in dequeue_entity()
>> with sched_feat(DELAY_DEQUEUE) check to avoid this tax when the feature
>> is disabled.
>>
>>>                       set_delayed(se);
>>>                       return false;
>>>               }
>>> @@ -5634,7 +5647,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>>>       update_cfs_group(se);
>>>
>>>       if (flags & DEQUEUE_DELAYED)
>>> -             finish_delayed_dequeue_entity(se);
>>> +             clear_delayed(se);
>>>
>>>       if (cfs_rq->nr_queued == 0) {
>>>               update_idle_cfs_rq_clock_pelt(cfs_rq);
>>> @@ -7088,18 +7101,14 @@ requeue_delayed_entity(struct sched_entity *se)
>>>       WARN_ON_ONCE(!se->sched_delayed);
>>>       WARN_ON_ONCE(!se->on_rq);
>>>
>>> -     if (sched_feat(DELAY_ZERO)) {
>>> -             update_entity_lag(cfs_rq, se);
>>> -             if (se->vlag > 0) {
>>> -                     cfs_rq->nr_queued--;
>>> -                     if (se != cfs_rq->curr)
>>> -                             __dequeue_entity(cfs_rq, se);
>>> -                     se->vlag = 0;
>>> -                     place_entity(cfs_rq, se, 0);
>>> -                     if (se != cfs_rq->curr)
>>> -                             __enqueue_entity(cfs_rq, se);
>>> -                     cfs_rq->nr_queued++;
>>> -             }
>>> +     if (update_entity_lag(cfs_rq, se)) {
>>> +             cfs_rq->nr_queued--;
>>> +             if (se != cfs_rq->curr)
>>> +                     __dequeue_entity(cfs_rq, se);
>>> +             place_entity(cfs_rq, se, 0);
>>> +             if (se != cfs_rq->curr)
>>> +                     __enqueue_entity(cfs_rq, se);
>>> +             cfs_rq->nr_queued++;
>>
>> Triggering a full dequeue/enqueue cycle for every vlag adjustment appears
>> to be a major bottleneck. Frequent RB-tree rebalancing here creates
>> significant contention.
>
> This adjustment is not supposed to happen
>
>>
>> Could we preserve fairness while recovering throughput by only re-queuing
>> when the lag sign changes or a significant eligibility threshold is
>> crossed?
>
> Could you monitor how often we have to adjust the lag in your case? As
> mentioned above, this shouldn't happen often, in particular the
> increase of neg lag case
>
>>
>>>       }
>>>
>>>       update_load_avg(cfs_rq, se, 0);
>>> --
>>> 2.43.0
>>>
>>>
>> Regards,
>> Shubhang Kaushik
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-04-04  8:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 16:23 [PATCH v2] sched: fair: Prevent negative lag increase during delayed dequeue Vincent Guittot
2026-04-02 13:13 ` Peter Zijlstra
2026-04-02 13:17   ` Vincent Guittot
2026-04-02 13:42 ` Peter Zijlstra
2026-04-02 19:27 ` Shubhang Kaushik
2026-04-03  8:37   ` Vincent Guittot
2026-04-04  8:08     ` Shubhang Kaushik
2026-04-03  8:46   ` Vincent Guittot
2026-04-03 12:30 ` [tip: sched/core] sched/fair: " tip-bot2 for Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox