public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/deadline: Fix stale dl_defer_running in dl_server else-branch
@ 2026-04-02 13:30 soolaugust
  2026-04-03  0:05 ` John Stultz
  0 siblings, 1 reply; 16+ messages in thread
From: soolaugust @ 2026-04-02 13:30 UTC (permalink / raw)
  To: jstultz, bristot; +Cc: peterz, mingo, linux-kernel, arighi, Zhidao Su

From: Zhidao Su <suzhidao@xiaomi.com>

Peter's fix (115135422562) cleared dl_defer_running in the if-branch of
update_dl_entity() (deadline expired/overflow). This ensures
replenish_dl_new_period() always arms the zero-laxity timer. However,
with PROXY_WAKING, re-activation hits the else-branch (same-period,
deadline not expired), where dl_defer_running from a prior starvation
episode can be stale.

During PROXY_WAKING CPU return-migration, proxy_force_return() migrates
the task to a new CPU via deactivate_task()+attach_one_task(). The
enqueue path on the new CPU triggers enqueue_task_fair() which calls
dl_server_start() for the fair_server. Crucially, this re-activation
does NOT call dl_server_stop() first, so dl_defer_running retains its
prior value. If a prior starvation episode left dl_defer_running=1,
and the server is re-activated within the same period:

  [4] D->A: dl_server_stop() clears flags but may be skipped when
            dl_server_active=0 (server was already stopped before
            return-migration triggered dl_server_start())
  [1] A->B: dl_server_start() -> enqueue_dl_entity(WAKEUP)
             -> update_dl_entity() enters else-branch
             -> 'if (!dl_defer_running)' guard fires, skips
                dl_defer_armed=1 / dl_throttled=1
             -> server enqueued into [D] state directly
             -> update_curr_dl_se() consumes runtime
             -> start_dl_timer() with dl_defer_armed=0 (slow path)
             -> boot time increases ~72%

Fix: in the else-branch, unconditionally clear dl_defer_running and always
set dl_defer_armed=1 / dl_throttled=1. This ensures every same-period
re-activation properly re-arms the zero-laxity timer, regardless of whether
a prior starvation episode had set dl_defer_running.

The if-branch (deadline expired) is left untouched:
replenish_dl_new_period() contains its own guard ('if (!dl_defer_running)')
that arms the zero-laxity timer only when dl_defer_running=0. With
PROXY_WAKING, dl_defer_running=1 in the deadline-expired path means a
genuine starvation episode is ongoing, so the server can skip the
zero-laxity wait and enter [D] directly. Clearing dl_defer_running here
(as Peter's fix did) forces every PROXY_WAKING deadline-expired
re-activation through the ~950ms zero-laxity wait.

Measured boot time to first ksched_football event (4 CPUs, 4G):
  This fix: ~15-20s
  Without fix (stale dl_defer_running): ~43-62s (+72-200%)

Note: Andrea Righi's v2 patch addresses the same symptom by clearing
dl_defer_running in dl_server_stop(). However, dl_server_stop() is not
called during PROXY_WAKING return-migration (proxy_force_return() calls
dl_server_start() directly without dl_server_stop()). This fix targets
the correct location: the else-branch of update_dl_entity().

Signed-off-by: Zhidao Su <suzhidao@xiaomi.com>
---
 kernel/sched/deadline.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index 01754d699f0..b2bcd34f3ea 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1034,22 +1034,22 @@ static void update_dl_entity(struct sched_dl_entity *dl_se)
 			return;
 		}
 
-		/*
-		 * When [4] D->A is followed by [1] A->B, dl_defer_running
-		 * needs to be cleared, otherwise it will fail to properly
-		 * start the zero-laxity timer.
-		 */
-		dl_se->dl_defer_running = 0;
 		replenish_dl_new_period(dl_se, rq);
 	} else if (dl_server(dl_se) && dl_se->dl_defer) {
 		/*
-		 * The server can still use its previous deadline, so check if
-		 * it left the dl_defer_running state.
+		 * The server can still use its previous deadline. Clear
+		 * dl_defer_running unconditionally: a stale dl_defer_running=1
+		 * from a prior starvation episode (set in dl_server_timer() when
+		 * the zero-laxity timer fires) must not carry over to the next
+		 * activation. PROXY_WAKING return-migration (proxy_force_return)
+		 * re-activates the server via attach_one_task()->enqueue_task_fair()
+		 * without calling dl_server_stop() first, so the flag is not
+		 * cleared in the [4] D->A path for that case.
+		 * Always re-arm the zero-laxity timer on each re-activation.
 		 */
-		if (!dl_se->dl_defer_running) {
-			dl_se->dl_defer_armed = 1;
-			dl_se->dl_throttled = 1;
-		}
+		dl_se->dl_defer_running = 0;
+		dl_se->dl_defer_armed = 1;
+		dl_se->dl_throttled = 1;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-04-07 15:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 13:30 [PATCH] sched/deadline: Fix stale dl_defer_running in dl_server else-branch soolaugust
2026-04-03  0:05 ` John Stultz
2026-04-03  1:30   ` John Stultz
2026-04-03  8:12     ` [PATCH] sched/deadline: Fix stale dl_defer_running in update_dl_entity() if-branch soolaugust
2026-04-03 13:42       ` Peter Zijlstra
2026-04-03 13:58         ` Andrea Righi
2026-04-03 19:31         ` John Stultz
2026-04-03 22:46           ` Peter Zijlstra
2026-04-03 22:51             ` John Stultz
2026-04-03 22:54               ` John Stultz
2026-04-04 10:22             ` Peter Zijlstra
2026-04-05  8:37               ` zhidao su
2026-04-06 20:01               ` John Stultz
2026-04-06 20:03                 ` John Stultz
2026-04-07 12:22               ` Juri Lelli
2026-04-07 15:00                 ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox