public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion
@ 2026-02-19 13:37 Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through " Juri Lelli
                   ` (3 more replies)
  0 siblings, 4 replies; 7+ messages in thread
From: Juri Lelli @ 2026-02-19 13:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Jonathan Corbet, Shuah Khan, Qais Yousef, Clark Williams,
	Gabriele Monaco, Tommaso Cucinotta, Luca Abeni
  Cc: linux-kernel, linux-doc, Juri Lelli

Hi All,

This RFC introduces a bandwidth reclaiming mechanism for SCHED_DEADLINE
tasks through temporary demotion to SCHED_NORMAL when runtime is
exhausted. This resurrects and refines the demotion concept from the
original SCHED_DEADLINE development circa 2010, focusing exclusively on
SCHED_NORMAL demotion.

Discussions about the feature have been resurfacing over the years and I
wanted to check for feasibility and real interest. Found a little time
to play around with the idea and this is the result of that.

When a DEADLINE task with SCHED_FLAG_DL_DEMOTION exhausts its runtime
budget, the scheduler demotes it to SCHED_NORMAL rather than throttling
it until the next period. The task continues execution competing fairly
with other normal tasks, using the nice value specified in
sched_attr.sched_nice. At the next period boundary, the replenishment
timer automatically promotes the task back to SCHED_DEADLINE with a
fresh runtime budget.

This provides a "soft(er) real-time" mode where tasks get timing
guarantees when within budget but gracefully degrade to best-effort
execution during overruns rather than being suspended. The bandwidth
reservation remains in place during demotion, making the mechanism
transparent from an admission control perspective similar to throttling.

Key design aspects:

The implementation focuses solely on SCHED_NORMAL demotion, unlike
earlier proposals that suggested multiple demotion targets including RT
and DL postponement. Simpler and maybe enough?

The feature reuses the existing sched_attr.sched_nice field to specify
the nice value during demotion, avoiding new UAPI additions while
maintaining ABI compatibility. This is orthogonal to GRUB
(SCHED_FLAG_RECLAIM) - tasks can combine both mechanisms for
opportunistic reclaiming through accounting and continued execution
through demotion (at least in principle, didn't actually test it yet :).

Demoted tasks cannot migrate between CPUs. This simplification keeps
bandwidth accounting straightforward by ensuring the reservation stays
on the original CPU throughout demotion. Migration is re-enabled after
promotion or explicit parameter changes via sched_setattr().

The bandwidth accounting follows the throttling model rather than full
class switching. Dequeue operations omit DEQUEUE_SAVE to keep the
reservation in this_bw (admission control bandwidth). Running bandwidth
(enforcement) is handled at 0-lag time for tasks that sleep while
demoted, maintaining correct GRUB accounting.

Explicit sched_setattr() calls on demoted tasks cancel the demotion
state and perform full bandwidth cleanup including inactive timer
handling and cpuset tracking. The replenishment timer remains armed but
fires harmlessly when it detects the task is no longer DEADLINE.

This posting is very much experimental. I added AI generated tests
(included here just for reference) that helped checking a few cases
during implementation. However, I am quite sure I'm missing several
additional cases that can cause breakage. Test it at your own risk! :P

Based on original work by Dario Faggioli:
https://lore.kernel.org/lkml/1288334546.8661.161.camel@Palantir/

As always comments and questions are more than welcome.

Series also available at

git@github.com:jlelli/linux.git upstream/deadline-demotion

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
Juri Lelli (4):
      sched/deadline: Implement reclaim/soft mode through SCHED_OTHER demotion
      sched/doc: Document SCHED_DEADLINE demotion feature
      DEBUG selftests/sched: Add tests for SCHED_DEADLINE demotion feature
      DEBUG selftests/sched: Add simple demonstration of SCHED_DEADLINE demotion

 Documentation/scheduler/sched-deadline.rst         |  54 +++
 include/linux/sched.h                              |  10 +
 include/uapi/linux/sched.h                         |   4 +-
 include/uapi/linux/sched/types.h                   |   8 +
 kernel/sched/deadline.c                            | 213 +++++++++-
 kernel/sched/fair.c                                |   8 +
 kernel/sched/sched.h                               |  15 +-
 kernel/sched/syscalls.c                            |   8 +
 tools/testing/selftests/sched/.gitignore           |   3 +
 tools/testing/selftests/sched/Makefile             |   4 +-
 tools/testing/selftests/sched/README_dl_demotion   |  83 ++++
 tools/testing/selftests/sched/dl_demotion_demo.c   | 239 +++++++++++
 tools/testing/selftests/sched/dl_demotion_stress.c | 208 ++++++++++
 tools/testing/selftests/sched/dl_demotion_test.c   | 460 +++++++++++++++++++++
 .../selftests/sched/run_dl_demotion_with_trace.sh  |  71 ++++
 15 files changed, 1382 insertions(+), 6 deletions(-)
---
base-commit: e34881c84c255bc300f24d9fe685324be20da3d1
change-id: 20260218-upstream-deadline-demotion-19511e741055

Best regards,
--  
Juri Lelli <juri.lelli@redhat.com>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through SCHED_OTHER demotion
  2026-02-19 13:37 [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion Juri Lelli
@ 2026-02-19 13:37 ` Juri Lelli
  2026-02-20 19:47   ` Peter Zijlstra
  2026-02-19 13:37 ` [PATCH RFC 2/4] sched/doc: Document SCHED_DEADLINE demotion feature Juri Lelli
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 7+ messages in thread
From: Juri Lelli @ 2026-02-19 13:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Jonathan Corbet, Shuah Khan, Qais Yousef, Clark Williams,
	Gabriele Monaco, Tommaso Cucinotta, Luca Abeni
  Cc: linux-kernel, linux-doc, Juri Lelli

Add support for demoting deadline tasks to SCHED_OTHER when they exhaust
their runtime. This prevents starvation of lower priority tasks while still
allowing deadline tasks to utilize available CPU bandwidth.

This feature resurrects and refines the bandwidth reclaiming concept
from the original SCHED_DEADLINE development (circa 2010), focusing on a
single demotion mode: SCHED_OTHER.

When SCHED_FLAG_DL_DEMOTION is set and a task exhausts its runtime:
- Task is demoted from SCHED_DEADLINE to SCHED_OTHER
- Task continues executing with the nice value from sched_attr.sched_nice
- Cannot starve other SCHED_OTHER tasks (fair sharing)
- Cannot interfere with SCHED_DEADLINE or "SCHED_RT" tasks
- Automatically promoted back to SCHED_DEADLINE at replenishment

The feature is orthogonal to GRUB (SCHED_FLAG_RECLAIM). Tasks can use
GRUB for opportunistic bandwidth reclaiming through accounting, while
being demoted for continued execution as SCHED_OTHER after runtime
exhaustion.

Note that demotion is disabled for DL servers and PI-boosted tasks. Also
demoted tasks cannot migrate (easier bandwidth accounting).

Based on original work by Dario Faggioli:
https://lore.kernel.org/lkml/1288334546.8661.161.camel@Palantir/

Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 include/linux/sched.h            |  10 ++
 include/uapi/linux/sched.h       |   4 +-
 include/uapi/linux/sched/types.h |   8 ++
 kernel/sched/deadline.c          | 213 ++++++++++++++++++++++++++++++++++++++-
 kernel/sched/fair.c              |   8 ++
 kernel/sched/sched.h             |  15 ++-
 kernel/sched/syscalls.c          |   8 ++
 7 files changed, 262 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d834d0190c46f..680c178184260 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,15 @@ struct sched_dl_entity {
 	 * running, skipping the defer phase.
 	 *
 	 * @dl_defer_idle tracks idle state
+	 *
+	 * @dl_demotion_state tracks the demotion state machine:
+	 *   DL_NOT_DEMOTED (0): Normal SCHED_DEADLINE execution
+	 *   DL_DEMOTING (1): Transition in progress (DL -> NORMAL), skip bw removal
+	 *   DL_DEMOTED (2): Running as SCHED_NORMAL, bandwidth still reserved
+	 *   DL_PROMOTING (3): Transition in progress (NORMAL -> DL), skip bw addition
+	 *
+	 * Demoted tasks cannot migrate (enforced by dl_task_can_migrate()), so
+	 * bandwidth reservation always stays on the current CPU.
 	 */
 	unsigned int			dl_throttled      : 1;
 	unsigned int			dl_yielded        : 1;
@@ -707,6 +716,7 @@ struct sched_dl_entity {
 	unsigned int			dl_defer_armed	  : 1;
 	unsigned int			dl_defer_running  : 1;
 	unsigned int			dl_defer_idle     : 1;
+	unsigned int			dl_demotion_state : 2;
 
 	/*
 	 * Bandwidth enforcement timer. Each -deadline task has its
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 359a14cc76a40..aeab67899ed30 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -133,6 +133,7 @@ struct clone_args {
 #define SCHED_FLAG_KEEP_PARAMS		0x10
 #define SCHED_FLAG_UTIL_CLAMP_MIN	0x20
 #define SCHED_FLAG_UTIL_CLAMP_MAX	0x40
+#define SCHED_FLAG_DL_DEMOTION		0x80
 
 #define SCHED_FLAG_KEEP_ALL	(SCHED_FLAG_KEEP_POLICY | \
 				 SCHED_FLAG_KEEP_PARAMS)
@@ -144,6 +145,7 @@ struct clone_args {
 			 SCHED_FLAG_RECLAIM		| \
 			 SCHED_FLAG_DL_OVERRUN		| \
 			 SCHED_FLAG_KEEP_ALL		| \
-			 SCHED_FLAG_UTIL_CLAMP)
+			 SCHED_FLAG_UTIL_CLAMP		| \
+			 SCHED_FLAG_DL_DEMOTION)
 
 #endif /* _UAPI_LINUX_SCHED_H */
diff --git a/include/uapi/linux/sched/types.h b/include/uapi/linux/sched/types.h
index bf6e9ae031c11..ca581b4fa3f93 100644
--- a/include/uapi/linux/sched/types.h
+++ b/include/uapi/linux/sched/types.h
@@ -94,6 +94,14 @@
  * scheduled on a CPU with no more capacity than the specified value.
  *
  * A task utilization boundary can be reset by setting the attribute to -1.
+ *
+ * SCHED_DEADLINE Demotion
+ * ========================
+ *
+ * When a SCHED_DEADLINE task exhausts its runtime and SCHED_FLAG_DL_DEMOTION
+ * is set, the task is demoted to SCHED_OTHER to continue executing at lower
+ * priority. The sched_nice value specifies the nice level when demoted.
+ * The task is automatically promoted back to SCHED_DEADLINE at replenish.
  */
 struct sched_attr {
 	__u32 size;
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d08b004293234..d835945123c16 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -1204,6 +1204,9 @@ static enum hrtimer_restart dl_server_timer(struct hrtimer *timer, struct sched_
 	return HRTIMER_NORESTART;
 }
 
+static void
+dl_task_promote(struct rq *rq, struct task_struct *p);
+
 /*
  * This is the bandwidth enforcement timer callback. If here, we know
  * a task is not on its dl_rq, since the fact that the timer was running
@@ -1236,7 +1239,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	 * The task might have changed its scheduling policy to something
 	 * different than SCHED_DEADLINE (through switched_from_dl()).
 	 */
-	if (!dl_task(p))
+	if (!dl_task(p) && dl_se->dl_demotion_state == DL_NOT_DEMOTED)
 		goto unlock;
 
 	/*
@@ -1256,6 +1259,27 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	sched_clock_tick();
 	update_rq_clock(rq);
 
+	/*
+	 * If the task was demoted to SCHED_OTHER, promote it back to
+	 * SCHED_DEADLINE now that it's going to be replenished.
+	 *
+	 * Note: Demoted tasks cannot migrate (enforced by dl_task_can_migrate),
+	 * so bandwidth is guaranteed to still be on this CPU.
+	 */
+	if (dl_se->dl_demotion_state == DL_DEMOTED) {
+		/*
+		 * We're at 0-lag time by definition (replenish). The task went
+		 * to sleep as SCHED_NORMAL, so task_non_contending() was never
+		 * called and running_bw was never removed. Remove it now so that
+		 * when the task wakes up as DEADLINE, the normal enqueue path
+		 * can add it back.
+		 */
+		if (!task_on_rq_queued(p))
+			sub_running_bw(dl_se, &rq->dl);
+
+		dl_task_promote(rq, p);
+	}
+
 	/*
 	 * If the throttle happened during sched-out; like:
 	 *
@@ -1293,6 +1317,7 @@ static enum hrtimer_restart dl_task_timer(struct hrtimer *timer)
 	}
 
 	enqueue_task_dl(rq, p, ENQUEUE_REPLENISH);
+
 	if (dl_task(rq->donor))
 		wakeup_preempt_dl(rq, p, 0);
 	else
@@ -1419,6 +1444,84 @@ s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta
 	return scaled_delta_exec;
 }
 
+/*
+ * Check if a deadline task can be demoted when it exhausts its runtime.
+ * dl-servers and boosted tasks cannot be demoted.
+ *
+ * Returns true if demotion should happen, false otherwise.
+ */
+static inline bool dl_task_can_demote(struct sched_dl_entity *dl_se)
+{
+	if (dl_server(dl_se))
+		return false;
+
+	if (is_dl_boosted(dl_se))
+		return false;
+
+	return !!(dl_se->flags & SCHED_FLAG_DL_DEMOTION);
+}
+
+/*
+ * Promote a demoted task back to SCHED_DEADLINE.
+ * The task's runtime will be replenished by the caller.
+ */
+static void dl_task_promote(struct rq *rq, struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	int queue_flags = DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_CLASS;
+
+	lockdep_assert_rq_held(rq);
+
+	if (dl_se->dl_demotion_state != DL_DEMOTED)
+		return;
+
+	dl_se->dl_demotion_state = DL_PROMOTING;
+
+	scoped_guard (sched_change, p, queue_flags) {
+		p->policy = SCHED_DEADLINE;
+		p->sched_class = &dl_sched_class;
+		p->prio = MAX_DL_PRIO - 1;
+		p->normal_prio = p->prio;
+	}
+
+	dl_se->dl_demotion_state = DL_NOT_DEMOTED;
+
+	__balance_callbacks(rq, NULL);
+}
+
+/*
+ * Demote a deadline task to SCHED_OTHER when it exhausts its runtime.
+ * The task will be promoted back to SCHED_DEADLINE at replenish.
+ */
+static void dl_task_demote(struct rq *rq, struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	int queue_flags = DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_CLASS;
+
+	lockdep_assert_rq_held(rq);
+
+	if (dl_se->dl_demotion_state != DL_NOT_DEMOTED || !dl_task_can_demote(dl_se))
+		return;
+
+	dl_se->dl_demotion_state = DL_DEMOTING;
+
+	scoped_guard (sched_change, p, queue_flags) {
+		/*
+		 * The task's static_prio is already set from the sched_nice
+		 * value in sched_attr.
+		 */
+		p->policy = SCHED_NORMAL;
+		p->sched_class = &fair_sched_class;
+		p->prio = p->static_prio;
+		p->normal_prio = p->static_prio;
+	}
+
+	dl_se->dl_demotion_state = DL_DEMOTED;
+
+	__balance_callbacks(rq, NULL);
+	resched_curr(rq);
+}
+
 static inline void
 update_stats_dequeue_dl(struct dl_rq *dl_rq, struct sched_dl_entity *dl_se, int flags);
 
@@ -1521,11 +1624,36 @@ static void update_curr_dl_se(struct rq *rq, struct sched_dl_entity *dl_se, s64
 			dequeue_pushable_dl_task(rq, dl_task_of(dl_se));
 		}
 
+		/*
+		 * Check if we should demote to SCHED_OTHER instead of throttling.
+		 * Demotion only applies to non-dl-server non-pi-boosted tasks
+		 * that have exhausted their runtime (not yielded).
+		 */
+		if (!dl_server(dl_se) && dl_runtime_exceeded(dl_se) &&
+		    dl_task_can_demote(dl_se))
+			dl_task_demote(rq, dl_task_of(dl_se));
+
+		/*
+		 * Start the replenishment timer for both demoted and throttled tasks.
+		 * If boosted or if the timer fails to start, we need to handle it
+		 * immediately to avoid leaving tasks stuck.
+		 */
 		if (unlikely(is_dl_boosted(dl_se) || !start_dl_timer(dl_se))) {
+			/*
+			 * If this was a demoted task, promote it back to SCHED_DEADLINE
+			 * before enqueueing.
+			 */
+			if (dl_se->dl_demotion_state == DL_DEMOTED)
+				dl_task_promote(rq, dl_task_of(dl_se));
+
 			if (dl_server(dl_se)) {
 				replenish_dl_new_period(dl_se, rq);
 				start_dl_timer(dl_se);
 			} else {
+				/*
+				 * For regular tasks (including previously demoted ones),
+				 * enqueue with replenishment.
+				 */
 				enqueue_task_dl(rq, dl_task_of(dl_se), ENQUEUE_REPLENISH);
 			}
 		}
@@ -3266,8 +3394,17 @@ void dl_clear_root_domain_cpu(int cpu)
 	dl_clear_root_domain(cpu_rq(cpu)->rd);
 }
 
-static void switched_from_dl(struct rq *rq, struct task_struct *p)
+/*
+ * Common cleanup when a task leaves SCHED_DEADLINE.
+ * Handles inactive timer, cpuset tracking, and bandwidth accounting.
+ *
+ * This is used both when a task is explicitly switched away from DEADLINE
+ * and when a demoted task's demotion is cancelled via sched_setattr().
+ */
+static void __dl_cleanup_bandwidth(struct task_struct *p, struct rq *rq)
 {
+	lockdep_assert_rq_held(rq);
+
 	/*
 	 * task_non_contending() can start the "inactive timer" (if the 0-lag
 	 * time is in the future). If the task switches back to dl before
@@ -3304,6 +3441,18 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
 	 */
 	if (p->dl.dl_non_contending)
 		p->dl.dl_non_contending = 0;
+}
+
+static void switched_from_dl(struct rq *rq, struct task_struct *p)
+{
+	/*
+	 * If demoting, skip all bandwidth accounting. The bandwidth
+	 * reservation stays in place while the task executes as SCHED_NORMAL.
+	 */
+	if (p->dl.dl_demotion_state == DL_DEMOTING)
+		return;
+
+	__dl_cleanup_bandwidth(p, rq);
 
 	/*
 	 * Since this might be the only -deadline task on the rq,
@@ -3322,6 +3471,16 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
  */
 static void switched_to_dl(struct rq *rq, struct task_struct *p)
 {
+	/*
+	 * If promoting from demotion, skip bandwidth/cpuset accounting.
+	 */
+	if (p->dl.dl_demotion_state == DL_PROMOTING) {
+		if (!task_on_rq_queued(p))
+			return;
+
+		goto check_preempt;
+	}
+
 	cancel_inactive_timer(&p->dl);
 
 	/*
@@ -3337,6 +3496,7 @@ static void switched_to_dl(struct rq *rq, struct task_struct *p)
 		return;
 	}
 
+check_preempt:
 	if (rq->donor != p) {
 		if (p->nr_cpus_allowed > 1 && rq->dl.overloaded)
 			deadline_queue_push_tasks(rq);
@@ -3625,6 +3785,47 @@ void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
 	attr->sched_flags |= dl_se->flags;
 }
 
+/*
+ * Check if a task can be migrated from DEADLINE perspective.
+ *
+ * Returns false if the task is a demoted DEADLINE task. Demoted tasks
+ * must stay on their demotion CPU because their bandwidth reservation
+ * is tied to that CPU. Migration will be allowed again after promotion.
+ */
+bool dl_task_can_migrate(struct task_struct *p)
+{
+	return p->dl.dl_demotion_state != DL_DEMOTED;
+}
+
+/*
+ * Cancel demotion for a demoted DEADLINE task when scheduling parameters
+ * are explicitly changed via sched_setattr().
+ *
+ * This performs the same cleanup as switched_from_dl() would do, releasing
+ * bandwidth reservation and clearing all DEADLINE-related state.
+ *
+ * The replenishment timer (dl_timer) is not cancelled - when it fires it will
+ * see the task is not DEADLINE and demotion state is cleared, and return early.
+ */
+void dl_cancel_demotion(struct task_struct *p)
+{
+	struct sched_dl_entity *dl_se = &p->dl;
+	struct rq *rq = task_rq(p);
+
+	lockdep_assert_rq_held(rq);
+
+	if (dl_se->dl_demotion_state != DL_DEMOTED)
+		return;
+
+	/*
+	 * Clear demotion state before cleanup. This allows the replenishment
+	 * timer to safely ignore the task when it fires.
+	 */
+	dl_se->dl_demotion_state = DL_NOT_DEMOTED;
+
+	__dl_cleanup_bandwidth(p, rq);
+}
+
 /*
  * This function validates the new parameters of a -deadline task.
  * We ask for the deadline not being zero, and greater or equal
@@ -3677,6 +3878,14 @@ bool __checkparam_dl(const struct sched_attr *attr)
 	if (period < min || period > max)
 		return false;
 
+	/*
+	 * Validate nice parameter if demotion flag is set.
+	 * The sched_nice value will be used when the task is demoted to SCHED_OTHER.
+	 */
+	if ((attr->sched_flags & SCHED_FLAG_DL_DEMOTION) &&
+	    (attr->sched_nice < MIN_NICE || attr->sched_nice > MAX_NICE))
+		return false;
+
 	return true;
 }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c16b5fd71b2d5..59e5459a75492 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9415,6 +9415,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
 	if (p->sched_task_hot)
 		p->sched_task_hot = 0;
 
+	/*
+	 * Demoted DEADLINE tasks cannot migrate. Their bandwidth reservation
+	 * is tied to the demotion CPU and will be released when the task is
+	 * promoted back to DEADLINE or explicitly switched to another policy.
+	 */
+	if (!dl_task_can_migrate(p))
+		return 0;
+
 	/*
 	 * We do not migrate tasks that are:
 	 * 1) delayed dequeued unless we migrate load, or
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a821cc8b2dd8f..6e5e98f3f4755 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -267,6 +267,16 @@ static inline unsigned long sched_weight_to_cgroup(unsigned long weight)
 		       CGROUP_WEIGHT_MIN, CGROUP_WEIGHT_MAX);
 }
 
+/*
+ * Deadline task demotion states
+ */
+enum dl_demotion_state {
+	DL_NOT_DEMOTED = 0,	/* Normal SCHED_DEADLINE execution */
+	DL_DEMOTING,		/* Transitioning DL -> NORMAL, skip bw removal */
+	DL_DEMOTED,		/* Running as SCHED_NORMAL, bw still reserved */
+	DL_PROMOTING,		/* Transitioning NORMAL -> DL, skip bw addition */
+};
+
 /*
  * !! For sched_setattr_nocheck() (kernel) only !!
  *
@@ -281,7 +291,8 @@ static inline unsigned long sched_weight_to_cgroup(unsigned long weight)
  */
 #define SCHED_FLAG_SUGOV	0x10000000
 
-#define SCHED_DL_FLAGS		(SCHED_FLAG_RECLAIM | SCHED_FLAG_DL_OVERRUN | SCHED_FLAG_SUGOV)
+#define SCHED_DL_FLAGS		(SCHED_FLAG_RECLAIM | SCHED_FLAG_DL_OVERRUN | \
+				 SCHED_FLAG_SUGOV | SCHED_FLAG_DL_DEMOTION)
 
 static inline bool dl_entity_is_special(const struct sched_dl_entity *dl_se)
 {
@@ -356,6 +367,8 @@ extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr);
 extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
 extern bool __checkparam_dl(const struct sched_attr *attr);
 extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
+extern bool dl_task_can_migrate(struct task_struct *p);
+extern void dl_cancel_demotion(struct task_struct *p);
 extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
 extern int  dl_bw_deactivate(int cpu);
 extern s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta_exec);
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 6f10db3646e7f..87bd4c821d97f 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -248,6 +248,8 @@ static void __setscheduler_params(struct task_struct *p,
 
 	p->policy = policy;
 
+	dl_cancel_demotion(p);
+
 	if (dl_policy(policy))
 		__setparam_dl(p, attr);
 	else if (fair_policy(policy))
@@ -569,6 +571,12 @@ int __sched_setscheduler(struct task_struct *p,
 			goto change;
 		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
 			goto change;
+		/*
+		 * If task is demoted, force through change path to cancel
+		 * demotion even if parameters are unchanged.
+		 */
+		if (p->dl.dl_demotion_state == DL_DEMOTED)
+			goto change;
 
 		p->sched_reset_on_fork = reset_on_fork;
 		retval = 0;

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 2/4] sched/doc: Document SCHED_DEADLINE demotion feature
  2026-02-19 13:37 [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through " Juri Lelli
@ 2026-02-19 13:37 ` Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 3/4] DEBUG selftests/sched: Add tests for " Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 4/4] DEBUG selftests/sched: Add simple demonstration of SCHED_DEADLINE demotion Juri Lelli
  3 siblings, 0 replies; 7+ messages in thread
From: Juri Lelli @ 2026-02-19 13:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Jonathan Corbet, Shuah Khan, Qais Yousef, Clark Williams,
	Gabriele Monaco, Tommaso Cucinotta, Luca Abeni
  Cc: linux-kernel, linux-doc, Juri Lelli

Add user-facing documentation for the SCHED_FLAG_DL_DEMOTION flag in
the SCHED_DEADLINE scheduler documentation.

The new section explains how tasks with this flag demote to SCHED_NORMAL
when runtime is exhausted and automatically promote back to SCHED_DEADLINE
at the next period. Cover the bandwidth accounting behavior, migration
restrictions while demoted, handling of explicit parameter changes, and
provide a usage example.

This enables users to leverage the demotion feature for soft real-time
workloads that can gracefully degrade to best-effort execution when
occasionally overrunning their reservation.

Assisted-by: Claude Code:Sonnet 4.5
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 Documentation/scheduler/sched-deadline.rst | 54 ++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/Documentation/scheduler/sched-deadline.rst b/Documentation/scheduler/sched-deadline.rst
index ec543a12f848e..d4f25150648b4 100644
--- a/Documentation/scheduler/sched-deadline.rst
+++ b/Documentation/scheduler/sched-deadline.rst
@@ -9,6 +9,8 @@ Deadline Task Scheduling
     2. Scheduling algorithm
       2.1 Main algorithm
       2.2 Bandwidth reclaiming
+      2.3 Energy-aware scheduling
+      2.4 Deadline task demotion
     3. Scheduling Real-Time Tasks
       3.1 Definitions
       3.2 Schedulability Analysis for Uniprocessor Systems
@@ -300,6 +302,58 @@ Deadline Task Scheduling
  setting a fixed CPU frequency results in a lower amount of deadline misses.
 
 
+2.4 Deadline task demotion
+---------------------------
+
+ The SCHED_FLAG_DL_DEMOTION flag enables an alternative behavior when a
+ SCHED_DEADLINE task exhausts its runtime budget. Instead of being throttled
+ until its next period (as described in Section 2.1), a task with this flag
+ set is temporarily "demoted" to run as SCHED_NORMAL, allowing it to continue
+ execution while competing with other normal tasks. When the replenishment
+ timer fires at the next period, the task is automatically "promoted" back to
+ SCHED_DEADLINE with its runtime budget replenished.
+
+ This mechanism is useful for soft real-time workloads that need timing
+ guarantees most of the time but can gracefully degrade to best-effort
+ execution when they occasionally overrun their reservation, rather than
+ being completely suspended until the next period.
+
+ State transitions and behavior:
+
+  - **Demotion**: When a task with SCHED_FLAG_DL_DEMOTION exhausts its runtime
+    (remaining runtime <= 0), it is immediately switched to SCHED_NORMAL policy
+    and continues executing as a normal task. Its bandwidth reservation remains
+    in place - the task still "owns" its reserved bandwidth even while running
+    as SCHED_NORMAL.
+
+  - **Promotion**: When the replenishment timer fires (at the next period), the
+    task is automatically switched back to SCHED_DEADLINE policy with its
+    runtime budget replenished. The task then resumes real-time execution.
+
+  - **Migration restriction**: While demoted, a task cannot migrate to other
+    CPUs through the normal load balancer. This simplifies bandwidth accounting
+    by ensuring the reservation stays on the original CPU. Migration is
+    re-enabled after promotion or if the task's scheduling parameters are
+    explicitly changed.
+
+ Example usage::
+
+    struct sched_attr attr;
+    attr.size = sizeof(attr);
+    attr.sched_policy = SCHED_DEADLINE;
+    attr.sched_runtime = 10 * 1000 * 1000;   /* 10ms */
+    attr.sched_deadline = 100 * 1000 * 1000; /* 100ms */
+    attr.sched_period = 100 * 1000 * 1000;   /* 100ms */
+    attr.sched_flags = SCHED_FLAG_DL_DEMOTION;
+    attr.sched_nice = 0;  /* Nice value when demoted */
+
+    sched_setattr(pid, &attr, 0);
+
+ When this task exhausts its 10ms budget within a period, it will be demoted
+ to SCHED_NORMAL (with nice value 0) rather than being throttled. It will be
+ promoted back to SCHED_DEADLINE at the start of the next period.
+
+
 3. Scheduling Real-Time Tasks
 =============================
 

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 3/4] DEBUG selftests/sched: Add tests for SCHED_DEADLINE demotion feature
  2026-02-19 13:37 [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through " Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 2/4] sched/doc: Document SCHED_DEADLINE demotion feature Juri Lelli
@ 2026-02-19 13:37 ` Juri Lelli
  2026-02-19 13:37 ` [PATCH RFC 4/4] DEBUG selftests/sched: Add simple demonstration of SCHED_DEADLINE demotion Juri Lelli
  3 siblings, 0 replies; 7+ messages in thread
From: Juri Lelli @ 2026-02-19 13:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Jonathan Corbet, Shuah Khan, Qais Yousef, Clark Williams,
	Gabriele Monaco, Tommaso Cucinotta, Luca Abeni
  Cc: linux-kernel, linux-doc, Juri Lelli

Add functional and stress tests for the SCHED_FLAG_DL_DEMOTION feature.

The functional test (dl_demotion_test.c) verifies:
- Basic demotion on runtime exhaustion
- Promotion when replenishment timer fires
- Explicit parameter change clears demotion state
- No demotion without SCHED_FLAG_DL_DEMOTION

The stress test (dl_demotion_stress.c) creates multiple demoting tasks
running concurrently to verify bandwidth accounting and state machine
correctness under load.

Also include a helper script for running tests with ftrace enabled to
aid in debugging bandwidth accounting issues.

Assisted-by: Claude Code:Sonnet 4.5
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 tools/testing/selftests/sched/.gitignore           |   2 +
 tools/testing/selftests/sched/Makefile             |   4 +-
 tools/testing/selftests/sched/README_dl_demotion   |  83 ++++
 tools/testing/selftests/sched/dl_demotion_stress.c | 208 ++++++++++
 tools/testing/selftests/sched/dl_demotion_test.c   | 460 +++++++++++++++++++++
 .../selftests/sched/run_dl_demotion_with_trace.sh  |  71 ++++
 6 files changed, 826 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
index 6996d4654d924..c8139d0067df4 100644
--- a/tools/testing/selftests/sched/.gitignore
+++ b/tools/testing/selftests/sched/.gitignore
@@ -1 +1,3 @@
 cs_prctl_test
+dl_demotion_test
+dl_demotion_stress
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 099ee9213557a..0938acab18700 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
 	  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test
-TEST_PROGS := cs_prctl_test
+TEST_GEN_FILES := cs_prctl_test dl_demotion_test dl_demotion_stress
+TEST_PROGS := cs_prctl_test dl_demotion_test
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/README_dl_demotion b/tools/testing/selftests/sched/README_dl_demotion
new file mode 100644
index 0000000000000..1cdd10fbbd7d1
--- /dev/null
+++ b/tools/testing/selftests/sched/README_dl_demotion
@@ -0,0 +1,83 @@
+SCHED_DEADLINE Demotion Tests
+==============================
+
+This test verifies the SCHED_FLAG_DL_DEMOTION feature which allows DEADLINE
+tasks to be demoted to SCHED_NORMAL when they exhaust their runtime budget.
+
+Building
+--------
+  make -C tools/testing/selftests/sched
+
+Running
+-------
+Requires root or CAP_SYS_NICE:
+
+  sudo ./tools/testing/selftests/sched/dl_demotion_test
+
+Or via kselftest framework:
+
+  sudo make -C tools/testing/selftests TARGETS=sched run_tests
+
+Tests
+-----
+
+Test 1: Basic demotion on runtime exhaustion
+  - Creates a DEADLINE task with SCHED_FLAG_DL_DEMOTION
+  - Runs until runtime is exhausted
+  - Verifies task is demoted to SCHED_NORMAL
+
+Test 2: Promotion on replenishment timer
+  - Gets demoted by exhausting runtime
+  - Waits for period to expire
+  - Verifies task is promoted back to SCHED_DEADLINE
+
+Test 3: Explicit parameter change while demoted
+  - Gets demoted
+  - Explicitly changes scheduling parameters
+  - Verifies demotion state is cleared (no automatic promotion)
+
+Test 4: No demotion without flag
+  - Creates DEADLINE task WITHOUT demotion flag
+  - Exhausts runtime
+  - Verifies task remains SCHED_DEADLINE (throttled but not demoted)
+
+Stress Test
+-----------
+The dl_demotion_stress test creates multiple threads that repeatedly go through
+demotion/promotion cycles. This is useful for stress testing the feature,
+especially migration scenarios.
+
+  sudo ./tools/testing/selftests/sched/dl_demotion_stress [threads] [duration]
+
+Arguments:
+  threads  - Number of worker threads (1-32, default: 4)
+  duration - Run duration in seconds (default: 10)
+
+Example:
+  sudo ./tools/testing/selftests/sched/dl_demotion_stress 8 30
+
+This test is NOT part of the automated test suite (not in TEST_PROGS) and
+must be run manually.
+
+Debugging
+---------
+To see the demotion/promotion state machine transitions, enable ftrace:
+
+  sudo su
+  cd /sys/kernel/debug/tracing
+  echo 1 > events/sched/enable
+  echo 1 > options/trace_printk
+  echo 1 > tracing_on
+
+Then run the test and check the trace:
+
+  cat trace
+
+Look for trace_printk messages showing state transitions:
+  - dl_demote: ... state: NOT_DEMOTED->DEMOTING
+  - dl_demote: ... state: DEMOTING->DEMOTED
+  - dl_promote: ... state: DEMOTED->PROMOTING
+  - dl_promote: ... state: PROMOTING->NOT_DEMOTED
+  - dl_timer: ... migrated_while_runnable/sleeping
+  - switched_from_dl: ... skip_bw_accounting
+  - switched_to_dl: ... skip_bw_accounting
diff --git a/tools/testing/selftests/sched/dl_demotion_stress.c b/tools/testing/selftests/sched/dl_demotion_stress.c
new file mode 100644
index 0000000000000..6e404d6b56af9
--- /dev/null
+++ b/tools/testing/selftests/sched/dl_demotion_stress.c
@@ -0,0 +1,208 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * SCHED_DEADLINE demotion stress test
+ *
+ * Creates multiple DEADLINE tasks with demotion enabled and runs them
+ * to stress test the demotion/promotion state machine, especially with
+ * migration scenarios.
+ */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <unistd.h>
+#include <time.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <pthread.h>
+#include <signal.h>
+
+#ifndef SCHED_FLAG_DL_DEMOTION
+#define SCHED_FLAG_DL_DEMOTION 0x80
+#endif
+
+#define NSEC_PER_SEC 1000000000ULL
+
+static volatile int keep_running = 1;
+
+/* Wrappers for sched_setattr/getattr - use syscall directly to avoid glibc conflicts */
+static int sys_sched_setattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int flags)
+{
+	return syscall(__NR_sched_setattr, pid, attr, flags);
+}
+
+static int sys_sched_getattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int size, unsigned int flags)
+{
+	return syscall(__NR_sched_getattr, pid, attr, size, flags);
+}
+
+/* Signal handler for clean shutdown */
+static void sigint_handler(int sig)
+{
+	(void)sig;
+	keep_running = 0;
+}
+
+/* Burn CPU cycles */
+static void burn_cpu(uint64_t nsec)
+{
+	struct timespec start, now;
+	uint64_t elapsed_ns;
+	volatile uint64_t dummy = 0;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		for (int i = 0; i < 10000; i++)
+			dummy += i;
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		elapsed_ns = (now.tv_sec - start.tv_sec) * NSEC_PER_SEC +
+			     (now.tv_nsec - start.tv_nsec);
+	} while (elapsed_ns < nsec);
+}
+
+/* Thread function - repeatedly exhaust runtime and get demoted/promoted */
+static void *worker_thread(void *arg)
+{
+	int thread_id = *(int *)arg;
+	struct sched_attr attr = {0};
+	int cycles = 0;
+	cpu_set_t cpuset;
+
+	/* Set CPU affinity to allow migration */
+	CPU_ZERO(&cpuset);
+	/* Allow running on CPUs 0-3 (adjust based on system) */
+	for (int i = 0; i < 4 && i < sysconf(_SC_NPROCESSORS_ONLN); i++)
+		CPU_SET(i, &cpuset);
+	pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);
+
+	/* Set DEADLINE with demotion */
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 20 * 1000 * 1000;   /* 20ms */
+	attr.sched_deadline = 100 * 1000 * 1000; /* 100ms */
+	attr.sched_period = 100 * 1000 * 1000;   /* 100ms */
+	attr.sched_flags = SCHED_FLAG_DL_DEMOTION;
+	attr.sched_nice = thread_id % 10;  /* Different nice values */
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		perror("sched_setattr");
+		return NULL;
+	}
+
+	printf("Thread %d: Started with SCHED_DEADLINE (runtime=20ms, period=100ms, nice=%d)\n",
+	       thread_id, attr.sched_nice);
+
+	while (keep_running) {
+		/* Burn CPU to exhaust runtime and trigger demotion */
+		burn_cpu(25 * 1000 * 1000); /* 25ms - exceeds 20ms runtime */
+
+		/* Now we should be demoted - do some light work as NORMAL */
+		usleep(10 * 1000); /* 10ms */
+
+		/* Wait for promotion (period expiry) */
+		usleep(120 * 1000); /* 120ms - exceeds 100ms period */
+
+		cycles++;
+		if (cycles % 10 == 0) {
+			printf("Thread %d: Completed %d demotion/promotion cycles\n",
+			       thread_id, cycles);
+		}
+	}
+
+	printf("Thread %d: Exiting after %d cycles\n", thread_id, cycles);
+
+	/* Reset to normal before exiting */
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+
+	return NULL;
+}
+
+int main(int argc, char *argv[])
+{
+	int num_threads = 4;
+	pthread_t *threads;
+	int *thread_ids;
+	int duration = 10; /* seconds */
+
+	/* Parse arguments */
+	if (argc > 1)
+		num_threads = atoi(argv[1]);
+	if (argc > 2)
+		duration = atoi(argv[2]);
+
+	if (num_threads < 1 || num_threads > 32) {
+		fprintf(stderr, "Number of threads must be 1-32\n");
+		return 1;
+	}
+
+	printf("SCHED_DEADLINE Demotion Stress Test\n");
+	printf("====================================\n");
+	printf("Threads: %d\n", num_threads);
+	printf("Duration: %d seconds\n", duration);
+	printf("Press Ctrl+C to stop early\n\n");
+
+	/* Check permissions */
+	struct sched_attr attr = {0};
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 10 * 1000 * 1000;
+	attr.sched_deadline = 100 * 1000 * 1000;
+	attr.sched_period = 100 * 1000 * 1000;
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		if (errno == EPERM) {
+			fprintf(stderr, "Need CAP_SYS_NICE or root privileges\n");
+			return 1;
+		} else if (errno == EINVAL) {
+			fprintf(stderr, "SCHED_DEADLINE or SCHED_FLAG_DL_DEMOTION not supported\n");
+			return 1;
+		}
+	}
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+
+	/* Set up signal handler */
+	signal(SIGINT, sigint_handler);
+
+	/* Allocate thread arrays */
+	threads = malloc(num_threads * sizeof(pthread_t));
+	thread_ids = malloc(num_threads * sizeof(int));
+	if (!threads || !thread_ids) {
+		fprintf(stderr, "Memory allocation failed\n");
+		return 1;
+	}
+
+	/* Create threads */
+	for (int i = 0; i < num_threads; i++) {
+		thread_ids[i] = i;
+		if (pthread_create(&threads[i], NULL, worker_thread, &thread_ids[i]) != 0) {
+			perror("pthread_create");
+			keep_running = 0;
+			break;
+		}
+	}
+
+	/* Run for specified duration */
+	sleep(duration);
+	keep_running = 0;
+
+	/* Wait for threads to finish */
+	printf("\nWaiting for threads to finish...\n");
+	for (int i = 0; i < num_threads; i++) {
+		pthread_join(threads[i], NULL);
+	}
+
+	free(threads);
+	free(thread_ids);
+
+	printf("\nStress test completed successfully\n");
+	return 0;
+}
diff --git a/tools/testing/selftests/sched/dl_demotion_test.c b/tools/testing/selftests/sched/dl_demotion_test.c
new file mode 100644
index 0000000000000..11ffe1c9ecbed
--- /dev/null
+++ b/tools/testing/selftests/sched/dl_demotion_test.c
@@ -0,0 +1,460 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * SCHED_DEADLINE demotion/promotion test
+ *
+ * Tests the SCHED_FLAG_DL_DEMOTION feature which allows DEADLINE tasks
+ * to be demoted to SCHED_NORMAL when they exhaust their runtime, and
+ * promoted back when the replenishment timer fires.
+ */
+
+#define _GNU_SOURCE
+#include <sched.h>
+#include <sys/types.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <time.h>
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <stdint.h>
+#include <stdarg.h>
+#include <pthread.h>
+
+#ifndef SCHED_FLAG_DL_DEMOTION
+#define SCHED_FLAG_DL_DEMOTION 0x80
+#endif
+
+#define NSEC_PER_SEC 1000000000ULL
+#define USEC_PER_SEC 1000000ULL
+
+/* Ftrace marker file */
+static int trace_marker_fd = -1;
+
+/* Wrappers for sys_sched_setattr/getattr - use syscall directly to avoid glibc conflicts */
+static int sys_sched_setattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int flags)
+{
+	return syscall(__NR_sched_setattr, pid, attr, flags);
+}
+
+static int sys_sched_getattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int size, unsigned int flags)
+{
+	return syscall(__NR_sched_getattr, pid, attr, size, flags);
+}
+
+/* Initialize ftrace marker for userspace tracing */
+static void trace_marker_init(void)
+{
+	const char *paths[] = {
+		"/sys/kernel/tracing/trace_marker",
+		"/sys/kernel/debug/tracing/trace_marker",
+		NULL
+	};
+
+	for (int i = 0; paths[i]; i++) {
+		trace_marker_fd = open(paths[i], O_WRONLY);
+		if (trace_marker_fd >= 0)
+			break;
+	}
+}
+
+/* Write a message to ftrace buffer */
+static void trace_write(const char *fmt, ...)
+{
+	char buf[256];
+	va_list args;
+	int len;
+
+	if (trace_marker_fd < 0)
+		return;
+
+	va_start(args, fmt);
+	len = vsnprintf(buf, sizeof(buf), fmt, args);
+	va_end(args);
+
+	if (len > 0)
+		write(trace_marker_fd, buf, len);
+}
+
+/* Close ftrace marker */
+static void trace_marker_close(void)
+{
+	if (trace_marker_fd >= 0) {
+		close(trace_marker_fd);
+		trace_marker_fd = -1;
+	}
+}
+
+/* Burn CPU cycles for approximately nsec nanoseconds */
+static void burn_cpu(uint64_t nsec)
+{
+	struct timespec start, now;
+	uint64_t elapsed_ns;
+	volatile uint64_t dummy = 0;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		for (int i = 0; i < 10000; i++)
+			dummy += i;
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		elapsed_ns = (now.tv_sec - start.tv_sec) * NSEC_PER_SEC +
+			     (now.tv_nsec - start.tv_nsec);
+	} while (elapsed_ns < nsec);
+}
+
+/* Get current scheduling policy */
+static int get_current_policy(void)
+{
+	struct sched_attr attr = {0};
+	attr.size = sizeof(attr);
+
+	if (sys_sched_getattr(0, &attr, sizeof(attr), 0) < 0) {
+		perror("sys_sched_getattr");
+		return -1;
+	}
+
+	return attr.sched_policy;
+}
+
+/*
+ * Test 1: Basic demotion when runtime exhausted
+ *
+ * Create a DEADLINE task with demotion flag, run it until runtime
+ * is exhausted, verify it gets demoted to SCHED_NORMAL.
+ */
+static int test_basic_demotion(void)
+{
+	struct sched_attr attr = {0};
+	int policy_before, policy_after;
+
+	printf("Test 1: Basic demotion on runtime exhaustion\n");
+	trace_write("TEST1: START - Basic demotion on runtime exhaustion");
+
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 10 * 1000 * 1000;   /* 10ms */
+	attr.sched_deadline = 100 * 1000 * 1000; /* 100ms */
+	attr.sched_period = 100 * 1000 * 1000;   /* 100ms */
+	attr.sched_flags = SCHED_FLAG_DL_DEMOTION;
+	attr.sched_nice = 0;  /* Nice value when demoted */
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		if (errno == EPERM) {
+			printf("  SKIP: Need CAP_SYS_NICE or root privileges\n");
+			return 0;
+		}
+		if (errno == EINVAL) {
+			printf("  SKIP: SCHED_FLAG_DL_DEMOTION not supported\n");
+			return 0;
+		}
+		perror("  FAIL: sys_sched_setattr");
+		return -1;
+	}
+
+	policy_before = get_current_policy();
+	if (policy_before != SCHED_DEADLINE) {
+		printf("  FAIL: Not SCHED_DEADLINE after setattr (got %d)\n",
+		       policy_before);
+		return -1;
+	}
+
+	/* Burn more than the runtime to trigger demotion */
+	printf("  Burning CPU to exhaust runtime...\n");
+	trace_write("TEST1: Burning CPU to exhaust runtime (15ms)");
+	burn_cpu(15 * 1000 * 1000); /* 15ms, more than 10ms runtime */
+	trace_write("TEST1: CPU burn complete, checking policy");
+
+	/* Check if we got demoted */
+	policy_after = get_current_policy();
+	if (policy_after == SCHED_NORMAL) {
+		printf("  PASS: Demoted to SCHED_NORMAL after runtime exhaustion\n");
+		trace_write("TEST1: PASS - Task demoted to SCHED_NORMAL");
+		/* Reset to normal before returning */
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST1: END");
+		return 0;
+	} else {
+		printf("  FAIL: Still policy %d after runtime exhaustion (expected SCHED_NORMAL)\n",
+		       policy_after);
+		trace_write("TEST1: FAIL - Task not demoted (policy=%d)", policy_after);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST1: END");
+		return -1;
+	}
+}
+
+/*
+ * Test 2: Promotion when replenishment timer fires
+ *
+ * Get demoted, then sleep until the period expires and verify
+ * we get promoted back to SCHED_DEADLINE.
+ */
+static int test_promotion_on_timer(void)
+{
+	struct sched_attr attr = {0};
+	int policy_before, policy_after;
+
+	printf("\nTest 2: Promotion on replenishment timer\n");
+	trace_write("TEST2: START - Promotion on replenishment timer");
+
+	/* Reset to SCHED_NORMAL before starting */
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 10 * 1000 * 1000;   /* 10ms */
+	attr.sched_deadline = 200 * 1000 * 1000; /* 200ms */
+	attr.sched_period = 200 * 1000 * 1000;   /* 200ms */
+	attr.sched_flags = SCHED_FLAG_DL_DEMOTION;
+	attr.sched_nice = 0;
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		if (errno == EINVAL) {
+			printf("  SKIP: SCHED_FLAG_DL_DEMOTION not supported\n");
+			return 0;
+		}
+		perror("  FAIL: sys_sched_setattr");
+		return -1;
+	}
+
+	/* Exhaust runtime to get demoted */
+	printf("  Exhausting runtime...\n");
+	trace_write("TEST2: Exhausting runtime to trigger demotion");
+	burn_cpu(15 * 1000 * 1000); /* 15ms */
+	trace_write("TEST2: CPU burn complete, checking if demoted");
+
+	policy_before = get_current_policy();
+	if (policy_before != SCHED_NORMAL) {
+		printf("  FAIL: Not demoted (policy=%d)\n", policy_before);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		return -1;
+	}
+	printf("  Demoted to SCHED_NORMAL\n");
+	trace_write("TEST2: Confirmed demoted to SCHED_NORMAL");
+
+	/* Wait for period to expire (timer should promote us) */
+	printf("  Waiting for replenishment timer (250ms)...\n");
+	trace_write("TEST2: Waiting for replenishment timer (250ms)");
+	usleep(250 * 1000); /* 250ms, longer than 200ms period */
+	trace_write("TEST2: Wait complete, checking if promoted");
+
+	/* Check if promoted back */
+	policy_after = get_current_policy();
+	if (policy_after == SCHED_DEADLINE) {
+		printf("  PASS: Promoted back to SCHED_DEADLINE\n");
+		trace_write("TEST2: PASS - Promoted back to SCHED_DEADLINE");
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST2: END");
+		return 0;
+	} else {
+		printf("  FAIL: Still policy %d after timer (expected SCHED_DEADLINE)\n",
+		       policy_after);
+		trace_write("TEST2: FAIL - Not promoted (policy=%d)", policy_after);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST2: END");
+		return -1;
+	}
+}
+
+/*
+ * Test 3: Explicit parameter change while demoted
+ *
+ * Get demoted, then explicitly change scheduling parameters.
+ * This should clear the demotion state and prevent automatic promotion.
+ */
+static int test_param_change_while_demoted(void)
+{
+	struct sched_attr attr = {0};
+	int policy;
+
+	printf("\nTest 3: Explicit parameter change while demoted\n");
+	trace_write("TEST3: START - Explicit parameter change while demoted");
+
+	/* Reset to SCHED_NORMAL before starting */
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 10 * 1000 * 1000;   /* 10ms */
+	attr.sched_deadline = 200 * 1000 * 1000; /* 200ms */
+	attr.sched_period = 200 * 1000 * 1000;   /* 200ms */
+	attr.sched_flags = SCHED_FLAG_DL_DEMOTION;
+	attr.sched_nice = 0;
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		if (errno == EINVAL) {
+			printf("  SKIP: SCHED_FLAG_DL_DEMOTION not supported\n");
+			return 0;
+		}
+		perror("  FAIL: sys_sched_setattr");
+		return -1;
+	}
+
+	/* Exhaust runtime to get demoted */
+	printf("  Exhausting runtime...\n");
+	trace_write("TEST3: Exhausting runtime to trigger demotion");
+	burn_cpu(15 * 1000 * 1000);
+	trace_write("TEST3: Checking if demoted");
+
+	policy = get_current_policy();
+	if (policy != SCHED_NORMAL) {
+		printf("  FAIL: Not demoted (policy=%d)\n", policy);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		return -1;
+	}
+	printf("  Demoted to SCHED_NORMAL\n");
+	trace_write("TEST3: Confirmed demoted to SCHED_NORMAL");
+
+	/* Explicitly change to SCHED_NORMAL (should clear demotion state) */
+	printf("  Explicitly setting SCHED_NORMAL...\n");
+	trace_write("TEST3: Explicitly calling sched_setattr(SCHED_NORMAL) to clear demotion state");
+	attr.sched_policy = SCHED_NORMAL;
+	attr.sched_nice = 5;
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		perror("  FAIL: sys_sched_setattr to NORMAL");
+		return -1;
+	}
+
+	/* Wait past the period - should NOT be promoted */
+	printf("  Waiting past period (250ms)...\n");
+	trace_write("TEST3: Waiting past period - should NOT be promoted");
+	usleep(250 * 1000);
+	trace_write("TEST3: Wait complete, verifying still NORMAL");
+
+	policy = get_current_policy();
+	if (policy == SCHED_NORMAL) {
+		printf("  PASS: Remained SCHED_NORMAL (demotion state cleared)\n");
+		trace_write("TEST3: PASS - Remained SCHED_NORMAL");
+		trace_write("TEST3: END");
+		return 0;
+	} else {
+		printf("  FAIL: Unexpected promotion to policy %d\n", policy);
+		trace_write("TEST3: FAIL - Unexpected promotion to policy %d", policy);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST3: END");
+		return -1;
+	}
+}
+
+/*
+ * Test 4: Demotion disabled without flag
+ *
+ * Create DEADLINE task without demotion flag, exhaust runtime,
+ * verify task stays SCHED_DEADLINE (throttled but not demoted).
+ */
+static int test_no_demotion_without_flag(void)
+{
+	struct sched_attr attr = {0};
+	int policy;
+
+	printf("\nTest 4: No demotion without SCHED_FLAG_DL_DEMOTION\n");
+	trace_write("TEST4: START - No demotion without flag");
+
+	/* Reset to SCHED_NORMAL before starting */
+	attr.size = sizeof(attr);
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+
+	attr.sched_policy = SCHED_DEADLINE;
+	attr.sched_runtime = 10 * 1000 * 1000;   /* 10ms */
+	attr.sched_deadline = 100 * 1000 * 1000; /* 100ms */
+	attr.sched_period = 100 * 1000 * 1000;   /* 100ms */
+	attr.sched_flags = 0;  /* No demotion flag */
+	attr.sched_nice = 0;
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		perror("  FAIL: sys_sched_setattr");
+		return -1;
+	}
+
+	/* Burn CPU to exhaust runtime */
+	printf("  Exhausting runtime...\n");
+	trace_write("TEST4: Exhausting runtime (no demotion flag set)");
+	burn_cpu(15 * 1000 * 1000);
+	trace_write("TEST4: CPU burn complete, checking policy");
+
+	/* Should still be SCHED_DEADLINE (throttled, not demoted) */
+	policy = get_current_policy();
+	if (policy == SCHED_DEADLINE) {
+		printf("  PASS: Remained SCHED_DEADLINE (throttled, not demoted)\n");
+		trace_write("TEST4: PASS - Remained SCHED_DEADLINE");
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST4: END");
+		return 0;
+	} else {
+		printf("  FAIL: Changed to policy %d without demotion flag\n", policy);
+		trace_write("TEST4: FAIL - Changed to policy %d", policy);
+		attr.sched_policy = SCHED_NORMAL;
+		sys_sched_setattr(0, &attr, 0);
+		trace_write("TEST4: END");
+		return -1;
+	}
+}
+
+int main(void)
+{
+	int failures = 0;
+
+	printf("SCHED_DEADLINE Demotion Tests\n");
+	printf("==============================\n\n");
+
+	/* Initialize ftrace marker (silently fails if not available) */
+	trace_marker_init();
+	trace_write("=== SCHED_DEADLINE Demotion Test Suite START ===");
+
+	/* Run tests with pauses between them for clearer trace separation */
+	if (test_basic_demotion() < 0)
+		failures++;
+
+	/* Pause between tests (300ms - longer than any test period) */
+	printf("\n--- Pausing 300ms between tests ---\n");
+	trace_write("=== PAUSE between tests (300ms) ===");
+	usleep(300 * 1000);
+
+	if (test_promotion_on_timer() < 0)
+		failures++;
+
+	printf("\n--- Pausing 300ms between tests ---\n");
+	trace_write("=== PAUSE between tests (300ms) ===");
+	usleep(300 * 1000);
+
+	if (test_param_change_while_demoted() < 0)
+		failures++;
+
+	printf("\n--- Pausing 300ms between tests ---\n");
+	trace_write("=== PAUSE between tests (300ms) ===");
+	usleep(300 * 1000);
+
+	if (test_no_demotion_without_flag() < 0)
+		failures++;
+
+	/* Summary */
+	printf("\n==============================\n");
+	if (failures == 0) {
+		printf("All tests PASSED\n");
+		trace_write("=== Test Suite PASSED ===");
+		trace_marker_close();
+		return 0;
+	} else {
+		printf("%d test(s) FAILED\n", failures);
+		trace_write("=== Test Suite FAILED (%d failures) ===", failures);
+		trace_marker_close();
+		return 1;
+	}
+}
diff --git a/tools/testing/selftests/sched/run_dl_demotion_with_trace.sh b/tools/testing/selftests/sched/run_dl_demotion_with_trace.sh
new file mode 100755
index 0000000000000..4b37864d45975
--- /dev/null
+++ b/tools/testing/selftests/sched/run_dl_demotion_with_trace.sh
@@ -0,0 +1,71 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Run SCHED_DEADLINE demotion tests with ftrace enabled to see
+# state machine transitions
+
+TRACE_DIR="/sys/kernel/debug/tracing"
+TEST_BIN="./dl_demotion_test"
+
+if [ ! -d "$TRACE_DIR" ]; then
+	echo "ERROR: ftrace not available at $TRACE_DIR"
+	echo "Make sure debugfs is mounted and CONFIG_FTRACE is enabled"
+	exit 1
+fi
+
+if [ $EUID -ne 0 ]; then
+	echo "ERROR: This script must be run as root"
+	exit 1
+fi
+
+if [ ! -x "$TEST_BIN" ]; then
+	echo "ERROR: Test binary not found: $TEST_BIN"
+	echo "Build with: make"
+	exit 1
+fi
+
+echo "Setting up ftrace..."
+
+# Clear previous trace
+echo 0 > "$TRACE_DIR/tracing_on"
+echo > "$TRACE_DIR/trace"
+
+# Enable trace_printk
+echo 1 > "$TRACE_DIR/options/trace_printk" 2>/dev/null || true
+
+# Enable sched events
+echo 1 > "$TRACE_DIR/events/sched/enable" 2>/dev/null || true
+
+# Start tracing
+echo 1 > "$TRACE_DIR/tracing_on"
+
+echo "Running deadline demotion tests..."
+echo "===================================="
+echo ""
+
+# Run the test
+$TEST_BIN
+
+echo ""
+echo "===================================="
+echo ""
+
+# Stop tracing
+echo 0 > "$TRACE_DIR/tracing_on"
+
+# Show relevant trace entries
+echo "Trace output (demotion/promotion events):"
+echo "=========================================="
+grep -E "dl_demote|dl_promote|dl_timer|switched_from_dl|switched_to_dl|setscheduler" \
+	"$TRACE_DIR/trace" | tail -100
+
+echo ""
+echo "Full trace saved to: /tmp/dl_demotion_trace.txt"
+cat "$TRACE_DIR/trace" > /tmp/dl_demotion_trace.txt
+
+# Reset tracing
+echo 0 > "$TRACE_DIR/events/sched/enable" 2>/dev/null || true
+echo > "$TRACE_DIR/trace"
+
+echo ""
+echo "Done!"

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH RFC 4/4] DEBUG selftests/sched: Add simple demonstration of SCHED_DEADLINE demotion
  2026-02-19 13:37 [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion Juri Lelli
                   ` (2 preceding siblings ...)
  2026-02-19 13:37 ` [PATCH RFC 3/4] DEBUG selftests/sched: Add tests for " Juri Lelli
@ 2026-02-19 13:37 ` Juri Lelli
  3 siblings, 0 replies; 7+ messages in thread
From: Juri Lelli @ 2026-02-19 13:37 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	Jonathan Corbet, Shuah Khan, Qais Yousef, Clark Williams,
	Gabriele Monaco, Tommaso Cucinotta, Luca Abeni
  Cc: linux-kernel, linux-doc, Juri Lelli

Add a simple demonstration program that clearly shows the difference
between normal DEADLINE throttling and the demotion feature.

The test runs two scenarios with identical parameters (30ms runtime per
100ms period) and workload (attempt to execute for 200ms):

Test 1: Normal DEADLINE without demotion flag - task is throttled after
exhausting its 30ms budget and stops executing until next period.

Test 2: DEADLINE with SCHED_FLAG_DL_DEMOTION - task is demoted to
SCHED_NORMAL after exhausting its budget and continues executing,
completing the full 200ms workload.

Each test phase writes markers to the ftrace buffer to make it easy to
identify different phases when examining traces for debugging.

Assisted-by: Claude Code:Sonnet 4.5
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
---
 tools/testing/selftests/sched/.gitignore         |   1 +
 tools/testing/selftests/sched/Makefile           |   4 +-
 tools/testing/selftests/sched/dl_demotion_demo.c | 239 +++++++++++++++++++++++
 3 files changed, 242 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/sched/.gitignore b/tools/testing/selftests/sched/.gitignore
index c8139d0067df4..b6c66024f6a01 100644
--- a/tools/testing/selftests/sched/.gitignore
+++ b/tools/testing/selftests/sched/.gitignore
@@ -1,3 +1,4 @@
 cs_prctl_test
 dl_demotion_test
 dl_demotion_stress
+dl_demotion_demo
diff --git a/tools/testing/selftests/sched/Makefile b/tools/testing/selftests/sched/Makefile
index 0938acab18700..6207baf0de090 100644
--- a/tools/testing/selftests/sched/Makefile
+++ b/tools/testing/selftests/sched/Makefile
@@ -8,7 +8,7 @@ CFLAGS += -O2 -Wall -g -I./ $(KHDR_INCLUDES) -Wl,-rpath=./ \
 	  $(CLANG_FLAGS)
 LDLIBS += -lpthread
 
-TEST_GEN_FILES := cs_prctl_test dl_demotion_test dl_demotion_stress
-TEST_PROGS := cs_prctl_test dl_demotion_test
+TEST_GEN_FILES := cs_prctl_test dl_demotion_test dl_demotion_stress dl_demotion_demo
+TEST_PROGS := cs_prctl_test dl_demotion_test dl_demotion_demo
 
 include ../lib.mk
diff --git a/tools/testing/selftests/sched/dl_demotion_demo.c b/tools/testing/selftests/sched/dl_demotion_demo.c
new file mode 100644
index 0000000000000..ca4da88ac94a5
--- /dev/null
+++ b/tools/testing/selftests/sched/dl_demotion_demo.c
@@ -0,0 +1,239 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Simple demonstration of SCHED_DEADLINE demotion feature
+ *
+ * This test demonstrates the difference between:
+ * 1. Normal throttling: task stops executing when runtime exhausted
+ * 2. Demotion: task continues as SCHED_NORMAL when runtime exhausted
+ *
+ * Both tasks try to execute continuously, but only have 30ms budget per 100ms period.
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <linux/sched.h>
+#include <linux/sched/types.h>
+#include <time.h>
+#include <string.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <stdarg.h>
+
+#ifndef SCHED_FLAG_DL_DEMOTION
+#define SCHED_FLAG_DL_DEMOTION 0x80
+#endif
+
+/* Ftrace marker file */
+static int trace_marker_fd = -1;
+
+/* Use syscall directly to avoid glibc conflicts */
+static int sys_sched_setattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int flags)
+{
+	return syscall(__NR_sched_setattr, pid, attr, flags);
+}
+
+static int sys_sched_getattr(pid_t pid, struct sched_attr *attr,
+			     unsigned int size, unsigned int flags)
+{
+	return syscall(__NR_sched_getattr, pid, attr, size, flags);
+}
+
+/* Open ftrace marker for writing trace events */
+static void open_trace_marker(void)
+{
+	trace_marker_fd = open("/sys/kernel/tracing/trace_marker", O_WRONLY);
+	if (trace_marker_fd < 0) {
+		/* Try debugfs location */
+		trace_marker_fd = open("/sys/kernel/debug/tracing/trace_marker", O_WRONLY);
+	}
+}
+
+/* Write a message to ftrace marker */
+static void trace_write(const char *fmt, ...)
+{
+	char buf[256];
+	va_list ap;
+	int len;
+
+	if (trace_marker_fd < 0)
+		return;
+
+	va_start(ap, fmt);
+	len = vsnprintf(buf, sizeof(buf), fmt, ap);
+	va_end(ap);
+
+	if (len > 0)
+		write(trace_marker_fd, buf, len);
+}
+
+/* Busy loop for specified milliseconds */
+static void burn_cpu_ms(int ms)
+{
+	struct timespec start, now;
+	long long elapsed_ns;
+	long long target_ns = ms * 1000000LL;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		/* Just burn CPU */
+		for (volatile int i = 0; i < 10000; i++)
+			;
+		clock_gettime(CLOCK_MONOTONIC, &now);
+		elapsed_ns = (now.tv_sec - start.tv_sec) * 1000000000LL +
+			     (now.tv_nsec - start.tv_nsec);
+	} while (elapsed_ns < target_ns);
+}
+
+/* Count how much CPU time we actually got */
+static int measure_execution_time_ms(void (*workload)(void))
+{
+	struct timespec start, end;
+	long long elapsed_ns;
+
+	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &start);
+	workload();
+	clock_gettime(CLOCK_THREAD_CPUTIME_ID, &end);
+
+	elapsed_ns = (end.tv_sec - start.tv_sec) * 1000000000LL +
+		     (end.tv_nsec - start.tv_nsec);
+
+	return elapsed_ns / 1000000; /* Convert to ms */
+}
+
+/* Try to execute for 200ms (should get throttled/demoted at 30ms) */
+static void try_burn_200ms(void)
+{
+	burn_cpu_ms(200);
+}
+
+static void run_throttled_test(void)
+{
+	struct sched_attr attr = {
+		.size = sizeof(attr),
+		.sched_policy = SCHED_DEADLINE,
+		.sched_runtime = 30 * 1000000,   /* 30ms */
+		.sched_deadline = 100 * 1000000, /* 100ms */
+		.sched_period = 100 * 1000000,   /* 100ms */
+		.sched_flags = 0,  /* NO demotion flag */
+		.sched_nice = 0,
+	};
+	int actual_ms;
+
+	printf("\n=== Test 1: Normal DEADLINE (no demotion) ===\n");
+	printf("Configuration:\n");
+	printf("  Runtime:  30ms per 100ms period\n");
+	printf("  Flags:    (none - normal throttling)\n");
+	printf("  Workload: Try to execute for 200ms\n\n");
+
+	trace_write("TEST1_START: Normal DEADLINE without demotion");
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		perror("sys_sched_setattr");
+		exit(1);
+	}
+
+	trace_write("TEST1_EXEC_START: Starting workload");
+	printf("Starting execution...\n");
+	actual_ms = measure_execution_time_ms(try_burn_200ms);
+	trace_write("TEST1_EXEC_END: Workload completed");
+
+	printf("\nResult:\n");
+	printf("  CPU time obtained: %dms\n", actual_ms);
+	if (actual_ms >= 55 && actual_ms <= 65) {
+		printf("  ✓ Task was THROTTLED after exhausting 30ms budget\n");
+		printf("  ✓ Task stopped executing (did not continue past budget)\n");
+	} else {
+		printf("  ✗ Unexpected execution time (expected ~30ms)\n");
+	}
+
+	/* Switch back to normal before next test */
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+	trace_write("TEST1_END: Switched back to SCHED_NORMAL");
+	usleep(100000); /* Let things settle */
+}
+
+static void run_demotion_test(void)
+{
+	struct sched_attr attr = {
+		.size = sizeof(attr),
+		.sched_policy = SCHED_DEADLINE,
+		.sched_runtime = 30 * 1000000,   /* 30ms */
+		.sched_deadline = 100 * 1000000, /* 100ms */
+		.sched_period = 100 * 1000000,   /* 100ms */
+		.sched_flags = SCHED_FLAG_DL_DEMOTION,
+		.sched_nice = 0,
+	};
+	int actual_ms;
+
+	printf("\n=== Test 2: DEADLINE with demotion ===\n");
+	printf("Configuration:\n");
+	printf("  Runtime:  30ms per 100ms period\n");
+	printf("  Flags:    SCHED_FLAG_DL_DEMOTION\n");
+	printf("  Workload: Try to execute for 200ms\n\n");
+
+	trace_write("TEST2_START: DEADLINE with demotion enabled");
+
+	if (sys_sched_setattr(0, &attr, 0) < 0) {
+		perror("sys_sched_setattr");
+		exit(1);
+	}
+
+	trace_write("TEST2_EXEC_START: Starting workload (will demote)");
+	printf("Starting execution...\n");
+	actual_ms = measure_execution_time_ms(try_burn_200ms);
+	trace_write("TEST2_EXEC_END: Workload completed");
+
+	printf("\nResult:\n");
+	printf("  CPU time obtained: %dms\n", actual_ms);
+	if (actual_ms >= 150) {
+		printf("  ✓ Task was DEMOTED after exhausting 30ms budget\n");
+		printf("  ✓ Task continued executing as SCHED_NORMAL\n");
+		printf("  ✓ Task completed the full workload\n");
+	} else if (actual_ms >= 25 && actual_ms <= 35) {
+		printf("  ✗ Task was throttled instead of demoted\n");
+		printf("  ✗ Demotion feature may not be working\n");
+	} else {
+		printf("  ✗ Unexpected execution time\n");
+	}
+
+	/* Switch back to normal */
+	attr.sched_policy = SCHED_NORMAL;
+	sys_sched_setattr(0, &attr, 0);
+	trace_write("TEST2_END: Switched back to SCHED_NORMAL");
+}
+
+int main(void)
+{
+	printf("SCHED_DEADLINE Demotion Feature Demonstration\n");
+	printf("==============================================\n");
+	printf("\nThis demonstrates the difference between normal throttling\n");
+	printf("and demotion when a DEADLINE task exhausts its runtime.\n");
+
+	/* Open ftrace marker for trace events */
+	open_trace_marker();
+	trace_write("DEMO_START: Beginning demonstration");
+
+	/* Run test without demotion (throttled) */
+	run_throttled_test();
+
+	/* Run test with demotion (continues as NORMAL) */
+	run_demotion_test();
+
+	printf("\n==============================================\n");
+	printf("Summary:\n");
+	printf("  Without demotion: Task stops when budget exhausted\n");
+	printf("  With demotion:    Task continues as SCHED_NORMAL\n");
+	printf("\nDemonstration complete!\n");
+
+	trace_write("DEMO_END: Demonstration completed");
+
+	if (trace_marker_fd >= 0)
+		close(trace_marker_fd);
+
+	return 0;
+}

-- 
2.53.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through SCHED_OTHER demotion
  2026-02-19 13:37 ` [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through " Juri Lelli
@ 2026-02-20 19:47   ` Peter Zijlstra
  2026-02-23  7:12     ` Juri Lelli
  0 siblings, 1 reply; 7+ messages in thread
From: Peter Zijlstra @ 2026-02-20 19:47 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Jonathan Corbet,
	Shuah Khan, Qais Yousef, Clark Williams, Gabriele Monaco,
	Tommaso Cucinotta, Luca Abeni, linux-kernel, linux-doc

On Thu, Feb 19, 2026 at 02:37:34PM +0100, Juri Lelli wrote:
> Add support for demoting deadline tasks to SCHED_OTHER when they exhaust
> their runtime. This prevents starvation of lower priority tasks while still
> allowing deadline tasks to utilize available CPU bandwidth.
> 
> This feature resurrects and refines the bandwidth reclaiming concept
> from the original SCHED_DEADLINE development (circa 2010), focusing on a
> single demotion mode: SCHED_OTHER.

Yeah, that's good enough for most I suppose. Demotion to FIFO/RR is
'weird' anyway.

> @@ -1419,6 +1444,84 @@ s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta
>  	return scaled_delta_exec;
>  }
>  
> +/*
> + * Check if a deadline task can be demoted when it exhausts its runtime.
> + * dl-servers and boosted tasks cannot be demoted.
> + *
> + * Returns true if demotion should happen, false otherwise.
> + */
> +static inline bool dl_task_can_demote(struct sched_dl_entity *dl_se)
> +{
> +	if (dl_server(dl_se))
> +		return false;
> +
> +	if (is_dl_boosted(dl_se))
> +		return false;
> +
> +	return !!(dl_se->flags & SCHED_FLAG_DL_DEMOTION);

It is already implicitly cast to bool by virtue of the return value, no
need for that explicit !!.

> +}
> +
> +/*
> + * Promote a demoted task back to SCHED_DEADLINE.
> + * The task's runtime will be replenished by the caller.
> + */
> +static void dl_task_promote(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +	int queue_flags = DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_CLASS;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (dl_se->dl_demotion_state != DL_DEMOTED)
> +		return;
> +
> +	dl_se->dl_demotion_state = DL_PROMOTING;
> +
> +	scoped_guard (sched_change, p, queue_flags) {
> +		p->policy = SCHED_DEADLINE;
> +		p->sched_class = &dl_sched_class;
> +		p->prio = MAX_DL_PRIO - 1;
> +		p->normal_prio = p->prio;
> +	}
> +
> +	dl_se->dl_demotion_state = DL_NOT_DEMOTED;
> +
> +	__balance_callbacks(rq, NULL);
> +}
> +
> +/*
> + * Demote a deadline task to SCHED_OTHER when it exhausts its runtime.
> + * The task will be promoted back to SCHED_DEADLINE at replenish.
> + */
> +static void dl_task_demote(struct rq *rq, struct task_struct *p)
> +{
> +	struct sched_dl_entity *dl_se = &p->dl;
> +	int queue_flags = DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_CLASS;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (dl_se->dl_demotion_state != DL_NOT_DEMOTED || !dl_task_can_demote(dl_se))
> +		return;
> +
> +	dl_se->dl_demotion_state = DL_DEMOTING;
> +
> +	scoped_guard (sched_change, p, queue_flags) {
> +		/*
> +		 * The task's static_prio is already set from the sched_nice
> +		 * value in sched_attr.
> +		 */
> +		p->policy = SCHED_NORMAL;
> +		p->sched_class = &fair_sched_class;
> +		p->prio = p->static_prio;
> +		p->normal_prio = p->static_prio;
> +	}
> +
> +	dl_se->dl_demotion_state = DL_DEMOTED;
> +
> +	__balance_callbacks(rq, NULL);
> +	resched_curr(rq);

Doesn't sched_change already force resched on class degradation?

Anyway, I love how simple this has become ;-)

> +}

> +static void switched_from_dl(struct rq *rq, struct task_struct *p)
> +{
> +	/*
> +	 * If demoting, skip all bandwidth accounting. The bandwidth
> +	 * reservation stays in place while the task executes as SCHED_NORMAL.
> +	 */
> +	if (p->dl.dl_demotion_state == DL_DEMOTING)
> +		return;
> +
> +	__dl_cleanup_bandwidth(p, rq);
>  
>  	/*
>  	 * Since this might be the only -deadline task on the rq,
> @@ -3322,6 +3471,16 @@ static void switched_from_dl(struct rq *rq, struct task_struct *p)
>   */
>  static void switched_to_dl(struct rq *rq, struct task_struct *p)
>  {
> +	/*
> +	 * If promoting from demotion, skip bandwidth/cpuset accounting.
> +	 */
> +	if (p->dl.dl_demotion_state == DL_PROMOTING) {
> +		if (!task_on_rq_queued(p))
> +			return;
> +
> +		goto check_preempt;
> +	}
> +
>  	cancel_inactive_timer(&p->dl);
>  
>  	/*

Ah, I wondered where you'd need those DEMOTING/PROMOTING states.


> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c16b5fd71b2d5..59e5459a75492 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9415,6 +9415,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
>  	if (p->sched_task_hot)
>  		p->sched_task_hot = 0;
>  
> +	/*
> +	 * Demoted DEADLINE tasks cannot migrate. Their bandwidth reservation
> +	 * is tied to the demotion CPU and will be released when the task is
> +	 * promoted back to DEADLINE or explicitly switched to another policy.
> +	 */
> +	if (!dl_task_can_migrate(p))
> +		return 0;

I suppose this works, the alternative is doing migrate_disable() in
demote and migrate_enable() in promote. Not quite sure which is the
least horrible in this case :-)


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through SCHED_OTHER demotion
  2026-02-20 19:47   ` Peter Zijlstra
@ 2026-02-23  7:12     ` Juri Lelli
  0 siblings, 0 replies; 7+ messages in thread
From: Juri Lelli @ 2026-02-23  7:12 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Jonathan Corbet,
	Shuah Khan, Qais Yousef, Clark Williams, Gabriele Monaco,
	Tommaso Cucinotta, Luca Abeni, linux-kernel, linux-doc

On 20/02/26 20:47, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:37:34PM +0100, Juri Lelli wrote:

...

> > @@ -1419,6 +1444,84 @@ s64 dl_scaled_delta_exec(struct rq *rq, struct sched_dl_entity *dl_se, s64 delta
> >  	return scaled_delta_exec;
> >  }
> >  
> > +/*
> > + * Check if a deadline task can be demoted when it exhausts its runtime.
> > + * dl-servers and boosted tasks cannot be demoted.
> > + *
> > + * Returns true if demotion should happen, false otherwise.
> > + */
> > +static inline bool dl_task_can_demote(struct sched_dl_entity *dl_se)
> > +{
> > +	if (dl_server(dl_se))
> > +		return false;
> > +
> > +	if (is_dl_boosted(dl_se))
> > +		return false;
> > +
> > +	return !!(dl_se->flags & SCHED_FLAG_DL_DEMOTION);
> 
> It is already implicitly cast to bool by virtue of the return value, no
> need for that explicit !!.

Indeed.

...

> > +/*
> > + * Demote a deadline task to SCHED_OTHER when it exhausts its runtime.
> > + * The task will be promoted back to SCHED_DEADLINE at replenish.
> > + */
> > +static void dl_task_demote(struct rq *rq, struct task_struct *p)
> > +{
> > +	struct sched_dl_entity *dl_se = &p->dl;
> > +	int queue_flags = DEQUEUE_MOVE | DEQUEUE_NOCLOCK | DEQUEUE_CLASS;
> > +
> > +	lockdep_assert_rq_held(rq);
> > +
> > +	if (dl_se->dl_demotion_state != DL_NOT_DEMOTED || !dl_task_can_demote(dl_se))
> > +		return;
> > +
> > +	dl_se->dl_demotion_state = DL_DEMOTING;
> > +
> > +	scoped_guard (sched_change, p, queue_flags) {
> > +		/*
> > +		 * The task's static_prio is already set from the sched_nice
> > +		 * value in sched_attr.
> > +		 */
> > +		p->policy = SCHED_NORMAL;
> > +		p->sched_class = &fair_sched_class;
> > +		p->prio = p->static_prio;
> > +		p->normal_prio = p->static_prio;
> > +	}
> > +
> > +	dl_se->dl_demotion_state = DL_DEMOTED;
> > +
> > +	__balance_callbacks(rq, NULL);
> > +	resched_curr(rq);
> 
> Doesn't sched_change already force resched on class degradation?

It does. Will remove.

> Anyway, I love how simple this has become ;-)

Yes! Quite handy. :)

...

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c16b5fd71b2d5..59e5459a75492 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9415,6 +9415,14 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
> >  	if (p->sched_task_hot)
> >  		p->sched_task_hot = 0;
> >  
> > +	/*
> > +	 * Demoted DEADLINE tasks cannot migrate. Their bandwidth reservation
> > +	 * is tied to the demotion CPU and will be released when the task is
> > +	 * promoted back to DEADLINE or explicitly switched to another policy.
> > +	 */
> > +	if (!dl_task_can_migrate(p))
> > +		return 0;
> 
> I suppose this works, the alternative is doing migrate_disable() in
> demote and migrate_enable() in promote. Not quite sure which is the
> least horrible in this case :-)

Yeah, I was undecided between the two. Can switch to migrate_disable().

I actually spent some time trying to figure out how to actually allow
migration while demoted w/o breaking things, but the attempts so far
ended up with not really pretty locking and retries. So, I decided to
leave this 'for later' and check first for interest on the feature.

Thanks for taking a look!

Juri


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-02-23  7:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-19 13:37 [PATCH RFC 0/4] sched/deadline: Add soft/reclaim mode via SCHED_OTHER demotion Juri Lelli
2026-02-19 13:37 ` [PATCH RFC 1/4] sched/deadline: Implement reclaim/soft mode through " Juri Lelli
2026-02-20 19:47   ` Peter Zijlstra
2026-02-23  7:12     ` Juri Lelli
2026-02-19 13:37 ` [PATCH RFC 2/4] sched/doc: Document SCHED_DEADLINE demotion feature Juri Lelli
2026-02-19 13:37 ` [PATCH RFC 3/4] DEBUG selftests/sched: Add tests for " Juri Lelli
2026-02-19 13:37 ` [PATCH RFC 4/4] DEBUG selftests/sched: Add simple demonstration of SCHED_DEADLINE demotion Juri Lelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox