public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched: expose RT throttling info to facilitate priority reversal processing
@ 2025-12-02  5:51 wen.yang
  2025-12-02  5:51 ` [PATCH 1/2] sched/debug: add explicit TASK_RTLOCK_WAIT printing wen.yang
  2025-12-02  5:51 ` [PATCH 2/2] sched/rt: add RT throttle statistics wen.yang
  0 siblings, 2 replies; 3+ messages in thread
From: wen.yang @ 2025-12-02  5:51 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt
  Cc: Wen Yang, linux-kernel

From: Wen Yang <wen.yang@linux.dev>

This series helps to solve a priority reversal issue, where a CFS task
waits for an rtmutex lock, an RT task holding the lock stops due to
RT throttling, and higher priority RT tasks frequently trigger RT
throttling due to long-term CPU consumption.

Details of it:

A priority inversion scenario can occur when a CFS task is starved
due to RT throttling. The scenario is as follows:

0. An rtmutex (e.g., softirq_ctrl.lock) is contended by both CFS
   tasks (e.g., ksoftirqd) and RT tasks (e.g., ktimer).
1. An RT task 'A' (e.g., ktimer) acquired the rtmutex.
2. A CFS task 'B' (e.g., ksoftirqd) attempts to acquire the same
   rtmutex and blocks.
3. A higher-priority RT task 'C' (e.g., stress-ng) runs for an
   extended period, preempting task 'A' and causing the RT runqueue
   to be throttled.
4. Once throttled, CFS task 'B' should run, but it remains blocked
   because the lock is still held by the non-running RT task 'A'. This
   can even lead to the CPU going idle.
5. When the throttle period ends, the high-priority RT task 'C'
   resumes execution, and the cycle repeats, leading to indefinite
   starvation of CFS task 'B'.

A typical stack trace for the blocked ksoftirqd shows it in a 'D'
(TASK_RTLOCK_WAIT) state, waiting on the lock:
     ksoftirqd/5-61      [005] d...211 58212.064160: sched_switch: prev_comm=ksoftirqd/5 prev_pid=61 prev_prio=120 prev_state=D ==> next_comm=swapper/5 next_pid=0 next_prio=120
     ksoftirqd/5-61      [005] d...211 58212.064161: <stack trace>
 => __schedule
 => schedule_rtlock
 => rtlock_slowlock_locked
 => rt_spin_lock
 => __local_bh_disable_ip
 => run_ksoftirqd
 => smpboot_thread_fn
 => kthread
 => ret_from_fork

These two patches expose the TASK_RTLOCK_WAIT state and add
throttle_count to rt_rq for monitoring in /proc/sched_debug.

User-space tools like stalld can use this info to detect and resolve
the inversion, for example, by boosting the lock holder or adjusting 
the priority of the blocked CFS task in TASK_RTLOCK_WAIT state.

Wen Yang (2):
  sched/debug: add explicit TASK_RTLOCK_WAIT printing
  sched/rt: add RT throttle statistics

 fs/proc/array.c              |  3 ++-
 include/linux/sched.h        | 21 +++++++++------------
 include/trace/events/sched.h |  1 +
 kernel/sched/debug.c         |  1 +
 kernel/sched/rt.c            |  1 +
 kernel/sched/sched.h         |  1 +
 6 files changed, 15 insertions(+), 13 deletions(-)

Cc: linux-kernel@vger.kernel.org
-- 
2.25.1


^ permalink raw reply	[flat|nested] 3+ messages in thread

* [PATCH 1/2] sched/debug: add explicit TASK_RTLOCK_WAIT printing
  2025-12-02  5:51 [PATCH 0/2] sched: expose RT throttling info to facilitate priority reversal processing wen.yang
@ 2025-12-02  5:51 ` wen.yang
  2025-12-02  5:51 ` [PATCH 2/2] sched/rt: add RT throttle statistics wen.yang
  1 sibling, 0 replies; 3+ messages in thread
From: wen.yang @ 2025-12-02  5:51 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt
  Cc: Wen Yang, Vincent Guittot, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel

From: Wen Yang <wen.yang@linux.dev>

A priority inversion scenario can occur when a CFS task is starved
due to RT throttling. The scenario is as follows:

0. An rtmutex (e.g., softirq_ctrl.lock) is contended by both CFS
   tasks (e.g., ksoftirqd) and RT tasks (e.g., ktimer).
1. An RT task 'A' (e.g., ktimer) acquired the rtmutex.
2. A CFS task 'B' (e.g., ksoftirqd) attempts to acquire the same
   rtmutex and blocks.
3. A higher-priority RT task 'C' (e.g., stress-ng) runs for an
   extended period, preempting task 'A' and causing the RT runqueue
   to be throttled.
4. Once rt throttled, CFS task 'B' should run, but it remains blocked
   because the lock is still held by the non-running RT task 'A'. This
   can even lead to the CPU going idle.
5. When the rt throttle period ends, the high-priority RT task 'C'
   resumes execution, and the cycle repeats, leading to indefinite
   starvation of CFS task 'B'.

A typical stack trace for the blocked ksoftirqd shows it in a 'D'
(TASK_RTLOCK_WAIT) state, waiting on the lock:
     ksoftirqd/5-61      [005] d...211 58212.064160: sched_switch: prev_comm=ksoftirqd/5 prev_pid=61 prev_prio=120 prev_state=D ==> next_comm=swapper/5 next_pid=0 next_prio=120
     ksoftirqd/5-61      [005] d...211 58212.064161: <stack trace>
 => __schedule
 => schedule_rtlock
 => rtlock_slowlock_locked
 => rt_spin_lock
 => __local_bh_disable_ip
 => run_ksoftirqd
 => smpboot_thread_fn
 => kthread
 => ret_from_fork

This patch makes TASK_RTLOCK_WAIT a distinct state 'L' in task state reporting,
allowing user-space tools (e.g., stalld) to detect blocked tasks in this state
and potentially boost the lock holder or adjust the priority of the blocked
CFS task to resolve the inversion.

This requires shuffling the state bits to fit into TASK_REPORT mask.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org
---
 fs/proc/array.c              |  3 ++-
 include/linux/sched.h        | 21 +++++++++------------
 include/trace/events/sched.h |  1 +
 3 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd4bc4a58e4..a9b7e5a920c1 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -134,9 +134,10 @@ static const char * const task_state_array[] = {
 	"X (dead)",		/* 0x10 */
 	"Z (zombie)",		/* 0x20 */
 	"P (parked)",		/* 0x40 */
+	"L (rtlock wait)",	/* 0x80 */
 
 	/* states beyond TASK_REPORT: */
-	"I (idle)",		/* 0x80 */
+	"I (idle)",		/* 0x100 */
 };
 
 static inline const char *get_task_state(struct task_struct *tsk)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d395f2810fac..455e41aa073f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -113,12 +113,12 @@ struct user_event_mm;
 #define EXIT_TRACE			(EXIT_ZOMBIE | EXIT_DEAD)
 /* Used in tsk->__state again: */
 #define TASK_PARKED			0x00000040
-#define TASK_DEAD			0x00000080
-#define TASK_WAKEKILL			0x00000100
-#define TASK_WAKING			0x00000200
-#define TASK_NOLOAD			0x00000400
-#define TASK_NEW			0x00000800
-#define TASK_RTLOCK_WAIT		0x00001000
+#define TASK_RTLOCK_WAIT		0x00000080
+#define TASK_DEAD			0x00000100
+#define TASK_WAKEKILL			0x00000200
+#define TASK_WAKING			0x00000400
+#define TASK_NOLOAD			0x00000800
+#define TASK_NEW			0x00001000
 #define TASK_FREEZABLE			0x00002000
 #define __TASK_FREEZABLE_UNSAFE	       (0x00004000 * IS_ENABLED(CONFIG_LOCKDEP))
 #define TASK_FROZEN			0x00008000
@@ -145,7 +145,7 @@ struct user_event_mm;
 #define TASK_REPORT			(TASK_RUNNING | TASK_INTERRUPTIBLE | \
 					 TASK_UNINTERRUPTIBLE | __TASK_STOPPED | \
 					 __TASK_TRACED | EXIT_DEAD | EXIT_ZOMBIE | \
-					 TASK_PARKED)
+					 TASK_PARKED | TASK_RTLOCK_WAIT)
 
 #define task_is_running(task)		(READ_ONCE((task)->__state) == TASK_RUNNING)
 
@@ -1672,12 +1672,9 @@ static inline unsigned int __task_state_index(unsigned int tsk_state,
 		state = TASK_REPORT_IDLE;
 
 	/*
-	 * We're lying here, but rather than expose a completely new task state
-	 * to userspace, we can make this appear as if the task has gone through
-	 * a regular rt_mutex_lock() call.
 	 * Report frozen tasks as uninterruptible.
 	 */
-	if ((tsk_state & TASK_RTLOCK_WAIT) || (tsk_state & TASK_FROZEN))
+	if ((tsk_state & TASK_FROZEN))
 		state = TASK_UNINTERRUPTIBLE;
 
 	return fls(state);
@@ -1690,7 +1687,7 @@ static inline unsigned int task_state_index(struct task_struct *tsk)
 
 static inline char task_index_to_char(unsigned int state)
 {
-	static const char state_char[] = "RSDTtXZPI";
+	static const char state_char[] = "RSDTtXZPLI";
 
 	BUILD_BUG_ON(TASK_REPORT_MAX * 2 != 1 << (sizeof(state_char) - 1));
 
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 7b2645b50e78..2e22bb74900a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -259,6 +259,7 @@ TRACE_EVENT(sched_switch,
 				{ EXIT_DEAD, "X" },
 				{ EXIT_ZOMBIE, "Z" },
 				{ TASK_PARKED, "P" },
+				{ TASK_RTLOCK_WAIT, "L" },
 				{ TASK_DEAD, "I" }) :
 		  "R",
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* [PATCH 2/2] sched/rt: add RT throttle statistics
  2025-12-02  5:51 [PATCH 0/2] sched: expose RT throttling info to facilitate priority reversal processing wen.yang
  2025-12-02  5:51 ` [PATCH 1/2] sched/debug: add explicit TASK_RTLOCK_WAIT printing wen.yang
@ 2025-12-02  5:51 ` wen.yang
  1 sibling, 0 replies; 3+ messages in thread
From: wen.yang @ 2025-12-02  5:51 UTC (permalink / raw)
  To: Ingo Molnar, Peter Zijlstra, Juri Lelli, Dietmar Eggemann,
	Steven Rostedt
  Cc: Wen Yang, Vincent Guittot, Ben Segall, Mel Gorman,
	Valentin Schneider, linux-kernel

From: Wen Yang <wen.yang@linux.dev>

A priority inversion scenario can occur when a CFS task is starved
due to RT throttling. The scenario is as follows:

0. An rtmutex (e.g., softirq_ctrl.lock) is contended by both CFS
   tasks (e.g., ksoftirqd) and RT tasks (e.g., ktimer).
1. An RT task 'A' (e.g., ktimer) acquired the rtmutex.
2. A CFS task 'B' (e.g., ksoftirqd) attempts to acquire the same
   rtmutex and blocks.
3. A higher-priority RT task 'C' (e.g., stress-ng) runs for an
   extended period, preempting task 'A' and causing the RT runqueue
   to be throttled.
4. Once rt throttled, CFS task 'B' should run, but it remains blocked
   because the lock is still held by the non-running RT task 'A'. This
   can even lead to the CPU going idle.
5. When the rt throttle period ends, the high-priority RT task 'C'
   resumes execution, and the cycle repeats, leading to indefinite
   starvation of CFS task 'B'.

A typical stack trace for the blocked ksoftirqd shows it in a 'D'
(TASK_RTLOCK_WAIT) state, waiting on the lock:
     ksoftirqd/5-61      [005] d...211 58212.064160: sched_switch: prev_comm=ksoftirqd/5 prev_pid=61 prev_prio=120 prev_state=D ==> next_comm=swapper/5 next_pid=0 next_prio=120
     ksoftirqd/5-61      [005] d...211 58212.064161: <stack trace>
 => __schedule
 => schedule_rtlock
 => rtlock_slowlock_locked
 => rt_spin_lock
 => __local_bh_disable_ip
 => run_ksoftirqd
 => smpboot_thread_fn
 => kthread
 => ret_from_fork

This patch adds throttle_count to rt_rq, incremented on each throttling event
and displayed in print_rt_rq for /proc/sched_debug.

Thus user-space tools (e.g. stalld) can monitor throttle_comunt to detect
the huge CPU consumption by RT processes and find tasks in the
'TASK_RTLOCK_WAIT' state to handle priority inversion.

Signed-off-by: Wen Yang <wen.yang@linux.dev>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: linux-kernel@vger.kernel.org
---
 kernel/sched/debug.c | 1 +
 kernel/sched/rt.c    | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 41caa22e0680..8ed33c74e5a5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -894,6 +894,7 @@ void print_rt_rq(struct seq_file *m, int cpu, struct rt_rq *rt_rq)
 	P(rt_throttled);
 	PN(rt_time);
 	PN(rt_runtime);
+	PU(throttle_count);
 #endif
 
 #undef PN
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index f1867fe8e5c5..88c659285c70 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -884,6 +884,7 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 		 */
 		if (likely(rt_b->rt_runtime)) {
 			rt_rq->rt_throttled = 1;
+			rt_rq->throttle_count++;
 			printk_deferred_once("sched: RT throttling activated\n");
 		} else {
 			/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bbf513b3e76c..88119540e4d4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -840,6 +840,7 @@ struct rt_rq {
 	int			rt_throttled;
 	u64			rt_time; /* consumed RT time, goes up in update_curr_rt */
 	u64			rt_runtime; /* allotted RT time, "slice" from rt_bandwidth, RT sharing/balancing */
+	u64			throttle_count;
 	/* Nests inside the rq lock: */
 	raw_spinlock_t		rt_runtime_lock;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-12-02  5:52 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-02  5:51 [PATCH 0/2] sched: expose RT throttling info to facilitate priority reversal processing wen.yang
2025-12-02  5:51 ` [PATCH 1/2] sched/debug: add explicit TASK_RTLOCK_WAIT printing wen.yang
2025-12-02  5:51 ` [PATCH 2/2] sched/rt: add RT throttle statistics wen.yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox