public inbox for rcu@vger.kernel.org
 help / color / mirror / Atom feed
* PATCH v3 0/3] softirq: Use a dedicated thread for timer wakeups with forced-threading.
@ 2024-11-06 14:51 Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 1/3] hrtimer: Use __raise_softirq_irqoff() to raise the softirq Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06 14:51 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner

Hi,

the following was in the PREEMPT_RT queue since last softirq rework. The
result is that timer wake ups (hrtimer, timer_list) happens in hardirq
processing them requires to wake ksoftirqd. ksoftirqd runs at SCHED_OTHER so it
will compete for resources with all other tasks in the system, potentially
delayed the processing further.

The idea was to let the timers be processed by a dedicated thread
running at low SCHED_FIFO priority.
While looking at it again, it might make sense to have the
pending_softirq flag per-thread to avoid threads with higher priority
picking up softirqs from low-priority threads. This isn't yet a problem
because adding softirqs for processing happens only from threaded
interrupts. So the low-priority thread will wait until the high-priority
thread is done. And the high-priority thread will PI-boost the
low-priority thread until it is done. It would only make sense to make
the flags per-thread once the BH lock is gone.

The patch is limited to the forced threaded case.

v2…v3 https://lore.kernel.org/all/20241024150413.518862-1-bigeasy@linutronix.de/
 - Redo the comment in interrupt.h. Make it more verbose and add more
   details. Frederick asked for it.

 - After updating the comment I convinced myself that it also makes
   sense to use this in the forced threaded case.

 - Merged raise_hrtimer_softirq() and raise_timer_softirq() into a
   single function since both use __raise_softirq_irqoff() in the
   alternative case.
 
v1…v2 https://lore.kernel.org/all/20241004103842.131014-1-bigeasy@linutronix.de/
 Frederick's comments:
 - Use __raise_softirq_irqoff() to raise the softirq for !PREEMPT_RT. Also a
   lockdep test to ensure that this is always invoked from an IRQ.
 - Make raise_ktimers_thread() only OR the flag and nothing else to
   align with __raise_softirq_irqoff(). The wake happens on return from
   interrupt anyway.
 - A comment in timersd_setup() and interrupt.h
 - local_pending_timers() => local_timers_pending().

Sebastian


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/3] hrtimer: Use __raise_softirq_irqoff() to raise the softirq.
  2024-11-06 14:51 PATCH v3 0/3] softirq: Use a dedicated thread for timer wakeups with forced-threading Sebastian Andrzej Siewior
@ 2024-11-06 14:51 ` Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 2/3] timers: " Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT Sebastian Andrzej Siewior
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06 14:51 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Sebastian Andrzej Siewior

As an optimisation use __raise_softirq_irqoff() to raise the softirq.
This is always called from an interrupt handler so it can be reduced to
just or set softirq flag and let softirq be invoked on return from
interrupt.

Use __raise_softirq_irqoff() to raise the softirq.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/time/hrtimer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index cddcd08ea827f..5402e0f242178 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1811,7 +1811,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
 		cpu_base->softirq_activated = 1;
-		raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+		__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
 	}
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
@@ -1906,7 +1906,7 @@ void hrtimer_run_queues(void)
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
 		cpu_base->softirq_activated = 1;
-		raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+		__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
 	}
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/3] timers: Use __raise_softirq_irqoff() to raise the softirq.
  2024-11-06 14:51 PATCH v3 0/3] softirq: Use a dedicated thread for timer wakeups with forced-threading Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 1/3] hrtimer: Use __raise_softirq_irqoff() to raise the softirq Sebastian Andrzej Siewior
@ 2024-11-06 14:51 ` Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT Sebastian Andrzej Siewior
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06 14:51 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Sebastian Andrzej Siewior

As an optimisation use __raise_softirq_irqoff() to raise the softirq.
This is always called from an interrupt handler, interrupts are already
disabled so it can be reduced to just or set softirq flag and let
softirq be invoked on return from interrupt.

Use __raise_softirq_irqoff() to raise the softirq.

Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/time/timer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 0fc9d066a7be4..1759de934284c 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2499,7 +2499,7 @@ static void run_local_timers(void)
 		 */
 		if (time_after_eq(jiffies, READ_ONCE(base->next_expiry)) ||
 		    (i == BASE_DEF && tmigr_requires_handle_remote())) {
-			raise_softirq(TIMER_SOFTIRQ);
+			__raise_softirq_irqoff(TIMER_SOFTIRQ);
 			return;
 		}
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2024-11-06 14:51 PATCH v3 0/3] softirq: Use a dedicated thread for timer wakeups with forced-threading Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 1/3] hrtimer: Use __raise_softirq_irqoff() to raise the softirq Sebastian Andrzej Siewior
  2024-11-06 14:51 ` [PATCH v3 2/3] timers: " Sebastian Andrzej Siewior
@ 2024-11-06 14:51 ` Sebastian Andrzej Siewior
  2025-12-01 21:51   ` Jan Kiszka
  2 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-11-06 14:51 UTC (permalink / raw)
  To: linux-kernel, rcu
  Cc: Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Sebastian Andrzej Siewior

A timer/ hrtimer softirq is raised in-IRQ context. With threaded
interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
means it will compete with other tasks for CPU ressources.
This can introduce long delays for timer processing on heavy loaded
systems and is not desired.

Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
timers thread and let it run at the lowest SCHED_FIFO priority.
Wake-ups for RT tasks happen from hardirq context so only timer_list timers
and hrtimers for "regular" tasks are processed here. The higher priority
ensures that wakeups are performed before scheduling SCHED_OTHER tasks.

Using a dedicated variable to store the pending softirq bits values
ensure that the timer are not accidentally picked up by ksoftirqd and
other threaded interrupts.
It shouldn't be picked up by ksoftirqd since it runs at lower priority.
However if ksoftirqd is already running while a timer fires, then
ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
Ideally we try to avoid having ksoftirqd running.

The timer thread can pick up pending softirqs from ksoftirqd but only
if the softirq load is high. It is not be desired that the picked up
softirqs are processed at SCHED_FIFO priority under high softirq load
but this can already happen by a PI-boost by a force-threaded interrupt.

[ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
  local_timers_pending() for tick_nohz_next_event() ]

[ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
  softirq is currently served. ]

Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/interrupt.h | 47 ++++++++++++++++++++++++++
 kernel/rcu/rcutorture.c   |  8 +++++
 kernel/softirq.c          | 69 ++++++++++++++++++++++++++++++++++++++-
 kernel/time/hrtimer.c     |  4 +--
 kernel/time/tick-sched.c  |  2 +-
 kernel/time/timer.c       |  2 +-
 6 files changed, 127 insertions(+), 5 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 457151f9f263d..8cd9327e4e78d 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -616,6 +616,53 @@ extern void __raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 
+/*
+ * With forced-threaded interrupts enabled a raised softirq is deferred to
+ * ksoftirqd unless it can be handled within the threaded interrupt. This
+ * affects timer_list timers and hrtimers which are explicitly marked with
+ * HRTIMER_MODE_SOFT.
+ * With PREEMPT_RT enabled more hrtimers are moved to softirq for processing
+ * which includes all timers which are not explicitly marked HRTIMER_MODE_HARD.
+ * Userspace controlled timers (like the clock_nanosleep() interface) is divided
+ * into two categories: Tasks with elevated scheduling policy including
+ * SCHED_{FIFO|RR|DL} and the remaining scheduling policy. The tasks with the
+ * elevated scheduling policy are woken up directly from the HARDIRQ while all
+ * other wake ups are delayed to softirq and so to ksoftirqd.
+ *
+ * The ksoftirqd runs at SCHED_OTHER policy at which it should remain since it
+ * handles the softirq in an overloaded situation (not handled everything
+ * within its last run).
+ * If the timers are handled at SCHED_OTHER priority then they competes with all
+ * other SCHED_OTHER tasks for CPU resources are possibly delayed.
+ * Moving timers softirqs to a low priority SCHED_FIFO thread instead ensures
+ * that timer are performed before scheduling any SCHED_OTHER thread.
+ */
+DECLARE_PER_CPU(struct task_struct *, ktimerd);
+DECLARE_PER_CPU(unsigned long, pending_timer_softirq);
+void raise_ktimers_thread(unsigned int nr);
+
+static inline unsigned int local_timers_pending_force_th(void)
+{
+	return __this_cpu_read(pending_timer_softirq);
+}
+
+static inline void raise_timer_softirq(unsigned int nr)
+{
+	lockdep_assert_in_irq();
+	if (force_irqthreads())
+		raise_ktimers_thread(nr);
+	else
+		__raise_softirq_irqoff(nr);
+}
+
+static inline unsigned int local_timers_pending(void)
+{
+	if (force_irqthreads())
+		return local_timers_pending_force_th();
+	else
+		return local_softirq_pending();
+}
+
 DECLARE_PER_CPU(struct task_struct *, ksoftirqd);
 
 static inline struct task_struct *this_cpu_ksoftirqd(void)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index bb75dbf5c800c..270c31a1e8570 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -2440,6 +2440,14 @@ static int rcutorture_booster_init(unsigned int cpu)
 		WARN_ON_ONCE(!t);
 		sp.sched_priority = 2;
 		sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+#ifdef CONFIG_IRQ_FORCED_THREADING
+		if (force_irqthreads()) {
+			t = per_cpu(ktimerd, cpu);
+			WARN_ON_ONCE(!t);
+			sp.sched_priority = 2;
+			sched_setscheduler_nocheck(t, SCHED_FIFO, &sp);
+		}
+#endif
 	}
 
 	/* Don't allow time recalculation while creating a new task. */
diff --git a/kernel/softirq.c b/kernel/softirq.c
index d082e7840f880..7b525c9044626 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -624,6 +624,24 @@ static inline void tick_irq_exit(void)
 #endif
 }
 
+#ifdef CONFIG_IRQ_FORCED_THREADING
+DEFINE_PER_CPU(struct task_struct *, ktimerd);
+DEFINE_PER_CPU(unsigned long, pending_timer_softirq);
+
+static void wake_timersd(void)
+{
+	struct task_struct *tsk = __this_cpu_read(ktimerd);
+
+	if (tsk)
+		wake_up_process(tsk);
+}
+
+#else
+
+static inline void wake_timersd(void) { }
+
+#endif
+
 static inline void __irq_exit_rcu(void)
 {
 #ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
@@ -636,6 +654,10 @@ static inline void __irq_exit_rcu(void)
 	if (!in_interrupt() && local_softirq_pending())
 		invoke_softirq();
 
+	if (IS_ENABLED(CONFIG_IRQ_FORCED_THREADING) && force_irqthreads() &&
+	    local_timers_pending_force_th() && !(in_nmi() | in_hardirq()))
+		wake_timersd();
+
 	tick_irq_exit();
 }
 
@@ -971,12 +993,57 @@ static struct smp_hotplug_thread softirq_threads = {
 	.thread_comm		= "ksoftirqd/%u",
 };
 
+#ifdef CONFIG_IRQ_FORCED_THREADING
+static void ktimerd_setup(unsigned int cpu)
+{
+	/* Above SCHED_NORMAL to handle timers before regular tasks. */
+	sched_set_fifo_low(current);
+}
+
+static int ktimerd_should_run(unsigned int cpu)
+{
+	return local_timers_pending_force_th();
+}
+
+void raise_ktimers_thread(unsigned int nr)
+{
+	trace_softirq_raise(nr);
+	__this_cpu_or(pending_timer_softirq, BIT(nr));
+}
+
+static void run_ktimerd(unsigned int cpu)
+{
+	unsigned int timer_si;
+
+	ksoftirqd_run_begin();
+
+	timer_si = local_timers_pending_force_th();
+	__this_cpu_write(pending_timer_softirq, 0);
+	or_softirq_pending(timer_si);
+
+	__do_softirq();
+
+	ksoftirqd_run_end();
+}
+
+static struct smp_hotplug_thread timer_thread = {
+	.store			= &ktimerd,
+	.setup			= ktimerd_setup,
+	.thread_should_run	= ktimerd_should_run,
+	.thread_fn		= run_ktimerd,
+	.thread_comm		= "ktimers/%u",
+};
+#endif
+
 static __init int spawn_ksoftirqd(void)
 {
 	cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
 				  takeover_tasklets);
 	BUG_ON(smpboot_register_percpu_thread(&softirq_threads));
-
+#ifdef CONFIG_IRQ_FORCED_THREADING
+	if (force_irqthreads())
+		BUG_ON(smpboot_register_percpu_thread(&timer_thread));
+#endif
 	return 0;
 }
 early_initcall(spawn_ksoftirqd);
diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 5402e0f242178..d9911516e7431 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -1811,7 +1811,7 @@ void hrtimer_interrupt(struct clock_event_device *dev)
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
 		cpu_base->softirq_activated = 1;
-		__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
@@ -1906,7 +1906,7 @@ void hrtimer_run_queues(void)
 	if (!ktime_before(now, cpu_base->softirq_expires_next)) {
 		cpu_base->softirq_expires_next = KTIME_MAX;
 		cpu_base->softirq_activated = 1;
-		__raise_softirq_irqoff(HRTIMER_SOFTIRQ);
+		raise_timer_softirq(HRTIMER_SOFTIRQ);
 	}
 
 	__hrtimer_run_queues(cpu_base, now, flags, HRTIMER_ACTIVE_HARD);
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index f203f000da1ad..e0c47259e91a7 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -865,7 +865,7 @@ static void tick_nohz_restart(struct tick_sched *ts, ktime_t now)
 
 static inline bool local_timer_softirq_pending(void)
 {
-	return local_softirq_pending() & BIT(TIMER_SOFTIRQ);
+	return local_timers_pending() & BIT(TIMER_SOFTIRQ);
 }
 
 /*
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 1759de934284c..06f0bc1db6d9a 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -2499,7 +2499,7 @@ static void run_local_timers(void)
 		 */
 		if (time_after_eq(jiffies, READ_ONCE(base->next_expiry)) ||
 		    (i == BASE_DEF && tmigr_requires_handle_remote())) {
-			__raise_softirq_irqoff(TIMER_SOFTIRQ);
+			raise_timer_softirq(TIMER_SOFTIRQ);
 			return;
 		}
 	}
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2024-11-06 14:51 ` [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT Sebastian Andrzej Siewior
@ 2025-12-01 21:51   ` Jan Kiszka
  2025-12-01 23:49     ` Luis Claudio R. Goncalves
  2025-12-02  8:24     ` Sebastian Andrzej Siewior
  0 siblings, 2 replies; 11+ messages in thread
From: Jan Kiszka @ 2025-12-01 21:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, linux-kernel, rcu, stable-rt
  Cc: Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Florian Bezdeka, Pavel Machek

On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
> A timer/ hrtimer softirq is raised in-IRQ context. With threaded
> interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
> for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
> means it will compete with other tasks for CPU ressources.
> This can introduce long delays for timer processing on heavy loaded
> systems and is not desired.
> 
> Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
> timers thread and let it run at the lowest SCHED_FIFO priority.
> Wake-ups for RT tasks happen from hardirq context so only timer_list timers
> and hrtimers for "regular" tasks are processed here. The higher priority
> ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
> 
> Using a dedicated variable to store the pending softirq bits values
> ensure that the timer are not accidentally picked up by ksoftirqd and
> other threaded interrupts.
> It shouldn't be picked up by ksoftirqd since it runs at lower priority.
> However if ksoftirqd is already running while a timer fires, then
> ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
> Ideally we try to avoid having ksoftirqd running.
> 
> The timer thread can pick up pending softirqs from ksoftirqd but only
> if the softirq load is high. It is not be desired that the picked up
> softirqs are processed at SCHED_FIFO priority under high softirq load
> but this can already happen by a PI-boost by a force-threaded interrupt.
> 
> [ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
>   local_timers_pending() for tick_nohz_next_event() ]
> 
> [ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
>   softirq is currently served. ]
> 
> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

This went into 6.13 and was never backported to 6.12-lts. And that is
why you can easily stall the latter with a workload like this and
CONFIG_PREEMPT_RT enabled:

echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control

mkdir /sys/fs/cgroup/stalltest.sub1
mkdir /sys/fs/cgroup/stalltest.sub2
sleep 10000000 &
pid=$!

systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
    while true; do
        echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
        echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
    done"

echo "1000 20000" > /sys/fs/cgroup/stalltest.slice/cpu.max

This triggers a lock-up if a holder of cgroup_file_kn_lock with
SCHED_OTHER is scheduled out after using up its timeslice and then
cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
trying to get this lock, failing and then never being able to reactivate
the lock holder again as well.

I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
6.12-rt does not trigger because it additionally has the downstream -rt
patches on board.

How should we handle this? Consider 6.12 mainline with -rt and cgroups
as potentially broken, asking people to user 6.12-rt? Or port this back?

BTW, the original report of this issue came from an older
5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
feature introduced to the -rt patches? Was it ever backported to 5.10-rt
or other -rt versions?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-01 21:51   ` Jan Kiszka
@ 2025-12-01 23:49     ` Luis Claudio R. Goncalves
  2025-12-02  6:57       ` Jan Kiszka
  2025-12-02  8:24     ` Sebastian Andrzej Siewior
  1 sibling, 1 reply; 11+ messages in thread
From: Luis Claudio R. Goncalves @ 2025-12-01 23:49 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Sebastian Andrzej Siewior, linux-kernel, rcu, stable-rt,
	Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Florian Bezdeka, Pavel Machek

On Mon, Dec 01, 2025 at 10:51:50PM +0100, Jan Kiszka wrote:
> On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
> > A timer/ hrtimer softirq is raised in-IRQ context. With threaded
> > interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
> > for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
> > means it will compete with other tasks for CPU ressources.
> > This can introduce long delays for timer processing on heavy loaded
> > systems and is not desired.
> > 
> > Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
> > timers thread and let it run at the lowest SCHED_FIFO priority.
> > Wake-ups for RT tasks happen from hardirq context so only timer_list timers
> > and hrtimers for "regular" tasks are processed here. The higher priority
> > ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
> > 
> > Using a dedicated variable to store the pending softirq bits values
> > ensure that the timer are not accidentally picked up by ksoftirqd and
> > other threaded interrupts.
> > It shouldn't be picked up by ksoftirqd since it runs at lower priority.
> > However if ksoftirqd is already running while a timer fires, then
> > ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
> > Ideally we try to avoid having ksoftirqd running.
> > 
> > The timer thread can pick up pending softirqs from ksoftirqd but only
> > if the softirq load is high. It is not be desired that the picked up
> > softirqs are processed at SCHED_FIFO priority under high softirq load
> > but this can already happen by a PI-boost by a force-threaded interrupt.
> > 
> > [ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
> >   local_timers_pending() for tick_nohz_next_event() ]
> > 
> > [ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
> >   softirq is currently served. ]
> > 
> > Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
> > Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
> > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> This went into 6.13 and was never backported to 6.12-lts. And that is
> why you can easily stall the latter with a workload like this and
> CONFIG_PREEMPT_RT enabled:
> 
> echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
> echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
> 
> mkdir /sys/fs/cgroup/stalltest.sub1
> mkdir /sys/fs/cgroup/stalltest.sub2
> sleep 10000000 &
> pid=$!
> 
> systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
>     while true; do
>         echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
>         echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
>     done"
> 
> echo "1000 20000" > /sys/fs/cgroup/stalltest.slice/cpu.max
> 
> This triggers a lock-up if a holder of cgroup_file_kn_lock with
> SCHED_OTHER is scheduled out after using up its timeslice and then
> cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
> trying to get this lock, failing and then never being able to reactivate
> the lock holder again as well.
> 
> I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
> 6.12-rt does not trigger because it additionally has the downstream -rt
> patches on board.
> 
> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> as potentially broken, asking people to user 6.12-rt? Or port this back?
> 
> BTW, the original report of this issue came from an older
> 5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
> feature introduced to the -rt patches? Was it ever backported to 5.10-rt
> or other -rt versions?

Hi Jan!

I failed to locate the original discussion (from v5.10-rt) as the V1 of this
patchset is a new thread. Anyway, you are correct, the commit below (and the
other two changes from the series) are not present in v5.10-rt.

AFAICT commit 49a17639508c ("softirq: Use a dedicated thread for timer wakeups
on PREEMPT_RT.") was merged initially to v6.13-rc1, it was never exclusive
to the RT tree.

Luis

> Jan
> 
> -- 
> Siemens AG, Foundational Technologies
> Linux Expert Center
> 
---end quoted text---


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-01 23:49     ` Luis Claudio R. Goncalves
@ 2025-12-02  6:57       ` Jan Kiszka
  2025-12-02  8:22         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kiszka @ 2025-12-02  6:57 UTC (permalink / raw)
  To: Luis Claudio R. Goncalves
  Cc: Sebastian Andrzej Siewior, linux-kernel, rcu, stable-rt,
	Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Florian Bezdeka, Pavel Machek

On 02.12.25 00:49, Luis Claudio R. Goncalves wrote:
> On Mon, Dec 01, 2025 at 10:51:50PM +0100, Jan Kiszka wrote:
>> On 06.11.24 15:51, Sebastian Andrzej Siewior wrote:
>>> A timer/ hrtimer softirq is raised in-IRQ context. With threaded
>>> interrupts enabled or on PREEMPT_RT this leads to waking the ksoftirqd
>>> for the processing of the softirq. ksoftirqd runs as SCHED_OTHER which
>>> means it will compete with other tasks for CPU ressources.
>>> This can introduce long delays for timer processing on heavy loaded
>>> systems and is not desired.
>>>
>>> Split the TIMER_SOFTIRQ and HRTIMER_SOFTIRQ processing into a dedicated
>>> timers thread and let it run at the lowest SCHED_FIFO priority.
>>> Wake-ups for RT tasks happen from hardirq context so only timer_list timers
>>> and hrtimers for "regular" tasks are processed here. The higher priority
>>> ensures that wakeups are performed before scheduling SCHED_OTHER tasks.
>>>
>>> Using a dedicated variable to store the pending softirq bits values
>>> ensure that the timer are not accidentally picked up by ksoftirqd and
>>> other threaded interrupts.
>>> It shouldn't be picked up by ksoftirqd since it runs at lower priority.
>>> However if ksoftirqd is already running while a timer fires, then
>>> ksoftird will be PI-boosted due to the BH-lock to ktimer's priority.
>>> Ideally we try to avoid having ksoftirqd running.
>>>
>>> The timer thread can pick up pending softirqs from ksoftirqd but only
>>> if the softirq load is high. It is not be desired that the picked up
>>> softirqs are processed at SCHED_FIFO priority under high softirq load
>>> but this can already happen by a PI-boost by a force-threaded interrupt.
>>>
>>> [ frederic@kernel.org: rcutorture.c fixes, storm fix by introduction of
>>>   local_timers_pending() for tick_nohz_next_event() ]
>>>
>>> [ junxiao.chang@intel.com: Ensure ktimersd gets woken up even if a
>>>   softirq is currently served. ]
>>>
>>> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> [rcutorture]
>>> Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
>>> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>>
>> This went into 6.13 and was never backported to 6.12-lts. And that is
>> why you can easily stall the latter with a workload like this and
>> CONFIG_PREEMPT_RT enabled:
>>
>> echo "+cpu" >> /sys/fs/cgroup/cgroup.subtree_control
>> echo "+cpuset" >> /sys/fs/cgroup/cgroup.subtree_control
>>
>> mkdir /sys/fs/cgroup/stalltest.sub1
>> mkdir /sys/fs/cgroup/stalltest.sub2
>> sleep 10000000 &
>> pid=$!
>>
>> systemd-run --slice "stalltest.slice" taskset -c 0 sh -c " \
>>     while true; do
>>         echo $pid > /sys/fs/cgroup/stalltest.sub1/cgroup.procs;
>>         echo $pid > /sys/fs/cgroup/stalltest.sub2/cgroup.procs;
>>     done"
>>
>> echo "1000 20000" > /sys/fs/cgroup/stalltest.slice/cpu.max
>>
>> This triggers a lock-up if a holder of cgroup_file_kn_lock with
>> SCHED_OTHER is scheduled out after using up its timeslice and then
>> cgroup_file_notify_timer fires over a SCHED_OTHER context as well,
>> trying to get this lock, failing and then never being able to reactivate
>> the lock holder again as well.
>>
>> I've nicely reproduced this with upstream 6.12.58 while Debian's lastest
>> 6.12-rt does not trigger because it additionally has the downstream -rt
>> patches on board.
>>
>> How should we handle this? Consider 6.12 mainline with -rt and cgroups
>> as potentially broken, asking people to user 6.12-rt? Or port this back?
>>
>> BTW, the original report of this issue came from an older
>> 5.10.194-cip39-rt16 kernel (based on rt94 for 5.10). When was this
>> feature introduced to the -rt patches? Was it ever backported to 5.10-rt
>> or other -rt versions?
> 
> Hi Jan!
> 
> I failed to locate the original discussion (from v5.10-rt) as the V1 of this
> patchset is a new thread. Anyway, you are correct, the commit below (and the
> other two changes from the series) are not present in v5.10-rt.
> 
> AFAICT commit 49a17639508c ("softirq: Use a dedicated thread for timer wakeups
> on PREEMPT_RT.") was merged initially to v6.13-rc1, it was never exclusive
> to the RT tree.

So, we have this right now in the 6.12-rt and 6.1-rt trees. The patch
(b12e35832805) that enabled the lockup above was added to 4.18. In
4.19-rt (still under maintenance via 4.19-cip-rt), we had ktimersoftd -
was that addressing the issue as well? Or could timers have expired back
then also outside of that thread, thus potentially without SCHED_FIFO prio?

Thanks,
Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-02  6:57       ` Jan Kiszka
@ 2025-12-02  8:22         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-02  8:22 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Luis Claudio R. Goncalves, linux-kernel, rcu, stable-rt,
	Paul E. McKenney, Anna-Maria Behnsen, Davidlohr Bueso,
	Frederic Weisbecker, Ingo Molnar, Josh Triplett, Thomas Gleixner,
	Florian Bezdeka, Pavel Machek

On 2025-12-02 07:57:04 [+0100], Jan Kiszka wrote:
> So, we have this right now in the 6.12-rt and 6.1-rt trees. The patch
> (b12e35832805) that enabled the lockup above was added to 4.18. In
> 4.19-rt (still under maintenance via 4.19-cip-rt), we had ktimersoftd -
> was that addressing the issue as well? Or could timers have expired back
> then also outside of that thread, thus potentially without SCHED_FIFO prio?

Looking at v4.19.25-rt16, there is ktimersoftd/ via "softirq: split
timer softirqs out of ksoftirqd". This became later ktimers/.
So yes, the timer should be expired with a SCHED_FIFO priority.

You are most likely aware of "Defer throttle when task exits to user"
	https://lore.kernel.org/all/20250829081120.806-1-ziqianlu@bytedance.com/

> Thanks,
> Jan

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-01 21:51   ` Jan Kiszka
  2025-12-01 23:49     ` Luis Claudio R. Goncalves
@ 2025-12-02  8:24     ` Sebastian Andrzej Siewior
  2025-12-02 12:39       ` Jan Kiszka
  1 sibling, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-02  8:24 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-kernel, rcu, stable-rt, Paul E. McKenney,
	Anna-Maria Behnsen, Davidlohr Bueso, Frederic Weisbecker,
	Ingo Molnar, Josh Triplett, Thomas Gleixner, Florian Bezdeka,
	Pavel Machek

On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> as potentially broken, asking people to user 6.12-rt? Or port this back?

If you have everything in v6.12 for an useable RT system and this is the
only missing piece I could ask the stable nicely to backport this.

> Jan
> 
Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-02  8:24     ` Sebastian Andrzej Siewior
@ 2025-12-02 12:39       ` Jan Kiszka
  2025-12-02 13:22         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kiszka @ 2025-12-02 12:39 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-kernel, rcu, stable-rt, Paul E. McKenney,
	Anna-Maria Behnsen, Davidlohr Bueso, Frederic Weisbecker,
	Ingo Molnar, Josh Triplett, Thomas Gleixner, Florian Bezdeka,
	Pavel Machek

On 02.12.25 09:24, Sebastian Andrzej Siewior wrote:
> On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
>> How should we handle this? Consider 6.12 mainline with -rt and cgroups
>> as potentially broken, asking people to user 6.12-rt? Or port this back?
> 
> If you have everything in v6.12 for an useable RT system and this is the
> only missing piece I could ask the stable nicely to backport this.
> 

Given that this is a fix for potential lock-up... Does it have
dependencies? The other two patches in this series are optimizations
only, or should they better join the backport?

Jan

-- 
Siemens AG, Foundational Technologies
Linux Expert Center

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT.
  2025-12-02 12:39       ` Jan Kiszka
@ 2025-12-02 13:22         ` Sebastian Andrzej Siewior
  0 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-12-02 13:22 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: linux-kernel, rcu, stable-rt, Paul E. McKenney,
	Anna-Maria Behnsen, Davidlohr Bueso, Frederic Weisbecker,
	Ingo Molnar, Josh Triplett, Thomas Gleixner, Florian Bezdeka,
	Pavel Machek

On 2025-12-02 13:39:56 [+0100], Jan Kiszka wrote:
> On 02.12.25 09:24, Sebastian Andrzej Siewior wrote:
> > On 2025-12-01 22:51:50 [+0100], Jan Kiszka wrote:
> >> How should we handle this? Consider 6.12 mainline with -rt and cgroups
> >> as potentially broken, asking people to user 6.12-rt? Or port this back?
> > 
> > If you have everything in v6.12 for an useable RT system and this is the
> > only missing piece I could ask the stable nicely to backport this.
> > 
> 
> Given that this is a fix for potential lock-up... Does it have
> dependencies? The other two patches in this series are optimizations
> only, or should they better join the backport?

Just an optimisation if I am not mistaken but it will conflict if not
backported.

> Jan

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-02 13:22 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-06 14:51 PATCH v3 0/3] softirq: Use a dedicated thread for timer wakeups with forced-threading Sebastian Andrzej Siewior
2024-11-06 14:51 ` [PATCH v3 1/3] hrtimer: Use __raise_softirq_irqoff() to raise the softirq Sebastian Andrzej Siewior
2024-11-06 14:51 ` [PATCH v3 2/3] timers: " Sebastian Andrzej Siewior
2024-11-06 14:51 ` [PATCH v3 3/3] softirq: Use a dedicated thread for timer wakeups on PREEMPT_RT Sebastian Andrzej Siewior
2025-12-01 21:51   ` Jan Kiszka
2025-12-01 23:49     ` Luis Claudio R. Goncalves
2025-12-02  6:57       ` Jan Kiszka
2025-12-02  8:22         ` Sebastian Andrzej Siewior
2025-12-02  8:24     ` Sebastian Andrzej Siewior
2025-12-02 12:39       ` Jan Kiszka
2025-12-02 13:22         ` Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox