[PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT_RT
@ 2025-09-01 16:38 Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-01 16:38 UTC (permalink / raw)
  To: linux-rt-devel, linux-kernel
  Cc: Clark Williams, Ingo Molnar, Lai Jiangshan, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Sebastian Andrzej Siewior

Users which rely on synchronisation within a BH-disabled section as in
- access to per-CPU data
- per-CPU timers
- synchronisation against another user within a BH-disabled section

rely on the local_lock_t lock in local_bh_disable() on PREEMPT_RT.
Almost all users dropped their dependency. The remaining (identified)
user in networking (pipapo) is in net-next.
What is left is the infrastructure as in tasklet and workqueue (for
bh-worker).
Both are part of this series. The last patch in the series adds an
option to drop the lock.

v1 tasklet https://lore.kernel.org/all/20250812143930.22RBn5BW@linutronix.de
v1 workqueue https://lore.kernel.org/all/20250820103657.vDuDuLx6@linutronix.de
v1 lock-drop https://lore.kernel.org/all/20250613105653.1860729-2-bigeasy@linutronix.de

Sebastian Andrzej Siewior (3):
  workqueue: Provide a handshake for canceling BH workers
  softirq: Provide a handshake for canceling tasklets via polling
  softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT

 kernel/Kconfig.preempt |  13 ++++
 kernel/softirq.c       | 145 ++++++++++++++++++++++++++++++++++-------
 kernel/workqueue.c     |  51 ++++++++++++---
 3 files changed, 175 insertions(+), 34 deletions(-)

-- 
2.51.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-01 16:38 [PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
@ 2025-09-01 16:38 ` Sebastian Andrzej Siewior
  2025-09-02 10:12   ` Lai Jiangshan
  2025-09-03  7:30   ` Lai Jiangshan
  2025-09-01 16:38 ` [PATCH v2 2/3] softirq: Provide a handshake for canceling tasklets via polling Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 3/3] softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
  2 siblings, 2 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-01 16:38 UTC (permalink / raw)
  To: linux-rt-devel, linux-kernel
  Cc: Clark Williams, Ingo Molnar, Lai Jiangshan, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Sebastian Andrzej Siewior

While a BH work item is canceled, the core code spins until it
determines that the item completed. On PREEMPT_RT the spinning relies on
a lock in local_bh_disable() to avoid a live lock if the canceling
thread has higher priority than the BH-worker and preempts it. This lock
ensures that the BH-worker makes progress by PI-boosting it.

This lock in local_bh_disable() is a central per-CPU BKL and about to be
removed.

To provide the required synchronisation add a per pool lock. The lock is
acquired by the bh_worker at the begin while the individual callbacks
are invoked. To enforce progress in case of interruption, __flush_work()
needs to acquire the lock.
This will flush all BH-work items assigned to that pool.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 42 insertions(+), 9 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c6b79b3675c31..94e226f637992 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -222,7 +222,9 @@ struct worker_pool {
 	struct workqueue_attrs	*attrs;		/* I: worker attributes */
 	struct hlist_node	hash_node;	/* PL: unbound_pool_hash node */
 	int			refcnt;		/* PL: refcnt for unbound pools */
-
+#ifdef CONFIG_PREEMPT_RT
+	spinlock_t		cb_lock;	/* BH worker cancel lock */
+#endif
 	/*
 	 * Destruction of pool is RCU protected to allow dereferences
 	 * from get_work_pool().
@@ -3078,6 +3080,31 @@ __acquires(&pool->lock)
 		goto restart;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+static void worker_lock_callback(struct worker_pool *pool)
+{
+	spin_lock(&pool->cb_lock);
+}
+
+static void worker_unlock_callback(struct worker_pool *pool)
+{
+	spin_unlock(&pool->cb_lock);
+}
+
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool)
+{
+	spin_lock(&pool->cb_lock);
+	spin_unlock(&pool->cb_lock);
+}
+
+#else
+
+static void worker_lock_callback(struct worker_pool *pool) { }
+static void worker_unlock_callback(struct worker_pool *pool) { }
+static void workqueue_callback_cancel_wait_running(struct worker_pool *pool) { }
+
+#endif
+
 /**
  * manage_workers - manage worker pool
  * @worker: self
@@ -3557,6 +3584,7 @@ static void bh_worker(struct worker *worker)
 	int nr_restarts = BH_WORKER_RESTARTS;
 	unsigned long end = jiffies + BH_WORKER_JIFFIES;
 
+	worker_lock_callback(pool);
 	raw_spin_lock_irq(&pool->lock);
 	worker_leave_idle(worker);
 
@@ -3585,6 +3613,7 @@ static void bh_worker(struct worker *worker)
 	worker_enter_idle(worker);
 	kick_pool(pool);
 	raw_spin_unlock_irq(&pool->lock);
+	worker_unlock_callback(pool);
 }
 
 /*
@@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
 		    (data & WORK_OFFQ_BH)) {
 			/*
 			 * On RT, prevent a live lock when %current preempted
-			 * soft interrupt processing or prevents ksoftirqd from
-			 * running by keeping flipping BH. If the BH work item
-			 * runs on a different CPU then this has no effect other
-			 * than doing the BH disable/enable dance for nothing.
-			 * This is copied from
-			 * kernel/softirq.c::tasklet_unlock_spin_wait().
+			 * soft interrupt processing by blocking on lock which
+			 * is owned by the thread invoking the callback.
 			 */
 			while (!try_wait_for_completion(&barr.done)) {
 				if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-					local_bh_disable();
-					local_bh_enable();
+					struct worker_pool *pool;
+
+					mutex_lock(&wq_pool_mutex);
+					pool = get_work_pool(work);
+					if (pool)
+						workqueue_callback_cancel_wait_running(pool);
+					mutex_unlock(&wq_pool_mutex);
 				} else {
 					cpu_relax();
 				}
@@ -4782,6 +4812,9 @@ static int init_worker_pool(struct worker_pool *pool)
 	ida_init(&pool->worker_ida);
 	INIT_HLIST_NODE(&pool->hash_node);
 	pool->refcnt = 1;
+#ifdef CONFIG_PREEMPT_RT
+	spin_lock_init(&pool->cb_lock);
+#endif
 
 	/* shouldn't fail above this point */
 	pool->attrs = alloc_workqueue_attrs();
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 2/3] softirq: Provide a handshake for canceling tasklets via polling
  2025-09-01 16:38 [PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
@ 2025-09-01 16:38 ` Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 3/3] softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-01 16:38 UTC (permalink / raw)
  To: linux-rt-devel, linux-kernel
  Cc: Clark Williams, Ingo Molnar, Lai Jiangshan, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Sebastian Andrzej Siewior

The tasklet_unlock_spin_wait() via tasklet_disable_in_atomic() is
provided for a few legacy tasklet users. The interface is used from
atomic context (which is either softirq or disabled preemption) on
non-PREEMPT_RT an relies on spinning until the tasklet callback
completes.
On PREEMPT_RT the context is never atomic but the busy polling logic
remains. It possible that the thread invoking tasklet_unlock_spin_wait()
has higher priority than the tasklet. If both run on the same CPU the
the tasklet makes no progress and the thread trying to cancel the
tasklet will live-lock the system.
To avoid the lockup tasklet_unlock_spin_wait() uses local_bh_disable()/
enable() which utilizes the local_lock_t for synchronisation. This lock
is a central per-CPU BKL and about to be removed.

Acquire a lock in tasklet_action_common() which is held while the
tasklet's callback is invoked. This lock will be acquired from
tasklet_unlock_spin_wait() via tasklet_callback_cancel_wait_running().
After the tasklet completed tasklet_callback_sync_wait_running() drops
the lock and acquires it again. In order to avoid unlocking the lock
even if there is no cancel request, there is a cb_waiters counter which
is incremented during a cancel request.
Blocking on the lock will PI-boost the tasklet if needed, ensuring
progress is made.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/softirq.c | 62 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 57 insertions(+), 5 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 513b1945987cc..4e2c980e7712e 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -805,6 +805,58 @@ static bool tasklet_clear_sched(struct tasklet_struct *t)
 	return false;
 }
 
+#ifdef CONFIG_PREEMPT_RT
+struct tasklet_sync_callback {
+	spinlock_t	cb_lock;
+	atomic_t	cb_waiters;
+};
+
+static DEFINE_PER_CPU(struct tasklet_sync_callback, tasklet_sync_callback) = {
+	.cb_lock	= __SPIN_LOCK_UNLOCKED(tasklet_sync_callback.cb_lock),
+	.cb_waiters	= ATOMIC_INIT(0),
+};
+
+static void tasklet_lock_callback(void)
+{
+	spin_lock(this_cpu_ptr(&tasklet_sync_callback.cb_lock));
+}
+
+static void tasklet_unlock_callback(void)
+{
+	spin_unlock(this_cpu_ptr(&tasklet_sync_callback.cb_lock));
+}
+
+static void tasklet_callback_cancel_wait_running(void)
+{
+	struct tasklet_sync_callback *sync_cb = this_cpu_ptr(&tasklet_sync_callback);
+
+	atomic_inc(&sync_cb->cb_waiters);
+	spin_lock(&sync_cb->cb_lock);
+	atomic_dec(&sync_cb->cb_waiters);
+	spin_unlock(&sync_cb->cb_lock);
+}
+
+static void tasklet_callback_sync_wait_running(void)
+{
+	struct tasklet_sync_callback *sync_cb = this_cpu_ptr(&tasklet_sync_callback);
+
+	if (atomic_read(&sync_cb->cb_waiters)) {
+		spin_unlock(&sync_cb->cb_lock);
+		spin_lock(&sync_cb->cb_lock);
+	}
+}
+
+#else /* !CONFIG_PREEMPT_RT: */
+
+static void tasklet_lock_callback(void) { }
+static void tasklet_unlock_callback(void) { }
+static void tasklet_callback_sync_wait_running(void) { }
+
+#ifdef CONFIG_SMP
+static void tasklet_callback_cancel_wait_running(void) { }
+#endif
+#endif /* !CONFIG_PREEMPT_RT */
+
 static void tasklet_action_common(struct tasklet_head *tl_head,
 				  unsigned int softirq_nr)
 {
@@ -816,6 +868,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
 	tl_head->tail = &tl_head->head;
 	local_irq_enable();
 
+	tasklet_lock_callback();
 	while (list) {
 		struct tasklet_struct *t = list;
 
@@ -835,6 +888,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
 					}
 				}
 				tasklet_unlock(t);
+				tasklet_callback_sync_wait_running();
 				continue;
 			}
 			tasklet_unlock(t);
@@ -847,6 +901,7 @@ static void tasklet_action_common(struct tasklet_head *tl_head,
 		__raise_softirq_irqoff(softirq_nr);
 		local_irq_enable();
 	}
+	tasklet_unlock_callback();
 }
 
 static __latent_entropy void tasklet_action(void)
@@ -897,12 +952,9 @@ void tasklet_unlock_spin_wait(struct tasklet_struct *t)
 			/*
 			 * Prevent a live lock when current preempted soft
 			 * interrupt processing or prevents ksoftirqd from
-			 * running. If the tasklet runs on a different CPU
-			 * then this has no effect other than doing the BH
-			 * disable/enable dance for nothing.
+			 * running.
 			 */
-			local_bh_disable();
-			local_bh_enable();
+			tasklet_callback_cancel_wait_running();
 		} else {
 			cpu_relax();
 		}
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v2 3/3] softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT
  2025-09-01 16:38 [PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
  2025-09-01 16:38 ` [PATCH v2 2/3] softirq: Provide a handshake for canceling tasklets via polling Sebastian Andrzej Siewior
@ 2025-09-01 16:38 ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-01 16:38 UTC (permalink / raw)
  To: linux-rt-devel, linux-kernel
  Cc: Clark Williams, Ingo Molnar, Lai Jiangshan, Peter Zijlstra,
	Steven Rostedt, Tejun Heo, Thomas Gleixner,
	Sebastian Andrzej Siewior

softirqs are preemptible on PREEMPT_RT. There is synchronisation between
individual sections which disable bottom halves. This in turn means that
a forced threaded interrupt can not preempt another forced threaded
interrupt. Instead it will PI-boost the other handler and wait for its
completion.

This is required because code within a softirq section is assumed to be
non-preemptible and may expect exclusive access to per-CPU resources
such as variables or pinned timers.
Code with such expectation has been identified and updated to use
local_lock_nested_bh() for locking of the per-CPU resource. This means
the lock can be removed.

Add CONFIG_PREEMPT_RT_NEEDS_BH_LOCK which keeps the old behaviour if
selected. Otherwise the softirq synchronising is lifted. The
softirq_ctrl.cnt accounting remains to let NOHZ code if softirqs are
currently handled.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 kernel/Kconfig.preempt | 13 +++++++
 kernel/softirq.c       | 83 ++++++++++++++++++++++++++++++++----------
 2 files changed, 76 insertions(+), 20 deletions(-)

diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
index 54ea59ff8fbeb..da326800c1c9b 100644
--- a/kernel/Kconfig.preempt
+++ b/kernel/Kconfig.preempt
@@ -103,6 +103,19 @@ config PREEMPT_RT
 	  Select this if you are building a kernel for systems which
 	  require real-time guarantees.
 
+config PREEMPT_RT_NEEDS_BH_LOCK
+	bool "Enforce softirq synchronisation on PREEMPT_RT"
+	depends on PREEMPT_RT
+	help
+	  Enforce synchronisation across the softirqs context. On PREEMPT_RT
+	  the softirq is preemptible. This enforces the same per-CPU BLK
+	  semantic non-PREEMPT_RT builds have. This should not be needed
+	  because per-CPU locks were added to avoid the per-CPU BKL.
+
+	  This switch provides the old behaviour for testing reasons. Select
+	  this if you suspect an error with preemptible softirq and want test
+	  the old synchronized behaviour.
+
 config PREEMPT_COUNT
        bool
 
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 4e2c980e7712e..77198911b8dd4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -165,7 +165,11 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 	/* First entry of a task into a BH disabled section? */
 	if (!current->softirq_disable_cnt) {
 		if (preemptible()) {
-			local_lock(&softirq_ctrl.lock);
+			if (IS_ENABLED(CONFIG_PREEMPT_RT_NEEDS_BH_LOCK))
+				local_lock(&softirq_ctrl.lock);
+			else
+				migrate_disable();
+
 			/* Required to meet the RCU bottomhalf requirements. */
 			rcu_read_lock();
 		} else {
@@ -177,17 +181,34 @@ void __local_bh_disable_ip(unsigned long ip, unsigned int cnt)
 	 * Track the per CPU softirq disabled state. On RT this is per CPU
 	 * state to allow preemption of bottom half disabled sections.
 	 */
-	newcnt = __this_cpu_add_return(softirq_ctrl.cnt, cnt);
-	/*
-	 * Reflect the result in the task state to prevent recursion on the
-	 * local lock and to make softirq_count() & al work.
-	 */
-	current->softirq_disable_cnt = newcnt;
+	if (IS_ENABLED(CONFIG_PREEMPT_RT_NEEDS_BH_LOCK)) {
+		newcnt = this_cpu_add_return(softirq_ctrl.cnt, cnt);
+		/*
+		 * Reflect the result in the task state to prevent recursion on the
+		 * local lock and to make softirq_count() & al work.
+		 */
+		current->softirq_disable_cnt = newcnt;
 
-	if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && newcnt == cnt) {
-		raw_local_irq_save(flags);
-		lockdep_softirqs_off(ip);
-		raw_local_irq_restore(flags);
+		if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && newcnt == cnt) {
+			raw_local_irq_save(flags);
+			lockdep_softirqs_off(ip);
+			raw_local_irq_restore(flags);
+		}
+	} else {
+		bool sirq_dis = false;
+
+		if (!current->softirq_disable_cnt)
+			sirq_dis = true;
+
+		this_cpu_add(softirq_ctrl.cnt, cnt);
+		current->softirq_disable_cnt += cnt;
+		WARN_ON_ONCE(current->softirq_disable_cnt < 0);
+
+		if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && sirq_dis) {
+			raw_local_irq_save(flags);
+			lockdep_softirqs_off(ip);
+			raw_local_irq_restore(flags);
+		}
 	}
 }
 EXPORT_SYMBOL(__local_bh_disable_ip);
@@ -195,23 +216,42 @@ EXPORT_SYMBOL(__local_bh_disable_ip);
 static void __local_bh_enable(unsigned int cnt, bool unlock)
 {
 	unsigned long flags;
+	bool sirq_en = false;
 	int newcnt;
 
-	DEBUG_LOCKS_WARN_ON(current->softirq_disable_cnt !=
-			    this_cpu_read(softirq_ctrl.cnt));
+	if (IS_ENABLED(CONFIG_PREEMPT_RT_NEEDS_BH_LOCK)) {
+		DEBUG_LOCKS_WARN_ON(current->softirq_disable_cnt !=
+				    this_cpu_read(softirq_ctrl.cnt));
+		if (softirq_count() == cnt)
+			sirq_en = true;
+	} else {
+		if (current->softirq_disable_cnt == cnt)
+			sirq_en = true;
+	}
 
-	if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && softirq_count() == cnt) {
+	if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS) && sirq_en) {
 		raw_local_irq_save(flags);
 		lockdep_softirqs_on(_RET_IP_);
 		raw_local_irq_restore(flags);
 	}
 
-	newcnt = __this_cpu_sub_return(softirq_ctrl.cnt, cnt);
-	current->softirq_disable_cnt = newcnt;
+	if (IS_ENABLED(CONFIG_PREEMPT_RT_NEEDS_BH_LOCK)) {
+		newcnt = this_cpu_sub_return(softirq_ctrl.cnt, cnt);
+		current->softirq_disable_cnt = newcnt;
 
-	if (!newcnt && unlock) {
-		rcu_read_unlock();
-		local_unlock(&softirq_ctrl.lock);
+		if (!newcnt && unlock) {
+			rcu_read_unlock();
+			local_unlock(&softirq_ctrl.lock);
+		}
+	} else {
+		current->softirq_disable_cnt -= cnt;
+		this_cpu_sub(softirq_ctrl.cnt, cnt);
+		if (unlock && !current->softirq_disable_cnt) {
+			migrate_enable();
+			rcu_read_unlock();
+		} else {
+			WARN_ON_ONCE(current->softirq_disable_cnt < 0);
+		}
 	}
 }
 
@@ -228,7 +268,10 @@ void __local_bh_enable_ip(unsigned long ip, unsigned int cnt)
 	lock_map_release(&bh_lock_map);
 
 	local_irq_save(flags);
-	curcnt = __this_cpu_read(softirq_ctrl.cnt);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT_NEEDS_BH_LOCK))
+		curcnt = this_cpu_read(softirq_ctrl.cnt);
+	else
+		curcnt = current->softirq_disable_cnt;
 
 	/*
 	 * If this is not reenabling soft interrupts, no point in trying to
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
@ 2025-09-02 10:12   ` Lai Jiangshan
  2025-09-02 11:17     ` Sebastian Andrzej Siewior
  2025-09-03  7:30   ` Lai Jiangshan
  1 sibling, 1 reply; 11+ messages in thread
From: Lai Jiangshan @ 2025-09-02 10:12 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

Hello

On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:
>
> While a BH work item is canceled, the core code spins until it
> determines that the item completed. On PREEMPT_RT the spinning relies on
> a lock in local_bh_disable() to avoid a live lock if the canceling
> thread has higher priority than the BH-worker and preempts it. This lock
> ensures that the BH-worker makes progress by PI-boosting it.
>
> This lock in local_bh_disable() is a central per-CPU BKL and about to be
> removed.
>
> To provide the required synchronisation add a per pool lock. The lock is
> acquired by the bh_worker at the begin while the individual callbacks
> are invoked. To enforce progress in case of interruption, __flush_work()
> needs to acquire the lock.
> This will flush all BH-work items assigned to that pool.
>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>  kernel/workqueue.c | 51 ++++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 42 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index c6b79b3675c31..94e226f637992 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -222,7 +222,9 @@ struct worker_pool {
>         struct workqueue_attrs  *attrs;         /* I: worker attributes */
>         struct hlist_node       hash_node;      /* PL: unbound_pool_hash node */
>         int                     refcnt;         /* PL: refcnt for unbound pools */
> -
> +#ifdef CONFIG_PREEMPT_RT
> +       spinlock_t              cb_lock;        /* BH worker cancel lock */
> +#endif
>         /*

Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
and rt_mutex_wait_proxy_lock()?

Or is it possible to add something like rt_spinlock_init_proxy_locked(),
rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
the same as the rt_mutex's proxy lock primitives but for non-sleep context?

I think they will work as an rt variant of struct completion and
they can be used for __flush_work() for BH work for preempt_rt
as the same way as wait_for_completion() is used for normal work.

Thanks
Lai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-02 10:12   ` Lai Jiangshan
@ 2025-09-02 11:17     ` Sebastian Andrzej Siewior
  2025-09-02 14:19       ` Lai Jiangshan
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-02 11:17 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

On 2025-09-02 18:12:10 [+0800], Lai Jiangshan wrote:
> Hello
Hi Lai,

> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -222,7 +222,9 @@ struct worker_pool {
> >         struct workqueue_attrs  *attrs;         /* I: worker attributes */
> >         struct hlist_node       hash_node;      /* PL: unbound_pool_hash node */
> >         int                     refcnt;         /* PL: refcnt for unbound pools */
> > -
> > +#ifdef CONFIG_PREEMPT_RT
> > +       spinlock_t              cb_lock;        /* BH worker cancel lock */
> > +#endif
> >         /*
> 
> Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> and rt_mutex_wait_proxy_lock()?
> 
> Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> the same as the rt_mutex's proxy lock primitives but for non-sleep context?

I don't think so. I think non-sleep context is the killer part. Those
are for PI and this works by assigning waiter's priority, going to sleep
until "it" is done. Now if you want non-sleep then you would have to
remain on the CPU and spin until the "work" is done. This spinning would
work if the other task is on a remote CPU. But if both are on the same
CPU then spinning is not working.

> I think they will work as an rt variant of struct completion and
> they can be used for __flush_work() for BH work for preempt_rt
> as the same way as wait_for_completion() is used for normal work.

I completely dislike the idea of spinning until completion for the BH
work. Yes, we have three tasklet users which are doing this as of today.
We have no users which require this for workqueue and you can schedule a
workqueue from interrupt context.
Why do we even what this? My idea was to submit this and then hide later
behind an "atomic" API which will hopefully remain unused. As explained
it the previous thread: From RT's point of view it requires to acquire
and drop the cb_lock each time and boost all items on that worker until
the "barrier" work item arrives. While in general the "one" item is
boosted if needed.

> Thanks
> Lai

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-02 11:17     ` Sebastian Andrzej Siewior
@ 2025-09-02 14:19       ` Lai Jiangshan
  2025-09-02 15:56         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 11+ messages in thread
From: Lai Jiangshan @ 2025-09-02 14:19 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

Hello, Sebastian

On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

> >
> > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> > and rt_mutex_wait_proxy_lock()?
> >
> > Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> > the same as the rt_mutex's proxy lock primitives but for non-sleep context?
>
> I don't think so. I think non-sleep context is the killer part. Those
> are for PI and this works by assigning waiter's priority, going to sleep
> until "it" is done. Now if you want non-sleep then you would have to
> remain on the CPU and spin until the "work" is done. This spinning would
> work if the other task is on a remote CPU. But if both are on the same
> CPU then spinning is not working.
>

I meant to say that the supposed rt_spinlock_wait_proxy_lock() would
work similarly to the rt_mutex proxy lock, which would wait until the
boosted task (in this case, the kthread running the BH work) calls
rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT
version of spin_lock, where the task blocked  on a spin_lock has a
special style of blocked/sleep instead of spinning on the CPU and this
is what the prefix "rt_spinlock" means.

By the way, I’m not objecting to this patch — I just want to explore
whether there might be other options.

Thanks
Lai

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-02 14:19       ` Lai Jiangshan
@ 2025-09-02 15:56         ` Sebastian Andrzej Siewior
  2025-09-03  7:51           ` Lai Jiangshan
  0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-09-02 15:56 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

On 2025-09-02 22:19:26 [+0800], Lai Jiangshan wrote:
> Hello, Sebastian
Hi Lai,

> On Tue, Sep 2, 2025 at 7:17 PM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> > > Is it possible to use rt_mutex_init_proxy_locked(), rt_mutex_proxy_unlock()
> > > and rt_mutex_wait_proxy_lock()?
> > >
> > > Or is it possible to add something like rt_spinlock_init_proxy_locked(),
> > > rt_spinlock_proxy_unlock() and rt_spinlock_wait_proxy_lock() which work
> > > the same as the rt_mutex's proxy lock primitives but for non-sleep context?
> >
> > I don't think so. I think non-sleep context is the killer part. Those
> > are for PI and this works by assigning waiter's priority, going to sleep
> > until "it" is done. Now if you want non-sleep then you would have to
> > remain on the CPU and spin until the "work" is done. This spinning would
> > work if the other task is on a remote CPU. But if both are on the same
> > CPU then spinning is not working.
> >
> 
> I meant to say that the supposed rt_spinlock_wait_proxy_lock() would
> work similarly to the rt_mutex proxy lock, which would wait until the
> boosted task (in this case, the kthread running the BH work) calls
> rt_spinlock_proxy_unlock(). It would also behave like the PREEMPT_RT
> version of spin_lock, where the task blocked  on a spin_lock has a
> special style of blocked/sleep instead of spinning on the CPU and this
> is what the prefix "rt_spinlock" means.

That interface is actually implementing that boosting for users which
don't use it directly. By "it" I mean the proper rtmutex.

This is used by the PI/ futex code where a rtmutex is created as a
substitute for the lock that is held by the user in userland. The lock
is acquired in userland without kernel's doing. So in case of contention
the user goes to kernel and creates a rtmutex as a representation of the
userland lock in the kernel and assign it to the task that is holding
the userland lock. Now you can block on the lock and userland tasks is
forced to go to the kernel for unlocking.

For RCU, as far as I remember, a task within an RCU can get preempted
and in that case it adds itself to a list during schedule() so it can be
identified later on. There can be more than one task which is preempted
within a RCU section and so block a GP. The boost mutex is assigned to
the first task currently blocking the GP and then next one if needed. 
A poor substitute would be something like taking a lock during
schedule() and having a list of all those locks in case boosting is
needed so it be acquired one by one.

> By the way, I’m not objecting to this patch — I just want to explore
> whether there might be other options.

Right. So you would avoid taking the cb_lock in bh_worker(). Instead you
would need to assign the "acquired" lock to the bh_work() task in
__flush_work() and then block on that lock in __flush_work(). In order
to figure out which task it is, you need some bookkeeping for it. And a
lock to synchronise adding/ removing tasks on that list (bookkeeping) as
well as accessing the lock itself in case of "contention".
So given all this, that approach looks slightly more complicated. You
would avoid the need to acquire the lock in bh_worker() but you would
also substitute it with bookkeeping and its locking elsewhere. So I am
not sure it is worth it.

On !RT you can have only one running softirq at a time. On RT with the
removal of the lock in local_bh_disable() (patch #3) there can be
multiple softirqs instances in parallel on the same CPU. The primary
goal is avoid center bottleneck and make it possible to have one NIC
doing throughput and another NIC doing low latency packets and allowing
the low latency NIC preempt the throughput NIC in the middle of sending/
receiving packets instead of waiting for NAPI doing a handover.

The lock I'm adding here adds some synchronisation here. So you see how
this requirement for the three legacy users makes it slightly more
complicated especially after the cleanup years ago…
However I hope now to come up with a atomic API as Tejun suggested and
push it behind Kconfig bars or so.

> Thanks
> Lai

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
  2025-09-02 10:12   ` Lai Jiangshan
@ 2025-09-03  7:30   ` Lai Jiangshan
  2025-09-03 18:38     ` Tejun Heo
  1 sibling, 1 reply; 11+ messages in thread
From: Lai Jiangshan @ 2025-09-03  7:30 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

> @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
>                     (data & WORK_OFFQ_BH)) {
>                         /*
>                          * On RT, prevent a live lock when %current preempted
> -                        * soft interrupt processing or prevents ksoftirqd from
> -                        * running by keeping flipping BH. If the BH work item
> -                        * runs on a different CPU then this has no effect other
> -                        * than doing the BH disable/enable dance for nothing.
> -                        * This is copied from
> -                        * kernel/softirq.c::tasklet_unlock_spin_wait().
> +                        * soft interrupt processing by blocking on lock which
> +                        * is owned by the thread invoking the callback.
>                          */
>                         while (!try_wait_for_completion(&barr.done)) {
>                                 if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -                                       local_bh_disable();
> -                                       local_bh_enable();
> +                                       struct worker_pool *pool;
> +
> +                                       mutex_lock(&wq_pool_mutex);
> +                                       pool = get_work_pool(work);
> +                                       if (pool)
> +                                               workqueue_callback_cancel_wait_running(pool);
> +                                       mutex_unlock(&wq_pool_mutex);

The goal is to avoid using a potentially sleeping function in
__flush_work() for BH work items on PREEMPT_RT, but
mutex_lock(&wq_pool_mutex) is not appropriate in this context.

To obtain the pool of a work item, the preferred approach is to use
rcu_read_lock() together with get_work_pool(work), as is done in
start_flush_work().


>                                 } else {
>                                         cpu_relax();
>                                 }

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-02 15:56         ` Sebastian Andrzej Siewior
@ 2025-09-03  7:51           ` Lai Jiangshan
  0 siblings, 0 replies; 11+ messages in thread
From: Lai Jiangshan @ 2025-09-03  7:51 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linux-rt-devel, linux-kernel, Clark Williams, Ingo Molnar,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner

On Tue, Sep 2, 2025 at 11:56 PM Sebastian Andrzej Siewior
<bigeasy@linutronix.de> wrote:

>
> That interface is actually implementing that boosting for users which
> don't use it directly. By "it" I mean the proper rtmutex.
>
> This is used by the PI/ futex code where a rtmutex is created as a
> substitute for the lock that is held by the user in userland. The lock
> is acquired in userland without kernel's doing. So in case of contention
> the user goes to kernel and creates a rtmutex as a representation of the
> userland lock in the kernel and assign it to the task that is holding
> the userland lock. Now you can block on the lock and userland tasks is
> forced to go to the kernel for unlocking.
>
> For RCU, as far as I remember, a task within an RCU can get preempted
> and in that case it adds itself to a list during schedule() so it can be
> identified later on. There can be more than one task which is preempted
> within a RCU section and so block a GP. The boost mutex is assigned to
> the first task currently blocking the GP and then next one if needed.
> A poor substitute would be something like taking a lock during
> schedule() and having a list of all those locks in case boosting is
> needed so it be acquired one by one.
>


Well explained why a “proxy” lock is needed in these two cases — thanks a lot.


>
> Right. So you would avoid taking the cb_lock in bh_worker(). Instead you
> would need to assign the "acquired" lock to the bh_work() task in
> __flush_work() and then block on that lock in __flush_work(). In order
> to figure out which task it is, you need some bookkeeping for it. And a
> lock to synchronise adding/ removing tasks on that list (bookkeeping) as
> well as accessing the lock itself in case of "contention".
> So given all this, that approach looks slightly more complicated. You
> would avoid the need to acquire the lock in bh_worker() but you would
> also substitute it with bookkeeping and its locking elsewhere. So I am
> not sure it is worth it.
>
> On !RT you can have only one running softirq at a time. On RT with the
> removal of the lock in local_bh_disable() (patch #3) there can be
> multiple softirqs instances in parallel on the same CPU. The primary
> goal is avoid center bottleneck and make it possible to have one NIC
> doing throughput and another NIC doing low latency packets and allowing
> the low latency NIC preempt the throughput NIC in the middle of sending/
> receiving packets instead of waiting for NAPI doing a handover.
>

The bookkeeping isn’t necessarily complicated. For bh_worker() on
PREEMPT_RT, it could be done in worker_leave_idle(worker) with
pool->lock already held. worker->task could be repurposed to record
the task running bh_worker(), which I think would also be useful for
debugging even on !RT.

That said, since a proxy lock isn’t appropriate here, let’s just leave it aside.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers
  2025-09-03  7:30   ` Lai Jiangshan
@ 2025-09-03 18:38     ` Tejun Heo
  0 siblings, 0 replies; 11+ messages in thread
From: Tejun Heo @ 2025-09-03 18:38 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Sebastian Andrzej Siewior, linux-rt-devel, linux-kernel,
	Clark Williams, Ingo Molnar, Peter Zijlstra, Steven Rostedt,
	Thomas Gleixner

On Wed, Sep 03, 2025 at 03:30:08PM +0800, Lai Jiangshan wrote:
> On Tue, Sep 2, 2025 at 12:38 AM Sebastian Andrzej Siewior
> <bigeasy@linutronix.de> wrote:
> 
> > @@ -4222,17 +4251,18 @@ static bool __flush_work(struct work_struct *work, bool from_cancel)
> >                     (data & WORK_OFFQ_BH)) {
> >                         /*
> >                          * On RT, prevent a live lock when %current preempted
> > -                        * soft interrupt processing or prevents ksoftirqd from
> > -                        * running by keeping flipping BH. If the BH work item
> > -                        * runs on a different CPU then this has no effect other
> > -                        * than doing the BH disable/enable dance for nothing.
> > -                        * This is copied from
> > -                        * kernel/softirq.c::tasklet_unlock_spin_wait().
> > +                        * soft interrupt processing by blocking on lock which
> > +                        * is owned by the thread invoking the callback.
> >                          */
> >                         while (!try_wait_for_completion(&barr.done)) {
> >                                 if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> > -                                       local_bh_disable();
> > -                                       local_bh_enable();
> > +                                       struct worker_pool *pool;
> > +
> > +                                       mutex_lock(&wq_pool_mutex);
> > +                                       pool = get_work_pool(work);
> > +                                       if (pool)
> > +                                               workqueue_callback_cancel_wait_running(pool);
> > +                                       mutex_unlock(&wq_pool_mutex);
> 
> The goal is to avoid using a potentially sleeping function in
> __flush_work() for BH work items on PREEMPT_RT, but
> mutex_lock(&wq_pool_mutex) is not appropriate in this context.
> 
> To obtain the pool of a work item, the preferred approach is to use
> rcu_read_lock() together with get_work_pool(work), as is done in
> start_flush_work().

Yeah, Sebastian, can you please switch it to rcu_read_lock()?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-09-03 18:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-01 16:38 [PATCH v2 0/3] Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior
2025-09-01 16:38 ` [PATCH v2 1/3] workqueue: Provide a handshake for canceling BH workers Sebastian Andrzej Siewior
2025-09-02 10:12   ` Lai Jiangshan
2025-09-02 11:17     ` Sebastian Andrzej Siewior
2025-09-02 14:19       ` Lai Jiangshan
2025-09-02 15:56         ` Sebastian Andrzej Siewior
2025-09-03  7:51           ` Lai Jiangshan
2025-09-03  7:30   ` Lai Jiangshan
2025-09-03 18:38     ` Tejun Heo
2025-09-01 16:38 ` [PATCH v2 2/3] softirq: Provide a handshake for canceling tasklets via polling Sebastian Andrzej Siewior
2025-09-01 16:38 ` [PATCH v2 3/3] softirq: Allow to drop the softirq-BKL lock on PREEMPT_RT Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).