[PATCH] workqueue: flush all pending jobs in destroy

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
@ 2025-04-23 17:51 Alice Ryhl
  2025-04-24 19:57 ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Alice Ryhl @ 2025-04-23 17:51 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel,
	Alice Ryhl

Normally, destroy_workqueue() will wait for queued work to exit, but if
destroy_workqueue() is called while it contains any pending (delayed or
rcu) work, then it fails to wait for the work. When those work items are
eventually queued in the timer/rcu callback, it may result in a UAF when
the callback tries to use a workqueue that has already been freed. (In
fact, they must be queued before we set __WQ_DESTROYING to avoid
triggering a WARN.)

To fix this, we introduce a new list of delayed work items so that
destroy_workqueue() can flush all delayed work items before waiting for
queued jobs. We also call rcu_barrier() to wait for rcu jobs to queue
themselves.

Flush or cancel work?
=====================

This patch proposes that we flush delayed work items rather than
cancelling them. This has the consequence that delayed work items may
executed sooner than the timer placed on them, which is somewhat
unfortunate. However, this is better than cancellation as this has
issues with cleanup - for example self-freeing work items would not get
freed if we cancel them. See the linked discussion for more on this.

Another option could be to wait for the timers to elapse so that the
work items are not executed "too early". This would work, but is deemed
to be too excessive because the timers could be very long. Thus, if the
user doesn't want their delayed work items to run early, they have to
cancel them before calling destroy_workqueue(). (Which is no different
from what is required prior to this patch.)

The same strategy is not used for rcu work items because such work items
are very likely to do something incorrect if an rcu grace period has not
passed when they run. But we still need to wait for them. Thus, this
patch introduces a call to rcu_barrier() in destroy_workqueue().

Dual use of the list_head
=========================

This patch reuses the list_head inside work_struct for the delayed_list
list. This could be dangerous if there is any code out there that is
using the list_head for various other purposes.

To avoid such issues, we ensure that if a work item is queued in
delayed_list, then it *must* be the case that the pending bit is owned
by the timer or by a *currently running* function in kernel/workqueue.c
such as __queue_delayed_work(), delayed_work_timer_fn(),
try_to_grab_pending(), or flush_delayed_work(). Note that this implies
that such functions *must* remove the work item from delayed_list (or
schedule the timer) before they give up ownership of the pending bit.

The above ensures that no existing code assumes it can use the list_head
for its own purposes - if such code exists, it has a bug today because
the list_head could get scheduled by __queue_work() at any time, which
WARNs in such cases.

Link: https://lore.kernel.org/r/aAFnEBv50t11Rjt0@slm.duckdns.org
Signed-off-by: Alice Ryhl <aliceryhl@google.com>
---
Based on top of
https://lore.kernel.org/all/20250404101543.74262-2-phasta@kernel.org/
---
 kernel/workqueue.c | 138 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 125 insertions(+), 13 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 2cb8276a27a99c951883d09ed2dbd5f488bda8e2..5b795d8c7ca5a16006fea6c4996acc60230e3984 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -367,6 +367,8 @@ struct workqueue_struct {
 	struct lockdep_map	__lockdep_map;
 	struct lockdep_map	*lockdep_map;
 #endif
+	raw_spinlock_t		delayed_lock;	/* protects pending_list */
+	struct list_head	delayed_list;	/* list of pending delayed jobs */
 	char			name[WQ_NAME_LEN]; /* I: workqueue name */
 
 	/*
@@ -2061,8 +2063,20 @@ static int try_to_grab_pending(struct work_struct *work, u32 cflags,
 		 * guaranteed that the timer is not queued anywhere and not
 		 * running on the local CPU.
 		 */
-		if (likely(timer_delete(&dwork->timer)))
+		if (likely(timer_delete(&dwork->timer))) {
+			/*
+			 * We took ownership of the pending bit from the timer,
+			 * so dwork->wq must be a valid workqueue and
+			 * work->entry must be in delayed_list of that wq. Note
+			 * that dwork->wq can't be freed here because
+			 * destroy_workqueue() spins until we delete the work
+			 * item from delayed_list.
+			 */
+			raw_spin_lock(&dwork->wq->delayed_lock);
+			list_del_init(&work->entry);
+			raw_spin_unlock(&dwork->wq->delayed_lock);
 			return 1;
+		}
 	}
 
 	/* try to claim PENDING the normal way */
@@ -2479,12 +2493,33 @@ bool queue_work_node(int node, struct workqueue_struct *wq,
 }
 EXPORT_SYMBOL_GPL(queue_work_node);
 
+/*
+ * Should be used instead of __queue_work() right after removing a struct
+ * delayed_work from the timer (either by completion or timer_delete). Not
+ * needed if it was never added to the timer in the first place.
+ */
+static void __queue_delayed_work_now(struct delayed_work *dwork)
+{
+	struct workqueue_struct *wq = dwork->wq;
+
+	/*
+	 * The __queue_work() call must happen inside the lock, because
+	 * otherwise destroy_flush_all_delayed() might see the list being empty
+	 * before we call __queue_work(), which would be illegal.
+	 */
+
+	raw_spin_lock(&wq->delayed_lock);
+	list_del_init(&dwork->work.entry);
+	__queue_work(dwork->cpu, wq, &dwork->work);
+	raw_spin_unlock(&wq->delayed_lock);
+}
+
 void delayed_work_timer_fn(struct timer_list *t)
 {
 	struct delayed_work *dwork = from_timer(dwork, t, timer);
 
 	/* should have been called from irqsafe timer with irq already off */
-	__queue_work(dwork->cpu, dwork->wq, &dwork->work);
+	__queue_delayed_work_now(dwork);
 }
 EXPORT_SYMBOL(delayed_work_timer_fn);
 
@@ -2515,6 +2550,10 @@ static void __queue_delayed_work(int cpu, struct workqueue_struct *wq,
 	dwork->cpu = cpu;
 	timer->expires = jiffies + delay;
 
+	raw_spin_lock(&wq->delayed_lock);
+	list_add_tail(&work->entry, &wq->delayed_list);
+	raw_spin_unlock(&wq->delayed_lock);
+
 	if (housekeeping_enabled(HK_TYPE_TIMER)) {
 		/* If the current cpu is a housekeeping cpu, use it. */
 		cpu = smp_processor_id();
@@ -4282,7 +4321,7 @@ bool flush_delayed_work(struct delayed_work *dwork)
 {
 	local_irq_disable();
 	if (timer_delete_sync(&dwork->timer))
-		__queue_work(dwork->cpu, dwork->wq, &dwork->work);
+		__queue_delayed_work_now(dwork);
 	local_irq_enable();
 	return flush_work(&dwork->work);
 }
@@ -5720,6 +5759,9 @@ static struct workqueue_struct *__alloc_workqueue(const char *fmt,
 	INIT_LIST_HEAD(&wq->flusher_overflow);
 	INIT_LIST_HEAD(&wq->maydays);
 
+	INIT_LIST_HEAD(&wq->delayed_list);
+	raw_spin_lock_init(&wq->delayed_lock);
+
 	INIT_LIST_HEAD(&wq->list);
 
 	if (flags & WQ_UNBOUND) {
@@ -5832,22 +5874,86 @@ static bool pwq_busy(struct pool_workqueue *pwq)
 	return false;
 }
 
+/*
+ * Helper function for destroy_workqueue() which ensures that all delayed work
+ * items are queued to the workqueue. This means that once the workqueue drains
+ * all jobs, it's guaranteed that no new jobs will be added.
+ *
+ * The user must not call any variant of queue_work_on() during the call to
+ * destroy_workqueue(). It's a user bug if that happens. As for the methods
+ * cancel_delayed_work() and flush_delayed_work(), those are okay to call.
+ */
+static void destroy_flush_all_delayed(struct workqueue_struct *wq)
+{
+	struct delayed_work *dwork;
+	/*
+	 * List of pending delayed work items where the pending bit is not
+	 * owned by the timer, but some other function. We need to wait for
+	 * them to remove themselves from this list.
+	 *
+	 * Protected by wq->delayed_lock.
+	 */
+	LIST_HEAD(running_timers);
+
+	raw_spin_lock_irq(&wq->delayed_lock);
+	while ((dwork = list_first_entry_or_null(&wq->delayed_list,
+				struct delayed_work, work.entry))) {
+		if (likely(timer_delete(&dwork->timer))) {
+			WARN_ON(wq != dwork->wq);
+			list_del_init(&dwork->work.entry);
+			__queue_work(dwork->cpu, wq, &dwork->work);
+		} else {
+			/*
+			 * The timer is in delayed_list, but the timer could
+			 * not be deleted. This means that the pending bit is
+			 * currently owned by another running function. In this
+			 * case we just need to wait for that other function to
+			 * finish running.
+			 */
+			list_move(&dwork->work.entry, &running_timers);
+		}
+
+		/* Give others a chance to use the lock. */
+		raw_spin_unlock_irq(&wq->delayed_lock);
+		raw_spin_lock_irq(&wq->delayed_lock);
+	}
+	raw_spin_unlock_irq(&wq->delayed_lock);
+
+	/*
+	 * We also need to ensure that all rcu_work items are queued. The
+	 * easiest way to do this is to wait for the rcu callbacks to finish,
+	 * so we call rcu_barrier(). The call to rcu_barrier() could happen
+	 * anywhere in this function, but right now we need to wait for timer
+	 * callbacks to remove themselves from running_timers and we have no
+	 * good way of waiting for that other than spinning, so this seems like
+	 * a good time to call rcu_barrier().
+	 */
+	rcu_barrier();
+
+	raw_spin_lock_irq(&wq->delayed_lock);
+	while (!list_empty(&running_timers)) {
+		raw_spin_unlock_irq(&wq->delayed_lock);
+		/*
+		 * The timers are irqsafe, so waiting for them by spinning
+		 * should be okay.
+		 */
+		cpu_relax();
+		raw_spin_lock_irq(&wq->delayed_lock);
+	}
+	raw_spin_unlock_irq(&wq->delayed_lock);
+}
+
 /**
  * destroy_workqueue - safely terminate a workqueue
  * @wq: target workqueue
  *
  * Safely destroy a workqueue. All work currently pending will be done first.
  *
- * This function does NOT guarantee that non-pending work that has been
- * submitted with queue_delayed_work() and similar functions will be done
- * before destroying the workqueue. The fundamental problem is that, currently,
- * the workqueue has no way of accessing non-pending delayed_work. delayed_work
- * is only linked on the timer-side. All delayed_work must, therefore, be
- * canceled before calling this function.
- *
- * TODO: It would be better if the problem described above wouldn't exist and
- * destroy_workqueue() would cleanly cancel all pending and non-pending
- * delayed_work.
+ * Note that delayed work is executed *without* waiting for the delay. This
+ * means that delayed work may execute sooner than expected. This doesn't apply
+ * to rcu work, which is still guaranteed to execute a grace period after being
+ * scheduled. Therefore, calling destroy_workqueue() may involve waiting for a
+ * rcu_barrier().
  */
 void destroy_workqueue(struct workqueue_struct *wq)
 {
@@ -5860,6 +5966,12 @@ void destroy_workqueue(struct workqueue_struct *wq)
 	 */
 	workqueue_sysfs_unregister(wq);
 
+	/*
+	 * Ensure that all delayed work items are queued properly so that
+	 * drain_workqueue() will not miss any pending jobs.
+	 */
+	destroy_flush_all_delayed(wq);
+
 	/* mark the workqueue destruction is in progress */
 	mutex_lock(&wq->mutex);
 	wq->flags |= __WQ_DESTROYING;

---
base-commit: 2762750ac5c6395dfa69b2cf1e3208fe6ae45cd5
change-id: 20250423-destroy-workqueue-flush-2d25dede7565

Best regards,
-- 
Alice Ryhl <aliceryhl@google.com>


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-23 17:51 [PATCH] workqueue: flush all pending jobs in destroy_workqueue() Alice Ryhl
@ 2025-04-24 19:57 ` Tejun Heo
  2025-04-25  9:33   ` Alice Ryhl
  0 siblings, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2025-04-24 19:57 UTC (permalink / raw)
  To: Alice Ryhl; +Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel

Hello, Alice.

On Wed, Apr 23, 2025 at 05:51:27PM +0000, Alice Ryhl wrote:
...
> @@ -367,6 +367,8 @@ struct workqueue_struct {
>  	struct lockdep_map	__lockdep_map;
>  	struct lockdep_map	*lockdep_map;
>  #endif
> +	raw_spinlock_t		delayed_lock;	/* protects pending_list */
> +	struct list_head	delayed_list;	/* list of pending delayed jobs */

I think we'll have to make this per-CPU or per-pwq. There can be a lot of
delayed work items being queued on, e.g., system_wq. Imagine that happening
on a multi-socket NUMA system. That cacheline is going to be bounced around
pretty hard.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-24 19:57 ` Tejun Heo
@ 2025-04-25  9:33   ` Alice Ryhl
  2025-04-25  9:57     ` Philipp Stanner
  2025-04-25 19:25     ` Tejun Heo
  0 siblings, 2 replies; 7+ messages in thread
From: Alice Ryhl @ 2025-04-25  9:33 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel

On Thu, Apr 24, 2025 at 09:57:55AM -1000, Tejun Heo wrote:
> Hello, Alice.
> 
> On Wed, Apr 23, 2025 at 05:51:27PM +0000, Alice Ryhl wrote:
> ...
> > @@ -367,6 +367,8 @@ struct workqueue_struct {
> >  	struct lockdep_map	__lockdep_map;
> >  	struct lockdep_map	*lockdep_map;
> >  #endif
> > +	raw_spinlock_t		delayed_lock;	/* protects pending_list */
> > +	struct list_head	delayed_list;	/* list of pending delayed jobs */
> 
> I think we'll have to make this per-CPU or per-pwq. There can be a lot of
> delayed work items being queued on, e.g., system_wq. Imagine that happening
> on a multi-socket NUMA system. That cacheline is going to be bounced around
> pretty hard.

Hmm. I think we would need to add a new field to delayed_work to keep
track of which list it has been added to.

Another option could be to add a boolean that disables the list. After
all, we never call destroy_workqueue() on system_wq so we don't need the
list for that workqueue.

Thoughts?

Alice

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-25  9:33   ` Alice Ryhl
@ 2025-04-25  9:57     ` Philipp Stanner
  2025-04-25 19:25     ` Tejun Heo
  1 sibling, 0 replies; 7+ messages in thread
From: Philipp Stanner @ 2025-04-25  9:57 UTC (permalink / raw)
  To: Alice Ryhl, Tejun Heo; +Cc: Lai Jiangshan, Danilo Krummrich, linux-kernel

On Fri, 2025-04-25 at 09:33 +0000, Alice Ryhl wrote:
> On Thu, Apr 24, 2025 at 09:57:55AM -1000, Tejun Heo wrote:
> > Hello, Alice.
> > 
> > On Wed, Apr 23, 2025 at 05:51:27PM +0000, Alice Ryhl wrote:
> > ...
> > > @@ -367,6 +367,8 @@ struct workqueue_struct {
> > >  	struct lockdep_map	__lockdep_map;
> > >  	struct lockdep_map	*lockdep_map;
> > >  #endif
> > > +	raw_spinlock_t		delayed_lock;	/* protects
> > > pending_list */
> > > +	struct list_head	delayed_list;	/* list of
> > > pending delayed jobs */
> > 
> > I think we'll have to make this per-CPU or per-pwq. There can be a
> > lot of
> > delayed work items being queued on, e.g., system_wq. Imagine that
> > happening
> > on a multi-socket NUMA system. That cacheline is going to be
> > bounced around
> > pretty hard.
> 
> Hmm. I think we would need to add a new field to delayed_work to keep
> track of which list it has been added to.
> 
> Another option could be to add a boolean that disables the list.
> After
> all, we never call destroy_workqueue() on system_wq so we don't need
> the
> list for that workqueue.
> 
> Thoughts?

I for my part was astonished that I actually found this half-bug in the
WQ implementation, because WQs are a) very important and b) very
intensively used, so I had expected that the bug *must* be on my side.
The fact that it wasn't is a hint for me that there are not that many
parties in the kernel that tear down with non-canceled DW.

You also have to race a bit to run into the problem.

I'm not sure how relevant that is for the synchronization overhead
Tejun describes; but take it for what it's worth.


P.

> 
> Alice


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-25  9:33   ` Alice Ryhl
  2025-04-25  9:57     ` Philipp Stanner
@ 2025-04-25 19:25     ` Tejun Heo
  2025-04-28  9:32       ` Alice Ryhl
  1 sibling, 1 reply; 7+ messages in thread
From: Tejun Heo @ 2025-04-25 19:25 UTC (permalink / raw)
  To: Alice Ryhl; +Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel

Hello,

On Fri, Apr 25, 2025 at 09:33:54AM +0000, Alice Ryhl wrote:
...
> Hmm. I think we would need to add a new field to delayed_work to keep
> track of which list it has been added to.

Can't we use the same cpu that's already recorded in delayed_work->cpu?

> Another option could be to add a boolean that disables the list. After
> all, we never call destroy_workqueue() on system_wq so we don't need the
> list for that workqueue.

It's not just system_wq tho. Any busy workqueue can hit scalability problems
and the result would be usually subtle performance penalties. If we can keep
it cheap enough, I'd prefer the behavior uniform across all workqueues.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-25 19:25     ` Tejun Heo
@ 2025-04-28  9:32       ` Alice Ryhl
  2025-04-28 18:30         ` Tejun Heo
  0 siblings, 1 reply; 7+ messages in thread
From: Alice Ryhl @ 2025-04-28  9:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel

On Fri, Apr 25, 2025 at 09:25:57AM -1000, Tejun Heo wrote:
> Hello,
> 
> On Fri, Apr 25, 2025 at 09:33:54AM +0000, Alice Ryhl wrote:
> ...
> > Hmm. I think we would need to add a new field to delayed_work to keep
> > track of which list it has been added to.
> 
> Can't we use the same cpu that's already recorded in delayed_work->cpu?

Isn't that just going to be equal to WORK_CPU_UNBOUND most of the time?

Though I guess we could use the values NR_CPUS .. 2*NR_CPUS-1 to
remember which list is used when it is unbound.

> > Another option could be to add a boolean that disables the list. After
> > all, we never call destroy_workqueue() on system_wq so we don't need the
> > list for that workqueue.
> 
> It's not just system_wq tho. Any busy workqueue can hit scalability problems
> and the result would be usually subtle performance penalties. If we can keep
> it cheap enough, I'd prefer the behavior uniform across all workqueues.

Yeah ... that does make sense.

Alice

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] workqueue: flush all pending jobs in destroy_workqueue()
  2025-04-28  9:32       ` Alice Ryhl
@ 2025-04-28 18:30         ` Tejun Heo
  0 siblings, 0 replies; 7+ messages in thread
From: Tejun Heo @ 2025-04-28 18:30 UTC (permalink / raw)
  To: Alice Ryhl; +Cc: Lai Jiangshan, Philipp Stanner, Danilo Krummrich, linux-kernel

Hello, Alice.

On Mon, Apr 28, 2025 at 09:32:46AM +0000, Alice Ryhl wrote:
> Isn't that just going to be equal to WORK_CPU_UNBOUND most of the time?
> 
> Though I guess we could use the values NR_CPUS .. 2*NR_CPUS-1 to
> remember which list is used when it is unbound.

I think there are multiple ways to go about it. Maybe we can
wq_select_unbound_cpu() earlier. Maybe we can encode the unboundedness in a
flag bit, or we can just add another 32bit field to delayed_work. Given the
layout, I don't think adding another 32bit field would change the actual
size of delayed_work on 64bit machines anyway.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-04-28 18:30 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-23 17:51 [PATCH] workqueue: flush all pending jobs in destroy_workqueue() Alice Ryhl
2025-04-24 19:57 ` Tejun Heo
2025-04-25  9:33   ` Alice Ryhl
2025-04-25  9:57     ` Philipp Stanner
2025-04-25 19:25     ` Tejun Heo
2025-04-28  9:32       ` Alice Ryhl
2025-04-28 18:30         ` Tejun Heo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).