[PATCHSET v2] Add lockless MPSC FIFO queue for task work

Linux io-uring development
 help / color / mirror / Atom feed

* [PATCHSET v2] Add lockless MPSC FIFO queue for task work
@ 2026-06-12  2:48 Jens Axboe
  2026-06-12  2:48 ` [PATCH 1/6] io_uring: grab RCU read lock marking task run Jens Axboe
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman

Hi,

For v1, see here:

https://lore.kernel.org/io-uring/20260611160553.1486640-1-axboe@kernel.dk/

v2 moves the regular task_work over to mpscq as well, and adds a few
optimizations on top of v1. See the changes section for more details.

Can also be found in a git tree here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq

Changes since v1:
- Add patches moving !defer task_work to mpscq as well.
- Get rid of ->cq_wait_added and the associated atomics on that. Have
  waiter set ->cq_wait_nr and just count down from that. Removes another
  atomic in local work additions, and eliminates the need for the
  atomic_try_cmpxchg() for the faster lazy wakes.
- Get rid of RCU read lock for local work additions (Caleb)
- Cleanup up mpscq API a bit (Caleb)
- Correct mpscq comment (Caleb)

 include/linux/io_uring_types.h |  39 ++++-
 io_uring/cancel.c              |   2 -
 io_uring/io_uring.c            |   9 +-
 io_uring/mpscq.h               | 118 +++++++++++++
 io_uring/sqpoll.c              |  30 ++--
 io_uring/tctx.c                |   3 +-
 io_uring/tw.c                  | 307 +++++++++++++++------------------
 io_uring/tw.h                  |  11 +-
 io_uring/wait.c                |   2 +-
 io_uring/wait.h                |  10 +-
 10 files changed, 316 insertions(+), 215 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/6] io_uring: grab RCU read lock marking task run
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  2026-06-13  2:27   ` Caleb Sander Mateos
  2026-06-12  2:48 ` [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

Not required right now, as io_req_local_work_add() already calls this
helper with the RCU read lock held. But in preparation for that not
being the case, grab it locally.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/tw.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/io_uring/tw.c b/io_uring/tw.c
index 023d5e6bc491..f4335c8d50d9 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -158,11 +158,11 @@ void tctx_task_work(struct callback_head *cb)
  */
 static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
 {
-	lockdep_assert_in_rcu_read_lock();
-
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) {
-		struct io_rings *rings = rcu_dereference(ctx->rings_rcu);
+		struct io_rings *rings;
 
+		guard(rcu)();
+		rings = rcu_dereference(ctx->rings_rcu);
 		atomic_or(IORING_SQ_TASKRUN, &rings->sq_flags);
 	}
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
  2026-06-12  2:48 ` [PATCH 1/6] io_uring: grab RCU read lock marking task run Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  2026-06-13  2:40   ` Caleb Sander Mateos
  2026-06-12  2:48 ` [PATCH 3/6] io_uring: switch local task_work to a mpscq Jens Axboe
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.

Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.

Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.

The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.

The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.

The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline. The cursor is written for every
popped entry, and keeping it on the same cacheline as ->tail would have
the consumer invalidating the line that producers need for every push.
Keeping it external lets the caller place it with its own consumer side
data instead.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  12 ++++
 io_uring/mpscq.h               | 118 +++++++++++++++++++++++++++++++++
 2 files changed, 130 insertions(+)
 create mode 100644 io_uring/mpscq.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aa4d5477f859..85e12b4884a5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -55,6 +55,18 @@ struct io_wq_work_list {
 	struct io_wq_work_node *last;
 };
 
+/*
+ * Lockless multi-producer, single-consumer FIFO queue, see
+ * io_uring/mpscq.h for the implementation and rules. Defined here so
+ * that it can be embedded in io_ring_ctx. This is the producer side
+ * only - the consumer cursor is kept separately, on a cacheline that
+ * isn't dirtied by the producers.
+ */
+struct mpscq {
+	struct llist_node	*tail;		/* producers */
+	struct llist_node	stub;
+};
+
 struct io_wq_work {
 	struct io_wq_work_node list;
 	atomic_t flags;
diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
new file mode 100644
index 000000000000..bc482d10e0f3
--- /dev/null
+++ b/io_uring/mpscq.h
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef IOU_MPSCQ_H
+#define IOU_MPSCQ_H
+
+/*
+ * mpscq - lockless multi-producer, single-consumer FIFO queue
+ *
+ * Unlike llist, which is LIFO ordered and hence needs an O(n)
+ * llist_reverse_order() pass before entries can be processed in queue order,
+ * this queue hands out nodes in the order they were pushed.
+ *
+ * The consumer cursor is held by the caller rather than in the queue struct
+ * (see below), and with the stub reinsertion done as a single cmpxchg attempt
+ * instead of an unconditional push, keeping tail == stub a reliable empty test
+ * while a producer is in the middle of a push.
+ *
+ * Producers may run in any context (task, softirq, hardirq) and are wait-free:
+ * a push is one xchg() plus one store, with no retry loops. FIFO order between
+ * producers is the order in which the xchg() on ->tail serializes them.
+ *
+ * The price for linked-list FIFO is that a push publishes the node in two
+ * steps: the xchg() makes it the new tail, and the subsequent store links it to
+ * its predecessor. In between, the tail end of the queue is not yet reachable
+ * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
+ * reports false. The consumer must not treat such a NULL as "queue empty" - it
+ * should retry later. The window is two instructions wide, but a producer can
+ * be preempted inside it, so the consumer must not spin on it while holding
+ * resources the producer might need to make progress.
+ *
+ * The consumer side only supports a single consumer at a time, callers must
+ * provide their own serialization for it. The stub node is what allows the
+ * consumer to detach the final node without racing with the link stores of
+ * producers. This scheme also guarantees that the previous tail observed by
+ * mpscq_push() cannot be freed by the consumer until the push has linked it,
+ * which is what makes the deferred link store safe.
+ *
+ * The queue struct only holds the producer side. The consumer keeps its cursor
+ * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
+ * so that it can be placed on a different cacheline: the cursor is written for
+ * every pop, and having it share a line with ->tail would have the consumer
+ * invalidating the line that producers need for every push.
+ */
+static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
+{
+	q->tail = *headp = &q->stub;
+	q->stub.next = NULL;
+}
+
+/*
+ * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
+ * yet. May be called from any context. Note that !empty doesn't guarantee that
+ * mpscq_pop() will return an entry yet, see the in-flight producer window
+ * above.
+ */
+static inline bool mpscq_empty(struct mpscq *q)
+{
+	return READ_ONCE(q->tail) == &q->stub;
+}
+
+/*
+ * Push a node onto the queue. Safe against concurrent pushes from any context,
+ * and against the (single) consumer. Returns true if the queue was empty
+ * before this push.
+ */
+static inline bool mpscq_push(struct mpscq *q, struct llist_node *node)
+{
+	struct llist_node *prev;
+
+	node->next = NULL;
+	/*
+	 * xchg() implies a full barrier, so the initialization of the
+	 * entry (including ->next above) is visible before the node can
+	 * be reached, either via ->tail or via ->next chasing from the
+	 * head once the store below has linked it.
+	 */
+	prev = xchg(&q->tail, node);
+	WRITE_ONCE(prev->next, node);
+	return prev == &q->stub;
+}
+
+/*
+ * Pop the oldest node off the queue, or return NULL if no node is available.
+ * NULL is returned both when the queue is empty and when a producer has
+ * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
+ * tell the two apart. Single consumer only, with headp being the consumer
+ * cursor that mpscq_init() set up.
+ */
+static inline struct llist_node *mpscq_pop(struct mpscq *q,
+					   struct llist_node **headp)
+{
+	struct llist_node *head = *headp, *next;
+
+	if (head == &q->stub) {
+		head = READ_ONCE(head->next);
+		if (!head)
+			return NULL;
+		*headp = head;
+	}
+	next = READ_ONCE(head->next);
+	if (next) {
+		*headp = next;
+		return head;
+	}
+	/*
+	 * 'head' is the last linked node, it can only be handed out once the
+	 * stub has taken its place as the tail. If the cmpxchg fails, a
+	 * producer has made a new node the tail but hasn't linked 'head' to
+	 * it yet - bail and let the caller retry.
+	 */
+	q->stub.next = NULL;
+	if (try_cmpxchg(&q->tail, &head, &q->stub)) {
+		*headp = &q->stub;
+		return head;
+	}
+	return NULL;
+}
+
+#endif /* IOU_MPSCQ_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 3/6] io_uring: switch local task_work to a mpscq
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
  2026-06-12  2:48 ` [PATCH 1/6] io_uring: grab RCU read lock marking task run Jens Axboe
  2026-06-12  2:48 ` [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  2026-06-12  3:20   ` Caleb Sander Mateos
  2026-06-12  2:48 ` [PATCH 4/6] io_uring: switch normal " Jens Axboe
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  13 +++-
 io_uring/io_uring.c            |   2 +-
 io_uring/tw.c                  | 135 ++++++++++++++-------------------
 io_uring/tw.h                  |   4 +-
 io_uring/wait.c                |   2 +-
 io_uring/wait.h                |  10 ++-
 6 files changed, 78 insertions(+), 88 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 85e12b4884a5..9df5584ec3b1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -351,6 +351,14 @@ struct io_ring_ctx {
 		 */
 		atomic_t		cancel_seq;
 
+		/*
+		 * Consumer cursor for ->work_list, protected by ->uring_lock.
+		 * Deliberately kept away from the producer side of the queue,
+		 * as it's written for every popped entry, and the producer
+		 * cacheline is contended enough as it is.
+		 */
+		struct llist_node	*work_head;
+
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
 		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
@@ -417,8 +425,7 @@ struct io_ring_ctx {
 	 */
 	struct {
 		struct io_rings	__rcu	*rings_rcu;
-		struct llist_head	work_llist;
-		struct llist_head	retry_llist;
+		struct mpscq		work_list;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
 		atomic_t		cq_timeouts;
@@ -742,8 +749,6 @@ struct io_kiocb {
 	 */
 	u16				buf_index;
 
-	unsigned			nr_tw;
-
 	/* REQ_F_* flags */
 	io_req_flags_t			flags;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 753ac23401c5..16acd99ff083 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -280,7 +280,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
-	init_llist_head(&ctx->work_llist);
+	mpscq_init(&ctx->work_list, &ctx->work_head);
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	mutex_init(&ctx->tctx_lock);
 	ctx->submit_state.free_list.next = NULL;
diff --git a/io_uring/tw.c b/io_uring/tw.c
index f4335c8d50d9..b8d6027aaeff 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -14,6 +14,7 @@
 #include "rw.h"
 #include "eventfd.h"
 #include "wait.h"
+#include "mpscq.h"
 
 void io_fallback_req_func(struct work_struct *work)
 {
@@ -170,11 +171,7 @@ static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
 void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	unsigned nr_wait, nr_tw, nr_tw_prev;
-	struct llist_node *head;
-
-	/* See comment above IO_CQ_WAKE_INIT */
-	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
+	int nr_wait;
 
 	/*
 	 * We don't know how many requests there are in the link and whether
@@ -183,56 +180,37 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 	if (req->flags & IO_REQ_LINK_FLAGS)
 		flags &= ~IOU_F_TWQ_LAZY_WAKE;
 
-	guard(rcu)();
-
-	head = READ_ONCE(ctx->work_llist.first);
-	do {
-		nr_tw_prev = 0;
-		if (head) {
-			struct io_kiocb *first_req = container_of(head,
-							struct io_kiocb,
-							io_task_work.node);
-			/*
-			 * Might be executed at any moment, rely on
-			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
-			 */
-			nr_tw_prev = READ_ONCE(first_req->nr_tw);
-		}
-
-		/*
-		 * Theoretically, it can overflow, but that's fine as one of
-		 * previous adds should've tried to wake the task.
-		 */
-		nr_tw = nr_tw_prev + 1;
-		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = IO_CQ_WAKE_FORCE;
-
-		req->nr_tw = nr_tw;
-		req->io_task_work.node.next = head;
-	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
-			      &req->io_task_work.node));
-
 	/*
-	 * cmpxchg implies a full barrier, which pairs with the barrier
-	 * in set_current_state() on the io_cqring_wait() side. It's used
-	 * to ensure that either we see updated ->cq_wait_nr, or waiters
-	 * going to sleep will observe the work added to the list, which
-	 * is similar to the wait/wawke task state sync.
+	 * The xchg() in mpscq_push() implies a full barrier, which pairs with
+	 * the barrier in set_current_state() on the io_cqring_wait() side. This
+	 * ensures that either we see the updated ->cq_wait_nr, or waiters going
+	 * to sleep will observe the work added to the list, which is similar to
+	 * the wait/wake task state sync.
 	 */
-
-	if (!head) {
+	if (mpscq_push(&ctx->work_list, &req->io_task_work.node)) {
 		io_ctx_mark_taskrun(ctx);
 		if (data_race(ctx->int_flags) & IO_RING_F_HAS_EVFD)
 			io_eventfd_signal(ctx, false);
 	}
 
+	/*
+	 * No one is waiting (IO_CQ_WAKE_INIT), or this cycle's wake up has
+	 * already been issued (zero or negative, see below).
+	 */
 	nr_wait = atomic_read(&ctx->cq_wait_nr);
-	/* not enough or no one is waiting */
-	if (nr_tw < nr_wait)
+	if (nr_wait <= 0)
 		return;
-	/* the previous add has already woken it up */
-	if (nr_tw_prev >= nr_wait)
+	if (flags & IOU_F_TWQ_LAZY_WAKE) {
+		/*
+		 * ->cq_wait_nr counts down the number of lazy adds, once it
+		 * hits zero we're good to wake the waiter.
+		 */
+		if (!atomic_dec_and_test(&ctx->cq_wait_nr))
+			return;
+	} else if (!atomic_try_cmpxchg(&ctx->cq_wait_nr, &nr_wait, IO_CQ_WAKE_INIT)) {
+		/* lost the race against another wake up, this one is covered */
 		return;
+	}
 	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
@@ -273,21 +251,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
 
 void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
-	struct llist_node *node;
+	struct llist_node *node, *first = NULL, **tail = &first;
 
 	/*
-	 * Running the work items may utilize ->retry_llist as a means
-	 * for capping the number of task_work entries run at the same
-	 * time. But that list can potentially race with moving the work
-	 * from here, if the task is exiting. As any normal task_work
-	 * running holds ->uring_lock already, just guard this slow path
-	 * with ->uring_lock to avoid racing on ->retry_llist.
+	 * The work list consumer side is serialized by ->uring_lock, see
+	 * __io_run_local_work(). Grab it to guard against racing with normal
+	 * task_work running, as the task may be exiting.
 	 */
 	guard(mutex)(&ctx->uring_lock);
-	node = llist_del_all(&ctx->work_llist);
-	__io_fallback_tw(node, false);
-	node = llist_del_all(&ctx->retry_llist);
-	__io_fallback_tw(node, false);
+
+	while (!mpscq_empty(&ctx->work_list)) {
+		node = mpscq_pop(&ctx->work_list, &ctx->work_head);
+		if (!node) {
+			/* a producer is mid-push, wait for it to link */
+			cpu_relax();
+			continue;
+		}
+		*tail = node;
+		tail = &node->next;
+	}
+	*tail = NULL;
+	__io_fallback_tw(first, false);
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
@@ -302,22 +286,23 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 	return false;
 }
 
-static int __io_run_local_work_loop(struct llist_node **node,
+static int __io_run_local_work_loop(struct io_ring_ctx *ctx,
 				    io_tw_token_t tw,
 				    int events)
 {
 	int ret = 0;
 
-	while (*node) {
-		struct llist_node *next = (*node)->next;
-		struct io_kiocb *req = container_of(*node, struct io_kiocb,
-						    io_task_work.node);
+	while (ret < events) {
+		struct llist_node *node = mpscq_pop(&ctx->work_list, &ctx->work_head);
+		struct io_kiocb *req;
+
+		if (!node)
+			break;
+		req = container_of(node, struct io_kiocb, io_task_work.node);
 		INDIRECT_CALL_2(req->io_task_work.func,
 				io_poll_task_func, io_req_rw_complete,
 				(struct io_tw_req){req}, tw);
-		*node = next;
-		if (++ret >= events)
-			break;
+		ret++;
 	}
 
 	return ret;
@@ -326,7 +311,6 @@ static int __io_run_local_work_loop(struct llist_node **node,
 static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
 			       int min_events, int max_events)
 {
-	struct llist_node *node;
 	unsigned int loops = 0;
 	int ret = 0;
 
@@ -335,24 +319,21 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 again:
-	tw.cancel = io_should_terminate_tw(ctx);
-	min_events -= ret;
-	ret = __io_run_local_work_loop(&ctx->retry_llist.first, tw, max_events);
-	if (ctx->retry_llist.first)
-		goto retry_done;
-
 	/*
-	 * llists are in reverse order, flip it back the right way before
-	 * running the pending items.
+	 * If the last loop made no progress while work is still pending,
+	 * a producer has published a node but hasn't linked it into the
+	 * queue yet (see mpscq_pop()). Give it a chance to finish rather
+	 * than spinning on the queue.
 	 */
-	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
-	ret += __io_run_local_work_loop(&node, tw, max_events - ret);
-	ctx->retry_llist.first = node;
+	if (unlikely(loops && !ret))
+		cond_resched();
+	tw.cancel = io_should_terminate_tw(ctx);
+	min_events -= ret;
+	ret = __io_run_local_work_loop(ctx, tw, max_events);
 	loops++;
 
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
-retry_done:
 	io_submit_flush_completions(ctx);
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
diff --git a/io_uring/tw.h b/io_uring/tw.h
index 415e330fabde..f42db5fdbded 100644
--- a/io_uring/tw.h
+++ b/io_uring/tw.h
@@ -6,6 +6,8 @@
 #include <linux/percpu-refcount.h>
 #include <linux/io_uring_types.h>
 
+#include "mpscq.h"
+
 #define IO_LOCAL_TW_DEFAULT_MAX		20
 
 /*
@@ -89,7 +91,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 {
-	return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
+	return !mpscq_empty(&ctx->work_list);
 }
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
diff --git a/io_uring/wait.c b/io_uring/wait.c
index ec01e78a216d..c5fc34d2ce97 100644
--- a/io_uring/wait.c
+++ b/io_uring/wait.c
@@ -98,7 +98,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
 		atomic_set(&ctx->cq_wait_nr, 1);
 		smp_mb();
-		if (!llist_empty(&ctx->work_llist))
+		if (io_local_work_pending(ctx))
 			goto out_wake;
 	}
 
diff --git a/io_uring/wait.h b/io_uring/wait.h
index a4274b137f81..6d494297e1ce 100644
--- a/io_uring/wait.h
+++ b/io_uring/wait.h
@@ -5,12 +5,14 @@
 #include <linux/io_uring_types.h>
 
 /*
- * No waiters. It's larger than any valid value of the tw counter
- * so that tests against ->cq_wait_nr would fail and skip wake_up().
+ * ->cq_wait_nr is armed with the number of lazy task_work adds the waiter
+ * still needs, and counted down by the add side, with the add reaching zero
+ * issuing the (single) wake up for this wait cycle. Zero and below means no
+ * wake up is to be issued: IO_CQ_WAKE_INIT when no task is waiting (also
+ * what a forced wake up resets it to when claiming one), zero once the
+ * countdown has fired.
  */
 #define IO_CQ_WAKE_INIT		(-1U)
-/* Forced wake up if there is a waiter regardless of ->cq_wait_nr */
-#define IO_CQ_WAKE_FORCE	(IO_CQ_WAKE_INIT >> 1)
 
 struct ext_arg {
 	size_t argsz;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
                   ` (2 preceding siblings ...)
  2026-06-12  2:48 ` [PATCH 3/6] io_uring: switch local task_work to a mpscq Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  2026-06-12 18:59   ` Caleb Sander Mateos
  2026-06-12  2:48 ` [PATCH 5/6] io_uring: run the tctx task_work fallback directly Jens Axboe
  2026-06-12  2:48 ` [PATCH 6/6] io_uring: remove the per-ctx fallback task_work machinery Jens Axboe
  5 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

Like the local task_work list, the normal (tctx) task_work list is an
llist, and hence needs the O(n) llist_reverse_order() pass before
running entries in queue order. On top of that, capped runs - sqpoll
processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
claimed-but-unprocessed leftovers carried in a separate retry_list,
as they can't be pushed back to the shared list.

Switch tctx->task_list to a mpscq, like what was done for the
DEFER_TASKRUN paths as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  12 ++-
 io_uring/sqpoll.c              |  30 +++----
 io_uring/tctx.c                |   3 +-
 io_uring/tw.c                  | 146 ++++++++++++++++++++-------------
 io_uring/tw.h                  |   4 +-
 5 files changed, 113 insertions(+), 82 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9df5584ec3b1..33de451127f9 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -131,6 +131,11 @@ struct io_uring_task {
 	const struct io_ring_ctx 	*last;
 	struct task_struct		*task;
 	struct io_wq			*io_wq;
+	/*
+	 * Consumer cursor for ->task_list. Only popped by the task itself,
+	 * or by ->fallback_work once the task can no longer run task_work.
+	 */
+	struct llist_node		*task_head;
 	struct file			*registered_rings[IO_RINGFD_REG_MAX];
 
 	struct xarray			xa;
@@ -139,8 +144,13 @@ struct io_uring_task {
 	atomic_t			inflight_tracked;
 	struct percpu_counter		inflight;
 
+	/* drains ->task_list once the task can no longer run task_work */
+	struct work_struct		fallback_work;
+
 	struct { /* task_work */
-		struct llist_head	task_list;
+		struct mpscq		task_list;
+		/* BIT(0) guards adding tw only once */
+		unsigned long		tw_pending;
 		struct callback_head	task_work;
 	} ____cacheline_aligned_in_smp;
 };
diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
index 46c12afec73e..2460bd605266 100644
--- a/io_uring/sqpoll.c
+++ b/io_uring/sqpoll.c
@@ -260,39 +260,29 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
 }
 
 /*
- * Run task_work, processing the retry_list first. The retry_list holds
- * entries that we passed on in the previous run, if we had more task_work
- * than we were asked to process. Newly queued task_work isn't run until the
- * retry list has been fully processed.
+ * Run task_work, processing no more than max_entries at a time. If more
+ * than that is pending, it simply stays on the queue for the next run.
  */
-static unsigned int io_sq_tw(struct llist_node **retry_list, int max_entries)
+static unsigned int io_sq_tw(int max_entries)
 {
 	struct io_uring_task *tctx = current->io_uring;
 	unsigned int count = 0;
 
-	if (*retry_list) {
-		*retry_list = io_handle_tw_list(*retry_list, &count, max_entries);
-		if (count >= max_entries)
-			goto out;
-		max_entries -= count;
-	}
-	*retry_list = tctx_task_work_run(tctx, max_entries, &count);
-out:
+	tctx_task_work_run(tctx, max_entries, &count);
 	if (task_work_pending(current))
 		task_work_run();
 	return count;
 }
 
-static bool io_sq_tw_pending(struct llist_node *retry_list)
+static bool io_sq_tw_pending(void)
 {
 	struct io_uring_task *tctx = current->io_uring;
 
-	return retry_list || !llist_empty(&tctx->task_list);
+	return !mpscq_empty(&tctx->task_list);
 }
 
 static int io_sq_thread(void *data)
 {
-	struct llist_node *retry_list = NULL;
 	struct io_sq_data *sqd = data;
 	struct io_ring_ctx *ctx;
 	unsigned long timeout = 0;
@@ -347,7 +337,7 @@ static int io_sq_thread(void *data)
 			if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
 				sqt_spin = true;
 		}
-		if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
+		if (io_sq_tw(IORING_TW_CAP_ENTRIES_VALUE))
 			sqt_spin = true;
 
 		list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
@@ -372,7 +362,7 @@ static int io_sq_thread(void *data)
 		}
 
 		prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
-		if (!io_sqd_events_pending(sqd) && !io_sq_tw_pending(retry_list)) {
+		if (!io_sqd_events_pending(sqd) && !io_sq_tw_pending()) {
 			bool needs_sched = true;
 
 			list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
@@ -411,8 +401,8 @@ static int io_sq_thread(void *data)
 		timeout = jiffies + sqd->sq_thread_idle;
 	}
 
-	if (retry_list)
-		io_sq_tw(&retry_list, UINT_MAX);
+	if (io_sq_tw_pending())
+		io_sq_tw(UINT_MAX);
 
 	io_uring_cancel_generic(true, sqd);
 	rcu_assign_pointer(sqd->thread, NULL);
diff --git a/io_uring/tctx.c b/io_uring/tctx.c
index 42b219b34aa8..cc3bf2b3bdbc 100644
--- a/io_uring/tctx.c
+++ b/io_uring/tctx.c
@@ -103,7 +103,8 @@ __cold struct io_uring_task *io_uring_alloc_task_context(struct task_struct *tas
 	init_waitqueue_head(&tctx->wait);
 	atomic_set(&tctx->in_cancel, 0);
 	atomic_set(&tctx->inflight_tracked, 0);
-	init_llist_head(&tctx->task_list);
+	mpscq_init(&tctx->task_list, &tctx->task_head);
+	INIT_WORK(&tctx->fallback_work, io_tctx_fallback_work);
 	init_task_work(&tctx->task_work, tctx_task_work);
 	return tctx;
 }
diff --git a/io_uring/tw.c b/io_uring/tw.c
index b8d6027aaeff..ca29bb0b9768 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -46,46 +46,6 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw)
 	percpu_ref_put(&ctx->refs);
 }
 
-/*
- * Run queued task_work, returning the number of entries processed in *count.
- * If more entries than max_entries are available, stop processing once this
- * is reached and return the rest of the list.
- */
-struct llist_node *io_handle_tw_list(struct llist_node *node,
-				     unsigned int *count,
-				     unsigned int max_entries)
-{
-	struct io_ring_ctx *ctx = NULL;
-	struct io_tw_state ts = { };
-
-	do {
-		struct llist_node *next = node->next;
-		struct io_kiocb *req = container_of(node, struct io_kiocb,
-						    io_task_work.node);
-
-		if (req->ctx != ctx) {
-			ctx_flush_and_put(ctx, ts);
-			ctx = req->ctx;
-			mutex_lock(&ctx->uring_lock);
-			percpu_ref_get(&ctx->refs);
-			ts.cancel = io_should_terminate_tw(ctx);
-		}
-		INDIRECT_CALL_2(req->io_task_work.func,
-				io_poll_task_func, io_req_rw_complete,
-				(struct io_tw_req){req}, ts);
-		node = next;
-		(*count)++;
-		if (unlikely(need_resched())) {
-			ctx_flush_and_put(ctx, ts);
-			ctx = NULL;
-			cond_resched();
-		}
-	} while (node && *count < max_entries);
-
-	ctx_flush_and_put(ctx, ts);
-	return node;
-}
-
 static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
 {
 	struct io_ring_ctx *last_ctx = NULL;
@@ -114,43 +74,109 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
 	}
 }
 
-static void io_fallback_tw(struct io_uring_task *tctx, bool sync)
+void io_tctx_fallback_work(struct work_struct *work)
 {
-	struct llist_node *node = llist_del_all(&tctx->task_list);
+	struct io_uring_task *tctx = container_of(work, struct io_uring_task,
+						  fallback_work);
+	struct llist_node *node, *first = NULL, **tail = &first;
+
+	/* see tctx_task_work() - a set bit must always have a run coming */
+	clear_bit(0, &tctx->tw_pending);
+	smp_mb__after_atomic();
+
+	while (!mpscq_empty(&tctx->task_list)) {
+		node = mpscq_pop(&tctx->task_list, &tctx->task_head);
+		if (!node) {
+			/* a producer is mid-push, wait for it to link */
+			cond_resched();
+			continue;
+		}
+		*tail = node;
+		tail = &node->next;
+	}
+	*tail = NULL;
+	__io_fallback_tw(first, false);
+	put_task_struct(tctx->task);
+}
 
-	__io_fallback_tw(node, sync);
+static void io_fallback_tw(struct io_uring_task *tctx)
+{
+	/*
+	 * The task ref both keeps ->task valid and, as __io_uring_free() is
+	 * only called when the task itself is freed, ensures the tctx (and
+	 * the queued work) stay around until the drain has run.
+	 */
+	get_task_struct(tctx->task);
+	if (!queue_work(system_unbound_wq, &tctx->fallback_work))
+		put_task_struct(tctx->task);
 }
 
-struct llist_node *tctx_task_work_run(struct io_uring_task *tctx,
-				      unsigned int max_entries,
-				      unsigned int *count)
+/*
+ * Run queued task_work, processing no more than max_entries, with the number
+ * of entries processed added to *count. If more entries than max_entries are
+ * available, the remainder simply stay on the queue for the next run.
+ */
+void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
+			unsigned int *count)
 {
-	struct llist_node *node;
+	struct io_ring_ctx *ctx = NULL;
+	struct io_tw_state ts = { };
 
-	node = llist_del_all(&tctx->task_list);
-	if (node) {
-		node = llist_reverse_order(node);
-		node = io_handle_tw_list(node, count, max_entries);
+	while (*count < max_entries) {
+		struct llist_node *node = mpscq_pop(&tctx->task_list,
+						    &tctx->task_head);
+		struct io_kiocb *req;
+
+		if (!node) {
+			if (mpscq_empty(&tctx->task_list))
+				break;
+			/*
+			 * A producer has published a node but hasn't
+			 * linked it into the queue yet (see mpscq_pop()).
+			 * Give it a chance to finish rather than spinning,
+			 * and don't sit on the ctx lock while doing so.
+			 */
+			ctx_flush_and_put(ctx, ts);
+			ctx = NULL;
+			cond_resched();
+			continue;
+		}
+		req = container_of(node, struct io_kiocb, io_task_work.node);
+		if (req->ctx != ctx) {
+			ctx_flush_and_put(ctx, ts);
+			ctx = req->ctx;
+			mutex_lock(&ctx->uring_lock);
+			percpu_ref_get(&ctx->refs);
+			ts.cancel = io_should_terminate_tw(ctx);
+		}
+		INDIRECT_CALL_2(req->io_task_work.func,
+				io_poll_task_func, io_req_rw_complete,
+				(struct io_tw_req){req}, ts);
+		(*count)++;
+		if (unlikely(need_resched())) {
+			ctx_flush_and_put(ctx, ts);
+			ctx = NULL;
+			cond_resched();
+		}
 	}
+	ctx_flush_and_put(ctx, ts);
 
 	/* relaxed read is enough as only the task itself sets ->in_cancel */
 	if (unlikely(atomic_read(&tctx->in_cancel)))
 		io_uring_drop_tctx_refs(current);
 
 	trace_io_uring_task_work_run(tctx, *count);
-	return node;
 }
 
 void tctx_task_work(struct callback_head *cb)
 {
 	struct io_uring_task *tctx;
-	struct llist_node *ret;
 	unsigned int count = 0;
 
 	tctx = container_of(cb, struct io_uring_task, task_work);
-	ret = tctx_task_work_run(tctx, UINT_MAX, &count);
-	/* can't happen */
-	WARN_ON_ONCE(ret);
+	clear_bit(0, &tctx->tw_pending);
+	smp_mb__after_atomic();
+	tctx_task_work_run(tctx, UINT_MAX, &count);
 }
 
 /*
@@ -220,7 +246,7 @@ void io_req_normal_work_add(struct io_kiocb *req)
 	struct io_ring_ctx *ctx = req->ctx;
 
 	/* task_work already pending, we're done */
-	if (!llist_add(&req->io_task_work.node, &tctx->task_list))
+	if (!mpscq_push(&tctx->task_list, &req->io_task_work.node))
 		return;
 
 	/*
@@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
 		return;
 	}
 
+	/* task_work must only be added once */
+	if (test_and_set_bit(0, &tctx->tw_pending))
+		return;
+
 	if (likely(!task_work_add(tctx->task, &tctx->task_work, ctx->notify_method)))
 		return;
 
-	io_fallback_tw(tctx, false);
+	io_fallback_tw(tctx);
 }
 
 void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
diff --git a/io_uring/tw.h b/io_uring/tw.h
index f42db5fdbded..387e52004da8 100644
--- a/io_uring/tw.h
+++ b/io_uring/tw.h
@@ -25,8 +25,8 @@ static inline bool io_should_terminate_tw(struct io_ring_ctx *ctx)
 }
 
 void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags);
-struct llist_node *io_handle_tw_list(struct llist_node *node, unsigned int *count, unsigned int max_entries);
 void tctx_task_work(struct callback_head *cb);
+void io_tctx_fallback_work(struct work_struct *work);
 int io_run_local_work(struct io_ring_ctx *ctx, int min_events, int max_events);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 
@@ -36,7 +36,7 @@ int io_run_local_work_locked(struct io_ring_ctx *ctx, int min_events);
 
 void io_req_local_work_add(struct io_kiocb *req, unsigned flags);
 void io_req_normal_work_add(struct io_kiocb *req);
-struct llist_node *tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
+void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
 
 static inline void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
 {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 5/6] io_uring: run the tctx task_work fallback directly
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
                   ` (3 preceding siblings ...)
  2026-06-12  2:48 ` [PATCH 4/6] io_uring: switch normal " Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  2026-06-12  2:48 ` [PATCH 6/6] io_uring: remove the per-ctx fallback task_work machinery Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

The fallback work drains the tctx queue only to redistribute the entries
into the per-ctx fallback lists, bouncing them through a second
(per-ctx) work item before they finally run. That made sense when the
producer side did the draining and could be in any context, but the
fallback work is a regular process context kworker: it can just run the
entries itself. Reuse the normal run loop - if run from the fallback
kernel thread, ts.cancel will get set, and the work terminated.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/tw.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/io_uring/tw.c b/io_uring/tw.c
index ca29bb0b9768..0fa685aa3926 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -78,24 +78,18 @@ void io_tctx_fallback_work(struct work_struct *work)
 {
 	struct io_uring_task *tctx = container_of(work, struct io_uring_task,
 						  fallback_work);
-	struct llist_node *node, *first = NULL, **tail = &first;
+	unsigned int count = 0;
 
 	/* see tctx_task_work() - a set bit must always have a run coming */
 	clear_bit(0, &tctx->tw_pending);
 	smp_mb__after_atomic();
 
-	while (!mpscq_empty(&tctx->task_list)) {
-		node = mpscq_pop(&tctx->task_list, &tctx->task_head);
-		if (!node) {
-			/* a producer is mid-push, wait for it to link */
-			cond_resched();
-			continue;
-		}
-		*tail = node;
-		tail = &node->next;
-	}
-	*tail = NULL;
-	__io_fallback_tw(first, false);
+	/*
+	 * Run the entries directly. We're in PF_KTHRED context, hence
+	 * io_should_terminate_tw() is true and they will be marked as
+	 * canceled.
+	 */
+	tctx_task_work_run(tctx, UINT_MAX, &count);
 	put_task_struct(tctx->task);
 }
 
@@ -161,8 +155,13 @@ void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
 	}
 	ctx_flush_and_put(ctx, ts);
 
-	/* relaxed read is enough as only the task itself sets ->in_cancel */
-	if (unlikely(atomic_read(&tctx->in_cancel)))
+	/*
+	 * Relaxed read is enough as only the task itself sets ->in_cancel.
+	 * The tctx may also be drained by io_tctx_fallback_work(), in which
+	 * case current is a kworker that has no tctx refs to drop.
+	 */
+	if (unlikely(atomic_read(&tctx->in_cancel)) &&
+	    current->io_uring == tctx)
 		io_uring_drop_tctx_refs(current);
 
 	trace_io_uring_task_work_run(tctx, *count);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 6/6] io_uring: remove the per-ctx fallback task_work machinery
  2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
                   ` (4 preceding siblings ...)
  2026-06-12  2:48 ` [PATCH 5/6] io_uring: run the tctx task_work fallback directly Jens Axboe
@ 2026-06-12  2:48 ` Jens Axboe
  5 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-06-12  2:48 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, csander, krisman, Jens Axboe

With the tctx fallback running its entries directly, the per-ctx
fallback work has a single user left: moving local (DEFER_TASKRUN)
task_work entries out of a ring that is going away. Both of its call
sites are process context and don't hold ->uring_lock, the same
conditions the deferred fallback work itself ran under - so run the
entries in cancel mode right there instead, and rename the helper to
io_cancel_local_task_work() to match what it now does.

With that, ->fallback_llist, ->fallback_work, io_fallback_req_func()
and __io_fallback_tw() can all go away, along with the fallback work
flushing in the ring exit and cancel paths. Requests that get
orphaned by an exiting task now run via the tctx fallback work, which
the ring exit side implicitly waits on through the ctx refs those
requests hold.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  2 -
 io_uring/cancel.c              |  2 -
 io_uring/io_uring.c            |  7 +---
 io_uring/tw.c                  | 67 +++++++---------------------------
 io_uring/tw.h                  |  3 +-
 5 files changed, 16 insertions(+), 65 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 33de451127f9..a0de8dafd990 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -498,8 +498,6 @@ struct io_ring_ctx {
 	struct mutex			tctx_lock;
 
 	/* ctx exit and cancelation */
-	struct llist_head		fallback_llist;
-	struct delayed_work		fallback_work;
 	struct work_struct		exit_work;
 	struct completion		ref_comp;
 
diff --git a/io_uring/cancel.c b/io_uring/cancel.c
index 4aa3103ba9c3..8c6fa6f367e4 100644
--- a/io_uring/cancel.c
+++ b/io_uring/cancel.c
@@ -565,8 +565,6 @@ __cold bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 	mutex_unlock(&ctx->uring_lock);
 	if (tctx)
 		ret |= io_run_task_work() > 0;
-	else
-		ret |= flush_delayed_work(&ctx->fallback_work);
 	return ret;
 }
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 16acd99ff083..33b4340d32a7 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -289,7 +289,6 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 #ifdef CONFIG_FUTEX
 	INIT_HLIST_HEAD(&ctx->futex_list);
 #endif
-	INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func);
 	INIT_WQ_LIST(&ctx->submit_state.compl_reqs);
 	INIT_HLIST_HEAD(&ctx->cancelable_uring_cmd);
 	io_napi_init(ctx);
@@ -1204,7 +1203,7 @@ __cold void io_iopoll_try_reap_events(struct io_ring_ctx *ctx)
 	mutex_unlock(&ctx->uring_lock);
 
 	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
-		io_move_task_work_from_local(ctx);
+		io_cancel_local_task_work(ctx);
 }
 
 static int io_iopoll_check(struct io_ring_ctx *ctx, unsigned int min_events)
@@ -2350,7 +2349,7 @@ static __cold void io_ring_exit_work(struct work_struct *work)
 		/* The SQPOLL thread never reaches this path */
 		do {
 			if (ctx->flags & IORING_SETUP_DEFER_TASKRUN)
-				io_move_task_work_from_local(ctx);
+				io_cancel_local_task_work(ctx);
 			cond_resched();
 		} while (io_uring_try_cancel_requests(ctx, NULL, true, false));
 
@@ -2436,8 +2435,6 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 		io_unregister_personality(ctx, index);
 	mutex_unlock(&ctx->uring_lock);
 
-	flush_delayed_work(&ctx->fallback_work);
-
 	INIT_WORK(&ctx->exit_work, io_ring_exit_work);
 	/*
 	 * Use system_dfl_wq to avoid spawning tons of event kworkers
diff --git a/io_uring/tw.c b/io_uring/tw.c
index 0fa685aa3926..31f9feb42353 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -16,24 +16,6 @@
 #include "wait.h"
 #include "mpscq.h"
 
-void io_fallback_req_func(struct work_struct *work)
-{
-	struct io_ring_ctx *ctx = container_of(work, struct io_ring_ctx,
-						fallback_work.work);
-	struct llist_node *node = llist_del_all(&ctx->fallback_llist);
-	struct io_kiocb *req, *tmp;
-	struct io_tw_state ts = {};
-
-	percpu_ref_get(&ctx->refs);
-	mutex_lock(&ctx->uring_lock);
-	ts.cancel = io_should_terminate_tw(ctx);
-	llist_for_each_entry_safe(req, tmp, node, io_task_work.node)
-		req->io_task_work.func((struct io_tw_req){req}, ts);
-	io_submit_flush_completions(ctx);
-	mutex_unlock(&ctx->uring_lock);
-	percpu_ref_put(&ctx->refs);
-}
-
 static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw)
 {
 	if (!ctx)
@@ -46,34 +28,6 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw)
 	percpu_ref_put(&ctx->refs);
 }
 
-static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
-{
-	struct io_ring_ctx *last_ctx = NULL;
-	struct io_kiocb *req;
-
-	while (node) {
-		req = container_of(node, struct io_kiocb, io_task_work.node);
-		node = node->next;
-		if (last_ctx != req->ctx) {
-			if (last_ctx) {
-				if (sync)
-					flush_delayed_work(&last_ctx->fallback_work);
-				percpu_ref_put(&last_ctx->refs);
-			}
-			last_ctx = req->ctx;
-			percpu_ref_get(&last_ctx->refs);
-		}
-		if (llist_add(&req->io_task_work.node, &last_ctx->fallback_llist))
-			schedule_delayed_work(&last_ctx->fallback_work, 1);
-	}
-
-	if (last_ctx) {
-		if (sync)
-			flush_delayed_work(&last_ctx->fallback_work);
-		percpu_ref_put(&last_ctx->refs);
-	}
-}
-
 void io_tctx_fallback_work(struct work_struct *work)
 {
 	struct io_uring_task *tctx = container_of(work, struct io_uring_task,
@@ -278,29 +232,34 @@ void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
 	__io_req_task_work_add(req, flags);
 }
 
-void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
+void __cold io_cancel_local_task_work(struct io_ring_ctx *ctx)
 {
-	struct llist_node *node, *first = NULL, **tail = &first;
+	struct io_tw_state ts = { .cancel = true };
+	struct llist_node *node;
 
 	/*
 	 * The work list consumer side is serialized by ->uring_lock, see
 	 * __io_run_local_work(). Grab it to guard against racing with normal
-	 * task_work running, as the task may be exiting.
+	 * task_work running, as the task may be exiting. The ring is going
+	 * away, run the entries in cancel mode right here - the callers
+	 * provide the same process context the per-ctx fallback work that
+	 * they were previously punted to ran in.
 	 */
 	guard(mutex)(&ctx->uring_lock);
 
 	while (!mpscq_empty(&ctx->work_list)) {
+		struct io_kiocb *req;
+
 		node = mpscq_pop(&ctx->work_list, &ctx->work_head);
 		if (!node) {
 			/* a producer is mid-push, wait for it to link */
-			cpu_relax();
+			cond_resched();
 			continue;
 		}
-		*tail = node;
-		tail = &node->next;
+		req = container_of(node, struct io_kiocb, io_task_work.node);
+		req->io_task_work.func((struct io_tw_req){req}, ts);
 	}
-	*tail = NULL;
-	__io_fallback_tw(first, false);
+	io_submit_flush_completions(ctx);
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
diff --git a/io_uring/tw.h b/io_uring/tw.h
index 387e52004da8..3ade5ad577fd 100644
--- a/io_uring/tw.h
+++ b/io_uring/tw.h
@@ -30,8 +30,7 @@ void io_tctx_fallback_work(struct work_struct *work);
 int io_run_local_work(struct io_ring_ctx *ctx, int min_events, int max_events);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 
-__cold void io_fallback_req_func(struct work_struct *work);
-__cold void io_move_task_work_from_local(struct io_ring_ctx *ctx);
+__cold void io_cancel_local_task_work(struct io_ring_ctx *ctx);
 int io_run_local_work_locked(struct io_ring_ctx *ctx, int min_events);
 
 void io_req_local_work_add(struct io_kiocb *req, unsigned flags);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: switch local task_work to a mpscq
  2026-06-12  2:48 ` [PATCH 3/6] io_uring: switch local task_work to a mpscq Jens Axboe
@ 2026-06-12  3:20   ` Caleb Sander Mateos
  2026-06-12 12:23     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12  3:20 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Thu, Jun 11, 2026 at 7:51 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
> ordered, and hence __io_run_local_work() has to restore the right
> running order with an O(n) llist_reverse_order() pass first. On top of
> that, a batch that gets capped by max_events needs the leftover entries
> parked on a separate ->retry_llist, as they can't be pushed back to the
> shared list.
>
> Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
> retry loop, entries are popped in queue order with no reversal pass,
> capping a run simply leaves the remainder on the queue, and
> ->retry_llist goes away entirely. The consumer cursor, ->work_head,
> lives with the rest of the ->uring_lock protected state rather than
> next to the queue, so that popping entries doesn't dirty the producer
> side cacheline.
>
> For low amounts of task_work, this ends up being a bit more efficient
> than the existing scheme. As an example of that, doing multishot
> receives for 8 clients has the following task_work overhead:
>
>      1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
>      0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
>      0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      2.64% at ~46Gb/sec
>
> and after this change:
>
>      1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      2.11% at ~53Gb/sec
>
> which has less overhead even though that test run was faster. For a case
> of having 1024 clients on a single ring:
>
>      2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
>      0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
>      0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      3.50% at ~24Gb/sec
>
> we start to see the llist reversing taking a considerable amount of
> time, and the total add+run task_work overhead is around 3.5%. After
> the change:
>
>      0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      1.32% at ~26Gb/sec
>
> most of that overhead is gone, and performance is better as well.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  13 +++-
>  io_uring/io_uring.c            |   2 +-
>  io_uring/tw.c                  | 135 ++++++++++++++-------------------
>  io_uring/tw.h                  |   4 +-
>  io_uring/wait.c                |   2 +-
>  io_uring/wait.h                |  10 ++-
>  6 files changed, 78 insertions(+), 88 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 85e12b4884a5..9df5584ec3b1 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -351,6 +351,14 @@ struct io_ring_ctx {
>                  */
>                 atomic_t                cancel_seq;
>
> +               /*
> +                * Consumer cursor for ->work_list, protected by ->uring_lock.
> +                * Deliberately kept away from the producer side of the queue,
> +                * as it's written for every popped entry, and the producer
> +                * cacheline is contended enough as it is.
> +                */
> +               struct llist_node       *work_head;

Looks like this field has padding both before (next to atomic_t) and
after (next to bool). Probably doesn't matter currently, as the outer
struct is cache-aligned and has 16 bytes of padding at the end, but
could save 8 bytes of padding by reordering next to an existing
8-byte-aligned field.

> +
>                 /*
>                  * ->iopoll_list is protected by the ctx->uring_lock for
>                  * io_uring instances that don't use IORING_SETUP_SQPOLL.
> @@ -417,8 +425,7 @@ struct io_ring_ctx {
>          */
>         struct {
>                 struct io_rings __rcu   *rings_rcu;
> -               struct llist_head       work_llist;
> -               struct llist_head       retry_llist;
> +               struct mpscq            work_list;
>                 unsigned long           check_cq;
>                 atomic_t                cq_wait_nr;
>                 atomic_t                cq_timeouts;
> @@ -742,8 +749,6 @@ struct io_kiocb {
>          */
>         u16                             buf_index;
>
> -       unsigned                        nr_tw;
> -
>         /* REQ_F_* flags */
>         io_req_flags_t                  flags;
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 753ac23401c5..16acd99ff083 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -280,7 +280,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>         INIT_LIST_HEAD(&ctx->defer_list);
>         INIT_LIST_HEAD(&ctx->timeout_list);
>         INIT_LIST_HEAD(&ctx->ltimeout_list);
> -       init_llist_head(&ctx->work_llist);
> +       mpscq_init(&ctx->work_list, &ctx->work_head);
>         INIT_LIST_HEAD(&ctx->tctx_list);
>         mutex_init(&ctx->tctx_lock);
>         ctx->submit_state.free_list.next = NULL;
> diff --git a/io_uring/tw.c b/io_uring/tw.c
> index f4335c8d50d9..b8d6027aaeff 100644
> --- a/io_uring/tw.c
> +++ b/io_uring/tw.c
> @@ -14,6 +14,7 @@
>  #include "rw.h"
>  #include "eventfd.h"
>  #include "wait.h"
> +#include "mpscq.h"
>
>  void io_fallback_req_func(struct work_struct *work)
>  {
> @@ -170,11 +171,7 @@ static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
>  void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>  {
>         struct io_ring_ctx *ctx = req->ctx;
> -       unsigned nr_wait, nr_tw, nr_tw_prev;
> -       struct llist_node *head;
> -
> -       /* See comment above IO_CQ_WAKE_INIT */
> -       BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
> +       int nr_wait;
>
>         /*
>          * We don't know how many requests there are in the link and whether
> @@ -183,56 +180,37 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>         if (req->flags & IO_REQ_LINK_FLAGS)
>                 flags &= ~IOU_F_TWQ_LAZY_WAKE;
>
> -       guard(rcu)();
> -
> -       head = READ_ONCE(ctx->work_llist.first);
> -       do {
> -               nr_tw_prev = 0;
> -               if (head) {
> -                       struct io_kiocb *first_req = container_of(head,
> -                                                       struct io_kiocb,
> -                                                       io_task_work.node);
> -                       /*
> -                        * Might be executed at any moment, rely on
> -                        * SLAB_TYPESAFE_BY_RCU to keep it alive.
> -                        */
> -                       nr_tw_prev = READ_ONCE(first_req->nr_tw);
> -               }
> -
> -               /*
> -                * Theoretically, it can overflow, but that's fine as one of
> -                * previous adds should've tried to wake the task.
> -                */
> -               nr_tw = nr_tw_prev + 1;
> -               if (!(flags & IOU_F_TWQ_LAZY_WAKE))
> -                       nr_tw = IO_CQ_WAKE_FORCE;
> -
> -               req->nr_tw = nr_tw;
> -               req->io_task_work.node.next = head;
> -       } while (!try_cmpxchg(&ctx->work_llist.first, &head,
> -                             &req->io_task_work.node));
> -
>         /*
> -        * cmpxchg implies a full barrier, which pairs with the barrier
> -        * in set_current_state() on the io_cqring_wait() side. It's used
> -        * to ensure that either we see updated ->cq_wait_nr, or waiters
> -        * going to sleep will observe the work added to the list, which
> -        * is similar to the wait/wawke task state sync.
> +        * The xchg() in mpscq_push() implies a full barrier, which pairs with
> +        * the barrier in set_current_state() on the io_cqring_wait() side. This
> +        * ensures that either we see the updated ->cq_wait_nr, or waiters going
> +        * to sleep will observe the work added to the list, which is similar to
> +        * the wait/wake task state sync.
>          */
> -
> -       if (!head) {
> +       if (mpscq_push(&ctx->work_list, &req->io_task_work.node)) {
>                 io_ctx_mark_taskrun(ctx);
>                 if (data_race(ctx->int_flags) & IO_RING_F_HAS_EVFD)
>                         io_eventfd_signal(ctx, false);
>         }
>
> +       /*
> +        * No one is waiting (IO_CQ_WAKE_INIT), or this cycle's wake up has
> +        * already been issued (zero or negative, see below).
> +        */
>         nr_wait = atomic_read(&ctx->cq_wait_nr);
> -       /* not enough or no one is waiting */
> -       if (nr_tw < nr_wait)
> +       if (nr_wait <= 0)
>                 return;
> -       /* the previous add has already woken it up */
> -       if (nr_tw_prev >= nr_wait)
> +       if (flags & IOU_F_TWQ_LAZY_WAKE) {
> +               /*
> +                * ->cq_wait_nr counts down the number of lazy adds, once it
> +                * hits zero we're good to wake the waiter.
> +                */
> +               if (!atomic_dec_and_test(&ctx->cq_wait_nr))
> +                       return;

It's possible that another task work wakes up the task before this one
reaches the atomic_dec_and_test(), right? If the submitter task begins
a new wait in between, this could decrement cq_wait_nr even though the
queued task work has already been processed after the previous wakeup.
I guess that's okay; in the worse case, the waiter will be woken
prematurely.

> +       } else if (!atomic_try_cmpxchg(&ctx->cq_wait_nr, &nr_wait, IO_CQ_WAKE_INIT)) {
> +               /* lost the race against another wake up, this one is covered */
>                 return;
> +       }
>         wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>  }
>
> @@ -273,21 +251,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
>
>  void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
>  {
> -       struct llist_node *node;
> +       struct llist_node *node, *first = NULL, **tail = &first;
>
>         /*
> -        * Running the work items may utilize ->retry_llist as a means
> -        * for capping the number of task_work entries run at the same
> -        * time. But that list can potentially race with moving the work
> -        * from here, if the task is exiting. As any normal task_work
> -        * running holds ->uring_lock already, just guard this slow path
> -        * with ->uring_lock to avoid racing on ->retry_llist.
> +        * The work list consumer side is serialized by ->uring_lock, see
> +        * __io_run_local_work(). Grab it to guard against racing with normal
> +        * task_work running, as the task may be exiting.
>          */
>         guard(mutex)(&ctx->uring_lock);
> -       node = llist_del_all(&ctx->work_llist);
> -       __io_fallback_tw(node, false);
> -       node = llist_del_all(&ctx->retry_llist);
> -       __io_fallback_tw(node, false);
> +
> +       while (!mpscq_empty(&ctx->work_list)) {
> +               node = mpscq_pop(&ctx->work_list, &ctx->work_head);
> +               if (!node) {
> +                       /* a producer is mid-push, wait for it to link */
> +                       cpu_relax();
> +                       continue;
> +               }
> +               *tail = node;
> +               tail = &node->next;
> +       }
> +       *tail = NULL;
> +       __io_fallback_tw(first, false);
>  }
>
>  static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
> @@ -302,22 +286,23 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
>         return false;
>  }
>
> -static int __io_run_local_work_loop(struct llist_node **node,
> +static int __io_run_local_work_loop(struct io_ring_ctx *ctx,
>                                     io_tw_token_t tw,
>                                     int events)
>  {
>         int ret = 0;
>
> -       while (*node) {
> -               struct llist_node *next = (*node)->next;
> -               struct io_kiocb *req = container_of(*node, struct io_kiocb,
> -                                                   io_task_work.node);
> +       while (ret < events) {
> +               struct llist_node *node = mpscq_pop(&ctx->work_list, &ctx->work_head);
> +               struct io_kiocb *req;
> +
> +               if (!node)
> +                       break;
> +               req = container_of(node, struct io_kiocb, io_task_work.node);
>                 INDIRECT_CALL_2(req->io_task_work.func,
>                                 io_poll_task_func, io_req_rw_complete,
>                                 (struct io_tw_req){req}, tw);
> -               *node = next;
> -               if (++ret >= events)
> -                       break;
> +               ret++;
>         }
>
>         return ret;
> @@ -326,7 +311,6 @@ static int __io_run_local_work_loop(struct llist_node **node,
>  static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
>                                int min_events, int max_events)
>  {
> -       struct llist_node *node;
>         unsigned int loops = 0;
>         int ret = 0;
>
> @@ -335,24 +319,21 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
>         if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>                 atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>  again:
> -       tw.cancel = io_should_terminate_tw(ctx);
> -       min_events -= ret;
> -       ret = __io_run_local_work_loop(&ctx->retry_llist.first, tw, max_events);
> -       if (ctx->retry_llist.first)
> -               goto retry_done;
> -
>         /*
> -        * llists are in reverse order, flip it back the right way before
> -        * running the pending items.
> +        * If the last loop made no progress while work is still pending,
> +        * a producer has published a node but hasn't linked it into the
> +        * queue yet (see mpscq_pop()). Give it a chance to finish rather
> +        * than spinning on the queue.
>          */
> -       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
> -       ret += __io_run_local_work_loop(&node, tw, max_events - ret);
> -       ctx->retry_llist.first = node;
> +       if (unlikely(loops && !ret))
> +               cond_resched();
> +       tw.cancel = io_should_terminate_tw(ctx);
> +       min_events -= ret;
> +       ret = __io_run_local_work_loop(ctx, tw, max_events);
>         loops++;
>
>         if (io_run_local_work_continue(ctx, ret, min_events))
>                 goto again;
> -retry_done:
>         io_submit_flush_completions(ctx);
>         if (io_run_local_work_continue(ctx, ret, min_events))
>                 goto again;
> diff --git a/io_uring/tw.h b/io_uring/tw.h
> index 415e330fabde..f42db5fdbded 100644
> --- a/io_uring/tw.h
> +++ b/io_uring/tw.h
> @@ -6,6 +6,8 @@
>  #include <linux/percpu-refcount.h>
>  #include <linux/io_uring_types.h>
>
> +#include "mpscq.h"
> +
>  #define IO_LOCAL_TW_DEFAULT_MAX                20
>
>  /*
> @@ -89,7 +91,7 @@ static inline int io_run_task_work(void)
>
>  static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
>  {
> -       return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
> +       return !mpscq_empty(&ctx->work_list);
>  }
>
>  static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
> diff --git a/io_uring/wait.c b/io_uring/wait.c
> index ec01e78a216d..c5fc34d2ce97 100644
> --- a/io_uring/wait.c
> +++ b/io_uring/wait.c
> @@ -98,7 +98,7 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
>         if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
>                 atomic_set(&ctx->cq_wait_nr, 1);
>                 smp_mb();
> -               if (!llist_empty(&ctx->work_llist))
> +               if (io_local_work_pending(ctx))
>                         goto out_wake;
>         }
>
> diff --git a/io_uring/wait.h b/io_uring/wait.h
> index a4274b137f81..6d494297e1ce 100644
> --- a/io_uring/wait.h
> +++ b/io_uring/wait.h
> @@ -5,12 +5,14 @@
>  #include <linux/io_uring_types.h>
>
>  /*
> - * No waiters. It's larger than any valid value of the tw counter
> - * so that tests against ->cq_wait_nr would fail and skip wake_up().
> + * ->cq_wait_nr is armed with the number of lazy task_work adds the waiter
> + * still needs, and counted down by the add side, with the add reaching zero
> + * issuing the (single) wake up for this wait cycle. Zero and below means no
> + * wake up is to be issued: IO_CQ_WAKE_INIT when no task is waiting (also
> + * what a forced wake up resets it to when claiming one), zero once the
> + * countdown has fired.
>   */
>  #define IO_CQ_WAKE_INIT                (-1U)

Since cq_wait_nr is now used as a signed value, would it make sense to
drop the U here?

Best,
Caleb

> -/* Forced wake up if there is a waiter regardless of ->cq_wait_nr */
> -#define IO_CQ_WAKE_FORCE       (IO_CQ_WAKE_INIT >> 1)
>
>  struct ext_arg {
>         size_t argsz;
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 3/6] io_uring: switch local task_work to a mpscq
  2026-06-12  3:20   ` Caleb Sander Mateos
@ 2026-06-12 12:23     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-06-12 12:23 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/11/26 9:20 PM, Caleb Sander Mateos wrote:
>> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
>> index 85e12b4884a5..9df5584ec3b1 100644
>> --- a/include/linux/io_uring_types.h
>> +++ b/include/linux/io_uring_types.h
>> @@ -351,6 +351,14 @@ struct io_ring_ctx {
>>                  */
>>                 atomic_t                cancel_seq;
>>
>> +               /*
>> +                * Consumer cursor for ->work_list, protected by ->uring_lock.
>> +                * Deliberately kept away from the producer side of the queue,
>> +                * as it's written for every popped entry, and the producer
>> +                * cacheline is contended enough as it is.
>> +                */
>> +               struct llist_node       *work_head;
> 
> Looks like this field has padding both before (next to atomic_t) and
> after (next to bool). Probably doesn't matter currently, as the outer
> struct is cache-aligned and has 16 bytes of padding at the end, but
> could save 8 bytes of padding by reordering next to an existing
> 8-byte-aligned field.

Indeed - I'll still move it, then we don't have to hunt holes later.

>> +       /*
>> +        * No one is waiting (IO_CQ_WAKE_INIT), or this cycle's wake up has
>> +        * already been issued (zero or negative, see below).
>> +        */
>>         nr_wait = atomic_read(&ctx->cq_wait_nr);
>> -       /* not enough or no one is waiting */
>> -       if (nr_tw < nr_wait)
>> +       if (nr_wait <= 0)
>>                 return;
>> -       /* the previous add has already woken it up */
>> -       if (nr_tw_prev >= nr_wait)
>> +       if (flags & IOU_F_TWQ_LAZY_WAKE) {
>> +               /*
>> +                * ->cq_wait_nr counts down the number of lazy adds, once it
>> +                * hits zero we're good to wake the waiter.
>> +                */
>> +               if (!atomic_dec_and_test(&ctx->cq_wait_nr))
>> +                       return;
> 
> It's possible that another task work wakes up the task before this one
> reaches the atomic_dec_and_test(), right? If the submitter task begins
> a new wait in between, this could decrement cq_wait_nr even though the
> queued task work has already been processed after the previous wakeup.
> I guess that's okay; in the worse case, the waiter will be woken
> prematurely.

That's correct, if the race is particularly unlucky, it could wake
early. I think that's fine, that's worth living with, and should be rare
enough to not really matter. It's not a lost wake, which would have been
a real problem.

I'll add a comment.

>> diff --git a/io_uring/wait.h b/io_uring/wait.h
>> index a4274b137f81..6d494297e1ce 100644
>> --- a/io_uring/wait.h
>> +++ b/io_uring/wait.h
>> @@ -5,12 +5,14 @@
>>  #include <linux/io_uring_types.h>
>>
>>  /*
>> - * No waiters. It's larger than any valid value of the tw counter
>> - * so that tests against ->cq_wait_nr would fail and skip wake_up().
>> + * ->cq_wait_nr is armed with the number of lazy task_work adds the waiter
>> + * still needs, and counted down by the add side, with the add reaching zero
>> + * issuing the (single) wake up for this wait cycle. Zero and below means no
>> + * wake up is to be issued: IO_CQ_WAKE_INIT when no task is waiting (also
>> + * what a forced wake up resets it to when claiming one), zero once the
>> + * countdown has fired.
>>   */
>>  #define IO_CQ_WAKE_INIT                (-1U)
> 
> Since cq_wait_nr is now used as a signed value, would it make sense to
> drop the U here?

Indeed, I'll fix that too.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-12  2:48 ` [PATCH 4/6] io_uring: switch normal " Jens Axboe
@ 2026-06-12 18:59   ` Caleb Sander Mateos
  2026-06-12 19:37     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12 18:59 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Thu, Jun 11, 2026 at 7:51 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Like the local task_work list, the normal (tctx) task_work list is an
> llist, and hence needs the O(n) llist_reverse_order() pass before
> running entries in queue order. On top of that, capped runs - sqpoll
> processing IORING_TW_CAP_ENTRIES_VALUE entries at a time - need the
> claimed-but-unprocessed leftovers carried in a separate retry_list,
> as they can't be pushed back to the shared list.
>
> Switch tctx->task_list to a mpscq, like what was done for the
> DEFER_TASKRUN paths as well.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  12 ++-
>  io_uring/sqpoll.c              |  30 +++----
>  io_uring/tctx.c                |   3 +-
>  io_uring/tw.c                  | 146 ++++++++++++++++++++-------------
>  io_uring/tw.h                  |   4 +-
>  5 files changed, 113 insertions(+), 82 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 9df5584ec3b1..33de451127f9 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -131,6 +131,11 @@ struct io_uring_task {
>         const struct io_ring_ctx        *last;
>         struct task_struct              *task;
>         struct io_wq                    *io_wq;
> +       /*
> +        * Consumer cursor for ->task_list. Only popped by the task itself,
> +        * or by ->fallback_work once the task can no longer run task_work.
> +        */
> +       struct llist_node               *task_head;
>         struct file                     *registered_rings[IO_RINGFD_REG_MAX];
>
>         struct xarray                   xa;
> @@ -139,8 +144,13 @@ struct io_uring_task {
>         atomic_t                        inflight_tracked;
>         struct percpu_counter           inflight;
>
> +       /* drains ->task_list once the task can no longer run task_work */
> +       struct work_struct              fallback_work;
> +
>         struct { /* task_work */
> -               struct llist_head       task_list;
> +               struct mpscq            task_list;
> +               /* BIT(0) guards adding tw only once */
> +               unsigned long           tw_pending;
>                 struct callback_head    task_work;
>         } ____cacheline_aligned_in_smp;
>  };
> diff --git a/io_uring/sqpoll.c b/io_uring/sqpoll.c
> index 46c12afec73e..2460bd605266 100644
> --- a/io_uring/sqpoll.c
> +++ b/io_uring/sqpoll.c
> @@ -260,39 +260,29 @@ static bool io_sqd_handle_event(struct io_sq_data *sqd)
>  }
>
>  /*
> - * Run task_work, processing the retry_list first. The retry_list holds
> - * entries that we passed on in the previous run, if we had more task_work
> - * than we were asked to process. Newly queued task_work isn't run until the
> - * retry list has been fully processed.
> + * Run task_work, processing no more than max_entries at a time. If more
> + * than that is pending, it simply stays on the queue for the next run.
>   */
> -static unsigned int io_sq_tw(struct llist_node **retry_list, int max_entries)
> +static unsigned int io_sq_tw(int max_entries)
>  {
>         struct io_uring_task *tctx = current->io_uring;
>         unsigned int count = 0;
>
> -       if (*retry_list) {
> -               *retry_list = io_handle_tw_list(*retry_list, &count, max_entries);
> -               if (count >= max_entries)
> -                       goto out;
> -               max_entries -= count;
> -       }
> -       *retry_list = tctx_task_work_run(tctx, max_entries, &count);
> -out:
> +       tctx_task_work_run(tctx, max_entries, &count);
>         if (task_work_pending(current))
>                 task_work_run();
>         return count;
>  }
>
> -static bool io_sq_tw_pending(struct llist_node *retry_list)
> +static bool io_sq_tw_pending(void)
>  {
>         struct io_uring_task *tctx = current->io_uring;
>
> -       return retry_list || !llist_empty(&tctx->task_list);
> +       return !mpscq_empty(&tctx->task_list);
>  }
>
>  static int io_sq_thread(void *data)
>  {
> -       struct llist_node *retry_list = NULL;
>         struct io_sq_data *sqd = data;
>         struct io_ring_ctx *ctx;
>         unsigned long timeout = 0;
> @@ -347,7 +337,7 @@ static int io_sq_thread(void *data)
>                         if (!sqt_spin && (ret > 0 || !list_empty(&ctx->iopoll_list)))
>                                 sqt_spin = true;
>                 }
> -               if (io_sq_tw(&retry_list, IORING_TW_CAP_ENTRIES_VALUE))
> +               if (io_sq_tw(IORING_TW_CAP_ENTRIES_VALUE))
>                         sqt_spin = true;
>
>                 list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> @@ -372,7 +362,7 @@ static int io_sq_thread(void *data)
>                 }
>
>                 prepare_to_wait(&sqd->wait, &wait, TASK_INTERRUPTIBLE);
> -               if (!io_sqd_events_pending(sqd) && !io_sq_tw_pending(retry_list)) {
> +               if (!io_sqd_events_pending(sqd) && !io_sq_tw_pending()) {
>                         bool needs_sched = true;
>
>                         list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
> @@ -411,8 +401,8 @@ static int io_sq_thread(void *data)
>                 timeout = jiffies + sqd->sq_thread_idle;
>         }
>
> -       if (retry_list)
> -               io_sq_tw(&retry_list, UINT_MAX);
> +       if (io_sq_tw_pending())
> +               io_sq_tw(UINT_MAX);
>
>         io_uring_cancel_generic(true, sqd);
>         rcu_assign_pointer(sqd->thread, NULL);
> diff --git a/io_uring/tctx.c b/io_uring/tctx.c
> index 42b219b34aa8..cc3bf2b3bdbc 100644
> --- a/io_uring/tctx.c
> +++ b/io_uring/tctx.c
> @@ -103,7 +103,8 @@ __cold struct io_uring_task *io_uring_alloc_task_context(struct task_struct *tas
>         init_waitqueue_head(&tctx->wait);
>         atomic_set(&tctx->in_cancel, 0);
>         atomic_set(&tctx->inflight_tracked, 0);
> -       init_llist_head(&tctx->task_list);
> +       mpscq_init(&tctx->task_list, &tctx->task_head);
> +       INIT_WORK(&tctx->fallback_work, io_tctx_fallback_work);
>         init_task_work(&tctx->task_work, tctx_task_work);
>         return tctx;
>  }
> diff --git a/io_uring/tw.c b/io_uring/tw.c
> index b8d6027aaeff..ca29bb0b9768 100644
> --- a/io_uring/tw.c
> +++ b/io_uring/tw.c
> @@ -46,46 +46,6 @@ static void ctx_flush_and_put(struct io_ring_ctx *ctx, io_tw_token_t tw)
>         percpu_ref_put(&ctx->refs);
>  }
>
> -/*
> - * Run queued task_work, returning the number of entries processed in *count.
> - * If more entries than max_entries are available, stop processing once this
> - * is reached and return the rest of the list.
> - */
> -struct llist_node *io_handle_tw_list(struct llist_node *node,
> -                                    unsigned int *count,
> -                                    unsigned int max_entries)
> -{
> -       struct io_ring_ctx *ctx = NULL;
> -       struct io_tw_state ts = { };
> -
> -       do {
> -               struct llist_node *next = node->next;
> -               struct io_kiocb *req = container_of(node, struct io_kiocb,
> -                                                   io_task_work.node);
> -
> -               if (req->ctx != ctx) {
> -                       ctx_flush_and_put(ctx, ts);
> -                       ctx = req->ctx;
> -                       mutex_lock(&ctx->uring_lock);
> -                       percpu_ref_get(&ctx->refs);
> -                       ts.cancel = io_should_terminate_tw(ctx);
> -               }
> -               INDIRECT_CALL_2(req->io_task_work.func,
> -                               io_poll_task_func, io_req_rw_complete,
> -                               (struct io_tw_req){req}, ts);
> -               node = next;
> -               (*count)++;
> -               if (unlikely(need_resched())) {
> -                       ctx_flush_and_put(ctx, ts);
> -                       ctx = NULL;
> -                       cond_resched();
> -               }
> -       } while (node && *count < max_entries);
> -
> -       ctx_flush_and_put(ctx, ts);
> -       return node;
> -}
> -
>  static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
>  {
>         struct io_ring_ctx *last_ctx = NULL;
> @@ -114,43 +74,109 @@ static __cold void __io_fallback_tw(struct llist_node *node, bool sync)
>         }
>  }
>
> -static void io_fallback_tw(struct io_uring_task *tctx, bool sync)
> +void io_tctx_fallback_work(struct work_struct *work)
>  {
> -       struct llist_node *node = llist_del_all(&tctx->task_list);
> +       struct io_uring_task *tctx = container_of(work, struct io_uring_task,
> +                                                 fallback_work);
> +       struct llist_node *node, *first = NULL, **tail = &first;
> +
> +       /* see tctx_task_work() - a set bit must always have a run coming */
> +       clear_bit(0, &tctx->tw_pending);
> +       smp_mb__after_atomic();
> +
> +       while (!mpscq_empty(&tctx->task_list)) {
> +               node = mpscq_pop(&tctx->task_list, &tctx->task_head);
> +               if (!node) {
> +                       /* a producer is mid-push, wait for it to link */
> +                       cond_resched();
> +                       continue;
> +               }
> +               *tail = node;
> +               tail = &node->next;
> +       }
> +       *tail = NULL;
> +       __io_fallback_tw(first, false);
> +       put_task_struct(tctx->task);
> +}
>
> -       __io_fallback_tw(node, sync);
> +static void io_fallback_tw(struct io_uring_task *tctx)
> +{
> +       /*
> +        * The task ref both keeps ->task valid and, as __io_uring_free() is
> +        * only called when the task itself is freed, ensures the tctx (and
> +        * the queued work) stay around until the drain has run.
> +        */
> +       get_task_struct(tctx->task);
> +       if (!queue_work(system_unbound_wq, &tctx->fallback_work))
> +               put_task_struct(tctx->task);
>  }
>
> -struct llist_node *tctx_task_work_run(struct io_uring_task *tctx,
> -                                     unsigned int max_entries,
> -                                     unsigned int *count)
> +/*
> + * Run queued task_work, processing no more than max_entries, with the number
> + * of entries processed added to *count. If more entries than max_entries are
> + * available, the remainder simply stay on the queue for the next run.
> + */
> +void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
> +                       unsigned int *count)
>  {
> -       struct llist_node *node;
> +       struct io_ring_ctx *ctx = NULL;
> +       struct io_tw_state ts = { };
>
> -       node = llist_del_all(&tctx->task_list);
> -       if (node) {
> -               node = llist_reverse_order(node);
> -               node = io_handle_tw_list(node, count, max_entries);
> +       while (*count < max_entries) {
> +               struct llist_node *node = mpscq_pop(&tctx->task_list,
> +                                                   &tctx->task_head);
> +               struct io_kiocb *req;
> +
> +               if (!node) {
> +                       if (mpscq_empty(&tctx->task_list))
> +                               break;
> +                       /*
> +                        * A producer has published a node but hasn't
> +                        * linked it into the queue yet (see mpscq_pop()).
> +                        * Give it a chance to finish rather than spinning,
> +                        * and don't sit on the ctx lock while doing so.
> +                        */
> +                       ctx_flush_and_put(ctx, ts);
> +                       ctx = NULL;
> +                       cond_resched();
> +                       continue;
> +               }
> +               req = container_of(node, struct io_kiocb, io_task_work.node);
> +               if (req->ctx != ctx) {
> +                       ctx_flush_and_put(ctx, ts);
> +                       ctx = req->ctx;
> +                       mutex_lock(&ctx->uring_lock);
> +                       percpu_ref_get(&ctx->refs);
> +                       ts.cancel = io_should_terminate_tw(ctx);
> +               }
> +               INDIRECT_CALL_2(req->io_task_work.func,
> +                               io_poll_task_func, io_req_rw_complete,
> +                               (struct io_tw_req){req}, ts);
> +               (*count)++;
> +               if (unlikely(need_resched())) {
> +                       ctx_flush_and_put(ctx, ts);
> +                       ctx = NULL;
> +                       cond_resched();
> +               }
>         }
> +       ctx_flush_and_put(ctx, ts);
>
>         /* relaxed read is enough as only the task itself sets ->in_cancel */
>         if (unlikely(atomic_read(&tctx->in_cancel)))
>                 io_uring_drop_tctx_refs(current);
>
>         trace_io_uring_task_work_run(tctx, *count);
> -       return node;
>  }
>
>  void tctx_task_work(struct callback_head *cb)
>  {
>         struct io_uring_task *tctx;
> -       struct llist_node *ret;
>         unsigned int count = 0;
>
>         tctx = container_of(cb, struct io_uring_task, task_work);
> -       ret = tctx_task_work_run(tctx, UINT_MAX, &count);
> -       /* can't happen */
> -       WARN_ON_ONCE(ret);
> +       clear_bit(0, &tctx->tw_pending);
> +       smp_mb__after_atomic();
> +       tctx_task_work_run(tctx, UINT_MAX, &count);
>  }
>
>  /*
> @@ -220,7 +246,7 @@ void io_req_normal_work_add(struct io_kiocb *req)
>         struct io_ring_ctx *ctx = req->ctx;
>
>         /* task_work already pending, we're done */
> -       if (!llist_add(&req->io_task_work.node, &tctx->task_list))
> +       if (!mpscq_push(&tctx->task_list, &req->io_task_work.node))
>                 return;
>
>         /*
> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
>                 return;
>         }
>
> +       /* task_work must only be added once */
> +       if (test_and_set_bit(0, &tctx->tw_pending))
> +               return;

Is tw_pending necessary? How come the task_work_add() exclusivity
isn't already provided by the mpscq_push() check above?

Best,
Caleb

> +
>         if (likely(!task_work_add(tctx->task, &tctx->task_work, ctx->notify_method)))
>                 return;
>
> -       io_fallback_tw(tctx, false);
> +       io_fallback_tw(tctx);
>  }
>
>  void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
> diff --git a/io_uring/tw.h b/io_uring/tw.h
> index f42db5fdbded..387e52004da8 100644
> --- a/io_uring/tw.h
> +++ b/io_uring/tw.h
> @@ -25,8 +25,8 @@ static inline bool io_should_terminate_tw(struct io_ring_ctx *ctx)
>  }
>
>  void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags);
> -struct llist_node *io_handle_tw_list(struct llist_node *node, unsigned int *count, unsigned int max_entries);
>  void tctx_task_work(struct callback_head *cb);
> +void io_tctx_fallback_work(struct work_struct *work);
>  int io_run_local_work(struct io_ring_ctx *ctx, int min_events, int max_events);
>  int io_run_task_work_sig(struct io_ring_ctx *ctx);
>
> @@ -36,7 +36,7 @@ int io_run_local_work_locked(struct io_ring_ctx *ctx, int min_events);
>
>  void io_req_local_work_add(struct io_kiocb *req, unsigned flags);
>  void io_req_normal_work_add(struct io_kiocb *req);
> -struct llist_node *tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
> +void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries, unsigned int *count);
>
>  static inline void __io_req_task_work_add(struct io_kiocb *req, unsigned flags)
>  {
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-12 18:59   ` Caleb Sander Mateos
@ 2026-06-12 19:37     ` Jens Axboe
  2026-06-13  2:26       ` Caleb Sander Mateos
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-12 19:37 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
>>                 return;
>>         }
>>
>> +       /* task_work must only be added once */
>> +       if (test_and_set_bit(0, &tctx->tw_pending))
>> +               return;
> 
> Is tw_pending necessary? How come the task_work_add() exclusivity
> isn't already provided by the mpscq_push() check above?

It is, because the transition from empty -> not-empty no longer works
for that, as the mpscq emtpies one-by-one rather than with a delete-all
kind of primitive.

I missed that originally and things blew up spectacularly very quickly
:-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-12 19:37     ` Jens Axboe
@ 2026-06-13  2:26       ` Caleb Sander Mateos
  2026-06-13 12:08         ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-13  2:26 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Fri, Jun 12, 2026 at 12:37 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
> >> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
> >>                 return;
> >>         }
> >>
> >> +       /* task_work must only be added once */
> >> +       if (test_and_set_bit(0, &tctx->tw_pending))
> >> +               return;
> >
> > Is tw_pending necessary? How come the task_work_add() exclusivity
> > isn't already provided by the mpscq_push() check above?
>
> It is, because the transition from empty -> not-empty no longer works
> for that, as the mpscq emtpies one-by-one rather than with a delete-all
> kind of primitive.

Sorry, I'm still not following why the empty check doesn't suffice.
It's true that mpscq elements can be removed from the head one at a
time, but mpscq_push() will continue to return false until the
consumer pops all the elements and successfully sets tail back to
&stub. mpscq_push() will return true once when tail transitions away
from &stub, and then not again until the task work runs and sets tail
back to &stub.

Thanks,
Caleb

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/6] io_uring: grab RCU read lock marking task run
  2026-06-12  2:48 ` [PATCH 1/6] io_uring: grab RCU read lock marking task run Jens Axboe
@ 2026-06-13  2:27   ` Caleb Sander Mateos
  0 siblings, 0 replies; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-13  2:27 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Thu, Jun 11, 2026 at 7:51 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Not required right now, as io_req_local_work_add() already calls this
> helper with the RCU read lock held. But in preparation for that not
> being the case, grab it locally.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
>  io_uring/tw.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/io_uring/tw.c b/io_uring/tw.c
> index 023d5e6bc491..f4335c8d50d9 100644
> --- a/io_uring/tw.c
> +++ b/io_uring/tw.c
> @@ -158,11 +158,11 @@ void tctx_task_work(struct callback_head *cb)
>   */
>  static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
>  {
> -       lockdep_assert_in_rcu_read_lock();
> -
>         if (ctx->flags & IORING_SETUP_TASKRUN_FLAG) {
> -               struct io_rings *rings = rcu_dereference(ctx->rings_rcu);
> +               struct io_rings *rings;
>
> +               guard(rcu)();
> +               rings = rcu_dereference(ctx->rings_rcu);
>                 atomic_or(IORING_SQ_TASKRUN, &rings->sq_flags);
>         }
>  }
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-12  2:48 ` [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
@ 2026-06-13  2:40   ` Caleb Sander Mateos
  2026-06-13 12:22     ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-13  2:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Thu, Jun 11, 2026 at 7:51 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> Local task_work is currently using llists for managing the work,
> but that's a LIFO type of list. This means that running this task_work
> needs to reverse the list first, to ensure fairness in running the
> queued items.
>
> Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
> node-based queue algorithm, modified with an externally held consumer
> cursor and conditional stub reinsertion. See comments in the header.
>
> Producers are wait-free: a push is a single xchg() on the queue tail,
> which serializes concurrent producers and defines the FIFO order, plus
> a store linking the node to its predecessor. There are no cmpxchg retry
> loops, and pushing is safe from any context, including hardirq.
>
> The cost of linked list FIFO ordering is that a push publishes the node
> in two steps - the xchg() makes it visible as the new tail before the
> subsequent store links it into the chain that is reachable from the
> head. A consumer hitting that window gets a NULL from mpscq_pop() while
> mpscq_empty() reports false, and must retry later rather than treat the
> queue as empty. The window is two instructions wide, but a producer can
> get preempted inside it, so the consumer must not busy wait on it.
>
> The consumer side supports a single consumer at a time, with callers
> providing their own serialization. A stub node, which also defines the
> empty state (tail == stub), allows the consumer to detach the final
> node without racing against producer link stores: that node is only
> handed out once the stub has been cmpxchg'ed back in as the tail. This
> also guarantees that the previous tail returned by mpscq_push() cannot
> get freed before that push has linked it, making it always valid for
> comparisons.
>
> The consumer cursor is deliberately not part of the queue struct - the
> caller owns it and passes it to mpscq_pop(). This is done to separate
> the consumer and producers cacheline. The cursor is written for every
> popped entry, and keeping it on the same cacheline as ->tail would have
> the consumer invalidating the line that producers need for every push.
> Keeping it external lets the caller place it with its own consumer side
> data instead.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  12 ++++
>  io_uring/mpscq.h               | 118 +++++++++++++++++++++++++++++++++
>  2 files changed, 130 insertions(+)
>  create mode 100644 io_uring/mpscq.h
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aa4d5477f859..85e12b4884a5 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -55,6 +55,18 @@ struct io_wq_work_list {
>         struct io_wq_work_node *last;
>  };
>
> +/*
> + * Lockless multi-producer, single-consumer FIFO queue, see
> + * io_uring/mpscq.h for the implementation and rules. Defined here so
> + * that it can be embedded in io_ring_ctx. This is the producer side
> + * only - the consumer cursor is kept separately, on a cacheline that
> + * isn't dirtied by the producers.
> + */
> +struct mpscq {
> +       struct llist_node       *tail;          /* producers */
> +       struct llist_node       stub;
> +};
> +
>  struct io_wq_work {
>         struct io_wq_work_node list;
>         atomic_t flags;
> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
> new file mode 100644
> index 000000000000..bc482d10e0f3
> --- /dev/null
> +++ b/io_uring/mpscq.h
> @@ -0,0 +1,118 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef IOU_MPSCQ_H
> +#define IOU_MPSCQ_H

#include <linux/io_uring_types.h> so this header can compile on its own?

> +
> +/*
> + * mpscq - lockless multi-producer, single-consumer FIFO queue
> + *
> + * Unlike llist, which is LIFO ordered and hence needs an O(n)
> + * llist_reverse_order() pass before entries can be processed in queue order,
> + * this queue hands out nodes in the order they were pushed.
> + *
> + * The consumer cursor is held by the caller rather than in the queue struct
> + * (see below), and with the stub reinsertion done as a single cmpxchg attempt
> + * instead of an unconditional push, keeping tail == stub a reliable empty test
> + * while a producer is in the middle of a push.
> + *
> + * Producers may run in any context (task, softirq, hardirq) and are wait-free:
> + * a push is one xchg() plus one store, with no retry loops. FIFO order between
> + * producers is the order in which the xchg() on ->tail serializes them.
> + *
> + * The price for linked-list FIFO is that a push publishes the node in two
> + * steps: the xchg() makes it the new tail, and the subsequent store links it to
> + * its predecessor. In between, the tail end of the queue is not yet reachable
> + * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
> + * reports false. The consumer must not treat such a NULL as "queue empty" - it
> + * should retry later. The window is two instructions wide, but a producer can
> + * be preempted inside it, so the consumer must not spin on it while holding
> + * resources the producer might need to make progress.
> + *
> + * The consumer side only supports a single consumer at a time, callers must
> + * provide their own serialization for it. The stub node is what allows the
> + * consumer to detach the final node without racing with the link stores of
> + * producers. This scheme also guarantees that the previous tail observed by
> + * mpscq_push() cannot be freed by the consumer until the push has linked it,
> + * which is what makes the deferred link store safe.
> + *
> + * The queue struct only holds the producer side. The consumer keeps its cursor
> + * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
> + * so that it can be placed on a different cacheline: the cursor is written for
> + * every pop, and having it share a line with ->tail would have the consumer
> + * invalidating the line that producers need for every push.
> + */
> +static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
> +{
> +       q->tail = *headp = &q->stub;
> +       q->stub.next = NULL;
> +}
> +
> +/*
> + * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
> + * yet. May be called from any context. Note that !empty doesn't guarantee that
> + * mpscq_pop() will return an entry yet, see the in-flight producer window
> + * above.
> + */
> +static inline bool mpscq_empty(struct mpscq *q)
> +{
> +       return READ_ONCE(q->tail) == &q->stub;
> +}
> +
> +/*
> + * Push a node onto the queue. Safe against concurrent pushes from any context,
> + * and against the (single) consumer. Returns true if the queue was empty
> + * before this push.
> + */
> +static inline bool mpscq_push(struct mpscq *q, struct llist_node *node)
> +{
> +       struct llist_node *prev;
> +
> +       node->next = NULL;
> +       /*
> +        * xchg() implies a full barrier, so the initialization of the
> +        * entry (including ->next above) is visible before the node can
> +        * be reached, either via ->tail or via ->next chasing from the
> +        * head once the store below has linked it.
> +        */
> +       prev = xchg(&q->tail, node);
> +       WRITE_ONCE(prev->next, node);
> +       return prev == &q->stub;
> +}
> +
> +/*
> + * Pop the oldest node off the queue, or return NULL if no node is available.
> + * NULL is returned both when the queue is empty and when a producer has
> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
> + * tell the two apart. Single consumer only, with headp being the consumer
> + * cursor that mpscq_init() set up.
> + */
> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
> +                                          struct llist_node **headp)
> +{
> +       struct llist_node *head = *headp, *next;
> +
> +       if (head == &q->stub) {
> +               head = READ_ONCE(head->next);
> +               if (!head)
> +                       return NULL;
> +               *headp = head;
> +       }
> +       next = READ_ONCE(head->next);
> +       if (next) {
> +               *headp = next;
> +               return head;
> +       }
> +       /*
> +        * 'head' is the last linked node, it can only be handed out once the
> +        * stub has taken its place as the tail. If the cmpxchg fails, a
> +        * producer has made a new node the tail but hasn't linked 'head' to
> +        * it yet - bail and let the caller retry.
> +        */
> +       q->stub.next = NULL;

I think this could be moved before *headp = head. That way it only
runs once each time the queue becomes nonempty rather than on every
attempt to switch tail back to &stub. And it would keep next =
READ_ONCE(head->next) and try_cmpxchg(&q->tail, &head, &q->stub))
closer together, reducing the window where the consumer could lose the
race to pop the last element.

Other that that,
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> +       if (try_cmpxchg(&q->tail, &head, &q->stub)) {
> +               *headp = &q->stub;
> +               return head;
> +       }
> +       return NULL;
> +}
> +
> +#endif /* IOU_MPSCQ_H */
> --
> 2.53.0
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-13  2:26       ` Caleb Sander Mateos
@ 2026-06-13 12:08         ` Jens Axboe
  2026-06-15 18:33           ` Caleb Sander Mateos
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-13 12:08 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/12/26 8:26 PM, Caleb Sander Mateos wrote:
> On Fri, Jun 12, 2026 at 12:37?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
>>>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
>>>>                 return;
>>>>         }
>>>>
>>>> +       /* task_work must only be added once */
>>>> +       if (test_and_set_bit(0, &tctx->tw_pending))
>>>> +               return;
>>>
>>> Is tw_pending necessary? How come the task_work_add() exclusivity
>>> isn't already provided by the mpscq_push() check above?
>>
>> It is, because the transition from empty -> not-empty no longer works
>> for that, as the mpscq emtpies one-by-one rather than with a delete-all
>> kind of primitive.
> 
> Sorry, I'm still not following why the empty check doesn't suffice.
> It's true that mpscq elements can be removed from the head one at a
> time, but mpscq_push() will continue to return false until the
> consumer pops all the elements and successfully sets tail back to
> &stub. mpscq_push() will return true once when tail transitions away
> from &stub, and then not again until the task work runs and sets tail
> back to &stub.

Let's say the task_work is currently running, a producer is adding more.
It finds queue empty, re-adds the task_work. That part is fine, we can
add the task_work while it's running as it has been detached already.
The task_work keeps running and also prunes this new item. Producer adds
another one, finds the queue empty, re-adds task_work. This one is not
OK, the task_work was already re-added when it previously found it
empty. Boom.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-13  2:40   ` Caleb Sander Mateos
@ 2026-06-13 12:22     ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2026-06-13 12:22 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/12/26 8:40 PM, Caleb Sander Mateos wrote:
>> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
>> new file mode 100644
>> index 000000000000..bc482d10e0f3
>> --- /dev/null
>> +++ b/io_uring/mpscq.h
>> @@ -0,0 +1,118 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef IOU_MPSCQ_H
>> +#define IOU_MPSCQ_H
> 
> #include <linux/io_uring_types.h> so this header can compile on its own?

Sure, we can do that.

>> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
>> +                                          struct llist_node **headp)
>> +{
>> +       struct llist_node *head = *headp, *next;
>> +
>> +       if (head == &q->stub) {
>> +               head = READ_ONCE(head->next);
>> +               if (!head)
>> +                       return NULL;
>> +               *headp = head;
>> +       }
>> +       next = READ_ONCE(head->next);
>> +       if (next) {
>> +               *headp = next;
>> +               return head;
>> +       }
>> +       /*
>> +        * 'head' is the last linked node, it can only be handed out once the
>> +        * stub has taken its place as the tail. If the cmpxchg fails, a
>> +        * producer has made a new node the tail but hasn't linked 'head' to
>> +        * it yet - bail and let the caller retry.
>> +        */
>> +       q->stub.next = NULL;
> 
> I think this could be moved before *headp = head. That way it only
> runs once each time the queue becomes nonempty rather than on every
> attempt to switch tail back to &stub. And it would keep next =
> READ_ONCE(head->next) and try_cmpxchg(&q->tail, &head, &q->stub))
> closer together, reducing the window where the consumer could lose the
> race to pop the last element.

That's a nice observation! Yes, that looks correct to me, I'll fold it
in.

> Other that that,
> Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-13 12:08         ` Jens Axboe
@ 2026-06-15 18:33           ` Caleb Sander Mateos
  2026-06-15 18:47             ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-15 18:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Sat, Jun 13, 2026 at 5:08 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/12/26 8:26 PM, Caleb Sander Mateos wrote:
> > On Fri, Jun 12, 2026 at 12:37?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
> >>>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
> >>>>                 return;
> >>>>         }
> >>>>
> >>>> +       /* task_work must only be added once */
> >>>> +       if (test_and_set_bit(0, &tctx->tw_pending))
> >>>> +               return;
> >>>
> >>> Is tw_pending necessary? How come the task_work_add() exclusivity
> >>> isn't already provided by the mpscq_push() check above?
> >>
> >> It is, because the transition from empty -> not-empty no longer works
> >> for that, as the mpscq emtpies one-by-one rather than with a delete-all
> >> kind of primitive.
> >
> > Sorry, I'm still not following why the empty check doesn't suffice.
> > It's true that mpscq elements can be removed from the head one at a
> > time, but mpscq_push() will continue to return false until the
> > consumer pops all the elements and successfully sets tail back to
> > &stub. mpscq_push() will return true once when tail transitions away
> > from &stub, and then not again until the task work runs and sets tail
> > back to &stub.
>
> Let's say the task_work is currently running, a producer is adding more.
> It finds queue empty, re-adds the task_work. That part is fine, we can
> add the task_work while it's running as it has been detached already.
> The task_work keeps running and also prunes this new item. Producer adds
> another one, finds the queue empty, re-adds task_work. This one is not
> OK, the task_work was already re-added when it previously found it
> empty. Boom.

Ah right, I forgot that mpscq_pop() can both return a popped node and
set the tail back to &stub. Maybe it would make sense for it to return
whether the queue has been marked empty and break out of
tctx_task_work_run() in that case instead of relying on a separate
call to mpscq_empty()? The atomic RMW for tw_pending every time the
queue transitions between empty and non-empty seems like it could be
quite expensive.

Best,
Caleb

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-15 18:33           ` Caleb Sander Mateos
@ 2026-06-15 18:47             ` Jens Axboe
  2026-06-15 20:04               ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-15 18:47 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/15/26 12:33 PM, Caleb Sander Mateos wrote:
> On Sat, Jun 13, 2026 at 5:08?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/12/26 8:26 PM, Caleb Sander Mateos wrote:
>>> On Fri, Jun 12, 2026 at 12:37?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
>>>>>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
>>>>>>                 return;
>>>>>>         }
>>>>>>
>>>>>> +       /* task_work must only be added once */
>>>>>> +       if (test_and_set_bit(0, &tctx->tw_pending))
>>>>>> +               return;
>>>>>
>>>>> Is tw_pending necessary? How come the task_work_add() exclusivity
>>>>> isn't already provided by the mpscq_push() check above?
>>>>
>>>> It is, because the transition from empty -> not-empty no longer works
>>>> for that, as the mpscq emtpies one-by-one rather than with a delete-all
>>>> kind of primitive.
>>>
>>> Sorry, I'm still not following why the empty check doesn't suffice.
>>> It's true that mpscq elements can be removed from the head one at a
>>> time, but mpscq_push() will continue to return false until the
>>> consumer pops all the elements and successfully sets tail back to
>>> &stub. mpscq_push() will return true once when tail transitions away
>>> from &stub, and then not again until the task work runs and sets tail
>>> back to &stub.
>>
>> Let's say the task_work is currently running, a producer is adding more.
>> It finds queue empty, re-adds the task_work. That part is fine, we can
>> add the task_work while it's running as it has been detached already.
>> The task_work keeps running and also prunes this new item. Producer adds
>> another one, finds the queue empty, re-adds task_work. This one is not
>> OK, the task_work was already re-added when it previously found it
>> empty. Boom.
> 
> Ah right, I forgot that mpscq_pop() can both return a popped node and
> set the tail back to &stub. Maybe it would make sense for it to return
> whether the queue has been marked empty and break out of
> tctx_task_work_run() in that case instead of relying on a separate
> call to mpscq_empty()? The atomic RMW for tw_pending every time the
> queue transitions between empty and non-empty seems like it could be
> quite expensive.

We could tweak it like that. I didn't look too closely as this is the
!DEFER case and hence a lot less interesting, but if you want to send a
patch my way I'd be happy to stage it on top.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-15 18:47             ` Jens Axboe
@ 2026-06-15 20:04               ` Jens Axboe
  2026-06-15 20:40                 ` Caleb Sander Mateos
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-15 20:04 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/15/26 12:47 PM, Jens Axboe wrote:
> On 6/15/26 12:33 PM, Caleb Sander Mateos wrote:
>> On Sat, Jun 13, 2026 at 5:08?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 6/12/26 8:26 PM, Caleb Sander Mateos wrote:
>>>> On Fri, Jun 12, 2026 at 12:37?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
>>>>>>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
>>>>>>>                 return;
>>>>>>>         }
>>>>>>>
>>>>>>> +       /* task_work must only be added once */
>>>>>>> +       if (test_and_set_bit(0, &tctx->tw_pending))
>>>>>>> +               return;
>>>>>>
>>>>>> Is tw_pending necessary? How come the task_work_add() exclusivity
>>>>>> isn't already provided by the mpscq_push() check above?
>>>>>
>>>>> It is, because the transition from empty -> not-empty no longer works
>>>>> for that, as the mpscq emtpies one-by-one rather than with a delete-all
>>>>> kind of primitive.
>>>>
>>>> Sorry, I'm still not following why the empty check doesn't suffice.
>>>> It's true that mpscq elements can be removed from the head one at a
>>>> time, but mpscq_push() will continue to return false until the
>>>> consumer pops all the elements and successfully sets tail back to
>>>> &stub. mpscq_push() will return true once when tail transitions away
>>>> from &stub, and then not again until the task work runs and sets tail
>>>> back to &stub.
>>>
>>> Let's say the task_work is currently running, a producer is adding more.
>>> It finds queue empty, re-adds the task_work. That part is fine, we can
>>> add the task_work while it's running as it has been detached already.
>>> The task_work keeps running and also prunes this new item. Producer adds
>>> another one, finds the queue empty, re-adds task_work. This one is not
>>> OK, the task_work was already re-added when it previously found it
>>> empty. Boom.
>>
>> Ah right, I forgot that mpscq_pop() can both return a popped node and
>> set the tail back to &stub. Maybe it would make sense for it to return
>> whether the queue has been marked empty and break out of
>> tctx_task_work_run() in that case instead of relying on a separate
>> call to mpscq_empty()? The atomic RMW for tw_pending every time the
>> queue transitions between empty and non-empty seems like it could be
>> quite expensive.
> 
> We could tweak it like that. I didn't look too closely as this is the
> !DEFER case and hence a lot less interesting, but if you want to send a
> patch my way I'd be happy to stage it on top.

I took a look, and yes I think it actually comes out nicer this way.
Good suggestion! It also helps cap the number of task_work items run,
which is a nice side effect. What do you think?

Needs a commit message obviously.

commit 572a1fb6d0f25b706ff044fcf141827f49db2ec0
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Jun 15 13:43:16 2026 -0600

    io_uring: get rid of tw_pending for !DEFER task work
    
    Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 6415a3353ee0..87151a5b62c1 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -149,8 +149,6 @@ struct io_uring_task {
 
 	struct { /* task_work */
 		struct mpscq		task_list;
-		/* BIT(0) guards adding tw only once */
-		unsigned long		tw_pending;
 		struct callback_head	task_work;
 	} ____cacheline_aligned_in_smp;
 };
diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
index c801384c6a0a..f910526766fd 100644
--- a/io_uring/mpscq.h
+++ b/io_uring/mpscq.h
@@ -122,4 +122,13 @@ static inline struct llist_node *mpscq_pop(struct mpscq *q,
 	return NULL;
 }
 
+/*
+ * Returns true if the most recent mpscq_pop() that returned a node also
+ * emptied the queue. Consumer must be serialized.
+ */
+static inline bool mpscq_pop_emptied(struct mpscq *q, struct llist_node *head)
+{
+	return head == &q->stub;
+}
+
 #endif /* IOU_MPSCQ_H */
diff --git a/io_uring/tw.c b/io_uring/tw.c
index e74372233f40..f2ce806b01a1 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -34,10 +34,6 @@ void io_tctx_fallback_work(struct work_struct *work)
 						  fallback_work);
 	unsigned int count = 0;
 
-	/* see tctx_task_work() - a set bit must always have a run coming */
-	clear_bit(0, &tctx->tw_pending);
-	smp_mb__after_atomic();
-
 	/*
 	 * Run the entries directly. We're in PF_KTHRED context, hence
 	 * io_should_terminate_tw() is true and they will be marked as
@@ -101,6 +97,13 @@ void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
 				io_poll_task_func, io_req_rw_complete,
 				(struct io_tw_req){req}, ts);
 		(*count)++;
+		/*
+		 * Break if most recent pop emptied the queue. This helps
+		 * bound task_work run, and also protects the regular
+		 * task_work addition.
+		 */
+		if (mpscq_pop_emptied(&tctx->task_list, tctx->task_head))
+			break;
 		if (unlikely(need_resched())) {
 			ctx_flush_and_put(ctx, ts);
 			ctx = NULL;
@@ -127,8 +130,6 @@ void tctx_task_work(struct callback_head *cb)
 	unsigned int count = 0;
 
 	tctx = container_of(cb, struct io_uring_task, task_work);
-	clear_bit(0, &tctx->tw_pending);
-	smp_mb__after_atomic();
 	tctx_task_work_run(tctx, UINT_MAX, &count);
 }
 
@@ -206,7 +207,7 @@ void io_req_normal_work_add(struct io_kiocb *req)
 	struct io_uring_task *tctx = req->tctx;
 	struct io_ring_ctx *ctx = req->ctx;
 
-	/* task_work already pending, we're done */
+	/* tw run already pending, nothing else to do */
 	if (!mpscq_push(&tctx->task_list, &req->io_task_work.node))
 		return;
 
@@ -223,10 +224,6 @@ void io_req_normal_work_add(struct io_kiocb *req)
 		return;
 	}
 
-	/* task_work must only be added once */
-	if (test_and_set_bit(0, &tctx->tw_pending))
-		return;
-
 	if (likely(!task_work_add(tctx->task, &tctx->task_work, ctx->notify_method)))
 		return;
 

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-15 20:04               ` Jens Axboe
@ 2026-06-15 20:40                 ` Caleb Sander Mateos
  2026-06-15 21:51                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-15 20:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Mon, Jun 15, 2026 at 1:04 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/15/26 12:47 PM, Jens Axboe wrote:
> > On 6/15/26 12:33 PM, Caleb Sander Mateos wrote:
> >> On Sat, Jun 13, 2026 at 5:08?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 6/12/26 8:26 PM, Caleb Sander Mateos wrote:
> >>>> On Fri, Jun 12, 2026 at 12:37?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 6/12/26 12:59 PM, Caleb Sander Mateos wrote:
> >>>>>>> @@ -236,10 +262,14 @@ void io_req_normal_work_add(struct io_kiocb *req)
> >>>>>>>                 return;
> >>>>>>>         }
> >>>>>>>
> >>>>>>> +       /* task_work must only be added once */
> >>>>>>> +       if (test_and_set_bit(0, &tctx->tw_pending))
> >>>>>>> +               return;
> >>>>>>
> >>>>>> Is tw_pending necessary? How come the task_work_add() exclusivity
> >>>>>> isn't already provided by the mpscq_push() check above?
> >>>>>
> >>>>> It is, because the transition from empty -> not-empty no longer works
> >>>>> for that, as the mpscq emtpies one-by-one rather than with a delete-all
> >>>>> kind of primitive.
> >>>>
> >>>> Sorry, I'm still not following why the empty check doesn't suffice.
> >>>> It's true that mpscq elements can be removed from the head one at a
> >>>> time, but mpscq_push() will continue to return false until the
> >>>> consumer pops all the elements and successfully sets tail back to
> >>>> &stub. mpscq_push() will return true once when tail transitions away
> >>>> from &stub, and then not again until the task work runs and sets tail
> >>>> back to &stub.
> >>>
> >>> Let's say the task_work is currently running, a producer is adding more.
> >>> It finds queue empty, re-adds the task_work. That part is fine, we can
> >>> add the task_work while it's running as it has been detached already.
> >>> The task_work keeps running and also prunes this new item. Producer adds
> >>> another one, finds the queue empty, re-adds task_work. This one is not
> >>> OK, the task_work was already re-added when it previously found it
> >>> empty. Boom.
> >>
> >> Ah right, I forgot that mpscq_pop() can both return a popped node and
> >> set the tail back to &stub. Maybe it would make sense for it to return
> >> whether the queue has been marked empty and break out of
> >> tctx_task_work_run() in that case instead of relying on a separate
> >> call to mpscq_empty()? The atomic RMW for tw_pending every time the
> >> queue transitions between empty and non-empty seems like it could be
> >> quite expensive.
> >
> > We could tweak it like that. I didn't look too closely as this is the
> > !DEFER case and hence a lot less interesting, but if you want to send a
> > patch my way I'd be happy to stage it on top.
>
> I took a look, and yes I think it actually comes out nicer this way.
> Good suggestion! It also helps cap the number of task_work items run,
> which is a nice side effect. What do you think?

Yeah, looks nice! Thanks for doing this.

>
> Needs a commit message obviously.
>
> commit 572a1fb6d0f25b706ff044fcf141827f49db2ec0
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Mon Jun 15 13:43:16 2026 -0600
>
>     io_uring: get rid of tw_pending for !DEFER task work
>
>     Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
>     Signed-off-by: Jens Axboe <axboe@kernel.dk>
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 6415a3353ee0..87151a5b62c1 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -149,8 +149,6 @@ struct io_uring_task {
>
>         struct { /* task_work */
>                 struct mpscq            task_list;
> -               /* BIT(0) guards adding tw only once */
> -               unsigned long           tw_pending;
>                 struct callback_head    task_work;
>         } ____cacheline_aligned_in_smp;
>  };
> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
> index c801384c6a0a..f910526766fd 100644
> --- a/io_uring/mpscq.h
> +++ b/io_uring/mpscq.h
> @@ -122,4 +122,13 @@ static inline struct llist_node *mpscq_pop(struct mpscq *q,
>         return NULL;
>  }
>
> +/*
> + * Returns true if the most recent mpscq_pop() that returned a node also
> + * emptied the queue. Consumer must be serialized.
> + */
> +static inline bool mpscq_pop_emptied(struct mpscq *q, struct llist_node *head)
> +{
> +       return head == &q->stub;
> +}
> +
>  #endif /* IOU_MPSCQ_H */
> diff --git a/io_uring/tw.c b/io_uring/tw.c
> index e74372233f40..f2ce806b01a1 100644
> --- a/io_uring/tw.c
> +++ b/io_uring/tw.c
> @@ -34,10 +34,6 @@ void io_tctx_fallback_work(struct work_struct *work)
>                                                   fallback_work);
>         unsigned int count = 0;
>
> -       /* see tctx_task_work() - a set bit must always have a run coming */
> -       clear_bit(0, &tctx->tw_pending);
> -       smp_mb__after_atomic();
> -
>         /*
>          * Run the entries directly. We're in PF_KTHRED context, hence
>          * io_should_terminate_tw() is true and they will be marked as
> @@ -101,6 +97,13 @@ void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
>                                 io_poll_task_func, io_req_rw_complete,
>                                 (struct io_tw_req){req}, ts);
>                 (*count)++;
> +               /*
> +                * Break if most recent pop emptied the queue. This helps
> +                * bound task_work run, and also protects the regular
> +                * task_work addition.
> +                */
> +               if (mpscq_pop_emptied(&tctx->task_list, tctx->task_head))
> +                       break;

I think we can now remove the "if (mpscq_empty(&tctx->task_list))
break;" above? The queue must be nonempty initially, otherwise the
task work wouldn't have been scheduled. And if the queue is empty
after an attempted pop, the previous iteration of this loop must have
successfully marked the queue as empty.

Best,
Caleb

>                 if (unlikely(need_resched())) {
>                         ctx_flush_and_put(ctx, ts);
>                         ctx = NULL;
> @@ -127,8 +130,6 @@ void tctx_task_work(struct callback_head *cb)
>         unsigned int count = 0;
>
>         tctx = container_of(cb, struct io_uring_task, task_work);
> -       clear_bit(0, &tctx->tw_pending);
> -       smp_mb__after_atomic();
>         tctx_task_work_run(tctx, UINT_MAX, &count);
>  }
>
> @@ -206,7 +207,7 @@ void io_req_normal_work_add(struct io_kiocb *req)
>         struct io_uring_task *tctx = req->tctx;
>         struct io_ring_ctx *ctx = req->ctx;
>
> -       /* task_work already pending, we're done */
> +       /* tw run already pending, nothing else to do */
>         if (!mpscq_push(&tctx->task_list, &req->io_task_work.node))
>                 return;
>
> @@ -223,10 +224,6 @@ void io_req_normal_work_add(struct io_kiocb *req)
>                 return;
>         }
>
> -       /* task_work must only be added once */
> -       if (test_and_set_bit(0, &tctx->tw_pending))
> -               return;
> -
>         if (likely(!task_work_add(tctx->task, &tctx->task_work, ctx->notify_method)))
>                 return;
>
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-15 20:40                 ` Caleb Sander Mateos
@ 2026-06-15 21:51                   ` Jens Axboe
  2026-06-16  0:22                     ` Caleb Sander Mateos
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2026-06-15 21:51 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov, krisman

On 6/15/26 2:40 PM, Caleb Sander Mateos wrote:
>> @@ -34,10 +34,6 @@ void io_tctx_fallback_work(struct work_struct *work)
>>                                                   fallback_work);
>>         unsigned int count = 0;
>>
>> -       /* see tctx_task_work() - a set bit must always have a run coming */
>> -       clear_bit(0, &tctx->tw_pending);
>> -       smp_mb__after_atomic();
>> -
>>         /*
>>          * Run the entries directly. We're in PF_KTHRED context, hence
>>          * io_should_terminate_tw() is true and they will be marked as
>> @@ -101,6 +97,13 @@ void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
>>                                 io_poll_task_func, io_req_rw_complete,
>>                                 (struct io_tw_req){req}, ts);
>>                 (*count)++;
>> +               /*
>> +                * Break if most recent pop emptied the queue. This helps
>> +                * bound task_work run, and also protects the regular
>> +                * task_work addition.
>> +                */
>> +               if (mpscq_pop_emptied(&tctx->task_list, tctx->task_head))
>> +                       break;
> 
> I think we can now remove the "if (mpscq_empty(&tctx->task_list))
> break;" above? The queue must be nonempty initially, otherwise the
> task work wouldn't have been scheduled. And if the queue is empty
> after an attempted pop, the previous iteration of this loop must have
> successfully marked the queue as empty.

We could, but then we'd need to special case the SQPOLL side. I think
it's better if we just leave it somewhat defensive as-is, it's just a
single compare anyway, non-atomic.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 4/6] io_uring: switch normal task_work to a mpscq
  2026-06-15 21:51                   ` Jens Axboe
@ 2026-06-16  0:22                     ` Caleb Sander Mateos
  0 siblings, 0 replies; 22+ messages in thread
From: Caleb Sander Mateos @ 2026-06-16  0:22 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov, krisman

On Mon, Jun 15, 2026 at 2:51 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/15/26 2:40 PM, Caleb Sander Mateos wrote:
> >> @@ -34,10 +34,6 @@ void io_tctx_fallback_work(struct work_struct *work)
> >>                                                   fallback_work);
> >>         unsigned int count = 0;
> >>
> >> -       /* see tctx_task_work() - a set bit must always have a run coming */
> >> -       clear_bit(0, &tctx->tw_pending);
> >> -       smp_mb__after_atomic();
> >> -
> >>         /*
> >>          * Run the entries directly. We're in PF_KTHRED context, hence
> >>          * io_should_terminate_tw() is true and they will be marked as
> >> @@ -101,6 +97,13 @@ void tctx_task_work_run(struct io_uring_task *tctx, unsigned int max_entries,
> >>                                 io_poll_task_func, io_req_rw_complete,
> >>                                 (struct io_tw_req){req}, ts);
> >>                 (*count)++;
> >> +               /*
> >> +                * Break if most recent pop emptied the queue. This helps
> >> +                * bound task_work run, and also protects the regular
> >> +                * task_work addition.
> >> +                */
> >> +               if (mpscq_pop_emptied(&tctx->task_list, tctx->task_head))
> >> +                       break;
> >
> > I think we can now remove the "if (mpscq_empty(&tctx->task_list))
> > break;" above? The queue must be nonempty initially, otherwise the
> > task work wouldn't have been scheduled. And if the queue is empty
> > after an attempted pop, the previous iteration of this loop must have
> > successfully marked the queue as empty.
>
> We could, but then we'd need to special case the SQPOLL side. I think
> it's better if we just leave it somewhat defensive as-is, it's just a
> single compare anyway, non-atomic.

Fine by me.

Best,
Caleb

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2026-06-16  0:22 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-12  2:48 [PATCHSET v2] Add lockless MPSC FIFO queue for task work Jens Axboe
2026-06-12  2:48 ` [PATCH 1/6] io_uring: grab RCU read lock marking task run Jens Axboe
2026-06-13  2:27   ` Caleb Sander Mateos
2026-06-12  2:48 ` [PATCH 2/6] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
2026-06-13  2:40   ` Caleb Sander Mateos
2026-06-13 12:22     ` Jens Axboe
2026-06-12  2:48 ` [PATCH 3/6] io_uring: switch local task_work to a mpscq Jens Axboe
2026-06-12  3:20   ` Caleb Sander Mateos
2026-06-12 12:23     ` Jens Axboe
2026-06-12  2:48 ` [PATCH 4/6] io_uring: switch normal " Jens Axboe
2026-06-12 18:59   ` Caleb Sander Mateos
2026-06-12 19:37     ` Jens Axboe
2026-06-13  2:26       ` Caleb Sander Mateos
2026-06-13 12:08         ` Jens Axboe
2026-06-15 18:33           ` Caleb Sander Mateos
2026-06-15 18:47             ` Jens Axboe
2026-06-15 20:04               ` Jens Axboe
2026-06-15 20:40                 ` Caleb Sander Mateos
2026-06-15 21:51                   ` Jens Axboe
2026-06-16  0:22                     ` Caleb Sander Mateos
2026-06-12  2:48 ` [PATCH 5/6] io_uring: run the tctx task_work fallback directly Jens Axboe
2026-06-12  2:48 ` [PATCH 6/6] io_uring: remove the per-ctx fallback task_work machinery Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox