[PATCHSET 0/2] Add lockless MPSC FIFO queue for task work

Linux io-uring development
 help / color / mirror / Atom feed

* [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work
@ 2026-06-11 15:58 Jens Axboe
  2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
  2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
  0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2026-06-11 15:58 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov

Hi,

Details are in the commits, but this adds a variant of an MPSC FIFO
queued based on Dmitry's intrusive MPSC node-based queue algorithm.
Main motivation is better cache locality between the consumer and
producers, and avoiding the need to reverse the llist before running
it. Numbers in patch 2.

Patch 1 adds the basic queue implementation, patch 2 adopts it for
DEFER_TASKRUN variants of io_uring.

Results are really promising. It clearly scales better with more
task work running or producing, and it avoids the added overhead
of needing to reverse the llist when local task work is run. Runs all
the regression tests, and the benchmarking I've done. I've had a user
harness version of this running on arm64 and x86-64 as well.

Can also be found in a git tree here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq

 include/linux/io_uring_types.h |  26 +++++-
 io_uring/io_uring.c            |   2 +-
 io_uring/loop.c                |   2 +-
 io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++
 io_uring/tw.c                  | 145 ++++++++++++++++-----------------
 io_uring/tw.h                  |   4 +-
 io_uring/wait.c                |   8 +-
 io_uring/wait.h                |  20 ++++-
 8 files changed, 239 insertions(+), 89 deletions(-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-11 15:58 [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work Jens Axboe
@ 2026-06-11 15:58 ` Jens Axboe
  2026-06-11 16:49   ` Gabriel Krisman Bertazi
  2026-06-12  1:13   ` Caleb Sander Mateos
  2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
  1 sibling, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2026-06-11 15:58 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, Jens Axboe

Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.

Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.

Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.

The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.

The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.

The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline.The cursor is written for

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  12 ++++
 io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++
 2 files changed, 133 insertions(+)
 create mode 100644 io_uring/mpscq.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aa4d5477f859..85e12b4884a5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -55,6 +55,18 @@ struct io_wq_work_list {
 	struct io_wq_work_node *last;
 };
 
+/*
+ * Lockless multi-producer, single-consumer FIFO queue, see
+ * io_uring/mpscq.h for the implementation and rules. Defined here so
+ * that it can be embedded in io_ring_ctx. This is the producer side
+ * only - the consumer cursor is kept separately, on a cacheline that
+ * isn't dirtied by the producers.
+ */
+struct mpscq {
+	struct llist_node	*tail;		/* producers */
+	struct llist_node	stub;
+};
+
 struct io_wq_work {
 	struct io_wq_work_node list;
 	atomic_t flags;
diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
new file mode 100644
index 000000000000..12172cef8394
--- /dev/null
+++ b/io_uring/mpscq.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef IOU_MPSCQ_H
+#define IOU_MPSCQ_H
+
+/*
+ * mpscq - lockless multi-producer, single-consumer FIFO queue
+ *
+ * Unlike llist, which is LIFO ordered and hence needs an O(n)
+ * llist_reverse_order() pass before entries can be processed in queue order,
+ * this queue hands out nodes in the order they were pushed.
+ *
+ * The consumer cursor is held by the caller rather than in the queue struct
+ * (see below), and with the stub reinsertion done as a single cmpxchg attempt
+ * instead of an unconditional push, keeping tail == stub a reliable empty test
+ * while a producer is in the middle of a push.
+ *
+ * Producers may run in any context (task, softirq, hardirq) and are wait-free:
+ * a push is one xchg() plus one store, with no retry loops. FIFO order between
+ * producers is the order in which the xchg() on ->tail serializes them.
+ *
+ * The price for linked-list FIFO is that a push publishes the node in two
+ * steps: the xchg() makes it the new tail, and the subsequent store links it to
+ * its predecessor. In between, the tail end of the queue is not yet reachable
+ * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
+ * reports false. The consumer must not treat such a NULL as "queue empty" - it
+ * should retry later. The window is two instructions wide, but a producer can
+ * be preempted inside it, so the consumer must not spin on it while holding
+ * resources the producer might need to make progress.
+ *
+ * The consumer side only supports a single consumer at a time, callers must
+ * provide their own serialization for it. The stub node is what allows the
+ * consumer to detach the final node without racing with the link stores of
+ * producers. This scheme also guarantees that the previous tail returned by
+ * mpscq_push() cannot be freed by the consumer until the push that returned it
+ * has linked it, hence it's always safe to compare against (but not
+ * dereference, unless the caller otherwise guarantees its lifetime).
+ *
+ * The queue struct only holds the producer side. The consumer keeps its cursor
+ * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
+ * so that it can be placed on a different cacheline: the cursor is written for
+ * every pop, and having it share a line with ->tail would have the consumer
+ * invalidating the line that producers need for every push.
+ */
+static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
+{
+	q->tail = *headp = &q->stub;
+	q->stub.next = NULL;
+}
+
+/*
+ * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
+ * yet. May be called from any context. Note that !empty doesn't guarantee that
+ * mpscq_pop() will return an entry yet, see the in-flight producer window
+ * above.
+ */
+static inline bool mpscq_empty(struct mpscq *q)
+{
+	return READ_ONCE(q->tail) == &q->stub;
+}
+
+/*
+ * Push a node onto the queue. Safe against concurrent pushes from any context,
+ * and against the (single) consumer. Returns the previous tail node, which is
+ * &q->stub if and only if the queue was empty before this push.
+ */
+static inline struct llist_node *mpscq_push(struct mpscq *q,
+					    struct llist_node *node)
+{
+	struct llist_node *prev;
+
+	node->next = NULL;
+	/*
+	 * xchg() implies a full barrier, so the initialization of the
+	 * entry (including ->next above) is visible before the node can
+	 * be reached, either via ->tail or via ->next chasing from the
+	 * head once the store below has linked it.
+	 */
+	prev = xchg(&q->tail, node);
+	WRITE_ONCE(prev->next, node);
+	return prev;
+}
+
+/*
+ * Pop the oldest node off the queue, or return NULL if no node is available.
+ * NULL is returned both when the queue is empty and when a producer has
+ * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
+ * tell the two apart. Single consumer only, with headp being the consumer
+ * cursor that mpscq_init() set up.
+ */
+static inline struct llist_node *mpscq_pop(struct mpscq *q,
+					   struct llist_node **headp)
+{
+	struct llist_node *head = *headp;
+	struct llist_node *next = READ_ONCE(head->next);
+
+	if (head == &q->stub) {
+		if (!next)
+			return NULL;
+		*headp = next;
+		head = next;
+		next = READ_ONCE(head->next);
+	}
+	if (next) {
+		*headp = next;
+		return head;
+	}
+	/*
+	 * 'head' is the last linked node, it can only be handed out once the
+	 * stub has taken its place as the tail. If the cmpxchg fails, a
+	 * producer has made a new node the tail but hasn't linked it to 'head'
+	 * yet - bail and let the caller retry.
+	 */
+	q->stub.next = NULL;
+	if (try_cmpxchg(&q->tail, &head, &q->stub)) {
+		*headp = &q->stub;
+		return head;
+	}
+	return NULL;
+}
+
+#endif /* IOU_MPSCQ_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-11 15:58 [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work Jens Axboe
  2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
@ 2026-06-11 15:58 ` Jens Axboe
  2026-06-12  1:14   ` Caleb Sander Mateos
  1 sibling, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-11 15:58 UTC (permalink / raw)
  To: io-uring; +Cc: dvyukov, Jens Axboe

The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
ordered, and hence __io_run_local_work() has to restore the right
running order with an O(n) llist_reverse_order() pass first. On top of
that, a batch that gets capped by max_events needs the leftover entries
parked on a separate ->retry_llist, as they can't be pushed back to the
shared list.

Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
retry loop, entries are popped in queue order with no reversal pass,
capping a run simply leaves the remainder on the queue, and
->retry_llist goes away entirely. The consumer cursor, ->work_head,
lives with the rest of the ->uring_lock protected state rather than
next to the queue, so that popping entries doesn't dirty the producer
side cacheline.

For low amounts of task_work, this ends up being a bit more efficient
than the existing scheme. As an example of that, doing multishot
receives for 8 clients has the following task_work overhead:

     1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.64% at ~46Gb/sec

and after this change:

     1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     2.11% at ~53Gb/sec

which has less overhead even though that test run was faster. For a case
of having 1024 clients on a single ring:

     2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
     0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     3.50% at ~24Gb/sec

we start to see the llist reversing taking a considerable amount of
time, and the total add+run task_work overhead is around 3.5%. After
the change:

     0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
     0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
     1.32% at ~26Gb/sec

most of that overhead is gone, and performance is better as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  14 +++-
 io_uring/io_uring.c            |   2 +-
 io_uring/loop.c                |   2 +-
 io_uring/tw.c                  | 145 ++++++++++++++++-----------------
 io_uring/tw.h                  |   4 +-
 io_uring/wait.c                |   8 +-
 io_uring/wait.h                |  20 ++++-
 7 files changed, 106 insertions(+), 89 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 85e12b4884a5..e918301da5fc 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -351,6 +351,14 @@ struct io_ring_ctx {
 		 */
 		atomic_t		cancel_seq;
 
+		/*
+		 * Consumer cursor for ->work_list, protected by ->uring_lock.
+		 * Deliberately kept away from the producer side of the queue,
+		 * as it's written for every popped entry, and the producer
+		 * cacheline is contended enough as it is.
+		 */
+		struct llist_node	*work_head;
+
 		/*
 		 * ->iopoll_list is protected by the ctx->uring_lock for
 		 * io_uring instances that don't use IORING_SETUP_SQPOLL.
@@ -417,10 +425,10 @@ struct io_ring_ctx {
 	 */
 	struct {
 		struct io_rings	__rcu	*rings_rcu;
-		struct llist_head	work_llist;
-		struct llist_head	retry_llist;
+		struct mpscq		work_list;
 		unsigned long		check_cq;
 		atomic_t		cq_wait_nr;
+		atomic_t		cq_wait_added;
 		atomic_t		cq_timeouts;
 		struct wait_queue_head	cq_wait;
 	} ____cacheline_aligned_in_smp;
@@ -742,8 +750,6 @@ struct io_kiocb {
 	 */
 	u16				buf_index;
 
-	unsigned			nr_tw;
-
 	/* REQ_F_* flags */
 	io_req_flags_t			flags;
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 753ac23401c5..16acd99ff083 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -280,7 +280,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
 	INIT_LIST_HEAD(&ctx->defer_list);
 	INIT_LIST_HEAD(&ctx->timeout_list);
 	INIT_LIST_HEAD(&ctx->ltimeout_list);
-	init_llist_head(&ctx->work_llist);
+	mpscq_init(&ctx->work_list, &ctx->work_head);
 	INIT_LIST_HEAD(&ctx->tctx_list);
 	mutex_init(&ctx->tctx_lock);
 	ctx->submit_state.free_list.next = NULL;
diff --git a/io_uring/loop.c b/io_uring/loop.c
index bbbb6ef14e6a..2ecc1cf49f84 100644
--- a/io_uring/loop.c
+++ b/io_uring/loop.c
@@ -11,7 +11,7 @@ static inline int io_loop_nr_cqes(const struct io_ring_ctx *ctx,
 
 static inline void io_loop_wait_start(struct io_ring_ctx *ctx, unsigned nr_wait)
 {
-	atomic_set(&ctx->cq_wait_nr, nr_wait);
+	io_cq_wait_arm(ctx, nr_wait);
 	set_current_state(TASK_INTERRUPTIBLE);
 }
 
diff --git a/io_uring/tw.c b/io_uring/tw.c
index 023d5e6bc491..4cf350cffb6c 100644
--- a/io_uring/tw.c
+++ b/io_uring/tw.c
@@ -14,6 +14,7 @@
 #include "rw.h"
 #include "eventfd.h"
 #include "wait.h"
+#include "mpscq.h"
 
 void io_fallback_req_func(struct work_struct *work)
 {
@@ -170,11 +171,8 @@ static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
 void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 {
 	struct io_ring_ctx *ctx = req->ctx;
-	unsigned nr_wait, nr_tw, nr_tw_prev;
-	struct llist_node *head;
-
-	/* See comment above IO_CQ_WAKE_INIT */
-	BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
+	struct llist_node *prev;
+	unsigned nr_wait;
 
 	/*
 	 * We don't know how many requests there are in the link and whether
@@ -185,55 +183,47 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
 
 	guard(rcu)();
 
-	head = READ_ONCE(ctx->work_llist.first);
-	do {
-		nr_tw_prev = 0;
-		if (head) {
-			struct io_kiocb *first_req = container_of(head,
-							struct io_kiocb,
-							io_task_work.node);
-			/*
-			 * Might be executed at any moment, rely on
-			 * SLAB_TYPESAFE_BY_RCU to keep it alive.
-			 */
-			nr_tw_prev = READ_ONCE(first_req->nr_tw);
-		}
-
-		/*
-		 * Theoretically, it can overflow, but that's fine as one of
-		 * previous adds should've tried to wake the task.
-		 */
-		nr_tw = nr_tw_prev + 1;
-		if (!(flags & IOU_F_TWQ_LAZY_WAKE))
-			nr_tw = IO_CQ_WAKE_FORCE;
-
-		req->nr_tw = nr_tw;
-		req->io_task_work.node.next = head;
-	} while (!try_cmpxchg(&ctx->work_llist.first, &head,
-			      &req->io_task_work.node));
-
 	/*
-	 * cmpxchg implies a full barrier, which pairs with the barrier
-	 * in set_current_state() on the io_cqring_wait() side. It's used
-	 * to ensure that either we see updated ->cq_wait_nr, or waiters
-	 * going to sleep will observe the work added to the list, which
-	 * is similar to the wait/wawke task state sync.
+	 * The xchg() in mpscq_push() implies a full barrier, which pairs with
+	 * the barrier in set_current_state() on the io_cqring_wait() side. This
+	 * ensures that either we see the updated ->cq_wait_nr, or waiters going
+	 * to sleep will observe the work added to the list, which is similar to
+	 * the wait/wake task state sync.
 	 */
+	prev = mpscq_push(&ctx->work_list, &req->io_task_work.node);
 
-	if (!head) {
+	if (prev == &ctx->work_list.stub) {
 		io_ctx_mark_taskrun(ctx);
 		if (data_race(ctx->int_flags) & IO_RING_F_HAS_EVFD)
 			io_eventfd_signal(ctx, false);
 	}
 
-	nr_wait = atomic_read(&ctx->cq_wait_nr);
-	/* not enough or no one is waiting */
-	if (nr_tw < nr_wait)
+	/* acquire pairs with the release in io_cq_wait_arm() */
+	nr_wait = atomic_read_acquire(&ctx->cq_wait_nr);
+	/* no one is waiting */
+	if (nr_wait == IO_CQ_WAKE_INIT)
 		return;
-	/* the previous add has already woken it up */
-	if (nr_tw_prev >= nr_wait)
+	/*
+	 * For a lazy wake, defer waking the task until enough work is pending
+	 * to satisfy the number of events it's waiting for. As a waiter only
+	 * sleeps on an empty queue, the lazy adds counted since it armed
+	 * ->cq_wait_nr are the full pending count, see io_cq_wait_arm(). If we
+	 * instead saw a stale, unarmed (or previous cycle) ->cq_wait_nr, then
+	 * per the barrier pairing above, the waiter's check after arming will
+	 * see our work and abort the sleep - no wakeup is needed from here in
+	 * that case.
+	 */
+	if ((flags & IOU_F_TWQ_LAZY_WAKE) &&
+	    atomic_inc_return(&ctx->cq_wait_added) < nr_wait)
 		return;
-	wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
+	/*
+	 * Only one wake up is needed per arming of the wait. Claim it by
+	 * resetting ->cq_wait_nr - the waiter re-arms it for every wait cycle
+	 * and checks for pending work after arming, so a wakeup cannot get
+	 * lost.
+	 */
+	if (atomic_try_cmpxchg(&ctx->cq_wait_nr, &nr_wait, IO_CQ_WAKE_INIT))
+		wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
 }
 
 void io_req_normal_work_add(struct io_kiocb *req)
@@ -273,21 +263,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
 
 void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
 {
-	struct llist_node *node;
+	struct llist_node *node, *first = NULL, **tail = &first;
 
 	/*
-	 * Running the work items may utilize ->retry_llist as a means
-	 * for capping the number of task_work entries run at the same
-	 * time. But that list can potentially race with moving the work
-	 * from here, if the task is exiting. As any normal task_work
-	 * running holds ->uring_lock already, just guard this slow path
-	 * with ->uring_lock to avoid racing on ->retry_llist.
+	 * The work list consumer side is serialized by ->uring_lock, see
+	 * __io_run_local_work(). Grab it to guard against racing with normal
+	 * task_work running, as the task may be exiting.
 	 */
 	guard(mutex)(&ctx->uring_lock);
-	node = llist_del_all(&ctx->work_llist);
-	__io_fallback_tw(node, false);
-	node = llist_del_all(&ctx->retry_llist);
-	__io_fallback_tw(node, false);
+
+	while (!mpscq_empty(&ctx->work_list)) {
+		node = mpscq_pop(&ctx->work_list, &ctx->work_head);
+		if (!node) {
+			/* a producer is mid-push, wait for it to link */
+			cpu_relax();
+			continue;
+		}
+		*tail = node;
+		tail = &node->next;
+	}
+	*tail = NULL;
+	__io_fallback_tw(first, false);
 }
 
 static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
@@ -302,22 +298,23 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
 	return false;
 }
 
-static int __io_run_local_work_loop(struct llist_node **node,
+static int __io_run_local_work_loop(struct io_ring_ctx *ctx,
 				    io_tw_token_t tw,
 				    int events)
 {
 	int ret = 0;
 
-	while (*node) {
-		struct llist_node *next = (*node)->next;
-		struct io_kiocb *req = container_of(*node, struct io_kiocb,
-						    io_task_work.node);
+	while (ret < events) {
+		struct llist_node *node = mpscq_pop(&ctx->work_list, &ctx->work_head);
+		struct io_kiocb *req;
+
+		if (!node)
+			break;
+		req = container_of(node, struct io_kiocb, io_task_work.node);
 		INDIRECT_CALL_2(req->io_task_work.func,
 				io_poll_task_func, io_req_rw_complete,
 				(struct io_tw_req){req}, tw);
-		*node = next;
-		if (++ret >= events)
-			break;
+		ret++;
 	}
 
 	return ret;
@@ -326,7 +323,6 @@ static int __io_run_local_work_loop(struct llist_node **node,
 static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
 			       int min_events, int max_events)
 {
-	struct llist_node *node;
 	unsigned int loops = 0;
 	int ret = 0;
 
@@ -335,24 +331,21 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
 	if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
 		atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
 again:
-	tw.cancel = io_should_terminate_tw(ctx);
-	min_events -= ret;
-	ret = __io_run_local_work_loop(&ctx->retry_llist.first, tw, max_events);
-	if (ctx->retry_llist.first)
-		goto retry_done;
-
 	/*
-	 * llists are in reverse order, flip it back the right way before
-	 * running the pending items.
+	 * If the last loop made no progress while work is still pending,
+	 * a producer has published a node but hasn't linked it into the
+	 * queue yet (see mpscq_pop()). Give it a chance to finish rather
+	 * than spinning on the queue.
 	 */
-	node = llist_reverse_order(llist_del_all(&ctx->work_llist));
-	ret += __io_run_local_work_loop(&node, tw, max_events - ret);
-	ctx->retry_llist.first = node;
+	if (unlikely(loops && !ret))
+		cond_resched();
+	tw.cancel = io_should_terminate_tw(ctx);
+	min_events -= ret;
+	ret = __io_run_local_work_loop(ctx, tw, max_events);
 	loops++;
 
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
-retry_done:
 	io_submit_flush_completions(ctx);
 	if (io_run_local_work_continue(ctx, ret, min_events))
 		goto again;
diff --git a/io_uring/tw.h b/io_uring/tw.h
index 415e330fabde..f42db5fdbded 100644
--- a/io_uring/tw.h
+++ b/io_uring/tw.h
@@ -6,6 +6,8 @@
 #include <linux/percpu-refcount.h>
 #include <linux/io_uring_types.h>
 
+#include "mpscq.h"
+
 #define IO_LOCAL_TW_DEFAULT_MAX		20
 
 /*
@@ -89,7 +91,7 @@ static inline int io_run_task_work(void)
 
 static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
 {
-	return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
+	return !mpscq_empty(&ctx->work_list);
 }
 
 static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
diff --git a/io_uring/wait.c b/io_uring/wait.c
index ec01e78a216d..05ac779635e8 100644
--- a/io_uring/wait.c
+++ b/io_uring/wait.c
@@ -96,9 +96,13 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
 	 * the task and return.
 	 */
 	if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
+		/*
+		 * No need to zero ->cq_wait_added when arming with 1, any
+		 * counted add will satisfy it.
+		 */
 		atomic_set(&ctx->cq_wait_nr, 1);
 		smp_mb();
-		if (!llist_empty(&ctx->work_llist))
+		if (io_local_work_pending(ctx))
 			goto out_wake;
 	}
 
@@ -257,7 +261,7 @@ int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
 		unsigned long check_cq;
 
 		if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
-			atomic_set(&ctx->cq_wait_nr, nr_wait);
+			io_cq_wait_arm(ctx, nr_wait);
 			set_current_state(TASK_INTERRUPTIBLE);
 		} else {
 			prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
diff --git a/io_uring/wait.h b/io_uring/wait.h
index a4274b137f81..2ecea3e2a63f 100644
--- a/io_uring/wait.h
+++ b/io_uring/wait.h
@@ -5,12 +5,24 @@
 #include <linux/io_uring_types.h>
 
 /*
- * No waiters. It's larger than any valid value of the tw counter
- * so that tests against ->cq_wait_nr would fail and skip wake_up().
+ * No waiters. ->cq_wait_nr holds this when no task is waiting, and is
+ * reset back to it by the task work add side when it claims a wake up,
+ * so that only one wake up is issued per arming of the wait.
  */
 #define IO_CQ_WAKE_INIT		(-1U)
-/* Forced wake up if there is a waiter regardless of ->cq_wait_nr */
-#define IO_CQ_WAKE_FORCE	(IO_CQ_WAKE_INIT >> 1)
+
+/*
+ * A waiter only sleeps on an empty work list (it checks for pending work after
+ * arming), hence the number of lazy adds since arming is the full pending
+ * count. The release pairs with the acquire in io_req_local_work_add(), hence
+ * a producer observing the armed ->cq_wait_nr also observes the zeroed
+ * ->cq_wait_added.
+ */
+static inline void io_cq_wait_arm(struct io_ring_ctx *ctx, int nr_wait)
+{
+	atomic_set(&ctx->cq_wait_added, 0);
+	atomic_set_release(&ctx->cq_wait_nr, nr_wait);
+}
 
 struct ext_arg {
 	size_t argsz;
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
@ 2026-06-11 16:49   ` Gabriel Krisman Bertazi
  2026-06-11 16:58     ` Jens Axboe
  2026-06-12  1:13   ` Caleb Sander Mateos
  1 sibling, 1 reply; 16+ messages in thread
From: Gabriel Krisman Bertazi @ 2026-06-11 16:49 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

Jens Axboe <axboe@kernel.dk> writes:

> Local task_work is currently using llists for managing the work,
> but that's a LIFO type of list. This means that running this task_work
> needs to reverse the list first, to ensure fairness in running the
> queued items.
>
> Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
> node-based queue algorithm, modified with an externally held consumer
> cursor and conditional stub reinsertion. See comments in the header.
>
> Producers are wait-free: a push is a single xchg() on the queue tail,
> which serializes concurrent producers and defines the FIFO order, plus
> a store linking the node to its predecessor. There are no cmpxchg retry
> loops, and pushing is safe from any context, including hardirq.
>
> The cost of linked list FIFO ordering is that a push publishes the node
> in two steps - the xchg() makes it visible as the new tail before the
> subsequent store links it into the chain that is reachable from the
> head. A consumer hitting that window gets a NULL from mpscq_pop() while
> mpscq_empty() reports false, and must retry later rather than treat the
> queue as empty. The window is two instructions wide, but a producer can
> get preempted inside it, so the consumer must not busy wait on it.
>
> The consumer side supports a single consumer at a time, with callers
> providing their own serialization. A stub node, which also defines the
> empty state (tail == stub), allows the consumer to detach the final
> node without racing against producer link stores: that node is only
> handed out once the stub has been cmpxchg'ed back in as the tail. This
> also guarantees that the previous tail returned by mpscq_push() cannot
> get freed before that push has linked it, making it always valid for
> comparisons.
>
> The consumer cursor is deliberately not part of the queue struct - the
> caller owns it and passes it to mpscq_pop(). This is done to separate
> the consumer and producers cacheline.The cursor is written for

Interesting stuff!  The commit message is truncated here, though.

>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  12 ++++
>  io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++

There's nothing io_uring specific here.  Perhaps put in lib/ directly
some a wider audience can review  and use?

>  2 files changed, 133 insertions(+)
>  create mode 100644 io_uring/mpscq.h
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aa4d5477f859..85e12b4884a5 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -55,6 +55,18 @@ struct io_wq_work_list {
>  	struct io_wq_work_node *last;
>  };
>  
> +/*
> + * Lockless multi-producer, single-consumer FIFO queue, see
> + * io_uring/mpscq.h for the implementation and rules. Defined here so
> + * that it can be embedded in io_ring_ctx. This is the producer side
> + * only - the consumer cursor is kept separately, on a cacheline that
> + * isn't dirtied by the producers.
> + */
> +struct mpscq {
> +	struct llist_node	*tail;		/* producers */
> +	struct llist_node	stub;
> +};
> +
>  struct io_wq_work {
>  	struct io_wq_work_node list;
>  	atomic_t flags;
> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
> new file mode 100644
> index 000000000000..12172cef8394
> --- /dev/null
> +++ b/io_uring/mpscq.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef IOU_MPSCQ_H
> +#define IOU_MPSCQ_H
> +
> +/*
> + * mpscq - lockless multi-producer, single-consumer FIFO queue
> + *
> + * Unlike llist, which is LIFO ordered and hence needs an O(n)
> + * llist_reverse_order() pass before entries can be processed in queue order,
> + * this queue hands out nodes in the order they were pushed.
> + *
> + * The consumer cursor is held by the caller rather than in the queue struct
> + * (see below), and with the stub reinsertion done as a single cmpxchg attempt
> + * instead of an unconditional push, keeping tail == stub a reliable empty test
> + * while a producer is in the middle of a push.
> + *
> + * Producers may run in any context (task, softirq, hardirq) and are wait-free:
> + * a push is one xchg() plus one store, with no retry loops. FIFO order between
> + * producers is the order in which the xchg() on ->tail serializes them.
> + *
> + * The price for linked-list FIFO is that a push publishes the node in two
> + * steps: the xchg() makes it the new tail, and the subsequent store links it to
> + * its predecessor. In between, the tail end of the queue is not yet reachable
> + * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
> + * reports false. The consumer must not treat such a NULL as "queue empty" - it
> + * should retry later. The window is two instructions wide, but a producer can
> + * be preempted inside it, so the consumer must not spin on it while holding
> + * resources the producer might need to make progress.
> + *
> + * The consumer side only supports a single consumer at a time, callers must
> + * provide their own serialization for it. The stub node is what allows the
> + * consumer to detach the final node without racing with the link stores of
> + * producers. This scheme also guarantees that the previous tail returned by
> + * mpscq_push() cannot be freed by the consumer until the push that returned it
> + * has linked it, hence it's always safe to compare against (but not
> + * dereference, unless the caller otherwise guarantees its lifetime).
> + *
> + * The queue struct only holds the producer side. The consumer keeps its cursor
> + * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
> + * so that it can be placed on a different cacheline: the cursor is written for
> + * every pop, and having it share a line with ->tail would have the consumer
> + * invalidating the line that producers need for every push.
> + */
> +static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
> +{
> +	q->tail = *headp = &q->stub;
> +	q->stub.next = NULL;
> +}
> +
> +/*
> + * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
> + * yet. May be called from any context. Note that !empty doesn't guarantee that
> + * mpscq_pop() will return an entry yet, see the in-flight producer window
> + * above.
> + */
> +static inline bool mpscq_empty(struct mpscq *q)
> +{
> +	return READ_ONCE(q->tail) == &q->stub;
> +}
> +
> +/*
> + * Push a node onto the queue. Safe against concurrent pushes from any context,
> + * and against the (single) consumer. Returns the previous tail node, which is
> + * &q->stub if and only if the queue was empty before this push.
> + */
> +static inline struct llist_node *mpscq_push(struct mpscq *q,
> +					    struct llist_node *node)
> +{
> +	struct llist_node *prev;
> +
> +	node->next = NULL;
> +	/*
> +	 * xchg() implies a full barrier, so the initialization of the
> +	 * entry (including ->next above) is visible before the node can
> +	 * be reached, either via ->tail or via ->next chasing from the
> +	 * head once the store below has linked it.
> +	 */
> +	prev = xchg(&q->tail, node);
> +	WRITE_ONCE(prev->next, node);
> +	return prev;
> +}
> +
> +/*
> + * Pop the oldest node off the queue, or return NULL if no node is available.
> + * NULL is returned both when the queue is empty and when a producer has
> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
> + * tell the two apart. Single consumer only, with headp being the consumer
> + * cursor that mpscq_init() set up.
> + */
> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
> +					   struct llist_node **headp)
> +{
> +	struct llist_node *head = *headp;
> +	struct llist_node *next = READ_ONCE(head->next);
> +
> +	if (head == &q->stub) {
> +		if (!next)
> +			return NULL;
> +		*headp = next;
> +		head = next;
> +		next = READ_ONCE(head->next);
> +	}
> +	if (next) {
> +		*headp = next;
> +		return head;
> +	}
> +	/*
> +	 * 'head' is the last linked node, it can only be handed out once the
> +	 * stub has taken its place as the tail. If the cmpxchg fails, a
> +	 * producer has made a new node the tail but hasn't linked it to 'head'
> +	 * yet - bail and let the caller retry.
> +	 */
> +	q->stub.next = NULL;
> +	if (try_cmpxchg(&q->tail, &head, &q->stub)) {
> +		*headp = &q->stub;
> +		return head;
> +	}
> +	return NULL;
> +}
> +
> +#endif /* IOU_MPSCQ_H */

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-11 16:49   ` Gabriel Krisman Bertazi
@ 2026-06-11 16:58     ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2026-06-11 16:58 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: io-uring, dvyukov

On 6/11/26 10:49 AM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
> 
>> Local task_work is currently using llists for managing the work,
>> but that's a LIFO type of list. This means that running this task_work
>> needs to reverse the list first, to ensure fairness in running the
>> queued items.
>>
>> Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
>> node-based queue algorithm, modified with an externally held consumer
>> cursor and conditional stub reinsertion. See comments in the header.
>>
>> Producers are wait-free: a push is a single xchg() on the queue tail,
>> which serializes concurrent producers and defines the FIFO order, plus
>> a store linking the node to its predecessor. There are no cmpxchg retry
>> loops, and pushing is safe from any context, including hardirq.
>>
>> The cost of linked list FIFO ordering is that a push publishes the node
>> in two steps - the xchg() makes it visible as the new tail before the
>> subsequent store links it into the chain that is reachable from the
>> head. A consumer hitting that window gets a NULL from mpscq_pop() while
>> mpscq_empty() reports false, and must retry later rather than treat the
>> queue as empty. The window is two instructions wide, but a producer can
>> get preempted inside it, so the consumer must not busy wait on it.
>>
>> The consumer side supports a single consumer at a time, with callers
>> providing their own serialization. A stub node, which also defines the
>> empty state (tail == stub), allows the consumer to detach the final
>> node without racing against producer link stores: that node is only
>> handed out once the stub has been cmpxchg'ed back in as the tail. This
>> also guarantees that the previous tail returned by mpscq_push() cannot
>> get freed before that push has linked it, making it always valid for
>> comparisons.
>>
>> The consumer cursor is deliberately not part of the queue struct - the
>> caller owns it and passes it to mpscq_pop(). This is done to separate
>> the consumer and producers cacheline.The cursor is written for
> 
> Interesting stuff!  The commit message is truncated here, though.

Huh yes indeed, wonder how that happened. Was doing some shuffling and
editing, probably messed it up.

>> Signed-off-by: Jens Axboe <axboe@kernel.dk>
>> ---
>>  include/linux/io_uring_types.h |  12 ++++
>>  io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++
> 
> There's nothing io_uring specific here.  Perhaps put in lib/ directly
> some a wider audience can review  and use?

I think keeping it local is fine, if someone else wants to use it, then
it should just get migrated to include/linux/ instead as it's all in
that header. Code is small enough that it doesn't warrant .c and
non-inlines for it.

For now, as there's one consumer of this, better to keep it local.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
  2026-06-11 16:49   ` Gabriel Krisman Bertazi
@ 2026-06-12  1:13   ` Caleb Sander Mateos
  2026-06-12  2:21     ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12  1:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Thu, Jun 11, 2026 at 9:12 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Local task_work is currently using llists for managing the work,
> but that's a LIFO type of list. This means that running this task_work
> needs to reverse the list first, to ensure fairness in running the
> queued items.
>
> Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
> node-based queue algorithm, modified with an externally held consumer
> cursor and conditional stub reinsertion. See comments in the header.
>
> Producers are wait-free: a push is a single xchg() on the queue tail,
> which serializes concurrent producers and defines the FIFO order, plus
> a store linking the node to its predecessor. There are no cmpxchg retry
> loops, and pushing is safe from any context, including hardirq.
>
> The cost of linked list FIFO ordering is that a push publishes the node
> in two steps - the xchg() makes it visible as the new tail before the
> subsequent store links it into the chain that is reachable from the
> head. A consumer hitting that window gets a NULL from mpscq_pop() while
> mpscq_empty() reports false, and must retry later rather than treat the
> queue as empty. The window is two instructions wide, but a producer can
> get preempted inside it, so the consumer must not busy wait on it.
>
> The consumer side supports a single consumer at a time, with callers
> providing their own serialization. A stub node, which also defines the
> empty state (tail == stub), allows the consumer to detach the final
> node without racing against producer link stores: that node is only
> handed out once the stub has been cmpxchg'ed back in as the tail. This
> also guarantees that the previous tail returned by mpscq_push() cannot
> get freed before that push has linked it, making it always valid for
> comparisons.
>
> The consumer cursor is deliberately not part of the queue struct - the
> caller owns it and passes it to mpscq_pop(). This is done to separate
> the consumer and producers cacheline.The cursor is written for
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  12 ++++
>  io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++
>  2 files changed, 133 insertions(+)
>  create mode 100644 io_uring/mpscq.h
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index aa4d5477f859..85e12b4884a5 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -55,6 +55,18 @@ struct io_wq_work_list {
>         struct io_wq_work_node *last;
>  };
>
> +/*
> + * Lockless multi-producer, single-consumer FIFO queue, see
> + * io_uring/mpscq.h for the implementation and rules. Defined here so
> + * that it can be embedded in io_ring_ctx. This is the producer side
> + * only - the consumer cursor is kept separately, on a cacheline that
> + * isn't dirtied by the producers.
> + */
> +struct mpscq {
> +       struct llist_node       *tail;          /* producers */
> +       struct llist_node       stub;
> +};
> +
>  struct io_wq_work {
>         struct io_wq_work_node list;
>         atomic_t flags;
> diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
> new file mode 100644
> index 000000000000..12172cef8394
> --- /dev/null
> +++ b/io_uring/mpscq.h
> @@ -0,0 +1,121 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef IOU_MPSCQ_H
> +#define IOU_MPSCQ_H
> +
> +/*
> + * mpscq - lockless multi-producer, single-consumer FIFO queue
> + *
> + * Unlike llist, which is LIFO ordered and hence needs an O(n)
> + * llist_reverse_order() pass before entries can be processed in queue order,
> + * this queue hands out nodes in the order they were pushed.
> + *
> + * The consumer cursor is held by the caller rather than in the queue struct
> + * (see below), and with the stub reinsertion done as a single cmpxchg attempt
> + * instead of an unconditional push, keeping tail == stub a reliable empty test
> + * while a producer is in the middle of a push.
> + *
> + * Producers may run in any context (task, softirq, hardirq) and are wait-free:
> + * a push is one xchg() plus one store, with no retry loops. FIFO order between
> + * producers is the order in which the xchg() on ->tail serializes them.
> + *
> + * The price for linked-list FIFO is that a push publishes the node in two
> + * steps: the xchg() makes it the new tail, and the subsequent store links it to
> + * its predecessor. In between, the tail end of the queue is not yet reachable
> + * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
> + * reports false. The consumer must not treat such a NULL as "queue empty" - it
> + * should retry later. The window is two instructions wide, but a producer can
> + * be preempted inside it, so the consumer must not spin on it while holding
> + * resources the producer might need to make progress.
> + *
> + * The consumer side only supports a single consumer at a time, callers must
> + * provide their own serialization for it. The stub node is what allows the
> + * consumer to detach the final node without racing with the link stores of
> + * producers. This scheme also guarantees that the previous tail returned by
> + * mpscq_push() cannot be freed by the consumer until the push that returned it
> + * has linked it, hence it's always safe to compare against (but not
> + * dereference, unless the caller otherwise guarantees its lifetime).
> + *
> + * The queue struct only holds the producer side. The consumer keeps its cursor
> + * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
> + * so that it can be placed on a different cacheline: the cursor is written for
> + * every pop, and having it share a line with ->tail would have the consumer
> + * invalidating the line that producers need for every push.
> + */
> +static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
> +{
> +       q->tail = *headp = &q->stub;
> +       q->stub.next = NULL;
> +}
> +
> +/*
> + * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
> + * yet. May be called from any context. Note that !empty doesn't guarantee that
> + * mpscq_pop() will return an entry yet, see the in-flight producer window
> + * above.
> + */
> +static inline bool mpscq_empty(struct mpscq *q)
> +{
> +       return READ_ONCE(q->tail) == &q->stub;
> +}
> +
> +/*
> + * Push a node onto the queue. Safe against concurrent pushes from any context,
> + * and against the (single) consumer. Returns the previous tail node, which is
> + * &q->stub if and only if the queue was empty before this push.
> + */
> +static inline struct llist_node *mpscq_push(struct mpscq *q,
> +                                           struct llist_node *node)

It seems odd to return the previous tail node. The pointer can't be
dereferenced, as the node could be popped and freed at any point. The
return value is only compared against &stub  to determine whether the
queue was empty. Seems like the interface would be simpler and avoid
leaking implementation details by just returning whether the queue was
empty before the push.

> +{
> +       struct llist_node *prev;
> +
> +       node->next = NULL;
> +       /*
> +        * xchg() implies a full barrier, so the initialization of the
> +        * entry (including ->next above) is visible before the node can
> +        * be reached, either via ->tail or via ->next chasing from the
> +        * head once the store below has linked it.
> +        */
> +       prev = xchg(&q->tail, node);
> +       WRITE_ONCE(prev->next, node);

I think this needs to be a release-order store and the READ_ONCE()s in
mpscq_pop() need to be acquire-order loads. Since mpscq_pop() doesn't
necessarily load q->tail, there's no happens-before relationship
between pushing a node and popping it.

> +       return prev;
> +}
> +
> +/*
> + * Pop the oldest node off the queue, or return NULL if no node is available.
> + * NULL is returned both when the queue is empty and when a producer has
> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
> + * tell the two apart. Single consumer only, with headp being the consumer
> + * cursor that mpscq_init() set up.
> + */
> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
> +                                          struct llist_node **headp)
> +{
> +       struct llist_node *head = *headp;
> +       struct llist_node *next = READ_ONCE(head->next);
> +
> +       if (head == &q->stub) {
> +               if (!next)
> +                       return NULL;
> +               *headp = next;
> +               head = next;
> +               next = READ_ONCE(head->next);
> +       }

I would find it a bit clearer to avoid using "next" to refer to the
actual head in the stub case:

struct llist_node *head = *headp, *next;
if (head == &q->stub) {
        head = READ_ONCE(head->next);
        if (!head)
                return NULL;
       *headp = head;
}
next = READ_ONCE(head->next);

> +       if (next) {
> +               *headp = next;
> +               return head;
> +       }
> +       /*
> +        * 'head' is the last linked node, it can only be handed out once the
> +        * stub has taken its place as the tail. If the cmpxchg fails, a
> +        * producer has made a new node the tail but hasn't linked it to 'head'

nit: "but hasn't linked 'head' to it" since the pointer goes from head
to the new tail?

> +        * yet - bail and let the caller retry.
> +        */
> +       q->stub.next = NULL;
> +       if (try_cmpxchg(&q->tail, &head, &q->stub)) {
> +               *headp = &q->stub;
> +               return head;
> +       }
> +       return NULL;

An early return if the try_cmpxchg() fails would reduce indentation of
the successful path.

Best,
Caleb

> +}
> +
> +#endif /* IOU_MPSCQ_H */
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
@ 2026-06-12  1:14   ` Caleb Sander Mateos
  2026-06-12  2:23     ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12  1:14 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Thu, Jun 11, 2026 at 9:12 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> The local (DEFER_TASKRUN) task_work list is an llist, which is LIFO
> ordered, and hence __io_run_local_work() has to restore the right
> running order with an O(n) llist_reverse_order() pass first. On top of
> that, a batch that gets capped by max_events needs the leftover entries
> parked on a separate ->retry_llist, as they can't be pushed back to the
> shared list.
>
> Switch it to the FIFO mpscq. Adds are wait-free instead of a cmpxchg
> retry loop, entries are popped in queue order with no reversal pass,
> capping a run simply leaves the remainder on the queue, and
> ->retry_llist goes away entirely. The consumer cursor, ->work_head,
> lives with the rest of the ->uring_lock protected state rather than
> next to the queue, so that popping entries doesn't dirty the producer
> side cacheline.
>
> For low amounts of task_work, this ends up being a bit more efficient
> than the existing scheme. As an example of that, doing multishot
> receives for 8 clients has the following task_work overhead:
>
>      1.02%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      0.88%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
>      0.60%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
>      0.14%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      2.64% at ~46Gb/sec
>
> and after this change:
>
>      1.08%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      1.03%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      2.11% at ~53Gb/sec
>
> which has less overhead even though that test run was faster. For a case
> of having 1024 clients on a single ring:
>
>      2.22%  sock-test  [kernel.kallsyms]  [k] llist_reverse_order
>      0.84%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work_loop
>      0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      0.02%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      3.50% at ~24Gb/sec
>
> we start to see the llist reversing taking a considerable amount of
> time, and the total add+run task_work overhead is around 3.5%. After
> the change:
>
>      0.90%  sock-test  [kernel.kallsyms]  [k] __io_run_local_work
>      0.42%  sock-test  [kernel.kallsyms]  [k] io_req_local_work_add
>      1.32% at ~26Gb/sec
>
> most of that overhead is gone, and performance is better as well.

This is great stuff! I had also observed these hotspots on a ublk
workload. Since incoming ublk requests post task work to the ublk
server's io_urings and completed ublk requests post task work to the
client's io_urings, there is significant cross-CPU contention on the
task work queues.

>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  include/linux/io_uring_types.h |  14 +++-
>  io_uring/io_uring.c            |   2 +-
>  io_uring/loop.c                |   2 +-
>  io_uring/tw.c                  | 145 ++++++++++++++++-----------------
>  io_uring/tw.h                  |   4 +-
>  io_uring/wait.c                |   8 +-
>  io_uring/wait.h                |  20 ++++-
>  7 files changed, 106 insertions(+), 89 deletions(-)
>
> diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
> index 85e12b4884a5..e918301da5fc 100644
> --- a/include/linux/io_uring_types.h
> +++ b/include/linux/io_uring_types.h
> @@ -351,6 +351,14 @@ struct io_ring_ctx {
>                  */
>                 atomic_t                cancel_seq;
>
> +               /*
> +                * Consumer cursor for ->work_list, protected by ->uring_lock.
> +                * Deliberately kept away from the producer side of the queue,
> +                * as it's written for every popped entry, and the producer
> +                * cacheline is contended enough as it is.
> +                */
> +               struct llist_node       *work_head;
> +
>                 /*
>                  * ->iopoll_list is protected by the ctx->uring_lock for
>                  * io_uring instances that don't use IORING_SETUP_SQPOLL.
> @@ -417,10 +425,10 @@ struct io_ring_ctx {
>          */
>         struct {
>                 struct io_rings __rcu   *rings_rcu;
> -               struct llist_head       work_llist;
> -               struct llist_head       retry_llist;
> +               struct mpscq            work_list;
>                 unsigned long           check_cq;
>                 atomic_t                cq_wait_nr;
> +               atomic_t                cq_wait_added;
>                 atomic_t                cq_timeouts;
>                 struct wait_queue_head  cq_wait;
>         } ____cacheline_aligned_in_smp;
> @@ -742,8 +750,6 @@ struct io_kiocb {
>          */
>         u16                             buf_index;
>
> -       unsigned                        nr_tw;
> -
>         /* REQ_F_* flags */
>         io_req_flags_t                  flags;
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 753ac23401c5..16acd99ff083 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -280,7 +280,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
>         INIT_LIST_HEAD(&ctx->defer_list);
>         INIT_LIST_HEAD(&ctx->timeout_list);
>         INIT_LIST_HEAD(&ctx->ltimeout_list);
> -       init_llist_head(&ctx->work_llist);
> +       mpscq_init(&ctx->work_list, &ctx->work_head);
>         INIT_LIST_HEAD(&ctx->tctx_list);
>         mutex_init(&ctx->tctx_lock);
>         ctx->submit_state.free_list.next = NULL;
> diff --git a/io_uring/loop.c b/io_uring/loop.c
> index bbbb6ef14e6a..2ecc1cf49f84 100644
> --- a/io_uring/loop.c
> +++ b/io_uring/loop.c
> @@ -11,7 +11,7 @@ static inline int io_loop_nr_cqes(const struct io_ring_ctx *ctx,
>
>  static inline void io_loop_wait_start(struct io_ring_ctx *ctx, unsigned nr_wait)
>  {
> -       atomic_set(&ctx->cq_wait_nr, nr_wait);
> +       io_cq_wait_arm(ctx, nr_wait);
>         set_current_state(TASK_INTERRUPTIBLE);
>  }
>
> diff --git a/io_uring/tw.c b/io_uring/tw.c
> index 023d5e6bc491..4cf350cffb6c 100644
> --- a/io_uring/tw.c
> +++ b/io_uring/tw.c
> @@ -14,6 +14,7 @@
>  #include "rw.h"
>  #include "eventfd.h"
>  #include "wait.h"
> +#include "mpscq.h"
>
>  void io_fallback_req_func(struct work_struct *work)
>  {
> @@ -170,11 +171,8 @@ static void io_ctx_mark_taskrun(struct io_ring_ctx *ctx)
>  void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>  {
>         struct io_ring_ctx *ctx = req->ctx;
> -       unsigned nr_wait, nr_tw, nr_tw_prev;
> -       struct llist_node *head;
> -
> -       /* See comment above IO_CQ_WAKE_INIT */
> -       BUILD_BUG_ON(IO_CQ_WAKE_FORCE <= IORING_MAX_CQ_ENTRIES);
> +       struct llist_node *prev;
> +       unsigned nr_wait;
>
>         /*
>          * We don't know how many requests there are in the link and whether
> @@ -185,55 +183,47 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>
>         guard(rcu)();

Is the RCU guard still required now that a work list element can't be
accessed after the consumer has popped it?

Best,
Caleb


>
> -       head = READ_ONCE(ctx->work_llist.first);
> -       do {
> -               nr_tw_prev = 0;
> -               if (head) {
> -                       struct io_kiocb *first_req = container_of(head,
> -                                                       struct io_kiocb,
> -                                                       io_task_work.node);
> -                       /*
> -                        * Might be executed at any moment, rely on
> -                        * SLAB_TYPESAFE_BY_RCU to keep it alive.
> -                        */
> -                       nr_tw_prev = READ_ONCE(first_req->nr_tw);
> -               }
> -
> -               /*
> -                * Theoretically, it can overflow, but that's fine as one of
> -                * previous adds should've tried to wake the task.
> -                */
> -               nr_tw = nr_tw_prev + 1;
> -               if (!(flags & IOU_F_TWQ_LAZY_WAKE))
> -                       nr_tw = IO_CQ_WAKE_FORCE;
> -
> -               req->nr_tw = nr_tw;
> -               req->io_task_work.node.next = head;
> -       } while (!try_cmpxchg(&ctx->work_llist.first, &head,
> -                             &req->io_task_work.node));
> -
>         /*
> -        * cmpxchg implies a full barrier, which pairs with the barrier
> -        * in set_current_state() on the io_cqring_wait() side. It's used
> -        * to ensure that either we see updated ->cq_wait_nr, or waiters
> -        * going to sleep will observe the work added to the list, which
> -        * is similar to the wait/wawke task state sync.
> +        * The xchg() in mpscq_push() implies a full barrier, which pairs with
> +        * the barrier in set_current_state() on the io_cqring_wait() side. This
> +        * ensures that either we see the updated ->cq_wait_nr, or waiters going
> +        * to sleep will observe the work added to the list, which is similar to
> +        * the wait/wake task state sync.
>          */
> +       prev = mpscq_push(&ctx->work_list, &req->io_task_work.node);
>
> -       if (!head) {
> +       if (prev == &ctx->work_list.stub) {
>                 io_ctx_mark_taskrun(ctx);
>                 if (data_race(ctx->int_flags) & IO_RING_F_HAS_EVFD)
>                         io_eventfd_signal(ctx, false);
>         }
>
> -       nr_wait = atomic_read(&ctx->cq_wait_nr);
> -       /* not enough or no one is waiting */
> -       if (nr_tw < nr_wait)
> +       /* acquire pairs with the release in io_cq_wait_arm() */
> +       nr_wait = atomic_read_acquire(&ctx->cq_wait_nr);
> +       /* no one is waiting */
> +       if (nr_wait == IO_CQ_WAKE_INIT)
>                 return;
> -       /* the previous add has already woken it up */
> -       if (nr_tw_prev >= nr_wait)
> +       /*
> +        * For a lazy wake, defer waking the task until enough work is pending
> +        * to satisfy the number of events it's waiting for. As a waiter only
> +        * sleeps on an empty queue, the lazy adds counted since it armed
> +        * ->cq_wait_nr are the full pending count, see io_cq_wait_arm(). If we
> +        * instead saw a stale, unarmed (or previous cycle) ->cq_wait_nr, then
> +        * per the barrier pairing above, the waiter's check after arming will
> +        * see our work and abort the sleep - no wakeup is needed from here in
> +        * that case.
> +        */
> +       if ((flags & IOU_F_TWQ_LAZY_WAKE) &&
> +           atomic_inc_return(&ctx->cq_wait_added) < nr_wait)
>                 return;
> -       wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
> +       /*
> +        * Only one wake up is needed per arming of the wait. Claim it by
> +        * resetting ->cq_wait_nr - the waiter re-arms it for every wait cycle
> +        * and checks for pending work after arming, so a wakeup cannot get
> +        * lost.
> +        */
> +       if (atomic_try_cmpxchg(&ctx->cq_wait_nr, &nr_wait, IO_CQ_WAKE_INIT))
> +               wake_up_state(ctx->submitter_task, TASK_INTERRUPTIBLE);
>  }
>
>  void io_req_normal_work_add(struct io_kiocb *req)
> @@ -273,21 +263,27 @@ void io_req_task_work_add_remote(struct io_kiocb *req, unsigned flags)
>
>  void __cold io_move_task_work_from_local(struct io_ring_ctx *ctx)
>  {
> -       struct llist_node *node;
> +       struct llist_node *node, *first = NULL, **tail = &first;
>
>         /*
> -        * Running the work items may utilize ->retry_llist as a means
> -        * for capping the number of task_work entries run at the same
> -        * time. But that list can potentially race with moving the work
> -        * from here, if the task is exiting. As any normal task_work
> -        * running holds ->uring_lock already, just guard this slow path
> -        * with ->uring_lock to avoid racing on ->retry_llist.
> +        * The work list consumer side is serialized by ->uring_lock, see
> +        * __io_run_local_work(). Grab it to guard against racing with normal
> +        * task_work running, as the task may be exiting.
>          */
>         guard(mutex)(&ctx->uring_lock);
> -       node = llist_del_all(&ctx->work_llist);
> -       __io_fallback_tw(node, false);
> -       node = llist_del_all(&ctx->retry_llist);
> -       __io_fallback_tw(node, false);
> +
> +       while (!mpscq_empty(&ctx->work_list)) {
> +               node = mpscq_pop(&ctx->work_list, &ctx->work_head);
> +               if (!node) {
> +                       /* a producer is mid-push, wait for it to link */
> +                       cpu_relax();
> +                       continue;
> +               }
> +               *tail = node;
> +               tail = &node->next;
> +       }
> +       *tail = NULL;
> +       __io_fallback_tw(first, false);
>  }
>
>  static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
> @@ -302,22 +298,23 @@ static bool io_run_local_work_continue(struct io_ring_ctx *ctx, int events,
>         return false;
>  }
>
> -static int __io_run_local_work_loop(struct llist_node **node,
> +static int __io_run_local_work_loop(struct io_ring_ctx *ctx,
>                                     io_tw_token_t tw,
>                                     int events)
>  {
>         int ret = 0;
>
> -       while (*node) {
> -               struct llist_node *next = (*node)->next;
> -               struct io_kiocb *req = container_of(*node, struct io_kiocb,
> -                                                   io_task_work.node);
> +       while (ret < events) {
> +               struct llist_node *node = mpscq_pop(&ctx->work_list, &ctx->work_head);
> +               struct io_kiocb *req;
> +
> +               if (!node)
> +                       break;
> +               req = container_of(node, struct io_kiocb, io_task_work.node);
>                 INDIRECT_CALL_2(req->io_task_work.func,
>                                 io_poll_task_func, io_req_rw_complete,
>                                 (struct io_tw_req){req}, tw);
> -               *node = next;
> -               if (++ret >= events)
> -                       break;
> +               ret++;
>         }
>
>         return ret;
> @@ -326,7 +323,6 @@ static int __io_run_local_work_loop(struct llist_node **node,
>  static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
>                                int min_events, int max_events)
>  {
> -       struct llist_node *node;
>         unsigned int loops = 0;
>         int ret = 0;
>
> @@ -335,24 +331,21 @@ static int __io_run_local_work(struct io_ring_ctx *ctx, io_tw_token_t tw,
>         if (ctx->flags & IORING_SETUP_TASKRUN_FLAG)
>                 atomic_andnot(IORING_SQ_TASKRUN, &ctx->rings->sq_flags);
>  again:
> -       tw.cancel = io_should_terminate_tw(ctx);
> -       min_events -= ret;
> -       ret = __io_run_local_work_loop(&ctx->retry_llist.first, tw, max_events);
> -       if (ctx->retry_llist.first)
> -               goto retry_done;
> -
>         /*
> -        * llists are in reverse order, flip it back the right way before
> -        * running the pending items.
> +        * If the last loop made no progress while work is still pending,
> +        * a producer has published a node but hasn't linked it into the
> +        * queue yet (see mpscq_pop()). Give it a chance to finish rather
> +        * than spinning on the queue.
>          */
> -       node = llist_reverse_order(llist_del_all(&ctx->work_llist));
> -       ret += __io_run_local_work_loop(&node, tw, max_events - ret);
> -       ctx->retry_llist.first = node;
> +       if (unlikely(loops && !ret))
> +               cond_resched();
> +       tw.cancel = io_should_terminate_tw(ctx);
> +       min_events -= ret;
> +       ret = __io_run_local_work_loop(ctx, tw, max_events);
>         loops++;
>
>         if (io_run_local_work_continue(ctx, ret, min_events))
>                 goto again;
> -retry_done:
>         io_submit_flush_completions(ctx);
>         if (io_run_local_work_continue(ctx, ret, min_events))
>                 goto again;
> diff --git a/io_uring/tw.h b/io_uring/tw.h
> index 415e330fabde..f42db5fdbded 100644
> --- a/io_uring/tw.h
> +++ b/io_uring/tw.h
> @@ -6,6 +6,8 @@
>  #include <linux/percpu-refcount.h>
>  #include <linux/io_uring_types.h>
>
> +#include "mpscq.h"
> +
>  #define IO_LOCAL_TW_DEFAULT_MAX                20
>
>  /*
> @@ -89,7 +91,7 @@ static inline int io_run_task_work(void)
>
>  static inline bool io_local_work_pending(struct io_ring_ctx *ctx)
>  {
> -       return !llist_empty(&ctx->work_llist) || !llist_empty(&ctx->retry_llist);
> +       return !mpscq_empty(&ctx->work_list);
>  }
>
>  static inline bool io_task_work_pending(struct io_ring_ctx *ctx)
> diff --git a/io_uring/wait.c b/io_uring/wait.c
> index ec01e78a216d..05ac779635e8 100644
> --- a/io_uring/wait.c
> +++ b/io_uring/wait.c
> @@ -96,9 +96,13 @@ static enum hrtimer_restart io_cqring_min_timer_wakeup(struct hrtimer *timer)
>          * the task and return.
>          */
>         if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> +               /*
> +                * No need to zero ->cq_wait_added when arming with 1, any
> +                * counted add will satisfy it.
> +                */
>                 atomic_set(&ctx->cq_wait_nr, 1);
>                 smp_mb();
> -               if (!llist_empty(&ctx->work_llist))
> +               if (io_local_work_pending(ctx))
>                         goto out_wake;
>         }
>
> @@ -257,7 +261,7 @@ int io_cqring_wait(struct io_ring_ctx *ctx, int min_events, u32 flags,
>                 unsigned long check_cq;
>
>                 if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) {
> -                       atomic_set(&ctx->cq_wait_nr, nr_wait);
> +                       io_cq_wait_arm(ctx, nr_wait);
>                         set_current_state(TASK_INTERRUPTIBLE);
>                 } else {
>                         prepare_to_wait_exclusive(&ctx->cq_wait, &iowq.wq,
> diff --git a/io_uring/wait.h b/io_uring/wait.h
> index a4274b137f81..2ecea3e2a63f 100644
> --- a/io_uring/wait.h
> +++ b/io_uring/wait.h
> @@ -5,12 +5,24 @@
>  #include <linux/io_uring_types.h>
>
>  /*
> - * No waiters. It's larger than any valid value of the tw counter
> - * so that tests against ->cq_wait_nr would fail and skip wake_up().
> + * No waiters. ->cq_wait_nr holds this when no task is waiting, and is
> + * reset back to it by the task work add side when it claims a wake up,
> + * so that only one wake up is issued per arming of the wait.
>   */
>  #define IO_CQ_WAKE_INIT                (-1U)
> -/* Forced wake up if there is a waiter regardless of ->cq_wait_nr */
> -#define IO_CQ_WAKE_FORCE       (IO_CQ_WAKE_INIT >> 1)
> +
> +/*
> + * A waiter only sleeps on an empty work list (it checks for pending work after
> + * arming), hence the number of lazy adds since arming is the full pending
> + * count. The release pairs with the acquire in io_req_local_work_add(), hence
> + * a producer observing the armed ->cq_wait_nr also observes the zeroed
> + * ->cq_wait_added.
> + */
> +static inline void io_cq_wait_arm(struct io_ring_ctx *ctx, int nr_wait)
> +{
> +       atomic_set(&ctx->cq_wait_added, 0);
> +       atomic_set_release(&ctx->cq_wait_nr, nr_wait);
> +}
>
>  struct ext_arg {
>         size_t argsz;
> --
> 2.53.0
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-12  1:13   ` Caleb Sander Mateos
@ 2026-06-12  2:21     ` Jens Axboe
  2026-06-12  2:41       ` Caleb Sander Mateos
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-12  2:21 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov

On 6/11/26 7:13 PM, Caleb Sander Mateos wrote:
>> + * Push a node onto the queue. Safe against concurrent pushes from any context,
>> + * and against the (single) consumer. Returns the previous tail node, which is
>> + * &q->stub if and only if the queue was empty before this push.
>> + */
>> +static inline struct llist_node *mpscq_push(struct mpscq *q,
>> +                                           struct llist_node *node)
> 
> It seems odd to return the previous tail node. The pointer can't be
> dereferenced, as the node could be popped and freed at any point. The
> return value is only compared against &stub  to determine whether the
> queue was empty. Seems like the interface would be simpler and avoid
> leaking implementation details by just returning whether the queue was
> empty before the push.

That's not a bad idea, I'll take a look at that. I have a v2 of the
series which converts the non-defer task_work as well, so need to send
that out. Will do so tomorrow.

>> +{
>> +       struct llist_node *prev;
>> +
>> +       node->next = NULL;
>> +       /*
>> +        * xchg() implies a full barrier, so the initialization of the
>> +        * entry (including ->next above) is visible before the node can
>> +        * be reached, either via ->tail or via ->next chasing from the
>> +        * head once the store below has linked it.
>> +        */
>> +       prev = xchg(&q->tail, node);
>> +       WRITE_ONCE(prev->next, node);
> 
> I think this needs to be a release-order store and the READ_ONCE()s in
> mpscq_pop() need to be acquire-order loads. Since mpscq_pop() doesn't
> necessarily load q->tail, there's no happens-before relationship
> between pushing a node and popping it.

Don't think that's necessary. The xchg() is fully ordered and hence acts
as smp_mb() on both sides — so every init store propagates before the
link store. A release on the link store would only add ordering for
stores issued between the xchg and the link, but we have none of those.

For the consumer, every dereference of a node should be
address-dependent on the READ_ONCE() that observed it.
Address dependencies from marked loads are honored everywhere, for
example alpha even has a read barrier there.

>> +       return prev;
>> +}
>> +
>> +/*
>> + * Pop the oldest node off the queue, or return NULL if no node is available.
>> + * NULL is returned both when the queue is empty and when a producer has
>> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
>> + * tell the two apart. Single consumer only, with headp being the consumer
>> + * cursor that mpscq_init() set up.
>> + */
>> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
>> +                                          struct llist_node **headp)
>> +{
>> +       struct llist_node *head = *headp;
>> +       struct llist_node *next = READ_ONCE(head->next);
>> +
>> +       if (head == &q->stub) {
>> +               if (!next)
>> +                       return NULL;
>> +               *headp = next;
>> +               head = next;
>> +               next = READ_ONCE(head->next);
>> +       }
> 
> I would find it a bit clearer to avoid using "next" to refer to the
> actual head in the stub case:
> 
> struct llist_node *head = *headp, *next;
> if (head == &q->stub) {
>         head = READ_ONCE(head->next);
>         if (!head)
>                 return NULL;
>        *headp = head;
> }
> next = READ_ONCE(head->next);

I'll see if I can make that part look neater, I agree with you here.

>> +       if (next) {
>> +               *headp = next;
>> +               return head;
>> +       }
>> +       /*
>> +        * 'head' is the last linked node, it can only be handed out once the
>> +        * stub has taken its place as the tail. If the cmpxchg fails, a
>> +        * producer has made a new node the tail but hasn't linked it to 'head'
> 
> nit: "but hasn't linked 'head' to it" since the pointer goes from head
> to the new tail?

Good catch, yes it should read from head to new tail.

>> +        * yet - bail and let the caller retry.
>> +        */
>> +       q->stub.next = NULL;
>> +       if (try_cmpxchg(&q->tail, &head, &q->stub)) {
>> +               *headp = &q->stub;
>> +               return head;
>> +       }
>> +       return NULL;
> 
> An early return if the try_cmpxchg() fails would reduce indentation of
> the successful path.

I deliberately wrote it that way, reads better to me...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-12  1:14   ` Caleb Sander Mateos
@ 2026-06-12  2:23     ` Jens Axboe
  2026-06-12  5:24       ` Caleb Sander Mateos
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-12  2:23 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov

On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
> This is great stuff! I had also observed these hotspots on a ublk
> workload. Since incoming ublk requests post task work to the ublk
> server's io_urings and completed ublk requests post task work to the
> client's io_urings, there is significant cross-CPU contention on the
> task work queues.

Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
some tests with and without and see how it does for you?

>> @@ -185,55 +183,47 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
>>
>>         guard(rcu)();
> 
> Is the RCU guard still required now that a work list element can't be
> accessed after the consumer has popped it?

It's actually not. Might need the :

	if (prev == &ctx->work_list.stub) {
		io_ctx_mark_taskrun(ctx);

parts to just grab it in there, as lower down we'd still need it. But
the task_work part itself should not. I'll make that change.

Thanks for the reviews!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
  2026-06-12  2:21     ` Jens Axboe
@ 2026-06-12  2:41       ` Caleb Sander Mateos
  0 siblings, 0 replies; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12  2:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Thu, Jun 11, 2026 at 7:21 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/11/26 7:13 PM, Caleb Sander Mateos wrote:
> >> + * Push a node onto the queue. Safe against concurrent pushes from any context,
> >> + * and against the (single) consumer. Returns the previous tail node, which is
> >> + * &q->stub if and only if the queue was empty before this push.
> >> + */
> >> +static inline struct llist_node *mpscq_push(struct mpscq *q,
> >> +                                           struct llist_node *node)
> >
> > It seems odd to return the previous tail node. The pointer can't be
> > dereferenced, as the node could be popped and freed at any point. The
> > return value is only compared against &stub  to determine whether the
> > queue was empty. Seems like the interface would be simpler and avoid
> > leaking implementation details by just returning whether the queue was
> > empty before the push.
>
> That's not a bad idea, I'll take a look at that. I have a v2 of the
> series which converts the non-defer task_work as well, so need to send
> that out. Will do so tomorrow.
>
> >> +{
> >> +       struct llist_node *prev;
> >> +
> >> +       node->next = NULL;
> >> +       /*
> >> +        * xchg() implies a full barrier, so the initialization of the
> >> +        * entry (including ->next above) is visible before the node can
> >> +        * be reached, either via ->tail or via ->next chasing from the
> >> +        * head once the store below has linked it.
> >> +        */
> >> +       prev = xchg(&q->tail, node);
> >> +       WRITE_ONCE(prev->next, node);
> >
> > I think this needs to be a release-order store and the READ_ONCE()s in
> > mpscq_pop() need to be acquire-order loads. Since mpscq_pop() doesn't
> > necessarily load q->tail, there's no happens-before relationship
> > between pushing a node and popping it.
>
> Don't think that's necessary. The xchg() is fully ordered and hence acts
> as smp_mb() on both sides — so every init store propagates before the
> link store. A release on the link store would only add ordering for
> stores issued between the xchg and the link, but we have none of those.
>
> For the consumer, every dereference of a node should be
> address-dependent on the READ_ONCE() that observed it.
> Address dependencies from marked loads are honored everywhere, for
> example alpha even has a read barrier there.

I'm not too familiar with the Linux kernel memory model, you're
probably right :)

Best,
Caleb

>
> >> +       return prev;
> >> +}
> >> +
> >> +/*
> >> + * Pop the oldest node off the queue, or return NULL if no node is available.
> >> + * NULL is returned both when the queue is empty and when a producer has
> >> + * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
> >> + * tell the two apart. Single consumer only, with headp being the consumer
> >> + * cursor that mpscq_init() set up.
> >> + */
> >> +static inline struct llist_node *mpscq_pop(struct mpscq *q,
> >> +                                          struct llist_node **headp)
> >> +{
> >> +       struct llist_node *head = *headp;
> >> +       struct llist_node *next = READ_ONCE(head->next);
> >> +
> >> +       if (head == &q->stub) {
> >> +               if (!next)
> >> +                       return NULL;
> >> +               *headp = next;
> >> +               head = next;
> >> +               next = READ_ONCE(head->next);
> >> +       }
> >
> > I would find it a bit clearer to avoid using "next" to refer to the
> > actual head in the stub case:
> >
> > struct llist_node *head = *headp, *next;
> > if (head == &q->stub) {
> >         head = READ_ONCE(head->next);
> >         if (!head)
> >                 return NULL;
> >        *headp = head;
> > }
> > next = READ_ONCE(head->next);
>
> I'll see if I can make that part look neater, I agree with you here.
>
> >> +       if (next) {
> >> +               *headp = next;
> >> +               return head;
> >> +       }
> >> +       /*
> >> +        * 'head' is the last linked node, it can only be handed out once the
> >> +        * stub has taken its place as the tail. If the cmpxchg fails, a
> >> +        * producer has made a new node the tail but hasn't linked it to 'head'
> >
> > nit: "but hasn't linked 'head' to it" since the pointer goes from head
> > to the new tail?
>
> Good catch, yes it should read from head to new tail.
>
> >> +        * yet - bail and let the caller retry.
> >> +        */
> >> +       q->stub.next = NULL;
> >> +       if (try_cmpxchg(&q->tail, &head, &q->stub)) {
> >> +               *headp = &q->stub;
> >> +               return head;
> >> +       }
> >> +       return NULL;
> >
> > An early return if the try_cmpxchg() fails would reduce indentation of
> > the successful path.
>
> I deliberately wrote it that way, reads better to me...
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-12  2:23     ` Jens Axboe
@ 2026-06-12  5:24       ` Caleb Sander Mateos
  2026-06-12 12:21         ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-12  5:24 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Thu, Jun 11, 2026 at 7:23 PM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
> > This is great stuff! I had also observed these hotspots on a ublk
> > workload. Since incoming ublk requests post task work to the ublk
> > server's io_urings and completed ublk requests post task work to the
> > client's io_urings, there is significant cross-CPU contention on the
> > task work queues.
>
> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
> some tests with and without and see how it does for you?

Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
running fio with io_uring submitting I/O to the ublk devices and 32
ublk server CPUs (paired hypertwins) servicing the requests, achieving
around 4M IOPS. Both the client and server CPUs look completely busy.
I can see clear reductions in __io_req_task_work_add() and
llist_reverse_order() (now gone) on both sets of CPUs, through the
cache misses popping task work items are now attributed to
__io_run_local_work() instead.

Thanks,
Caleb

>
> >> @@ -185,55 +183,47 @@ void io_req_local_work_add(struct io_kiocb *req, unsigned flags)
> >>
> >>         guard(rcu)();
> >
> > Is the RCU guard still required now that a work list element can't be
> > accessed after the consumer has popped it?
>
> It's actually not. Might need the :
>
>         if (prev == &ctx->work_list.stub) {
>                 io_ctx_mark_taskrun(ctx);
>
> parts to just grab it in there, as lower down we'd still need it. But
> the task_work part itself should not. I'll make that change.
>
> Thanks for the reviews!
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-12  5:24       ` Caleb Sander Mateos
@ 2026-06-12 12:21         ` Jens Axboe
  2026-06-12 15:11           ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-12 12:21 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov

On 6/11/26 11:24 PM, Caleb Sander Mateos wrote:
> On Thu, Jun 11, 2026 at 7:23?PM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
>>> This is great stuff! I had also observed these hotspots on a ublk
>>> workload. Since incoming ublk requests post task work to the ublk
>>> server's io_urings and completed ublk requests post task work to the
>>> client's io_urings, there is significant cross-CPU contention on the
>>> task work queues.
>>
>> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
>> some tests with and without and see how it does for you?
> 
> Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
> 4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
> running fio with io_uring submitting I/O to the ublk devices and 32
> ublk server CPUs (paired hypertwins) servicing the requests, achieving
> around 4M IOPS. Both the client and server CPUs look completely busy.

That's a pretty nice improvement! Would be curious to hear what v2 looks
like.

> I can see clear reductions in __io_req_task_work_add() and
> llist_reverse_order() (now gone) on both sets of CPUs, through the
> cache misses popping task work items are now attributed to
> __io_run_local_work() instead.

Right, llist_reverse_order() previously could have had the useful side
effect of priming the cache. Sometimes that could be useful, if the
task_work itself was basically just posting a CQE. Other times, when the
task_work itself does actual work (eg socket recv), then it was just
harmful. For the former case, we could potentially prefetch() next when
popping. Not sure it's worth it though, though we could experiment with
something along those lines.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-12 12:21         ` Jens Axboe
@ 2026-06-12 15:11           ` Jens Axboe
  2026-06-15 17:55             ` Caleb Sander Mateos
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-12 15:11 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov

On 6/12/26 6:21 AM, Jens Axboe wrote:
> On 6/11/26 11:24 PM, Caleb Sander Mateos wrote:
>> On Thu, Jun 11, 2026 at 7:23?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>
>>> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
>>>> This is great stuff! I had also observed these hotspots on a ublk
>>>> workload. Since incoming ublk requests post task work to the ublk
>>>> server's io_urings and completed ublk requests post task work to the
>>>> client's io_urings, there is significant cross-CPU contention on the
>>>> task work queues.
>>>
>>> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
>>> some tests with and without and see how it does for you?
>>
>> Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
>> 4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
>> running fio with io_uring submitting I/O to the ublk devices and 32
>> ublk server CPUs (paired hypertwins) servicing the requests, achieving
>> around 4M IOPS. Both the client and server CPUs look completely busy.
> 
> That's a pretty nice improvement! Would be curious to hear what v2 looks
> like.

And here's some more stuff on top you might find interesting. For a
6 NVMe drive test, it drops my task work usage from top-of-profiles
to ~2%.

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq-batch

The patches sit on top of the io_uring-tw-mpscq branch.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-12 15:11           ` Jens Axboe
@ 2026-06-15 17:55             ` Caleb Sander Mateos
  2026-06-15 18:00               ` Jens Axboe
  0 siblings, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-15 17:55 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Fri, Jun 12, 2026 at 8:11 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/12/26 6:21 AM, Jens Axboe wrote:
> > On 6/11/26 11:24 PM, Caleb Sander Mateos wrote:
> >> On Thu, Jun 11, 2026 at 7:23?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>
> >>> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
> >>>> This is great stuff! I had also observed these hotspots on a ublk
> >>>> workload. Since incoming ublk requests post task work to the ublk
> >>>> server's io_urings and completed ublk requests post task work to the
> >>>> client's io_urings, there is significant cross-CPU contention on the
> >>>> task work queues.
> >>>
> >>> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
> >>> some tests with and without and see how it does for you?
> >>
> >> Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
> >> 4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
> >> running fio with io_uring submitting I/O to the ublk devices and 32
> >> ublk server CPUs (paired hypertwins) servicing the requests, achieving
> >> around 4M IOPS. Both the client and server CPUs look completely busy.
> >
> > That's a pretty nice improvement! Would be curious to hear what v2 looks
> > like.

Looks the same as v1, which makes sense as both the client and server
are using IORING_SETUP_DEFER_TASKRUN.

I did observe fio seem to get stuck forever on one out of the 85 or so
runs, though. I'm a little concerned there might be a missing wakeup.
It was using the default iodepth_batch_complete_min=1 (waiting for
io_uring completions) and IORING_SETUP_DEFER_TASKRUN.

>
> And here's some more stuff on top you might find interesting. For a
> 6 NVMe drive test, it drops my task work usage from top-of-profiles
> to ~2%.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq-batch
>
> The patches sit on top of the io_uring-tw-mpscq branch.

Yeah there are some interesting ideas there.

The ublk server isn't using UBLK_F_BATCH_IO, so it unfortunately
wouldn't benefit from the task work batching for
UBLK_U_IO_COMMIT_IO_CMDS. The batching would probably need to be
scoped to the whole io_submit_sqes() in order to allow batching across
the multiple UBLK_U_IO_COMMIT_AND_FETCH_REQ commands. I'm also not
sure about the claim that __ublk_walk_cmd_buf() won't sleep;
ublk_batch_commit_io() calls io_buffer_unregister_bvec(), which could
sleep depending on the io_uring issue_flags.

The NVMe passthrough task work batching could definitely reduce
contention on the task work queue. I'll run a perf test.

Thanks,
Caleb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-15 17:55             ` Caleb Sander Mateos
@ 2026-06-15 18:00               ` Jens Axboe
  2026-06-16 20:21                 ` Caleb Sander Mateos
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2026-06-15 18:00 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, dvyukov

On 6/15/26 11:55 AM, Caleb Sander Mateos wrote:
> On Fri, Jun 12, 2026 at 8:11?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 6/12/26 6:21 AM, Jens Axboe wrote:
>>> On 6/11/26 11:24 PM, Caleb Sander Mateos wrote:
>>>> On Thu, Jun 11, 2026 at 7:23?PM Jens Axboe <axboe@kernel.dk> wrote:
>>>>>
>>>>> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
>>>>>> This is great stuff! I had also observed these hotspots on a ublk
>>>>>> workload. Since incoming ublk requests post task work to the ublk
>>>>>> server's io_urings and completed ublk requests post task work to the
>>>>>> client's io_urings, there is significant cross-CPU contention on the
>>>>>> task work queues.
>>>>>
>>>>> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
>>>>> some tests with and without and see how it does for you?
>>>>
>>>> Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
>>>> 4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
>>>> running fio with io_uring submitting I/O to the ublk devices and 32
>>>> ublk server CPUs (paired hypertwins) servicing the requests, achieving
>>>> around 4M IOPS. Both the client and server CPUs look completely busy.
>>>
>>> That's a pretty nice improvement! Would be curious to hear what v2 looks
>>> like.
> 
> Looks the same as v1, which makes sense as both the client and server
> are using IORING_SETUP_DEFER_TASKRUN.

OK, sounds good.

> I did observe fio seem to get stuck forever on one out of the 85 or so
> runs, though. I'm a little concerned there might be a missing wakeup.
> It was using the default iodepth_batch_complete_min=1 (waiting for
> io_uring completions) and IORING_SETUP_DEFER_TASKRUN.

There's a bug in v2 where it can get missed, the in-tree code should
have that fixed. It was the atomic_dec_and_test() and
atomic_try_cmpxchg() in io_req_local_work_add() racing.

>> And here's some more stuff on top you might find interesting. For a
>> 6 NVMe drive test, it drops my task work usage from top-of-profiles
>> to ~2%.
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq-batch
>>
>> The patches sit on top of the io_uring-tw-mpscq branch.
> 
> Yeah there are some interesting ideas there.
> 
> The ublk server isn't using UBLK_F_BATCH_IO, so it unfortunately
> wouldn't benefit from the task work batching for
> UBLK_U_IO_COMMIT_IO_CMDS. The batching would probably need to be
> scoped to the whole io_submit_sqes() in order to allow batching across
> the multiple UBLK_U_IO_COMMIT_AND_FETCH_REQ commands. I'm also not
> sure about the claim that __ublk_walk_cmd_buf() won't sleep;
> ublk_batch_commit_io() calls io_buffer_unregister_bvec(), which could
> sleep depending on the io_uring issue_flags.

It's very much just a POC series of things... I suspect to get the
benefit of it, we'd need a bit of refactoring and reworking first. It
was more to get the idea out/across, not going anywhere right now.

> The NVMe passthrough task work batching could definitely reduce
> contention on the task work queue. I'll run a perf test.

Thanks!

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] io_uring: switch local task_work to a mpscq
  2026-06-15 18:00               ` Jens Axboe
@ 2026-06-16 20:21                 ` Caleb Sander Mateos
  0 siblings, 0 replies; 16+ messages in thread
From: Caleb Sander Mateos @ 2026-06-16 20:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, dvyukov

On Mon, Jun 15, 2026 at 11:00 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 6/15/26 11:55 AM, Caleb Sander Mateos wrote:
> > On Fri, Jun 12, 2026 at 8:11?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 6/12/26 6:21 AM, Jens Axboe wrote:
> >>> On 6/11/26 11:24 PM, Caleb Sander Mateos wrote:
> >>>> On Thu, Jun 11, 2026 at 7:23?PM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>>
> >>>>> On 6/11/26 7:14 PM, Caleb Sander Mateos wrote:
> >>>>>> This is great stuff! I had also observed these hotspots on a ublk
> >>>>>> workload. Since incoming ublk requests post task work to the ublk
> >>>>>> server's io_urings and completed ublk requests post task work to the
> >>>>>> client's io_urings, there is significant cross-CPU contention on the
> >>>>>> task work queues.
> >>>>>
> >>>>> Glad you like it! Once I post v2 tomorrow, perhaps you can try and run
> >>>>> some tests with and without and see how it does for you?
> >>>>
> >>>> Haven't tested v2 yet, but v1 shows a 4% IOPS improvement on a ublk
> >>>> 4-KB read workload. The workload has 8 CPUs (unpaired hypertwins)
> >>>> running fio with io_uring submitting I/O to the ublk devices and 32
> >>>> ublk server CPUs (paired hypertwins) servicing the requests, achieving
> >>>> around 4M IOPS. Both the client and server CPUs look completely busy.
> >>>
> >>> That's a pretty nice improvement! Would be curious to hear what v2 looks
> >>> like.
> >
> > Looks the same as v1, which makes sense as both the client and server
> > are using IORING_SETUP_DEFER_TASKRUN.
>
> OK, sounds good.
>
> > I did observe fio seem to get stuck forever on one out of the 85 or so
> > runs, though. I'm a little concerned there might be a missing wakeup.
> > It was using the default iodepth_batch_complete_min=1 (waiting for
> > io_uring completions) and IORING_SETUP_DEFER_TASKRUN.
>
> There's a bug in v2 where it can get missed, the in-tree code should
> have that fixed. It was the atomic_dec_and_test() and
> atomic_try_cmpxchg() in io_req_local_work_add() racing.

Great, glad it's already fixed.

>
> >> And here's some more stuff on top you might find interesting. For a
> >> 6 NVMe drive test, it drops my task work usage from top-of-profiles
> >> to ~2%.
> >>
> >> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux.git/log/?h=io_uring-tw-mpscq-batch
> >>
> >> The patches sit on top of the io_uring-tw-mpscq branch.
> >
> > Yeah there are some interesting ideas there.
> >
> > The ublk server isn't using UBLK_F_BATCH_IO, so it unfortunately
> > wouldn't benefit from the task work batching for
> > UBLK_U_IO_COMMIT_IO_CMDS. The batching would probably need to be
> > scoped to the whole io_submit_sqes() in order to allow batching across
> > the multiple UBLK_U_IO_COMMIT_AND_FETCH_REQ commands. I'm also not
> > sure about the claim that __ublk_walk_cmd_buf() won't sleep;
> > ublk_batch_commit_io() calls io_buffer_unregister_bvec(), which could
> > sleep depending on the io_uring issue_flags.
>
> It's very much just a POC series of things... I suspect to get the
> benefit of it, we'd need a bit of refactoring and reworking first. It
> was more to get the idea out/across, not going anywhere right now.
>
> > The NVMe passthrough task work batching could definitely reduce
> > contention on the task work queue. I'll run a perf test.
>
> Thanks!

I tried it out and the 4K read throughput looks a little lower
actually (about a 1.7% improvement over the baseline vs. 2.6% with
just v2). Since the workload is loading 24 NVMe devices, I suspect
there just isn't much to be gained from batching completions within a
single NVMe queue.

I do see the time in __ioreq_task_work_add() on the ublk server went
down from 1.04% to 0.43%, though there's now 0.67% in the newly added
io_local_work_flush_batch().

The time in update_io_ticks() (largely from blk_account_io_done() on
NVMe completions) increased from 0.18% to 1.11%, though I'm a little
surprised that would be caused by these patches.

Best,
Caleb

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-06-16 20:21 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-11 15:58 [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work Jens Axboe
2026-06-11 15:58 ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Jens Axboe
2026-06-11 16:49   ` Gabriel Krisman Bertazi
2026-06-11 16:58     ` Jens Axboe
2026-06-12  1:13   ` Caleb Sander Mateos
2026-06-12  2:21     ` Jens Axboe
2026-06-12  2:41       ` Caleb Sander Mateos
2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
2026-06-12  1:14   ` Caleb Sander Mateos
2026-06-12  2:23     ` Jens Axboe
2026-06-12  5:24       ` Caleb Sander Mateos
2026-06-12 12:21         ` Jens Axboe
2026-06-12 15:11           ` Jens Axboe
2026-06-15 17:55             ` Caleb Sander Mateos
2026-06-15 18:00               ` Jens Axboe
2026-06-16 20:21                 ` Caleb Sander Mateos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox