All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@kernel.dk>
To: io-uring@vger.kernel.org
Cc: dvyukov@google.com, Jens Axboe <axboe@kernel.dk>
Subject: [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue
Date: Thu, 11 Jun 2026 09:58:41 -0600	[thread overview]
Message-ID: <20260611160553.1486640-2-axboe@kernel.dk> (raw)
In-Reply-To: <20260611160553.1486640-1-axboe@kernel.dk>

Local task_work is currently using llists for managing the work,
but that's a LIFO type of list. This means that running this task_work
needs to reverse the list first, to ensure fairness in running the
queued items.

Add a lockless FIFO queued, based on Dmitry Vyukov's intrusive MPSC
node-based queue algorithm, modified with an externally held consumer
cursor and conditional stub reinsertion. See comments in the header.

Producers are wait-free: a push is a single xchg() on the queue tail,
which serializes concurrent producers and defines the FIFO order, plus
a store linking the node to its predecessor. There are no cmpxchg retry
loops, and pushing is safe from any context, including hardirq.

The cost of linked list FIFO ordering is that a push publishes the node
in two steps - the xchg() makes it visible as the new tail before the
subsequent store links it into the chain that is reachable from the
head. A consumer hitting that window gets a NULL from mpscq_pop() while
mpscq_empty() reports false, and must retry later rather than treat the
queue as empty. The window is two instructions wide, but a producer can
get preempted inside it, so the consumer must not busy wait on it.

The consumer side supports a single consumer at a time, with callers
providing their own serialization. A stub node, which also defines the
empty state (tail == stub), allows the consumer to detach the final
node without racing against producer link stores: that node is only
handed out once the stub has been cmpxchg'ed back in as the tail. This
also guarantees that the previous tail returned by mpscq_push() cannot
get freed before that push has linked it, making it always valid for
comparisons.

The consumer cursor is deliberately not part of the queue struct - the
caller owns it and passes it to mpscq_pop(). This is done to separate
the consumer and producers cacheline.The cursor is written for

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h |  12 ++++
 io_uring/mpscq.h               | 121 +++++++++++++++++++++++++++++++++
 2 files changed, 133 insertions(+)
 create mode 100644 io_uring/mpscq.h

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index aa4d5477f859..85e12b4884a5 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -55,6 +55,18 @@ struct io_wq_work_list {
 	struct io_wq_work_node *last;
 };
 
+/*
+ * Lockless multi-producer, single-consumer FIFO queue, see
+ * io_uring/mpscq.h for the implementation and rules. Defined here so
+ * that it can be embedded in io_ring_ctx. This is the producer side
+ * only - the consumer cursor is kept separately, on a cacheline that
+ * isn't dirtied by the producers.
+ */
+struct mpscq {
+	struct llist_node	*tail;		/* producers */
+	struct llist_node	stub;
+};
+
 struct io_wq_work {
 	struct io_wq_work_node list;
 	atomic_t flags;
diff --git a/io_uring/mpscq.h b/io_uring/mpscq.h
new file mode 100644
index 000000000000..12172cef8394
--- /dev/null
+++ b/io_uring/mpscq.h
@@ -0,0 +1,121 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef IOU_MPSCQ_H
+#define IOU_MPSCQ_H
+
+/*
+ * mpscq - lockless multi-producer, single-consumer FIFO queue
+ *
+ * Unlike llist, which is LIFO ordered and hence needs an O(n)
+ * llist_reverse_order() pass before entries can be processed in queue order,
+ * this queue hands out nodes in the order they were pushed.
+ *
+ * The consumer cursor is held by the caller rather than in the queue struct
+ * (see below), and with the stub reinsertion done as a single cmpxchg attempt
+ * instead of an unconditional push, keeping tail == stub a reliable empty test
+ * while a producer is in the middle of a push.
+ *
+ * Producers may run in any context (task, softirq, hardirq) and are wait-free:
+ * a push is one xchg() plus one store, with no retry loops. FIFO order between
+ * producers is the order in which the xchg() on ->tail serializes them.
+ *
+ * The price for linked-list FIFO is that a push publishes the node in two
+ * steps: the xchg() makes it the new tail, and the subsequent store links it to
+ * its predecessor. In between, the tail end of the queue is not yet reachable
+ * from the head. mpscq_pop() detects this and returns NULL, while mpscq_empty()
+ * reports false. The consumer must not treat such a NULL as "queue empty" - it
+ * should retry later. The window is two instructions wide, but a producer can
+ * be preempted inside it, so the consumer must not spin on it while holding
+ * resources the producer might need to make progress.
+ *
+ * The consumer side only supports a single consumer at a time, callers must
+ * provide their own serialization for it. The stub node is what allows the
+ * consumer to detach the final node without racing with the link stores of
+ * producers. This scheme also guarantees that the previous tail returned by
+ * mpscq_push() cannot be freed by the consumer until the push that returned it
+ * has linked it, hence it's always safe to compare against (but not
+ * dereference, unless the caller otherwise guarantees its lifetime).
+ *
+ * The queue struct only holds the producer side. The consumer keeps its cursor
+ * (the oldest not yet handed out node) externally and passes it to mpscq_pop(),
+ * so that it can be placed on a different cacheline: the cursor is written for
+ * every pop, and having it share a line with ->tail would have the consumer
+ * invalidating the line that producers need for every push.
+ */
+static inline void mpscq_init(struct mpscq *q, struct llist_node **headp)
+{
+	q->tail = *headp = &q->stub;
+	q->stub.next = NULL;
+}
+
+/*
+ * Returns true if the queue holds no entries that mpscq_pop() hasn't handed out
+ * yet. May be called from any context. Note that !empty doesn't guarantee that
+ * mpscq_pop() will return an entry yet, see the in-flight producer window
+ * above.
+ */
+static inline bool mpscq_empty(struct mpscq *q)
+{
+	return READ_ONCE(q->tail) == &q->stub;
+}
+
+/*
+ * Push a node onto the queue. Safe against concurrent pushes from any context,
+ * and against the (single) consumer. Returns the previous tail node, which is
+ * &q->stub if and only if the queue was empty before this push.
+ */
+static inline struct llist_node *mpscq_push(struct mpscq *q,
+					    struct llist_node *node)
+{
+	struct llist_node *prev;
+
+	node->next = NULL;
+	/*
+	 * xchg() implies a full barrier, so the initialization of the
+	 * entry (including ->next above) is visible before the node can
+	 * be reached, either via ->tail or via ->next chasing from the
+	 * head once the store below has linked it.
+	 */
+	prev = xchg(&q->tail, node);
+	WRITE_ONCE(prev->next, node);
+	return prev;
+}
+
+/*
+ * Pop the oldest node off the queue, or return NULL if no node is available.
+ * NULL is returned both when the queue is empty and when a producer has
+ * published a node via ->tail but hasn't linked it yet; use mpscq_empty() to
+ * tell the two apart. Single consumer only, with headp being the consumer
+ * cursor that mpscq_init() set up.
+ */
+static inline struct llist_node *mpscq_pop(struct mpscq *q,
+					   struct llist_node **headp)
+{
+	struct llist_node *head = *headp;
+	struct llist_node *next = READ_ONCE(head->next);
+
+	if (head == &q->stub) {
+		if (!next)
+			return NULL;
+		*headp = next;
+		head = next;
+		next = READ_ONCE(head->next);
+	}
+	if (next) {
+		*headp = next;
+		return head;
+	}
+	/*
+	 * 'head' is the last linked node, it can only be handed out once the
+	 * stub has taken its place as the tail. If the cmpxchg fails, a
+	 * producer has made a new node the tail but hasn't linked it to 'head'
+	 * yet - bail and let the caller retry.
+	 */
+	q->stub.next = NULL;
+	if (try_cmpxchg(&q->tail, &head, &q->stub)) {
+		*headp = &q->stub;
+		return head;
+	}
+	return NULL;
+}
+
+#endif /* IOU_MPSCQ_H */
-- 
2.53.0


  reply	other threads:[~2026-06-11 16:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-11 15:58 [PATCHSET 0/2] Add lockless MPSC FIFO queue for task work Jens Axboe
2026-06-11 15:58 ` Jens Axboe [this message]
2026-06-11 16:49   ` [PATCH 1/2] io_uring/mpscq: add lockless multi-producer, single-consumer FIFO queue Gabriel Krisman Bertazi
2026-06-11 16:58     ` Jens Axboe
2026-06-12  1:13   ` Caleb Sander Mateos
2026-06-12  2:21     ` Jens Axboe
2026-06-12  2:41       ` Caleb Sander Mateos
2026-06-11 15:58 ` [PATCH 2/2] io_uring: switch local task_work to a mpscq Jens Axboe
2026-06-12  1:14   ` Caleb Sander Mateos
2026-06-12  2:23     ` Jens Axboe
2026-06-12  5:24       ` Caleb Sander Mateos
2026-06-12 12:21         ` Jens Axboe
2026-06-12 15:11           ` Jens Axboe
2026-06-15 17:55             ` Caleb Sander Mateos
2026-06-15 18:00               ` Jens Axboe
2026-06-16 20:21                 ` Caleb Sander Mateos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260611160553.1486640-2-axboe@kernel.dk \
    --to=axboe@kernel.dk \
    --cc=dvyukov@google.com \
    --cc=io-uring@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.