io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET v2 0/8] Add support for mixed sized CQEs
@ 2025-08-21 14:18 Jens Axboe
  2025-08-21 14:18 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
                   ` (8 more replies)
  0 siblings, 9 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring

Hi,

Currently io_uring supports two modes for CQEs:

1) The standard mode, where 16b CQEs are used
2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b

Certain features need to pass more information back than just a single
32-bit res field, and hence mandate the use of CQE32 to be able to work.
Examples of that include passthrough or other uses of ->uring_cmd() like
socket option getting and setting, including timestamps.

This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
we need not waste twice the space for CQ rings, or use twice the space
per CQE posted, if only some of the CQEs posted require the use of 32b
CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
about this fact.

This is mostly trivial to support, with the corner case being attempting
to post a 32b CQE when the ring is a single 16b CQE away from wrapping.
As CQEs must be contigious in memory, that's simply not possible. The
solution taken by this patchset is to add a special CQE type, which has
IORING_CQE_F_SKIP set. This is a pad/nop CQE, which should simply be
ignored, as it carries no information and serves no other purpose than
to re-align the posted CQEs for ring wrap.

If used with liburing, then both the 32b vs 16b postings and the skip
are transparent.

liburing support and a few basic test cases can be found here:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=cqe-mixed

including man page updates for the newly added setup and CQE flags, and
the patches posted here can also be found at:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-cqe-mix

Patch 1 is just a prep patch, and patch 2 adds the cqe flags so that the
core can be adapted before support is actually there. Patches 3 and 4
are exactly that, and patch 5 finally adds support for the mixed mode.
Patch 6 adds support for NOP testing of this, and patches 7/8 allow
IORING_SETUP_CQE_MIXED for uring_cmd/zcrx which previously required
IORING_SETUP_CQE32 to work.

 Documentation/networking/iou-zcrx.rst |  2 +-
 include/linux/io_uring_types.h        |  6 ---
 include/trace/events/io_uring.h       |  4 +-
 include/uapi/linux/io_uring.h         | 17 ++++++
 io_uring/cmd_net.c                    |  3 +-
 io_uring/fdinfo.c                     | 22 ++++----
 io_uring/io_uring.c                   | 78 +++++++++++++++++++++------
 io_uring/io_uring.h                   | 49 ++++++++++++-----
 io_uring/nop.c                        | 17 +++++-
 io_uring/register.c                   |  3 +-
 io_uring/uring_cmd.c                  |  2 +-
 io_uring/zcrx.c                       |  5 +-
 12 files changed, 152 insertions(+), 56 deletions(-)

Changes since v1:
- Various little cleanups
- Rebase on for-6.18/io_uring

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

It's pretty pointless and only used for the tracing helper, get rid
of it.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/linux/io_uring_types.h  | 6 ------
 include/trace/events/io_uring.h | 4 ++--
 2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index 9c6c548f43f5..d1e25f3fe0b3 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -740,10 +740,4 @@ struct io_overflow_cqe {
 	struct list_head list;
 	struct io_uring_cqe cqe;
 };
-
-static inline bool io_ctx_cqe32(struct io_ring_ctx *ctx)
-{
-	return ctx->flags & IORING_SETUP_CQE32;
-}
-
 #endif
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 178ab6f611be..6a970625a3ea 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -340,8 +340,8 @@ TP_PROTO(struct io_ring_ctx *ctx, void *req, struct io_uring_cqe *cqe),
 		__entry->user_data	= cqe->user_data;
 		__entry->res		= cqe->res;
 		__entry->cflags		= cqe->flags;
-		__entry->extra1		= io_ctx_cqe32(ctx) ? cqe->big_cqe[0] : 0;
-		__entry->extra2		= io_ctx_cqe32(ctx) ? cqe->big_cqe[1] : 0;
+		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[0] : 0;
+		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[1] : 0;
 	),
 
 	TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
  2025-08-21 14:18 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This adds the CQE flags related to supporting a mixed CQ ring mode, where
both normal (16b) and big (32b) CQEs may be posted.

No functional changes in this patch.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 1e935f8901c5..7af8d10b3aba 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -491,12 +491,22 @@ struct io_uring_cqe {
  *			other provided buffer type, all completions with a
  *			buffer passed back is automatically returned to the
  *			application.
+ * IORING_CQE_F_SKIP	If set, then the application/liburing must ignore this
+ *			CQE. It's only purpose is to fill a gap in the ring,
+ *			if a large CQE is attempted posted when the ring has
+ *			just a single small CQE worth of space left before
+ *			wrapping.
+ * IORING_CQE_F_32	If set, this is a 32b/big-cqe posting. Use with rings
+ *			setup in a mixed CQE mode, where both 16b and 32b
+ *			CQEs may be posted to the CQ ring.
  */
 #define IORING_CQE_F_BUFFER		(1U << 0)
 #define IORING_CQE_F_MORE		(1U << 1)
 #define IORING_CQE_F_SOCK_NONEMPTY	(1U << 2)
 #define IORING_CQE_F_NOTIF		(1U << 3)
 #define IORING_CQE_F_BUF_MORE		(1U << 4)
+#define IORING_CQE_F_SKIP		(1U << 5)
+#define IORING_CQE_F_32			(1U << 15)
 
 #define IORING_CQE_BUFFER_SHIFT		16
 
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
  2025-08-21 14:18 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
  2025-08-21 14:18 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Ensure that the CQ ring iteration handles differently sized CQEs, not
just a fixed 16b or 32b size per ring. These CQEs aren't possible just
yet, but prepare the fdinfo CQ ring dumping for handling them.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/fdinfo.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index 9798d6fb4ec7..5c7339838769 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -65,15 +65,12 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 	unsigned int sq_tail = READ_ONCE(r->sq.tail);
 	unsigned int cq_head = READ_ONCE(r->cq.head);
 	unsigned int cq_tail = READ_ONCE(r->cq.tail);
-	unsigned int cq_shift = 0;
 	unsigned int sq_shift = 0;
-	unsigned int sq_entries, cq_entries;
+	unsigned int sq_entries;
 	int sq_pid = -1, sq_cpu = -1;
 	u64 sq_total_time = 0, sq_work_time = 0;
 	unsigned int i;
 
-	if (ctx->flags & IORING_SETUP_CQE32)
-		cq_shift = 1;
 	if (ctx->flags & IORING_SETUP_SQE128)
 		sq_shift = 1;
 
@@ -125,18 +122,23 @@ static void __io_uring_show_fdinfo(struct io_ring_ctx *ctx, struct seq_file *m)
 		seq_printf(m, "\n");
 	}
 	seq_printf(m, "CQEs:\t%u\n", cq_tail - cq_head);
-	cq_entries = min(cq_tail - cq_head, ctx->cq_entries);
-	for (i = 0; i < cq_entries; i++) {
-		unsigned int entry = i + cq_head;
-		struct io_uring_cqe *cqe = &r->cqes[(entry & cq_mask) << cq_shift];
+	while (cq_head < cq_tail) {
+		struct io_uring_cqe *cqe;
+		bool cqe32 = false;
 
+		cqe = &r->cqes[(cq_head & cq_mask)];
+		if (cqe->flags & IORING_CQE_F_32 || ctx->flags & IORING_SETUP_CQE32)
+			cqe32 = true;
 		seq_printf(m, "%5u: user_data:%llu, res:%d, flag:%x",
-			   entry & cq_mask, cqe->user_data, cqe->res,
+			   cq_head & cq_mask, cqe->user_data, cqe->res,
 			   cqe->flags);
-		if (cq_shift)
+		if (cqe32)
 			seq_printf(m, ", extra1:%llu, extra2:%llu\n",
 					cqe->big_cqe[0], cqe->big_cqe[1]);
 		seq_printf(m, "\n");
+		cq_head++;
+		if (cqe32)
+			cq_head++;
 	}
 
 	if (ctx->flags & IORING_SETUP_SQPOLL) {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (2 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Check for IORING_CQE_F_32 as well, not just if the ring was setup with
IORING_SETUP_CQE32 to only support big CQEs.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/trace/events/io_uring.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index 6a970625a3ea..45d15460b495 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -340,8 +340,8 @@ TP_PROTO(struct io_ring_ctx *ctx, void *req, struct io_uring_cqe *cqe),
 		__entry->user_data	= cqe->user_data;
 		__entry->res		= cqe->res;
 		__entry->cflags		= cqe->flags;
-		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[0] : 0;
-		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 ? cqe->big_cqe[1] : 0;
+		__entry->extra1		= ctx->flags & IORING_SETUP_CQE32 || cqe->flags & IORING_CQE_F_32 ? cqe->big_cqe[0] : 0;
+		__entry->extra2		= ctx->flags & IORING_SETUP_CQE32 || cqe->flags & IORING_CQE_F_32 ? cqe->big_cqe[1] : 0;
 	),
 
 	TP_printk("ring %p, req %p, user_data 0x%llx, result %d, cflags 0x%x "
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (3 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
                   ` (3 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Normal rings support 16b CQEs for posting completions, while certain
features require the ring to be configured with IORING_SETUP_CQE32, as
they need to convey more information per completion. This, in turn,
makes ALL the CQEs be 32b in size. This is somewhat wasteful and
inefficient, particularly when only certain CQEs need to be of the
bigger variant.

This adds support for setting up a ring with mixed CQE sizes, using
IORING_SETUP_CQE_MIXED. When setup in this mode, CQEs posted to the ring
may be either 16b or 32b in size. If a CQE is 32b in size, then
IORING_CQE_F_32 is set in the CQE flags to indicate that this is the
case. If this flag isn't set, the CQE is the normal 16b variant.

CQEs on these types of mixed rings may also have IORING_CQE_F_SKIP set.
This can happen if the ring is one (small) CQE entry away from wrapping,
and an attempt is made to post a 32b CQE. As CQEs must be contigious in
the CQ ring, a 32b CQE cannot wrap the ring. For this case, a single
dummy CQE is posted with the SKIP flag set. The application should
simply ignore those.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  6 +++
 io_uring/io_uring.c           | 78 ++++++++++++++++++++++++++++-------
 io_uring/io_uring.h           | 49 +++++++++++++++-------
 io_uring/register.c           |  3 +-
 4 files changed, 105 insertions(+), 31 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 7af8d10b3aba..5135e1be0390 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -225,6 +225,12 @@ enum io_uring_sqe_flags_bit {
 /* Use hybrid poll in iopoll process */
 #define IORING_SETUP_HYBRID_IOPOLL	(1U << 17)
 
+/*
+ * Allow both 16b and 32b CQEs. If a 32b CQE is posted, it will have
+ * IORING_CQE_F_32 set in cqe->flags.
+ */
+#define IORING_SETUP_CQE_MIXED		(1U << 18)
+
 enum io_uring_op {
 	IORING_OP_NOP,
 	IORING_OP_READV,
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6247d582fb40..58f7c2403a15 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -620,27 +620,29 @@ static void io_cq_unlock_post(struct io_ring_ctx *ctx)
 
 static void __io_cqring_overflow_flush(struct io_ring_ctx *ctx, bool dying)
 {
-	size_t cqe_size = sizeof(struct io_uring_cqe);
-
 	lockdep_assert_held(&ctx->uring_lock);
 
 	/* don't abort if we're dying, entries must get freed */
 	if (!dying && __io_cqring_events(ctx) == ctx->cq_entries)
 		return;
 
-	if (ctx->flags & IORING_SETUP_CQE32)
-		cqe_size <<= 1;
-
 	io_cq_lock(ctx);
 	while (!list_empty(&ctx->cq_overflow_list)) {
+		size_t cqe_size = sizeof(struct io_uring_cqe);
 		struct io_uring_cqe *cqe;
 		struct io_overflow_cqe *ocqe;
+		bool is_cqe32 = false;
 
 		ocqe = list_first_entry(&ctx->cq_overflow_list,
 					struct io_overflow_cqe, list);
+		if (ocqe->cqe.flags & IORING_CQE_F_32 ||
+		    ctx->flags & IORING_SETUP_CQE32) {
+			is_cqe32 = true;
+			cqe_size <<= 1;
+		}
 
 		if (!dying) {
-			if (!io_get_cqe_overflow(ctx, &cqe, true))
+			if (!io_get_cqe_overflow(ctx, &cqe, true, is_cqe32))
 				break;
 			memcpy(cqe, &ocqe->cqe, cqe_size);
 		}
@@ -752,10 +754,12 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
 {
 	struct io_overflow_cqe *ocqe;
 	size_t ocq_size = sizeof(struct io_overflow_cqe);
-	bool is_cqe32 = (ctx->flags & IORING_SETUP_CQE32);
+	bool is_cqe32 = false;
 
-	if (is_cqe32)
-		ocq_size += sizeof(struct io_uring_cqe);
+	if (cqe->flags & IORING_CQE_F_32 || ctx->flags & IORING_SETUP_CQE32) {
+		is_cqe32 = true;
+		ocq_size <<= 1;
+	}
 
 	ocqe = kzalloc(ocq_size, gfp | __GFP_ACCOUNT);
 	trace_io_uring_cqe_overflow(ctx, cqe->user_data, cqe->res, cqe->flags, ocqe);
@@ -773,12 +777,30 @@ static struct io_overflow_cqe *io_alloc_ocqe(struct io_ring_ctx *ctx,
 	return ocqe;
 }
 
+/*
+ * Fill an empty dummy CQE, in case alignment is off for posting a 32b CQE
+ * because the ring is a single 16b entry away from wrapping.
+ */
+static bool io_fill_nop_cqe(struct io_ring_ctx *ctx, unsigned int off)
+{
+	if (__io_cqring_events(ctx) < ctx->cq_entries) {
+		struct io_uring_cqe *cqe = &ctx->rings->cqes[off];
+
+		cqe->user_data = 0;
+		cqe->res = 0;
+		cqe->flags = IORING_CQE_F_SKIP;
+		ctx->cached_cq_tail++;
+		return true;
+	}
+	return false;
+}
+
 /*
  * writes to the cq entry need to come after reading head; the
  * control dependency is enough as we're using WRITE_ONCE to
  * fill the cq entry
  */
-bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32)
 {
 	struct io_rings *rings = ctx->rings;
 	unsigned int off = ctx->cached_cq_tail & (ctx->cq_entries - 1);
@@ -792,12 +814,22 @@ bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow)
 	if (!overflow && (ctx->check_cq & BIT(IO_CHECK_CQ_OVERFLOW_BIT)))
 		return false;
 
+	/*
+	 * Post dummy CQE if a 32b CQE is needed and there's only room for a
+	 * 16b CQE before the ring wraps.
+	 */
+	if (cqe32 && off + 1 == ctx->cq_entries) {
+		if (!io_fill_nop_cqe(ctx, off))
+			return false;
+		off = 0;
+	}
+
 	/* userspace may cheat modifying the tail, be safe and do min */
 	queued = min(__io_cqring_events(ctx), ctx->cq_entries);
 	free = ctx->cq_entries - queued;
 	/* we need a contiguous range, limit based on the current array offset */
 	len = min(free, ctx->cq_entries - off);
-	if (!len)
+	if (len < (cqe32 + 1))
 		return false;
 
 	if (ctx->flags & IORING_SETUP_CQE32) {
@@ -815,9 +847,9 @@ static bool io_fill_cqe_aux32(struct io_ring_ctx *ctx,
 {
 	struct io_uring_cqe *cqe;
 
-	if (WARN_ON_ONCE(!(ctx->flags & IORING_SETUP_CQE32)))
+	if (WARN_ON_ONCE(!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED))))
 		return false;
-	if (unlikely(!io_get_cqe(ctx, &cqe)))
+	if (unlikely(!io_get_cqe(ctx, &cqe, true)))
 		return false;
 
 	memcpy(cqe, src_cqe, 2 * sizeof(*cqe));
@@ -828,14 +860,15 @@ static bool io_fill_cqe_aux32(struct io_ring_ctx *ctx,
 static bool io_fill_cqe_aux(struct io_ring_ctx *ctx, u64 user_data, s32 res,
 			      u32 cflags)
 {
+	bool cqe32 = cflags & IORING_CQE_F_32;
 	struct io_uring_cqe *cqe;
 
-	if (likely(io_get_cqe(ctx, &cqe))) {
+	if (likely(io_get_cqe(ctx, &cqe, cqe32))) {
 		WRITE_ONCE(cqe->user_data, user_data);
 		WRITE_ONCE(cqe->res, res);
 		WRITE_ONCE(cqe->flags, cflags);
 
-		if (ctx->flags & IORING_SETUP_CQE32) {
+		if (cqe32) {
 			WRITE_ONCE(cqe->big_cqe[0], 0);
 			WRITE_ONCE(cqe->big_cqe[1], 0);
 		}
@@ -2755,6 +2788,10 @@ unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 		if (check_shl_overflow(off, 1, &off))
 			return SIZE_MAX;
 	}
+	if (flags & IORING_SETUP_CQE_MIXED) {
+		if (cq_entries < 2)
+			return SIZE_MAX;
+	}
 
 #ifdef CONFIG_SMP
 	off = ALIGN(off, SMP_CACHE_BYTES);
@@ -3679,6 +3716,14 @@ static int io_uring_sanitise_params(struct io_uring_params *p)
 	    !(flags & IORING_SETUP_SINGLE_ISSUER))
 		return -EINVAL;
 
+	/*
+	 * Nonsensical to ask for CQE32 and mixed CQE support, it's not
+	 * supported to post 16b CQEs on a ring setup with CQE32.
+	 */
+	if ((flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)) ==
+	    (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED))
+		return -EINVAL;
+
 	return 0;
 }
 
@@ -3905,7 +3950,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
 			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
 			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
-			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL))
+			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL |
+			IORING_SETUP_CQE_MIXED))
 		return -EINVAL;
 
 	return io_uring_create(entries, &p, params);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index abc6de227f74..2e4f7223a767 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -75,7 +75,7 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
 unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
 			 unsigned int cq_entries, size_t *sq_offset);
 int io_uring_fill_params(unsigned entries, struct io_uring_params *p);
-bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
+bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow, bool cqe32);
 int io_run_task_work_sig(struct io_ring_ctx *ctx);
 void io_req_defer_failed(struct io_kiocb *req, s32 res);
 bool io_post_aux_cqe(struct io_ring_ctx *ctx, u64 user_data, s32 res, u32 cflags);
@@ -169,25 +169,31 @@ static inline void io_submit_flush_completions(struct io_ring_ctx *ctx)
 
 static inline bool io_get_cqe_overflow(struct io_ring_ctx *ctx,
 					struct io_uring_cqe **ret,
-					bool overflow)
+					bool overflow, bool cqe32)
 {
 	io_lockdep_assert_cq_locked(ctx);
 
-	if (unlikely(ctx->cqe_cached >= ctx->cqe_sentinel)) {
-		if (unlikely(!io_cqe_cache_refill(ctx, overflow)))
+	if (unlikely(ctx->cqe_sentinel - ctx->cqe_cached < (cqe32 + 1))) {
+		if (unlikely(!io_cqe_cache_refill(ctx, overflow, cqe32)))
 			return false;
 	}
 	*ret = ctx->cqe_cached;
 	ctx->cached_cq_tail++;
 	ctx->cqe_cached++;
-	if (ctx->flags & IORING_SETUP_CQE32)
+	if (ctx->flags & IORING_SETUP_CQE32) {
+		ctx->cqe_cached++;
+	} else if (cqe32 && ctx->flags & IORING_SETUP_CQE_MIXED) {
 		ctx->cqe_cached++;
+		ctx->cached_cq_tail++;
+	}
+	WARN_ON_ONCE(ctx->cqe_cached > ctx->cqe_sentinel);
 	return true;
 }
 
-static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret)
+static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret,
+				bool cqe32)
 {
-	return io_get_cqe_overflow(ctx, ret, false);
+	return io_get_cqe_overflow(ctx, ret, false, cqe32);
 }
 
 static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
@@ -196,25 +202,24 @@ static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx,
 	io_lockdep_assert_cq_locked(ctx);
 
 	ctx->submit_state.cq_flush = true;
-	return io_get_cqe(ctx, cqe_ret);
+	return io_get_cqe(ctx, cqe_ret, false);
 }
 
 static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx,
 					    struct io_kiocb *req)
 {
+	bool is_cqe32 = req->cqe.flags & IORING_CQE_F_32;
 	struct io_uring_cqe *cqe;
 
 	/*
-	 * If we can't get a cq entry, userspace overflowed the
-	 * submission (by quite a lot). Increment the overflow count in
-	 * the ring.
+	 * If we can't get a cq entry, userspace overflowed the submission
+	 * (by quite a lot).
 	 */
-	if (unlikely(!io_get_cqe(ctx, &cqe)))
+	if (unlikely(!io_get_cqe(ctx, &cqe, is_cqe32)))
 		return false;
 
-
 	memcpy(cqe, &req->cqe, sizeof(*cqe));
-	if (ctx->flags & IORING_SETUP_CQE32) {
+	if (is_cqe32) {
 		memcpy(cqe->big_cqe, &req->big_cqe, sizeof(*cqe));
 		memset(&req->big_cqe, 0, sizeof(req->big_cqe));
 	}
@@ -239,6 +244,22 @@ static inline void io_req_set_res(struct io_kiocb *req, s32 res, u32 cflags)
 	req->cqe.flags = cflags;
 }
 
+static inline u32 ctx_cqe32_flags(struct io_ring_ctx *ctx)
+{
+	if (ctx->flags & IORING_SETUP_CQE_MIXED)
+		return IORING_CQE_F_32;
+	return 0;
+}
+
+static inline void io_req_set_res32(struct io_kiocb *req, s32 res, u32 cflags,
+				    __u64 extra1, __u64 extra2)
+{
+	req->cqe.res = res;
+	req->cqe.flags = cflags | ctx_cqe32_flags(req->ctx);
+	req->big_cqe.extra1 = extra1;
+	req->big_cqe.extra2 = extra2;
+}
+
 static inline void *io_uring_alloc_async_data(struct io_alloc_cache *cache,
 					      struct io_kiocb *req)
 {
diff --git a/io_uring/register.c b/io_uring/register.c
index a59589249fce..a1a9b2884eae 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -396,7 +396,8 @@ static void io_register_free_rings(struct io_ring_ctx *ctx,
 
 #define RESIZE_FLAGS	(IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP)
 #define COPY_FLAGS	(IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \
-			 IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP)
+			 IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP | \
+			 IORING_SETUP_CQE_MIXED)
 
 static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
 {
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 6/8] io_uring/nop: add support for IORING_SETUP_CQE_MIXED
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (4 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

This adds support for setting IORING_NOP_CQE32 as a flag for a NOP
command, in which case a 32b CQE will be posted rather than a regular
one. This is the default if the ring has been setup with
IORING_SETUP_CQE32. If the ring has been setup with
IORING_SETUP_CQE_MIXED, then 16b CQEs will be posted without this flag
set, and 32b CQEs if this flag is set. For the latter case, sqe->off is
what will be posted as cqe->big_cqe[0] and sqe->addr is what will be
posted as cqe->big_cqe[1].

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 include/uapi/linux/io_uring.h |  1 +
 io_uring/nop.c                | 17 +++++++++++++++--
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 5135e1be0390..04ebff33d0e6 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -464,6 +464,7 @@ enum io_uring_msg_ring_flags {
 #define IORING_NOP_FIXED_FILE		(1U << 2)
 #define IORING_NOP_FIXED_BUFFER		(1U << 3)
 #define IORING_NOP_TW			(1U << 4)
+#define IORING_NOP_CQE32		(1U << 5)
 
 /*
  * IO completion data structure (Completion Queue Entry)
diff --git a/io_uring/nop.c b/io_uring/nop.c
index 20ed0f85b1c2..3caf07878f8a 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -17,11 +17,13 @@ struct io_nop {
 	int             result;
 	int		fd;
 	unsigned int	flags;
+	__u64		extra1;
+	__u64		extra2;
 };
 
 #define NOP_FLAGS	(IORING_NOP_INJECT_RESULT | IORING_NOP_FIXED_FILE | \
 			 IORING_NOP_FIXED_BUFFER | IORING_NOP_FILE | \
-			 IORING_NOP_TW)
+			 IORING_NOP_TW | IORING_NOP_CQE32)
 
 int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 {
@@ -41,6 +43,14 @@ int io_nop_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
 		nop->fd = -1;
 	if (nop->flags & IORING_NOP_FIXED_BUFFER)
 		req->buf_index = READ_ONCE(sqe->buf_index);
+	if (nop->flags & IORING_NOP_CQE32) {
+		struct io_ring_ctx *ctx = req->ctx;
+
+		if (!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)))
+			return -EINVAL;
+		nop->extra1 = READ_ONCE(sqe->off);
+		nop->extra2 = READ_ONCE(sqe->addr);
+	}
 	return 0;
 }
 
@@ -68,7 +78,10 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
 done:
 	if (ret < 0)
 		req_set_fail(req);
-	io_req_set_res(req, nop->result, 0);
+	if (nop->flags & IORING_NOP_CQE32)
+		io_req_set_res32(req, nop->result, 0, nop->extra1, nop->extra2);
+	else
+		io_req_set_res(req, nop->result, 0);
 	if (nop->flags & IORING_NOP_TW) {
 		req->io_task_work.func = io_req_task_complete;
 		io_req_task_work_add(req);
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 7/8] io_uring/uring_cmd: add support for IORING_SETUP_CQE_MIXED
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (5 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 14:18 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
  2025-08-21 17:02 ` [PATCHSET v2 0/8] Add support for mixed sized CQEs Caleb Sander Mateos
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

Certain users of uring_cmd currently require fixed 32b CQE support,
which is propagated through IO_URING_F_CQE32. Allow
IORING_SETUP_CQE_MIXED to cover that case as well, so not all CQEs
posted need to be 32b in size.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 io_uring/cmd_net.c   | 3 ++-
 io_uring/uring_cmd.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/io_uring/cmd_net.c b/io_uring/cmd_net.c
index 3866fe6ff541..27a09aa4c9d0 100644
--- a/io_uring/cmd_net.c
+++ b/io_uring/cmd_net.c
@@ -4,6 +4,7 @@
 #include <net/sock.h>
 
 #include "uring_cmd.h"
+#include "io_uring.h"
 
 static inline int io_uring_cmd_getsockopt(struct socket *sock,
 					  struct io_uring_cmd *cmd,
@@ -73,7 +74,7 @@ static bool io_process_timestamp_skb(struct io_uring_cmd *cmd, struct sock *sk,
 
 	cqe->user_data = 0;
 	cqe->res = tskey;
-	cqe->flags = IORING_CQE_F_MORE;
+	cqe->flags = IORING_CQE_F_MORE | ctx_cqe32_flags(cmd_to_io_kiocb(cmd)->ctx);
 	cqe->flags |= tstype << IORING_TIMESTAMP_TYPE_SHIFT;
 	if (ret == SOF_TIMESTAMPING_TX_HARDWARE)
 		cqe->flags |= IORING_CQE_F_TSTAMP_HW;
diff --git a/io_uring/uring_cmd.c b/io_uring/uring_cmd.c
index 3cfb5d51b88a..90d3239df6bd 100644
--- a/io_uring/uring_cmd.c
+++ b/io_uring/uring_cmd.c
@@ -248,7 +248,7 @@ int io_uring_cmd(struct io_kiocb *req, unsigned int issue_flags)
 
 	if (ctx->flags & IORING_SETUP_SQE128)
 		issue_flags |= IO_URING_F_SQE128;
-	if (ctx->flags & IORING_SETUP_CQE32)
+	if (ctx->flags & (IORING_SETUP_CQE32 | IORING_SETUP_CQE_MIXED))
 		issue_flags |= IO_URING_F_CQE32;
 	if (io_is_compat(ctx))
 		issue_flags |= IO_URING_F_COMPAT;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 8/8] io_uring/zcrx: add support for IORING_SETUP_CQE_MIXED
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (6 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
@ 2025-08-21 14:18 ` Jens Axboe
  2025-08-21 17:02 ` [PATCHSET v2 0/8] Add support for mixed sized CQEs Caleb Sander Mateos
  8 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 14:18 UTC (permalink / raw)
  To: io-uring; +Cc: Jens Axboe

zcrx currently requires the ring to be set up with fixed 32b CQEs,
allow it to use IORING_SETUP_CQE_MIXED as well.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
 Documentation/networking/iou-zcrx.rst | 2 +-
 io_uring/zcrx.c                       | 5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
index 0127319b30bb..54a72e172bdc 100644
--- a/Documentation/networking/iou-zcrx.rst
+++ b/Documentation/networking/iou-zcrx.rst
@@ -75,7 +75,7 @@ Create an io_uring instance with the following required setup flags::
 
   IORING_SETUP_SINGLE_ISSUER
   IORING_SETUP_DEFER_TASKRUN
-  IORING_SETUP_CQE32
+  IORING_SETUP_CQE32 or IORING_SETUP_CQE_MIXED
 
 Create memory area
 ------------------
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index e5ff49f3425e..f1da852c496b 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -554,8 +554,9 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx,
 		return -EPERM;
 
 	/* mandatory io_uring features for zc rx */
-	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN &&
-	      ctx->flags & IORING_SETUP_CQE32))
+	if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN))
+		return -EINVAL;
+	if (!(ctx->flags & (IORING_SETUP_CQE32|IORING_SETUP_CQE_MIXED)))
 		return -EINVAL;
 	if (copy_from_user(&reg, arg, sizeof(reg)))
 		return -EFAULT;
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
                   ` (7 preceding siblings ...)
  2025-08-21 14:18 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
@ 2025-08-21 17:02 ` Caleb Sander Mateos
  2025-08-21 17:12   ` Jens Axboe
  8 siblings, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2025-08-21 17:02 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring

On Thu, Aug 21, 2025 at 7:28 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Hi,
>
> Currently io_uring supports two modes for CQEs:
>
> 1) The standard mode, where 16b CQEs are used
> 2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b
>
> Certain features need to pass more information back than just a single
> 32-bit res field, and hence mandate the use of CQE32 to be able to work.
> Examples of that include passthrough or other uses of ->uring_cmd() like
> socket option getting and setting, including timestamps.
>
> This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
> posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
> we need not waste twice the space for CQ rings, or use twice the space
> per CQE posted, if only some of the CQEs posted require the use of 32b
> CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
> IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
> about this fact.

This makes a lot of sense. Have you considered something analogous for
SQEs? Requiring all SQEs to be 128 bytes when an io_uring is used for
a mix of 64-byte and 128-byte SQEs also wastes memory, probably even
more since SQEs are 4x larger than CQEs.

Best,
Caleb

>
> This is mostly trivial to support, with the corner case being attempting
> to post a 32b CQE when the ring is a single 16b CQE away from wrapping.
> As CQEs must be contigious in memory, that's simply not possible. The
> solution taken by this patchset is to add a special CQE type, which has
> IORING_CQE_F_SKIP set. This is a pad/nop CQE, which should simply be
> ignored, as it carries no information and serves no other purpose than
> to re-align the posted CQEs for ring wrap.
>
> If used with liburing, then both the 32b vs 16b postings and the skip
> are transparent.
>
> liburing support and a few basic test cases can be found here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/liburing.git/log/?h=cqe-mixed
>
> including man page updates for the newly added setup and CQE flags, and
> the patches posted here can also be found at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/log/?h=io_uring-cqe-mix
>
> Patch 1 is just a prep patch, and patch 2 adds the cqe flags so that the
> core can be adapted before support is actually there. Patches 3 and 4
> are exactly that, and patch 5 finally adds support for the mixed mode.
> Patch 6 adds support for NOP testing of this, and patches 7/8 allow
> IORING_SETUP_CQE_MIXED for uring_cmd/zcrx which previously required
> IORING_SETUP_CQE32 to work.
>
>  Documentation/networking/iou-zcrx.rst |  2 +-
>  include/linux/io_uring_types.h        |  6 ---
>  include/trace/events/io_uring.h       |  4 +-
>  include/uapi/linux/io_uring.h         | 17 ++++++
>  io_uring/cmd_net.c                    |  3 +-
>  io_uring/fdinfo.c                     | 22 ++++----
>  io_uring/io_uring.c                   | 78 +++++++++++++++++++++------
>  io_uring/io_uring.h                   | 49 ++++++++++++-----
>  io_uring/nop.c                        | 17 +++++-
>  io_uring/register.c                   |  3 +-
>  io_uring/uring_cmd.c                  |  2 +-
>  io_uring/zcrx.c                       |  5 +-
>  12 files changed, 152 insertions(+), 56 deletions(-)
>
> Changes since v1:
> - Various little cleanups
> - Rebase on for-6.18/io_uring
>
> --
> Jens Axboe
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:02 ` [PATCHSET v2 0/8] Add support for mixed sized CQEs Caleb Sander Mateos
@ 2025-08-21 17:12   ` Jens Axboe
  2025-08-21 17:40     ` Keith Busch
  2025-08-21 17:41     ` Caleb Sander Mateos
  0 siblings, 2 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 17:12 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, Keith Busch

On 8/21/25 11:02 AM, Caleb Sander Mateos wrote:
> On Thu, Aug 21, 2025 at 7:28?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Hi,
>>
>> Currently io_uring supports two modes for CQEs:
>>
>> 1) The standard mode, where 16b CQEs are used
>> 2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b
>>
>> Certain features need to pass more information back than just a single
>> 32-bit res field, and hence mandate the use of CQE32 to be able to work.
>> Examples of that include passthrough or other uses of ->uring_cmd() like
>> socket option getting and setting, including timestamps.
>>
>> This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
>> posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
>> we need not waste twice the space for CQ rings, or use twice the space
>> per CQE posted, if only some of the CQEs posted require the use of 32b
>> CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
>> IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
>> about this fact.
> 
> This makes a lot of sense. Have you considered something analogous for
> SQEs? Requiring all SQEs to be 128 bytes when an io_uring is used for
> a mix of 64-byte and 128-byte SQEs also wastes memory, probably even
> more since SQEs are 4x larger than CQEs.

Adding Keith, as he and I literally just talked about that. My answer
was that the case is a bit different in that 32b CQEs can be useful in
cases that are predominately 16b in the first place. For example,
networking workload doing send/recv/etc and the occassional
get/setsockopt kind of thing. Or maybe a mix of normal recv and zero
copy rx.

For the SQE case, I think it's a bit different. At least the cases I
know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
to be told otherwise! Because that is kind of the key question that
needs answering before even thinking about doing that kind of work.

But yes, it could be supported, and Keith (kind of) signed himself up to
do that. One oddity I see on that side is that while with CQE32 the
kernel can manage the potential wrap-around gap, for SQEs that's
obviously on the application to do. That could just be a NOP or
something like that, but you do need something to fill/skip that space.
I guess that could be as simple as having an opcode that is simply "skip
me", so on the kernel side it'd be easy as it'd just drop it on the
floor. You still need to app side to fill one, however, and then deal
with "oops SQ ring is now full" too.

Probably won't be too bad at all, however.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:12   ` Jens Axboe
@ 2025-08-21 17:40     ` Keith Busch
  2025-08-21 17:47       ` Jens Axboe
  2025-08-21 17:41     ` Caleb Sander Mateos
  1 sibling, 1 reply; 16+ messages in thread
From: Keith Busch @ 2025-08-21 17:40 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Caleb Sander Mateos, io-uring

On Thu, Aug 21, 2025 at 11:12:28AM -0600, Jens Axboe wrote:
>
> For the SQE case, I think it's a bit different. At least the cases I
> know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
> to be told otherwise! Because that is kind of the key question that
> needs answering before even thinking about doing that kind of work.

The main use case I can think of is if an application allocates one ring
for uring_cmd with the 128b SQEs, and then a separate ring for normal
file and network stuff. Mixed SQE's would allow that application to have
just one ring without being wasteful, but I'm just not sure if the
separate rings is undesirable enough to make the effort worth it.
 
> But yes, it could be supported, and Keith (kind of) signed himself up to
> do that. One oddity I see on that side is that while with CQE32 the
> kernel can manage the potential wrap-around gap, for SQEs that's
> obviously on the application to do. That could just be a NOP or
> something like that, but you do need something to fill/skip that space.
> I guess that could be as simple as having an opcode that is simply "skip
> me", so on the kernel side it'd be easy as it'd just drop it on the
> floor. You still need to app side to fill one, however, and then deal
> with "oops SQ ring is now full" too.

Yep, I think it's doable, and your implementation for mixed CQEs
provides a great reference. I trust we can get liburing using it
correctly, but would be afraid for anyone not using the library.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:12   ` Jens Axboe
  2025-08-21 17:40     ` Keith Busch
@ 2025-08-21 17:41     ` Caleb Sander Mateos
  2025-08-21 17:46       ` Jens Axboe
  1 sibling, 1 reply; 16+ messages in thread
From: Caleb Sander Mateos @ 2025-08-21 17:41 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Keith Busch

On Thu, Aug 21, 2025 at 10:12 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 8/21/25 11:02 AM, Caleb Sander Mateos wrote:
> > On Thu, Aug 21, 2025 at 7:28?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> Hi,
> >>
> >> Currently io_uring supports two modes for CQEs:
> >>
> >> 1) The standard mode, where 16b CQEs are used
> >> 2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b
> >>
> >> Certain features need to pass more information back than just a single
> >> 32-bit res field, and hence mandate the use of CQE32 to be able to work.
> >> Examples of that include passthrough or other uses of ->uring_cmd() like
> >> socket option getting and setting, including timestamps.
> >>
> >> This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
> >> posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
> >> we need not waste twice the space for CQ rings, or use twice the space
> >> per CQE posted, if only some of the CQEs posted require the use of 32b
> >> CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
> >> IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
> >> about this fact.
> >
> > This makes a lot of sense. Have you considered something analogous for
> > SQEs? Requiring all SQEs to be 128 bytes when an io_uring is used for
> > a mix of 64-byte and 128-byte SQEs also wastes memory, probably even
> > more since SQEs are 4x larger than CQEs.
>
> Adding Keith, as he and I literally just talked about that. My answer
> was that the case is a bit different in that 32b CQEs can be useful in
> cases that are predominately 16b in the first place. For example,
> networking workload doing send/recv/etc and the occassional
> get/setsockopt kind of thing. Or maybe a mix of normal recv and zero
> copy rx.
>
> For the SQE case, I think it's a bit different. At least the cases I
> know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
> to be told otherwise! Because that is kind of the key question that
> needs answering before even thinking about doing that kind of work.

We certainly have a use case that mixes the two on the same io_uring:
ublk commit/buffer register/unregister commands (64 byte SQEs) and
NVMe passthru commands (128 byte SQEs). I could also imagine an
application issuing both normal read/write commands and NVMe passthru
commands. But you're probably right that this isn't a super common use
case.

>
> But yes, it could be supported, and Keith (kind of) signed himself up to
> do that. One oddity I see on that side is that while with CQE32 the
> kernel can manage the potential wrap-around gap, for SQEs that's
> obviously on the application to do. That could just be a NOP or
> something like that, but you do need something to fill/skip that space.
> I guess that could be as simple as having an opcode that is simply "skip
> me", so on the kernel side it'd be easy as it'd just drop it on the
> floor. You still need to app side to fill one, however, and then deal
> with "oops SQ ring is now full" too.

Sure, of course userspace would need to handle a misaligned big SQE at
the end of the SQ analogously to mixed CQE sizes. I assume liburing
should be able to do that mostly transparently, that logic could all
be encapsulated by io_uring_get_sqe().

Best,
Caleb

>
> Probably won't be too bad at all, however.
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:41     ` Caleb Sander Mateos
@ 2025-08-21 17:46       ` Jens Axboe
  2025-08-21 18:19         ` Caleb Sander Mateos
  0 siblings, 1 reply; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 17:46 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: io-uring, Keith Busch

On 8/21/25 11:41 AM, Caleb Sander Mateos wrote:
> On Thu, Aug 21, 2025 at 10:12?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 8/21/25 11:02 AM, Caleb Sander Mateos wrote:
>>> On Thu, Aug 21, 2025 at 7:28?AM Jens Axboe <axboe@kernel.dk> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Currently io_uring supports two modes for CQEs:
>>>>
>>>> 1) The standard mode, where 16b CQEs are used
>>>> 2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b
>>>>
>>>> Certain features need to pass more information back than just a single
>>>> 32-bit res field, and hence mandate the use of CQE32 to be able to work.
>>>> Examples of that include passthrough or other uses of ->uring_cmd() like
>>>> socket option getting and setting, including timestamps.
>>>>
>>>> This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
>>>> posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
>>>> we need not waste twice the space for CQ rings, or use twice the space
>>>> per CQE posted, if only some of the CQEs posted require the use of 32b
>>>> CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
>>>> IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
>>>> about this fact.
>>>
>>> This makes a lot of sense. Have you considered something analogous for
>>> SQEs? Requiring all SQEs to be 128 bytes when an io_uring is used for
>>> a mix of 64-byte and 128-byte SQEs also wastes memory, probably even
>>> more since SQEs are 4x larger than CQEs.
>>
>> Adding Keith, as he and I literally just talked about that. My answer
>> was that the case is a bit different in that 32b CQEs can be useful in
>> cases that are predominately 16b in the first place. For example,
>> networking workload doing send/recv/etc and the occassional
>> get/setsockopt kind of thing. Or maybe a mix of normal recv and zero
>> copy rx.
>>
>> For the SQE case, I think it's a bit different. At least the cases I
>> know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
>> to be told otherwise! Because that is kind of the key question that
>> needs answering before even thinking about doing that kind of work.
> 
> We certainly have a use case that mixes the two on the same io_uring:
> ublk commit/buffer register/unregister commands (64 byte SQEs) and
> NVMe passthru commands (128 byte SQEs). I could also imagine an
> application issuing both normal read/write commands and NVMe passthru
> commands. But you're probably right that this isn't a super common use
> case.

Yes that's a good point, and that would roughly be 50/50 in terms of 64b
vs 128b SQEs?

And yes, I can imagine other uses cases too, but I'm also finding a hard
time justifying those as likely. On the other hand, people do the
weirdest things...

>> But yes, it could be supported, and Keith (kind of) signed himself up to
>> do that. One oddity I see on that side is that while with CQE32 the
>> kernel can manage the potential wrap-around gap, for SQEs that's
>> obviously on the application to do. That could just be a NOP or
>> something like that, but you do need something to fill/skip that space.
>> I guess that could be as simple as having an opcode that is simply "skip
>> me", so on the kernel side it'd be easy as it'd just drop it on the
>> floor. You still need to app side to fill one, however, and then deal
>> with "oops SQ ring is now full" too.
> 
> Sure, of course userspace would need to handle a misaligned big SQE at
> the end of the SQ analogously to mixed CQE sizes. I assume liburing
> should be able to do that mostly transparently, that logic could all
> be encapsulated by io_uring_get_sqe().

Yep I think so, we'd need a new helper to return the kind of SQE you
want, and it'd just need to get a 64b one and mark it with the SKIP
opcode first if being asked for a 128b one and we're one off from
wrapping around.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:40     ` Keith Busch
@ 2025-08-21 17:47       ` Jens Axboe
  0 siblings, 0 replies; 16+ messages in thread
From: Jens Axboe @ 2025-08-21 17:47 UTC (permalink / raw)
  To: Keith Busch; +Cc: Caleb Sander Mateos, io-uring

On 8/21/25 11:40 AM, Keith Busch wrote:
> On Thu, Aug 21, 2025 at 11:12:28AM -0600, Jens Axboe wrote:
>>
>> For the SQE case, I think it's a bit different. At least the cases I
>> know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
>> to be told otherwise! Because that is kind of the key question that
>> needs answering before even thinking about doing that kind of work.
> 
> The main use case I can think of is if an application allocates one ring
> for uring_cmd with the 128b SQEs, and then a separate ring for normal
> file and network stuff. Mixed SQE's would allow that application to have
> just one ring without being wasteful, but I'm just not sure if the
> separate rings is undesirable enough to make the effort worth it.

Indeed! And like Caleb mentioned, their use case already does this in
fact, just passthrough with housekeeping buffer commands.

>> But yes, it could be supported, and Keith (kind of) signed himself up to
>> do that. One oddity I see on that side is that while with CQE32 the
>> kernel can manage the potential wrap-around gap, for SQEs that's
>> obviously on the application to do. That could just be a NOP or
>> something like that, but you do need something to fill/skip that space.
>> I guess that could be as simple as having an opcode that is simply "skip
>> me", so on the kernel side it'd be easy as it'd just drop it on the
>> floor. You still need to app side to fill one, however, and then deal
>> with "oops SQ ring is now full" too.
> 
> Yep, I think it's doable, and your implementation for mixed CQEs
> provides a great reference. I trust we can get liburing using it
> correctly, but would be afraid for anyone not using the library.

That's the same with the mixed CQEs though - as you can see from the
liburing changes, it's really not hard to support. If you're using the
raw interface, well then things are already a bit more complicated for
you. Not too worried about that use case.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCHSET v2 0/8] Add support for mixed sized CQEs
  2025-08-21 17:46       ` Jens Axboe
@ 2025-08-21 18:19         ` Caleb Sander Mateos
  0 siblings, 0 replies; 16+ messages in thread
From: Caleb Sander Mateos @ 2025-08-21 18:19 UTC (permalink / raw)
  To: Jens Axboe; +Cc: io-uring, Keith Busch

On Thu, Aug 21, 2025 at 10:46 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 8/21/25 11:41 AM, Caleb Sander Mateos wrote:
> > On Thu, Aug 21, 2025 at 10:12?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>
> >> On 8/21/25 11:02 AM, Caleb Sander Mateos wrote:
> >>> On Thu, Aug 21, 2025 at 7:28?AM Jens Axboe <axboe@kernel.dk> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Currently io_uring supports two modes for CQEs:
> >>>>
> >>>> 1) The standard mode, where 16b CQEs are used
> >>>> 2) Setting IORING_SETUP_CQE32, which makes all CQEs posted 32b
> >>>>
> >>>> Certain features need to pass more information back than just a single
> >>>> 32-bit res field, and hence mandate the use of CQE32 to be able to work.
> >>>> Examples of that include passthrough or other uses of ->uring_cmd() like
> >>>> socket option getting and setting, including timestamps.
> >>>>
> >>>> This patchset adds support for IORING_SETUP_CQE_MIXED, which allows
> >>>> posting both 16b and 32b CQEs on the same CQ ring. The idea here is that
> >>>> we need not waste twice the space for CQ rings, or use twice the space
> >>>> per CQE posted, if only some of the CQEs posted require the use of 32b
> >>>> CQEs. On a ring setup in CQE mixed mode, 32b posted CQEs will have
> >>>> IORING_CQE_F_32 set in cqe->flags to tell the application (or liburing)
> >>>> about this fact.
> >>>
> >>> This makes a lot of sense. Have you considered something analogous for
> >>> SQEs? Requiring all SQEs to be 128 bytes when an io_uring is used for
> >>> a mix of 64-byte and 128-byte SQEs also wastes memory, probably even
> >>> more since SQEs are 4x larger than CQEs.
> >>
> >> Adding Keith, as he and I literally just talked about that. My answer
> >> was that the case is a bit different in that 32b CQEs can be useful in
> >> cases that are predominately 16b in the first place. For example,
> >> networking workload doing send/recv/etc and the occassional
> >> get/setsockopt kind of thing. Or maybe a mix of normal recv and zero
> >> copy rx.
> >>
> >> For the SQE case, I think it's a bit different. At least the cases I
> >> know of, it's mostly 100% 64b SQEs or 128b SQEs. I'm certainly willing
> >> to be told otherwise! Because that is kind of the key question that
> >> needs answering before even thinking about doing that kind of work.
> >
> > We certainly have a use case that mixes the two on the same io_uring:
> > ublk commit/buffer register/unregister commands (64 byte SQEs) and
> > NVMe passthru commands (128 byte SQEs). I could also imagine an
> > application issuing both normal read/write commands and NVMe passthru
> > commands. But you're probably right that this isn't a super common use
> > case.
>
> Yes that's a good point, and that would roughly be 50/50 in terms of 64b
> vs 128b SQEs?

For our application, the ratio between 64 and 128 bytes SQEs depends
on the ublk workload. Small ublk I/Os are translated 1-1 into NVMe
passthru I/Os, so there can be as many as 3 64-byte ublk SQEs
(register buffer, unregister buffer, and commit) for each 128-byte
NVMe passthru SQE. Larger I/Os are sharded into more NVMe passthru
commands, so there are relatively more 128-byte SQEs. And some
workloads can't use ublk zero-copy (since the data needs to go through
a RAID computation), in which case the only 64-byte SQE is the ublk
commit.

Best,
Caleb

>
> And yes, I can imagine other uses cases too, but I'm also finding a hard
> time justifying those as likely. On the other hand, people do the
> weirdest things...
>
> >> But yes, it could be supported, and Keith (kind of) signed himself up to
> >> do that. One oddity I see on that side is that while with CQE32 the
> >> kernel can manage the potential wrap-around gap, for SQEs that's
> >> obviously on the application to do. That could just be a NOP or
> >> something like that, but you do need something to fill/skip that space.
> >> I guess that could be as simple as having an opcode that is simply "skip
> >> me", so on the kernel side it'd be easy as it'd just drop it on the
> >> floor. You still need to app side to fill one, however, and then deal
> >> with "oops SQ ring is now full" too.
> >
> > Sure, of course userspace would need to handle a misaligned big SQE at
> > the end of the SQ analogously to mixed CQE sizes. I assume liburing
> > should be able to do that mostly transparently, that logic could all
> > be encapsulated by io_uring_get_sqe().
>
> Yep I think so, we'd need a new helper to return the kind of SQE you
> want, and it'd just need to get a 64b one and mark it with the SKIP
> opcode first if being asked for a 128b one and we're one off from
> wrapping around.
>
> --
> Jens Axboe

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-08-21 18:20 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 14:18 [PATCHSET v2 0/8] Add support for mixed sized CQEs Jens Axboe
2025-08-21 14:18 ` [PATCH 1/8] io_uring: remove io_ctx_cqe32() helper Jens Axboe
2025-08-21 14:18 ` [PATCH 2/8] io_uring: add UAPI definitions for mixed CQE postings Jens Axboe
2025-08-21 14:18 ` [PATCH 3/8] io_uring/fdinfo: handle mixed sized CQEs Jens Axboe
2025-08-21 14:18 ` [PATCH 4/8] io_uring/trace: support completion tracing of mixed 32b CQEs Jens Axboe
2025-08-21 14:18 ` [PATCH 5/8] io_uring: add support for IORING_SETUP_CQE_MIXED Jens Axboe
2025-08-21 14:18 ` [PATCH 6/8] io_uring/nop: " Jens Axboe
2025-08-21 14:18 ` [PATCH 7/8] io_uring/uring_cmd: " Jens Axboe
2025-08-21 14:18 ` [PATCH 8/8] io_uring/zcrx: " Jens Axboe
2025-08-21 17:02 ` [PATCHSET v2 0/8] Add support for mixed sized CQEs Caleb Sander Mateos
2025-08-21 17:12   ` Jens Axboe
2025-08-21 17:40     ` Keith Busch
2025-08-21 17:47       ` Jens Axboe
2025-08-21 17:41     ` Caleb Sander Mateos
2025-08-21 17:46       ` Jens Axboe
2025-08-21 18:19         ` Caleb Sander Mateos

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).