Linux Documentation
 help / color / mirror / Atom feed
* [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting
@ 2026-05-18 15:35 Clément Léger
  2026-05-18 15:35 ` [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx Clément Léger
                   ` (6 more replies)
  0 siblings, 7 replies; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: Clément Léger, linux-doc, linux-kernel, linux-kselftest,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Vishwanath Seshagiri

The zcrx path can encounter various conditions that lead to internal
fallbacks or errors. These errors can have a large impact on performance
and functionality but are not yet not being reported to the user which
is then unable to take action.

This series addresses this problem by adding a new notification system
paired with a statistics structure. The notification system currently
report out of buffer and packets that fallback to copy. The statistics
structure report the number and total size of packets that were copied
rather than received via the zero-copy path.

The out of buffer notification allows the user to actually adjust the
buffer sizing when registering zcrx support for the ifq. Some future
work could allow the user to add more memory on the fly to the pool so
the page allocator doesn't run out of memory.

This series can be tested using the include kselftest modification and
using the liburing series that updates headers and tests/examples so
that it uses notifications and statistics.

Changes in v2:
- Rebase on top of Pavel's branch that now uses a single CQE per notif
- Change notification mask to type (ie one CQE per event)
- Use a type rather than a mask for rearm as well
- Update tests to use single typei
- Update documentatiopn to state that notif CQEs are sent for a single
  event
- Fix zero init of zcrx_query_notif __resv field
- Rename resv1 to __resv1
- Reduce __resv2 size to match io_uring_query_opcode size
- Verifies that stats_offset is 0 if FLAG_STATS is zero
- Added zcrx notif query sequence to documentation
- Add _copy_fallback to test name

---

Clément Léger (4):
  io_uring/zcrx: notify user on frag copy fallback
  io_uring/zcrx: add shared-memory notification statistics
  Documentation: networking: document zcrx notifications and statistics
  selftests: iou-zcrx: add notification and stats test for zcrx

Pavel Begunkov (2):
  io_uring/zcrx: add ctx pointer to zcrx
  io_uring/zcrx: notify user when out of buffers

 Documentation/networking/iou-zcrx.rst         | 121 ++++++++++++
 include/uapi/linux/io_uring/query.h           |  12 ++
 include/uapi/linux/io_uring/zcrx.h            |  36 +++-
 io_uring/io_uring.c                           |   2 +-
 io_uring/io_uring.h                           |   1 +
 io_uring/query.c                              |  16 ++
 io_uring/zcrx.c                               | 180 +++++++++++++++++-
 io_uring/zcrx.h                               |  11 +-
 .../selftests/drivers/net/hw/iou-zcrx.c       | 114 ++++++++++-
 .../selftests/drivers/net/hw/iou-zcrx.py      |  49 ++++-
 10 files changed, 517 insertions(+), 25 deletions(-)

-- 
Clément Léger

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-19 15:19   ` Vishwanath Seshagiri
  2026-05-18 15:35 ` [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers Clément Léger
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: linux-doc, linux-kernel, linux-kselftest, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Vishwanath Seshagiri,
	Vishwanath Seshagiri

From: Pavel Begunkov <asml.silence@gmail.com>

zcrx will need to have a pointer to an owning ctx to communicate
different events. Reference the ctx while it's attached to zcrx, and
rely on zcrx termination to drop the ctx to avoid circular ref deps.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/zcrx.c | 39 +++++++++++++++++++++++++++++++--------
 io_uring/zcrx.h |  3 +++
 2 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 3f9632e7790a..34faf90423f4 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -44,6 +44,17 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
+static bool zcrx_set_ring_ctx(struct io_zcrx_ifq *zcrx,
+			      struct io_ring_ctx *ctx)
+{
+	guard(spinlock_bh)(&zcrx->ctx_lock);
+	if (zcrx->master_ctx)
+		return false;
+	percpu_ref_get(&ctx->refs);
+	zcrx->master_ctx = ctx;
+	return true;
+}
+
 static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
 {
 	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
@@ -531,6 +542,7 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
 		return NULL;
 
 	ifq->if_rxq = -1;
+	spin_lock_init(&ifq->ctx_lock);
 	spin_lock_init(&ifq->rq.lock);
 	mutex_init(&ifq->pp_lock);
 	refcount_set(&ifq->refs, 1);
@@ -580,6 +592,8 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
 		return;
 	if (WARN_ON_ONCE(ifq->netdev != NULL))
 		return;
+	if (WARN_ON_ONCE(ifq->master_ctx))
+		return;
 
 	if (ifq->area)
 		io_zcrx_free_area(ifq, ifq->area);
@@ -656,17 +670,24 @@ static void io_zcrx_scrub(struct io_zcrx_ifq *ifq)
 	}
 }
 
-static void zcrx_unregister_user(struct io_zcrx_ifq *ifq)
+static void zcrx_unregister_user(struct io_zcrx_ifq *ifq, struct io_ring_ctx *ctx)
 {
+	scoped_guard(spinlock_bh, &ifq->ctx_lock) {
+		if (ctx && ifq->master_ctx == ctx) {
+			ifq->master_ctx = NULL;
+			percpu_ref_put(&ctx->refs);
+		}
+	}
+
 	if (refcount_dec_and_test(&ifq->user_refs)) {
 		io_close_queue(ifq);
 		io_zcrx_scrub(ifq);
 	}
 }
 
-static void zcrx_unregister(struct io_zcrx_ifq *ifq)
+static void zcrx_unregister(struct io_zcrx_ifq *ifq, struct io_ring_ctx *ctx)
 {
-	zcrx_unregister_user(ifq);
+	zcrx_unregister_user(ifq, ctx);
 	io_put_zcrx_ifq(ifq);
 }
 
@@ -686,7 +707,7 @@ static int zcrx_box_release(struct inode *inode, struct file *file)
 
 	if (WARN_ON_ONCE(!ifq))
 		return -EFAULT;
-	zcrx_unregister(ifq);
+	zcrx_unregister(ifq, NULL);
 	return 0;
 }
 
@@ -711,7 +732,7 @@ static int zcrx_export(struct io_ring_ctx *ctx, struct io_zcrx_ifq *ifq,
 	file = anon_inode_create_getfile("[zcrx]", &zcrx_box_fops,
 					 ifq, O_CLOEXEC, NULL);
 	if (IS_ERR(file)) {
-		zcrx_unregister(ifq);
+		zcrx_unregister(ifq, NULL);
 		return PTR_ERR(file);
 	}
 
@@ -787,7 +808,7 @@ static int import_zcrx(struct io_ring_ctx *ctx,
 	scoped_guard(mutex, &ctx->mmap_lock)
 		xa_erase(&ctx->zcrx_ctxs, id);
 err:
-	zcrx_unregister(ifq);
+	zcrx_unregister(ifq, ctx);
 	return ret;
 }
 
@@ -932,12 +953,14 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
 		ret = -EFAULT;
 		goto err;
 	}
+
+	zcrx_set_ring_ctx(ifq, ctx);
 	return 0;
 err:
 	scoped_guard(mutex, &ctx->mmap_lock)
 		xa_erase(&ctx->zcrx_ctxs, id);
 ifq_free:
-	zcrx_unregister(ifq);
+	zcrx_unregister(ifq, ctx);
 	return ret;
 }
 
@@ -967,7 +990,7 @@ void io_terminate_zcrx(struct io_ring_ctx *ctx)
 			break;
 		set_zcrx_entry_mark(ctx, id);
 		id++;
-		zcrx_unregister_user(ifq);
+		zcrx_unregister_user(ifq, ctx);
 	}
 }
 
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 9e1a6a1b11e8..6b565d0bf6da 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -73,6 +73,9 @@ struct io_zcrx_ifq {
 	 */
 	struct mutex			pp_lock;
 	struct io_mapped_region		rq_region;
+
+	spinlock_t			ctx_lock;
+	struct io_ring_ctx		*master_ctx;
 };
 
 #if defined(CONFIG_IO_URING_ZCRX)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
  2026-05-18 15:35 ` [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-19 15:21   ` Vishwanath Seshagiri
  2026-05-18 15:35 ` [PATCH v2 3/6] io_uring/zcrx: notify user on frag copy fallback Clément Léger
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: linux-doc, linux-kernel, linux-kselftest, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Vishwanath Seshagiri,
	Vishwanath Seshagiri

From: Pavel Begunkov <asml.silence@gmail.com>

There are currently no easy ways for the user to know if zcrx is out of
buffers and page pool fails to allocate. Add uapi for zcrx to communicate
it back.

It's implemented as a separate CQE, which for now is posted to the creator
ctx. To use it, on registration the user space needs to pass an instance
of struct zcrx_notification_desc, which tells the kernel the user_data
for resulting CQEs and which event types are expected / allowed.

When an allowed event happens, zcrx will post a CQE containing the
specified user_data, and lower bits of cqe->res will be set to the event
mask. Before the kernel could post another notification of the given
type, the user needs to acknowledge that it processed the previous one
by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.

The only notification type the patch implements is
ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.

Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring/zcrx.h | 24 ++++++++-
 io_uring/io_uring.c                |  2 +-
 io_uring/io_uring.h                |  1 +
 io_uring/zcrx.c                    | 86 +++++++++++++++++++++++++++++-
 io_uring/zcrx.h                    |  7 ++-
 5 files changed, 115 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/io_uring/zcrx.h b/include/uapi/linux/io_uring/zcrx.h
index 5ce02c7a6096..67185566ad3c 100644
--- a/include/uapi/linux/io_uring/zcrx.h
+++ b/include/uapi/linux/io_uring/zcrx.h
@@ -65,6 +65,20 @@ enum zcrx_features {
 	 * value in struct io_uring_zcrx_ifq_reg::rx_buf_len.
 	 */
 	ZCRX_FEATURE_RX_PAGE_SIZE	= 1 << 0,
+	ZCRX_FEATURE_NOTIFICATION	= 1 << 1,
+};
+
+enum zcrx_notification_type {
+	ZCRX_NOTIF_NO_BUFFERS,
+
+	__ZCRX_NOTIF_TYPE_LAST,
+};
+
+struct zcrx_notification_desc {
+	__u64	user_data;
+	__u32	type_mask;
+	__u32	__resv1;
+	__u64	__resv2[10];
 };
 
 /*
@@ -82,12 +96,14 @@ struct io_uring_zcrx_ifq_reg {
 	struct io_uring_zcrx_offsets offsets;
 	__u32	zcrx_id;
 	__u32	rx_buf_len;
-	__u64	__resv[3];
+	__u64	notif_desc; /* see struct zcrx_notification_desc */
+	__u64	__resv[2];
 };
 
 enum zcrx_ctrl_op {
 	ZCRX_CTRL_FLUSH_RQ,
 	ZCRX_CTRL_EXPORT,
+	ZCRX_CTRL_ARM_NOTIFICATION,
 
 	__ZCRX_CTRL_LAST,
 };
@@ -101,6 +117,11 @@ struct zcrx_ctrl_export {
 	__u32 		__resv1[11];
 };
 
+struct zcrx_ctrl_arm_notif {
+	__u32		notif_type;
+	__u32		__resv[11];
+};
+
 struct zcrx_ctrl {
 	__u32	zcrx_id;
 	__u32	op; /* see enum zcrx_ctrl_op */
@@ -109,6 +130,7 @@ struct zcrx_ctrl {
 	union {
 		struct zcrx_ctrl_export		zc_export;
 		struct zcrx_ctrl_flush_rq	zc_flush;
+		struct zcrx_ctrl_arm_notif	zc_arm_notif;
 	};
 };
 
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 2ebb0ba37c4f..c5972274cce1 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -160,7 +160,7 @@ static void io_poison_cached_req(struct io_kiocb *req)
 	req->apoll = IO_URING_PTR_POISON;
 }
 
-static void io_poison_req(struct io_kiocb *req)
+void io_poison_req(struct io_kiocb *req)
 {
 	io_poison_cached_req(req);
 	req->async_data = IO_URING_PTR_POISON;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e612a66ee80e..de0a3bed58d1 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -213,6 +213,7 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
 
 void io_activate_pollwq(struct io_ring_ctx *ctx);
 void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
+void io_poison_req(struct io_kiocb *req);
 
 static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
 {
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 34faf90423f4..463fbaead35b 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -768,6 +768,8 @@ static int import_zcrx(struct io_ring_ctx *ctx,
 		return -EINVAL;
 	if (reg->if_rxq || reg->rq_entries || reg->area_ptr || reg->region_ptr)
 		return -EINVAL;
+	if (reg->notif_desc)
+		return -EINVAL;
 	if (reg->flags & ~ZCRX_REG_IMPORT)
 		return -EINVAL;
 
@@ -856,6 +858,7 @@ static int zcrx_register_netdev(struct io_zcrx_ifq *ifq,
 int io_register_zcrx(struct io_ring_ctx *ctx,
 		     struct io_uring_zcrx_ifq_reg __user *arg)
 {
+	struct zcrx_notification_desc notif;
 	struct io_uring_zcrx_area_reg area;
 	struct io_uring_zcrx_ifq_reg reg;
 	struct io_uring_region_desc rd;
@@ -899,10 +902,22 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
 	if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area)))
 		return -EFAULT;
 
+	memset(&notif, 0, sizeof(notif));
+	if (reg.notif_desc && copy_from_user(&notif, u64_to_user_ptr(reg.notif_desc),
+					     sizeof(notif)))
+		return -EFAULT;
+	if (notif.type_mask & ~ZCRX_NOTIF_TYPE_MASK)
+		return -EINVAL;
+	if (notif.__resv1 || !mem_is_zero(&notif.__resv2, sizeof(notif.__resv2)))
+		return -EINVAL;
+
 	ifq = io_zcrx_ifq_alloc(ctx);
 	if (!ifq)
 		return -ENOMEM;
 
+	ifq->notif_data = notif.user_data;
+	ifq->allowed_notif_mask = notif.type_mask;
+
 	if (ctx->user) {
 		get_uid(ctx->user);
 		ifq->user = ctx->user;
@@ -954,7 +969,8 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
 		goto err;
 	}
 
-	zcrx_set_ring_ctx(ifq, ctx);
+	if (notif.type_mask)
+		zcrx_set_ring_ctx(ifq, ctx);
 	return 0;
 err:
 	scoped_guard(mutex, &ctx->mmap_lock)
@@ -1127,6 +1143,48 @@ static unsigned io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *if
 	return allocated;
 }
 
+static void zcrx_notif_tw(struct io_tw_req tw_req, io_tw_token_t tw)
+{
+	struct io_kiocb *req = tw_req.req;
+	struct io_ring_ctx *ctx = req->ctx;
+
+	io_post_aux_cqe(ctx, req->cqe.user_data, req->cqe.res, 0);
+	percpu_ref_put(&ctx->refs);
+	io_poison_req(req);
+	kmem_cache_free(req_cachep, req);
+}
+
+static void zcrx_send_notif(struct io_zcrx_ifq *ifq, unsigned type)
+{
+	gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN | __GFP_ZERO;
+	u32 type_mask = 1 << type;
+	struct io_kiocb *req;
+
+	if (!(type_mask & ifq->allowed_notif_mask))
+		return;
+
+	guard(spinlock_bh)(&ifq->ctx_lock);
+	if (!ifq->master_ctx)
+		return;
+	if (type_mask & ifq->fired_notifs)
+		return;
+
+	req = kmem_cache_alloc(req_cachep, gfp);
+	if (unlikely(!req))
+		return;
+
+	ifq->fired_notifs |= type_mask;
+
+	req->opcode = IORING_OP_NOP;
+	req->cqe.user_data = ifq->notif_data;
+	req->cqe.res = type;
+	req->ctx = ifq->master_ctx;
+	percpu_ref_get(&req->ctx->refs);
+	req->tctx = NULL;
+	req->io_task_work.func = zcrx_notif_tw;
+	io_req_task_work_add(req);
+}
+
 static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
 {
 	struct io_zcrx_ifq *ifq = io_pp_to_ifq(pp);
@@ -1143,8 +1201,10 @@ static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
 		goto out_return;
 
 	allocated = io_zcrx_refill_slow(pp, ifq, netmems, to_alloc);
-	if (!allocated)
+	if (!allocated) {
+		zcrx_send_notif(ifq, ZCRX_NOTIF_NO_BUFFERS);
 		return 0;
+	}
 out_return:
 	zcrx_sync_for_device(pp, ifq, netmems, allocated);
 	allocated--;
@@ -1293,12 +1353,32 @@ static int zcrx_flush_rq(struct io_ring_ctx *ctx, struct io_zcrx_ifq *zcrx,
 	return 0;
 }
 
+static int zcrx_arm_notif(struct io_ring_ctx *ctx, struct io_zcrx_ifq *zcrx,
+			  struct zcrx_ctrl *ctrl)
+{
+	const struct zcrx_ctrl_arm_notif *an = &ctrl->zc_arm_notif;
+	unsigned type_mask;
+
+	if (an->notif_type >= __ZCRX_NOTIF_TYPE_LAST)
+		return -EINVAL;
+	if (!mem_is_zero(&an->__resv, sizeof(an->__resv)))
+		return -EINVAL;
+
+	guard(spinlock_bh)(&zcrx->ctx_lock);
+	type_mask = 1U << an->notif_type;
+	if (type_mask & ~zcrx->fired_notifs)
+		return -EINVAL;
+	zcrx->fired_notifs &= ~type_mask;
+	return 0;
+}
+
 int io_zcrx_ctrl(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
 {
 	struct zcrx_ctrl ctrl;
 	struct io_zcrx_ifq *zcrx;
 
 	BUILD_BUG_ON(sizeof(ctrl.zc_export) != sizeof(ctrl.zc_flush));
+	BUILD_BUG_ON(sizeof(ctrl.zc_export) != sizeof(ctrl.zc_arm_notif));
 
 	if (nr_args)
 		return -EINVAL;
@@ -1316,6 +1396,8 @@ int io_zcrx_ctrl(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
 		return zcrx_flush_rq(ctx, zcrx, &ctrl);
 	case ZCRX_CTRL_EXPORT:
 		return zcrx_export(ctx, zcrx, &ctrl, arg);
+	case ZCRX_CTRL_ARM_NOTIFICATION:
+		return zcrx_arm_notif(ctx, zcrx, &ctrl);
 	}
 
 	return -EOPNOTSUPP;
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 6b565d0bf6da..cca10d0d02ac 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -9,7 +9,9 @@
 #include <net/net_trackers.h>
 
 #define ZCRX_SUPPORTED_REG_FLAGS	(ZCRX_REG_IMPORT | ZCRX_REG_NODEV)
-#define ZCRX_FEATURES			(ZCRX_FEATURE_RX_PAGE_SIZE)
+#define ZCRX_FEATURES			(ZCRX_FEATURE_RX_PAGE_SIZE |\
+					 ZCRX_FEATURE_NOTIFICATION)
+#define ZCRX_NOTIF_TYPE_MASK		(1U << ZCRX_NOTIF_NO_BUFFERS)
 
 struct io_zcrx_mem {
 	unsigned long			size;
@@ -76,6 +78,9 @@ struct io_zcrx_ifq {
 
 	spinlock_t			ctx_lock;
 	struct io_ring_ctx		*master_ctx;
+	u32				allowed_notif_mask;
+	u32				fired_notifs;
+	u64				notif_data;
 };
 
 #if defined(CONFIG_IO_URING_ZCRX)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 3/6] io_uring/zcrx: notify user on frag copy fallback
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
  2026-05-18 15:35 ` [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx Clément Léger
  2026-05-18 15:35 ` [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-18 15:35 ` [PATCH v2 4/6] io_uring/zcrx: add shared-memory notification statistics Clément Léger
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: Clément Léger, linux-doc, linux-kernel, linux-kselftest,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Vishwanath Seshagiri

Add a ZCRX_NOTIF_COPY notification type to signal userspace when a
received fragment could not be delivered using zero-copy and was
instead copied into a buffer.

Signed-off-by: Clément Léger <cleger@meta.com>
---
 include/uapi/linux/io_uring/zcrx.h | 1 +
 io_uring/zcrx.c                    | 7 ++++++-
 io_uring/zcrx.h                    | 2 +-
 3 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/io_uring/zcrx.h b/include/uapi/linux/io_uring/zcrx.h
index 67185566ad3c..3f7b72b09878 100644
--- a/include/uapi/linux/io_uring/zcrx.h
+++ b/include/uapi/linux/io_uring/zcrx.h
@@ -70,6 +70,7 @@ enum zcrx_features {
 
 enum zcrx_notification_type {
 	ZCRX_NOTIF_NO_BUFFERS,
+	ZCRX_NOTIF_COPY,
 
 	__ZCRX_NOTIF_TYPE_LAST,
 };
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 463fbaead35b..f31f2ca0f7ec 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -1534,8 +1534,13 @@ static int io_zcrx_copy_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			     const skb_frag_t *frag, int off, int len)
 {
 	struct page *page = skb_frag_page(frag);
+	int ret;
+
+	ret = io_zcrx_copy_chunk(req, ifq, page, off + skb_frag_off(frag), len);
+	if (ret > 0)
+		zcrx_send_notif(ifq, ZCRX_NOTIF_COPY);
 
-	return io_zcrx_copy_chunk(req, ifq, page, off + skb_frag_off(frag), len);
+	return ret;
 }
 
 static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index cca10d0d02ac..203b3049e14b 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -11,7 +11,7 @@
 #define ZCRX_SUPPORTED_REG_FLAGS	(ZCRX_REG_IMPORT | ZCRX_REG_NODEV)
 #define ZCRX_FEATURES			(ZCRX_FEATURE_RX_PAGE_SIZE |\
 					 ZCRX_FEATURE_NOTIFICATION)
-#define ZCRX_NOTIF_TYPE_MASK		(1U << ZCRX_NOTIF_NO_BUFFERS)
+#define ZCRX_NOTIF_TYPE_MASK		((1U << ZCRX_NOTIF_NO_BUFFERS) | (1U << ZCRX_NOTIF_COPY))
 
 struct io_zcrx_mem {
 	unsigned long			size;
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 4/6] io_uring/zcrx: add shared-memory notification statistics
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
                   ` (2 preceding siblings ...)
  2026-05-18 15:35 ` [PATCH v2 3/6] io_uring/zcrx: notify user on frag copy fallback Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-18 15:35 ` [PATCH v2 5/6] Documentation: networking: document zcrx notifications and statistics Clément Léger
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: Clément Léger, linux-doc, linux-kernel, linux-kselftest,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Vishwanath Seshagiri

Add support for an optional stats struct embedded in the refill queue
region, allowing userspace to monitor copy-fallback in real-time.

Userspace queries the stats struct size and alignment via
IO_URING_QUERY_ZCRX_NOTIF (notif_stats_size / notif_stats_alignment),
then provides a stats_offset in zcrx_notification_desc pointing to a
location within the refill queue region.

The kernel updates the stats counters in-place on every copy-fallback
event.

Signed-off-by: Clément Léger <cleger@meta.com>
---
 include/uapi/linux/io_uring/query.h | 12 +++++++
 include/uapi/linux/io_uring/zcrx.h  | 15 ++++++--
 io_uring/query.c                    | 16 +++++++++
 io_uring/zcrx.c                     | 54 +++++++++++++++++++++++++++--
 io_uring/zcrx.h                     |  1 +
 5 files changed, 94 insertions(+), 4 deletions(-)

diff --git a/include/uapi/linux/io_uring/query.h b/include/uapi/linux/io_uring/query.h
index 95500759cc13..1a68eca7c6b4 100644
--- a/include/uapi/linux/io_uring/query.h
+++ b/include/uapi/linux/io_uring/query.h
@@ -23,6 +23,7 @@ enum {
 	IO_URING_QUERY_OPCODES			= 0,
 	IO_URING_QUERY_ZCRX			= 1,
 	IO_URING_QUERY_SCQ			= 2,
+	IO_URING_QUERY_ZCRX_NOTIF		= 3,
 
 	__IO_URING_QUERY_MAX,
 };
@@ -62,6 +63,17 @@ struct io_uring_query_zcrx {
 	__u64 __resv2;
 };
 
+struct io_uring_query_zcrx_notif {
+	/* Bitmask of supported ZCRX_NOTIF_* flags */
+	__u32 notif_flags;
+	/* Size of io_uring_zcrx_notif_stats */
+	__u32 notif_stats_size;
+	/* Required alignment for the stats struct within the region (ie stats_offset) */
+	__u32 notif_stats_off_alignment;
+	__u32 __resv1;
+	__u64 __resv2[4];
+};
+
 struct io_uring_query_scq {
 	/* The SQ/CQ rings header size */
 	__u64 hdr_size;
diff --git a/include/uapi/linux/io_uring/zcrx.h b/include/uapi/linux/io_uring/zcrx.h
index 3f7b72b09878..384e185a180c 100644
--- a/include/uapi/linux/io_uring/zcrx.h
+++ b/include/uapi/linux/io_uring/zcrx.h
@@ -75,11 +75,22 @@ enum zcrx_notification_type {
 	__ZCRX_NOTIF_TYPE_LAST,
 };
 
+enum zcrx_notification_desc_flags {
+	/* If set, stats_offset holds a valid offset to a notif_stats struct */
+	ZCRX_NOTIF_DESC_FLAG_STATS = 1 << 0,
+};
+
+struct io_uring_zcrx_notif_stats {
+	__u64	copy_count;	/* cumulative copy-fallback CQEs */
+	__u64	copy_bytes;	/* cumulative bytes copied */
+};
+
 struct zcrx_notification_desc {
 	__u64	user_data;
 	__u32	type_mask;
-	__u32	__resv1;
-	__u64	__resv2[10];
+	__u32	flags; /* see enum zcrx_notification_desc_flags */
+	__u64	stats_offset; /* offset from the beginning of refill ring region for stats */
+	__u64	__resv2[9];
 };
 
 /*
diff --git a/io_uring/query.c b/io_uring/query.c
index c1704d088374..d17a83645bcd 100644
--- a/io_uring/query.c
+++ b/io_uring/query.c
@@ -9,6 +9,7 @@
 union io_query_data {
 	struct io_uring_query_opcode opcodes;
 	struct io_uring_query_zcrx zcrx;
+	struct io_uring_query_zcrx_notif zcrx_notif;
 	struct io_uring_query_scq scq;
 };
 
@@ -44,6 +45,18 @@ static ssize_t io_query_zcrx(union io_query_data *data)
 	return sizeof(*e);
 }
 
+static ssize_t io_query_zcrx_notif(union io_query_data *data)
+{
+	struct io_uring_query_zcrx_notif *e = &data->zcrx_notif;
+
+	e->notif_flags = ZCRX_NOTIF_TYPE_MASK;
+	e->notif_stats_size = sizeof(struct io_uring_zcrx_notif_stats);
+	e->notif_stats_off_alignment = __alignof__(struct io_uring_zcrx_notif_stats);
+	e->__resv1 = 0;
+	memset(&e->__resv2, 0, sizeof(e->__resv2));
+	return sizeof(*e);
+}
+
 static ssize_t io_query_scq(union io_query_data *data)
 {
 	struct io_uring_query_scq *e = &data->scq;
@@ -83,6 +96,9 @@ static int io_handle_query_entry(union io_query_data *data, void __user *uhdr,
 	case IO_URING_QUERY_ZCRX:
 		ret = io_query_zcrx(data);
 		break;
+	case IO_URING_QUERY_ZCRX_NOTIF:
+		ret = io_query_zcrx_notif(data);
+		break;
 	case IO_URING_QUERY_SCQ:
 		ret = io_query_scq(data);
 		break;
diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index f31f2ca0f7ec..2881ad76bacc 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -415,6 +415,7 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq)
 	io_free_region(ifq->user, &ifq->rq_region);
 	ifq->rq.ring = IO_URING_PTR_POISON;
 	ifq->rq.rqes = IO_URING_PTR_POISON;
+	ifq->notif_stats = IO_URING_PTR_POISON;
 }
 
 static void io_zcrx_free_area(struct io_zcrx_ifq *ifq,
@@ -855,6 +856,33 @@ static int zcrx_register_netdev(struct io_zcrx_ifq *ifq,
 	return ret;
 }
 
+static int zcrx_validate_notif_stats(struct io_zcrx_ifq *ifq,
+				     const struct io_uring_zcrx_ifq_reg *reg,
+				     const struct zcrx_notification_desc *notif)
+{
+	size_t stats_off = notif->stats_offset;
+	size_t used, end;
+
+	used = reg->offsets.rqes +
+	       sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries;
+
+	if (!IS_ALIGNED(stats_off, __alignof__(struct io_uring_zcrx_notif_stats)))
+		return -EINVAL;
+	if (stats_off < used)
+		return -ERANGE;
+	if (check_add_overflow(stats_off,
+			       sizeof(struct io_uring_zcrx_notif_stats),
+			       &end))
+		return -ERANGE;
+	if (end > io_region_size(&ifq->rq_region))
+		return -ERANGE;
+
+	ifq->notif_stats = io_region_get_ptr(&ifq->rq_region) + stats_off;
+	memset(ifq->notif_stats, 0, sizeof(*ifq->notif_stats));
+
+	return 0;
+}
+
 int io_register_zcrx(struct io_ring_ctx *ctx,
 		     struct io_uring_zcrx_ifq_reg __user *arg)
 {
@@ -908,7 +936,13 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
 		return -EFAULT;
 	if (notif.type_mask & ~ZCRX_NOTIF_TYPE_MASK)
 		return -EINVAL;
-	if (notif.__resv1 || !mem_is_zero(&notif.__resv2, sizeof(notif.__resv2)))
+	if (notif.flags & ~ZCRX_NOTIF_DESC_FLAG_STATS)
+		return -EINVAL;
+	if (!(notif.flags & ZCRX_NOTIF_DESC_FLAG_STATS)) {
+		if (notif.stats_offset)
+			return -EINVAL;
+	}
+	if (!mem_is_zero(&notif.__resv2, sizeof(notif.__resv2)))
 		return -EINVAL;
 
 	ifq = io_zcrx_ifq_alloc(ctx);
@@ -939,6 +973,12 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
 	if (ret)
 		goto err;
 
+	if (notif.flags & ZCRX_NOTIF_DESC_FLAG_STATS) {
+		ret = zcrx_validate_notif_stats(ifq, &reg, &notif);
+		if (ret)
+			goto err;
+	}
+
 	ifq->kern_readable = !(area.flags & IORING_ZCRX_AREA_DMABUF);
 
 	if (!(reg.flags & ZCRX_REG_NODEV)) {
@@ -1154,6 +1194,11 @@ static void zcrx_notif_tw(struct io_tw_req tw_req, io_tw_token_t tw)
 	kmem_cache_free(req_cachep, req);
 }
 
+static void zcrx_stat_add(__u64 *p, s64 v)
+{
+	WRITE_ONCE(*p, READ_ONCE(*p) + v);
+}
+
 static void zcrx_send_notif(struct io_zcrx_ifq *ifq, unsigned type)
 {
 	gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN | __GFP_ZERO;
@@ -1537,8 +1582,13 @@ static int io_zcrx_copy_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 	int ret;
 
 	ret = io_zcrx_copy_chunk(req, ifq, page, off + skb_frag_off(frag), len);
-	if (ret > 0)
+	if (ret > 0) {
+		if (ifq->notif_stats) {
+			zcrx_stat_add(&ifq->notif_stats->copy_count, 1);
+			zcrx_stat_add(&ifq->notif_stats->copy_bytes, ret);
+		}
 		zcrx_send_notif(ifq, ZCRX_NOTIF_COPY);
+	}
 
 	return ret;
 }
diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
index 203b3049e14b..e1aab76c310d 100644
--- a/io_uring/zcrx.h
+++ b/io_uring/zcrx.h
@@ -81,6 +81,7 @@ struct io_zcrx_ifq {
 	u32				allowed_notif_mask;
 	u32				fired_notifs;
 	u64				notif_data;
+	struct io_uring_zcrx_notif_stats *notif_stats;
 };
 
 #if defined(CONFIG_IO_URING_ZCRX)
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 5/6] Documentation: networking: document zcrx notifications and statistics
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
                   ` (3 preceding siblings ...)
  2026-05-18 15:35 ` [PATCH v2 4/6] io_uring/zcrx: add shared-memory notification statistics Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-18 15:35 ` [PATCH v2 6/6] selftests: iou-zcrx: add notification and stats test for zcrx Clément Léger
  2026-05-19 11:43 ` [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Pavel Begunkov
  6 siblings, 0 replies; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: Clément Léger, linux-doc, linux-kernel, linux-kselftest,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Vishwanath Seshagiri

Document the zcrx notification system and shared-memory statistics
that were introduced to let userspace monitor zero-copy receive health.
The notification section covers the two notification types
(ZCRX_NOTIF_NO_BUFFERS, ZCRX_NOTIF_COPY), registration via
zcrx_notification_desc, and the fire-once / re-arm mechanism via
ZCRX_CTRL_ARM_NOTIFICATION. The statistics section covers the optional
shared-memory io_uring_zcrx_notif_stats structure placed in the refill
ring region, including how to query its layout via
IO_URING_QUERY_ZCRX_NOTIF.

Signed-off-by: Clément Léger <cleger@meta.com>
---
 Documentation/networking/iou-zcrx.rst | 121 ++++++++++++++++++++++++++
 1 file changed, 121 insertions(+)

diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst
index 7f3f4b2e6cf2..442760a1ca03 100644
--- a/Documentation/networking/iou-zcrx.rst
+++ b/Documentation/networking/iou-zcrx.rst
@@ -196,6 +196,127 @@ Return buffers back to the kernel to be used again::
   rqe->len = cqe->res;
   IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail);
 
+Notifications
+-------------
+
+When zero-copy receive encounters conditions that impact performance or
+functionality, the kernel can notify userspace via dedicated CQE notifications.
+The application must register a notification descriptor during
+``IORING_REGISTER_ZCRX_IFQ`` to receive them. Notifications are sent
+individually and are not batched with other CQEs. Each notification CQE reports
+a single notification in ``cqe->res``.
+
+Supported features can be detected by checking for ``ZCRX_FEATURE_NOTIFICATION``
+in the features bitmask returned by ``IO_URING_QUERY_ZCRX``.
+
+**Notification types**
+
+``ZCRX_NOTIF_NO_BUFFERS``
+  Fired when the page pool fails to allocate because the zcrx buffer area is
+  exhausted.
+
+``ZCRX_NOTIF_COPY``
+  Fired when a received fragment could not be delivered zero-copy and was
+  instead copied into a buffer.
+
+**Registering notifications**
+
+Allocate and fill a ``struct zcrx_notification_desc``::
+
+  struct zcrx_notification_desc notif = {
+    .user_data = MY_NOTIF_USER_DATA,
+    .type_mask = ZCRX_NOTIF_NO_BUFFERS | ZCRX_NOTIF_COPY,
+  };
+
+  reg.notif_desc = (__u64)(unsigned long)&notif;
+
+``user_data`` is the value that will appear in the notification CQE's
+``user_data`` field. ``type_mask`` selects which notification types the
+application wants to receive.
+
+When a registered event occurs, the kernel posts a CQE with the specified
+``user_data`` and ``cqe->res`` set to a bitmask of the triggered notification
+types.
+
+**Rate limiting**
+
+Each notification type fires once until the application explicitly re-arms it.
+To re-arm, issue ``IORING_REGISTER_ZCRX_CTRL`` with
+``ZCRX_CTRL_ARM_NOTIFICATION``::
+
+  struct zcrx_ctrl ctrl = {
+    .zcrx_id = zcrx_id,
+    .op = ZCRX_CTRL_ARM_NOTIFICATION,
+    .zc_arm_notif = {
+      .notif_type = ZCRX_NOTIF_NO_BUFFERS,
+    },
+  };
+
+  io_uring_register(ring_fd, IORING_REGISTER_ZCRX_CTRL, &ctrl, 0);
+
+Only notification types that have previously fired can be re-armed.
+
+Notification statistics
+-----------------------
+
+In addition to CQE-based notifications, the kernel can maintain a shared-memory
+statistics structure that is updated on every relevant event. All stats are
+updated regardless of which notification flags were registered.
+
+The statistics structure layout and alignment requirements can be queried via
+``IO_URING_QUERY_ZCRX_NOTIF``. The application must query the structure size
+and alignment requirements so that it allocates enough memory for the region
+to fit both the refill ring and the stats structure::
+
+  struct io_uring_query_zcrx_notif notif_query = {};
+  struct io_uring_query_hdr hdr = {
+    .query_op = IO_URING_QUERY_ZCRX_NOTIF,
+    .size = sizeof(notif_query),
+    .query_data = (__u64)(unsigned long)&notif_query,
+  };
+
+  io_uring_register(ring_fd, IORING_REGISTER_QUERY, &hdr, 1);
+
+  __u32 notif_stats_size = notif_query.notif_stats_size;
+  __u32 notif_stats_off_alignment = notif_query.notif_stats_off_alignment;
+
+To enable statistics, place the stats structure after the refill ring entries
+within the same mapped region, and set the ``ZCRX_NOTIF_DESC_FLAG_STATS`` flag
+in the notification descriptor::
+
+  /* Compute offset for the stats struct (after refill ring entries) */
+  size_t stats_offset = ALIGN_UP(ring_size, notif_stats_off_alignment);
+  ring_size = stats_offset + notif_stats_size;
+  ring_size = ALIGN_UP(ring_size, PAGE_SIZE);
+
+  /* Map the region with the extra space */
+  ring_ptr = mmap(NULL, ring_size, PROT_READ | PROT_WRITE,
+                  MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
+
+  struct zcrx_notification_desc notif = {
+    .user_data = MY_NOTIF_USER_DATA,
+    .type_mask = ZCRX_NOTIF_COPY,
+    .flags = ZCRX_NOTIF_DESC_FLAG_STATS,
+    .stats_offset = stats_offset,
+  };
+
+The ``stats_offset`` must satisfy the alignment reported by
+``notif_stats_off_alignment`` and must point to a location within the mapped
+region that does not overlap with the refill ring header or entries.
+
+Application can read stat counters them at any time::
+
+  volatile struct io_uring_zcrx_notif_stats *stats =
+    (void *)((char *)ring_ptr + stats_offset);
+
+  printf("copy fallbacks: %llu (%llu bytes)\n",
+         IO_URING_READ_ONCE(stats->copy_count),
+	 IO_URING_READ_ONCE(stats->copy_bytes));
+
+``copy_count`` is incremented each time a fragment is copied instead of being
+delivered via zero-copy. ``copy_bytes`` accumulates the total number of bytes
+copied.
+
 Area chunking
 -------------
 
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH v2 6/6] selftests: iou-zcrx: add notification and stats test for zcrx
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
                   ` (4 preceding siblings ...)
  2026-05-18 15:35 ` [PATCH v2 5/6] Documentation: networking: document zcrx notifications and statistics Clément Léger
@ 2026-05-18 15:35 ` Clément Léger
  2026-05-19 11:43 ` [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Pavel Begunkov
  6 siblings, 0 replies; 10+ messages in thread
From: Clément Léger @ 2026-05-18 15:35 UTC (permalink / raw)
  To: io-uring, Pavel Begunkov, Jens Axboe
  Cc: Clément Léger, linux-doc, linux-kernel, linux-kselftest,
	netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jonathan Corbet, Shuah Khan,
	Vishwanath Seshagiri

Add a selftest to verify that ZCRX notification are properly delivered
to userspace and that the shared-memory notification stats (copy_count,
copy_bytes) are correctly incremented when zero-copy RX falls back to
copying or when it runs out of buffers.

The test registers a notification descriptor during
IORING_REGISTER_ZCRX_IFQ with a stats region placed after the refill
queue entries. A new -n flag verifies that the copy fallback is
triggered and -b/-a flags allows to check for out of buffer
notification.

To reliably trigger copy fallback, the Python test uses a new
single_no_flow() setup variant that configures tcp-data-split and RSS
but without ethtool flow rule. Without flow steering, traffic arrives
on non-zcrx queues as regular pages, forcing the kernel copy-fallback
path in io_zcrx_copy_frag().

Out-of-buffer notification is verified by using a smaller receive area
and by avoiding recycling the buffers so that the kernel runs out of
buffer quickly.

Signed-off-by: Clément Léger <cleger@meta.com>
---
 .../selftests/drivers/net/hw/iou-zcrx.c       | 114 ++++++++++++++++--
 .../selftests/drivers/net/hw/iou-zcrx.py      |  49 +++++++-
 2 files changed, 151 insertions(+), 12 deletions(-)

diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
index 240d13dbc54e..78a43ede77ed 100644
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c
@@ -52,7 +52,27 @@ struct t_io_uring_zcrx_ifq_reg {
 	struct io_uring_zcrx_offsets offsets;
 	__u32	zcrx_id;
 	__u32	rx_buf_len;
-	__u64	__resv[3];
+	__u64	notif_desc;
+	__u64	__resv[2];
+};
+
+#define ZCRX_NOTIF_NO_BUFFERS		0
+#define ZCRX_NOTIF_COPY			1
+#define ZCRX_NOTIF_DESC_FLAG_STATS	(1 << 0)
+
+#define NOTIF_USER_DATA			3
+
+struct t_zcrx_notification_desc {
+	__u64	user_data;
+	__u32	type_mask;
+	__u32	flags;
+	__u64	stats_offset;
+	__u64	__resv2[9];
+};
+
+struct t_io_uring_zcrx_notif_stats {
+	__u64	copy_count;
+	__u64	copy_bytes;
 };
 
 static long page_size;
@@ -84,7 +104,10 @@ static int cfg_oneshot_recvs;
 static int cfg_send_size = SEND_SIZE;
 static struct sockaddr_in6 cfg_addr;
 static unsigned int cfg_rx_buf_len;
+static size_t cfg_area_size;
 static bool cfg_dry_run;
+static bool cfg_copy_fallback;
+static bool cfg_no_buffers;
 
 static char *payload;
 static void *area_ptr;
@@ -95,6 +118,9 @@ static unsigned long area_token;
 static int connfd;
 static bool stop;
 static size_t received;
+static unsigned int received_notif_type;
+static bool received_notif;
+static size_t notif_stats_offset;
 
 static unsigned long gettimeofday_ms(void)
 {
@@ -142,6 +168,7 @@ static void setup_zcrx(struct io_uring *ring)
 {
 	unsigned int ifindex;
 	unsigned int rq_entries = 4096;
+	size_t area_size = cfg_area_size ? cfg_area_size : AREA_SIZE;
 	int ret;
 
 	ifindex = if_nametoindex(cfg_ifname);
@@ -150,7 +177,7 @@ static void setup_zcrx(struct io_uring *ring)
 
 	if (cfg_rx_buf_len && cfg_rx_buf_len != page_size) {
 		area_ptr = mmap(NULL,
-				AREA_SIZE,
+				area_size,
 				PROT_READ | PROT_WRITE,
 				MAP_ANONYMOUS | MAP_PRIVATE |
 				MAP_HUGETLB | MAP_HUGE_2MB,
@@ -162,7 +189,7 @@ static void setup_zcrx(struct io_uring *ring)
 		}
 	} else {
 		area_ptr = mmap(NULL,
-				AREA_SIZE,
+				area_size,
 				PROT_READ | PROT_WRITE,
 				MAP_ANONYMOUS | MAP_PRIVATE,
 				0,
@@ -172,6 +199,12 @@ static void setup_zcrx(struct io_uring *ring)
 	}
 
 	ring_size = get_refill_ring_size(rq_entries);
+
+	if (cfg_copy_fallback) {
+		notif_stats_offset = ring_size;
+		ring_size += ALIGN_UP(sizeof(struct t_io_uring_zcrx_notif_stats), page_size);
+	}
+
 	ring_ptr = mmap(NULL,
 			ring_size,
 			PROT_READ | PROT_WRITE,
@@ -187,10 +220,11 @@ static void setup_zcrx(struct io_uring *ring)
 
 	struct io_uring_zcrx_area_reg area_reg = {
 		.addr = (__u64)(unsigned long)area_ptr,
-		.len = AREA_SIZE,
+		.len = area_size,
 		.flags = 0,
 	};
 
+	struct t_zcrx_notification_desc notif_desc;
 	struct t_io_uring_zcrx_ifq_reg reg = {
 		.if_idx = ifindex,
 		.if_rxq = cfg_queue_id,
@@ -200,11 +234,32 @@ static void setup_zcrx(struct io_uring *ring)
 		.rx_buf_len = cfg_rx_buf_len,
 	};
 
+	if (cfg_copy_fallback || cfg_no_buffers) {
+		__u32 type_mask = 0;
+
+		if (cfg_copy_fallback)
+			type_mask = 1 << ZCRX_NOTIF_COPY;
+		if (cfg_no_buffers)
+			type_mask = 1 << ZCRX_NOTIF_NO_BUFFERS;
+
+		memset(&notif_desc, 0, sizeof(notif_desc));
+		notif_desc.user_data = NOTIF_USER_DATA;
+		notif_desc.type_mask = type_mask;
+		if (cfg_copy_fallback) {
+			notif_desc.flags = ZCRX_NOTIF_DESC_FLAG_STATS;
+			notif_desc.stats_offset = notif_stats_offset;
+		}
+		reg.notif_desc = (__u64)(unsigned long)&notif_desc;
+	}
+
 	ret = io_uring_register_ifq(ring, (void *)&reg);
 	if (cfg_rx_buf_len && (ret == -EINVAL || ret == -EOPNOTSUPP ||
 			       ret == -ERANGE)) {
 		printf("Large chunks are not supported %i\n", ret);
 		exit(SKIP_CODE);
+	} else if ((cfg_copy_fallback || cfg_no_buffers) && ret == -EINVAL) {
+		printf("Notifications not supported %i\n", ret);
+		exit(SKIP_CODE);
 	} else if (ret) {
 		error(1, 0, "io_uring_register_ifq(): %d", ret);
 	}
@@ -304,10 +359,13 @@ static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe)
 	}
 	received += n;
 
-	rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)];
-	rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | area_token;
-	rqe->len = cqe->res;
-	io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail);
+	/* Skip ring refill so that we ran out of buffers quickly */
+	if (!cfg_no_buffers) {
+		rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)];
+		rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | area_token;
+		rqe->len = cqe->res;
+		io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail);
+	}
 }
 
 static void server_loop(struct io_uring *ring)
@@ -324,8 +382,16 @@ static void server_loop(struct io_uring *ring)
 			process_accept(ring, cqe);
 		else if (cqe->user_data == 2)
 			process_recvzc(ring, cqe);
-		else
+		else if ((cfg_copy_fallback || cfg_no_buffers) &&
+			 cqe->user_data == NOTIF_USER_DATA) {
+			received_notif_type |= cqe->res;
+			received_notif = true;
+			if (cfg_no_buffers &&
+			    (cqe->res == ZCRX_NOTIF_NO_BUFFERS))
+				stop = true;
+		} else {
 			error(1, 0, "unknown cqe");
+		}
 		count++;
 	}
 	io_uring_cq_advance(ring, count);
@@ -374,6 +440,23 @@ static void run_server(void)
 
 	if (!stop)
 		error(1, 0, "test failed\n");
+
+	if (cfg_copy_fallback) {
+		struct t_io_uring_zcrx_notif_stats *stats =
+			(void *)((char *)ring_ptr + notif_stats_offset);
+
+		if (!received_notif || received_notif_type != ZCRX_NOTIF_COPY)
+			error(1, 0, "expected copy fallback notification");
+		if (!IO_URING_READ_ONCE(stats->copy_count))
+			error(1, 0, "expected copy_count > 0");
+		if (!IO_URING_READ_ONCE(stats->copy_bytes))
+			error(1, 0, "expected copy_bytes > 0");
+	}
+
+	if (cfg_no_buffers) {
+		if (!received_notif || received_notif_type != ZCRX_NOTIF_NO_BUFFERS)
+			error(1, 0, "expected no-buffers notification");
+	}
 }
 
 static void run_client(void)
@@ -425,7 +508,7 @@ static void parse_opts(int argc, char **argv)
 		usage(argv[0]);
 	cfg_payload_len = max_payload_len;
 
-	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:d")) != -1) {
+	while ((c = getopt(argc, argv, "sch:p:l:i:q:o:z:x:a:dnb")) != -1) {
 		switch (c) {
 		case 's':
 			if (cfg_client)
@@ -466,8 +549,19 @@ static void parse_opts(int argc, char **argv)
 		case 'd':
 			cfg_dry_run = true;
 			break;
+		case 'n':
+			cfg_copy_fallback = true;
+			break;
+		case 'b':
+			cfg_no_buffers = true;
+			break;
+		case 'a':
+			cfg_area_size = strtoul(optarg, NULL, 0) * page_size;
+			break;
 		}
 	}
+	if (cfg_copy_fallback && cfg_no_buffers)
+		error(1, 0, "Pass one of -n or -b");
 
 	if (cfg_server && addr)
 		error(1, 0, "Receiver cannot have -h specified");
diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
index e81724cb5542..82b4f4777182 100755
--- a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
+++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py
@@ -41,7 +41,9 @@ def set_flow_rule_rss(cfg, rss_ctx_id):
     return int(values)
 
 
-def single(cfg):
+def single_no_flow(cfg):
+    """Like single() but without a flow rule."""
+
     channels = cfg.ethnl.channels_get({'header': {'dev-index': cfg.ifindex}})
     channels = channels['combined-count']
     if channels < 2:
@@ -65,6 +67,9 @@ def single(cfg):
     ethtool(f"-X {cfg.ifname} equal {cfg.target}")
     defer(ethtool, f"-X {cfg.ifname} default")
 
+def single(cfg):
+    single_no_flow(cfg)
+
     flow_rule_id = set_flow_rule(cfg)
     defer(ethtool, f"-N {cfg.ifname} delete {flow_rule_id}")
 
@@ -130,6 +135,26 @@ def test_zcrx_oneshot(cfg, setup) -> None:
         cmd(tx_cmd, host=cfg.remote)
 
 
+@ksft_variants([
+    KsftNamedVariant("single", single_no_flow),
+])
+def test_zcrx_notif_copy_fallback(cfg, setup) -> None:
+    """Test zcrx copy fallback notification.
+
+    Omits the flow rule so traffic arrives on non-zcrx queues as regular
+    pages, forcing the kernel copy-fallback path. Asserts that the
+    ZCRX_NOTIF_COPY notification CQE is delivered."""
+
+    cfg.require_ipver('6')
+
+    setup(cfg)
+    rx_cmd = f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} -q {cfg.target} -n"
+    tx_cmd = f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} -l 12840"
+    with bkg(rx_cmd, exit_wait=True):
+        wait_port_listen(cfg.port, proto="tcp")
+        cmd(tx_cmd, host=cfg.remote)
+
+
 def test_zcrx_large_chunks(cfg) -> None:
     """Test zcrx with large buffer chunks."""
 
@@ -157,6 +182,25 @@ def test_zcrx_large_chunks(cfg) -> None:
         cmd(tx_cmd, host=cfg.remote)
 
 
+@ksft_variants([
+    KsftNamedVariant("single", single),
+])
+def test_zcrx_notif_no_buffers(cfg, setup) -> None:
+    """Test zcrx out-of-buffer notification.
+
+    Skips buffer refill so the pool is quickly exhausted, triggering
+    a ZCRX_NOTIF_NO_BUFFERS notification CQE."""
+
+    cfg.require_ipver('6')
+
+    setup(cfg)
+    rx_cmd = f"{cfg.bin_local} -s -p {cfg.port} -i {cfg.ifname} -q {cfg.target} -b -a 64"
+    tx_cmd = f"{cfg.bin_remote} -c -h {cfg.addr_v['6']} -p {cfg.port} -l 12840"
+    with bkg(rx_cmd, exit_wait=True):
+        wait_port_listen(cfg.port, proto="tcp")
+        cmd(tx_cmd, host=cfg.remote, fail=False)
+
+
 def main() -> None:
     with NetDrvEpEnv(__file__) as cfg:
         cfg.bin_local = path.abspath(path.dirname(__file__) + "/../../../drivers/net/hw/iou-zcrx")
@@ -166,7 +210,8 @@ def main() -> None:
         cfg.netnl = NetdevFamily()
         cfg.port = rand_port()
         ksft_run(globs=globals(), cases=[test_zcrx, test_zcrx_oneshot,
-                                        test_zcrx_large_chunks], args=(cfg, ))
+                                        test_zcrx_large_chunks, test_zcrx_notif_copy_fallback,
+                                        test_zcrx_notif_no_buffers], args=(cfg, ))
     ksft_exit()
 
 
-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting
  2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
                   ` (5 preceding siblings ...)
  2026-05-18 15:35 ` [PATCH v2 6/6] selftests: iou-zcrx: add notification and stats test for zcrx Clément Léger
@ 2026-05-19 11:43 ` Pavel Begunkov
  6 siblings, 0 replies; 10+ messages in thread
From: Pavel Begunkov @ 2026-05-19 11:43 UTC (permalink / raw)
  To: Clément Léger, io-uring, Jens Axboe
  Cc: linux-doc, linux-kernel, linux-kselftest, netdev, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jonathan Corbet, Shuah Khan, Vishwanath Seshagiri

On 5/18/26 16:35, Clément Léger wrote:
> The zcrx path can encounter various conditions that lead to internal
> fallbacks or errors. These errors can have a large impact on performance
> and functionality but are not yet not being reported to the user which
> is then unable to take action.> 
> This series addresses this problem by adding a new notification system
> paired with a statistics structure. The notification system currently
> report out of buffer and packets that fallback to copy. The statistics
> structure report the number and total size of packets that were copied
> rather than received via the zero-copy path.
> 
> The out of buffer notification allows the user to actually adjust the
> buffer sizing when registering zcrx support for the ifq. Some future
> work could allow the user to add more memory on the fly to the pool so
> the page allocator doesn't run out of memory.

Looks good, I'm going to take the first 4 and send out with other
zcrx patches.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx
  2026-05-18 15:35 ` [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx Clément Léger
@ 2026-05-19 15:19   ` Vishwanath Seshagiri
  0 siblings, 0 replies; 10+ messages in thread
From: Vishwanath Seshagiri @ 2026-05-19 15:19 UTC (permalink / raw)
  To: Clément Léger
  Cc: io-uring, Pavel Begunkov, Jens Axboe, linux-doc, linux-kernel,
	linux-kselftest, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Shuah Khan

On Mon, May 18, 2026 at 8:36 AM Clément Léger <cleger@meta.com> wrote:
>
> From: Pavel Begunkov <asml.silence@gmail.com>
>
> zcrx will need to have a pointer to an owning ctx to communicate
> different events. Reference the ctx while it's attached to zcrx, and
> rely on zcrx termination to drop the ctx to avoid circular ref deps.
>
> Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
> ---
>  io_uring/zcrx.c | 39 +++++++++++++++++++++++++++++++--------
>  io_uring/zcrx.h |  3 +++
>  2 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index 3f9632e7790a..34faf90423f4 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -44,6 +44,17 @@ static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
>         return container_of(owner, struct io_zcrx_area, nia);
>  }
>
> +static bool zcrx_set_ring_ctx(struct io_zcrx_ifq *zcrx,
> +                             struct io_ring_ctx *ctx)
> +{
> +       guard(spinlock_bh)(&zcrx->ctx_lock);
> +       if (zcrx->master_ctx)
> +               return false;
> +       percpu_ref_get(&ctx->refs);
> +       zcrx->master_ctx = ctx;
> +       return true;
> +}
> +
>  static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
>  {
>         struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
> @@ -531,6 +542,7 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx)
>                 return NULL;
>
>         ifq->if_rxq = -1;
> +       spin_lock_init(&ifq->ctx_lock);
>         spin_lock_init(&ifq->rq.lock);
>         mutex_init(&ifq->pp_lock);
>         refcount_set(&ifq->refs, 1);
> @@ -580,6 +592,8 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq)
>                 return;
>         if (WARN_ON_ONCE(ifq->netdev != NULL))
>                 return;
> +       if (WARN_ON_ONCE(ifq->master_ctx))
> +               return;
>
>         if (ifq->area)
>                 io_zcrx_free_area(ifq, ifq->area);
> @@ -656,17 +670,24 @@ static void io_zcrx_scrub(struct io_zcrx_ifq *ifq)
>         }
>  }
>
> -static void zcrx_unregister_user(struct io_zcrx_ifq *ifq)
> +static void zcrx_unregister_user(struct io_zcrx_ifq *ifq, struct io_ring_ctx *ctx)
>  {
> +       scoped_guard(spinlock_bh, &ifq->ctx_lock) {
> +               if (ctx && ifq->master_ctx == ctx) {
> +                       ifq->master_ctx = NULL;
> +                       percpu_ref_put(&ctx->refs);
> +               }
> +       }
> +
>         if (refcount_dec_and_test(&ifq->user_refs)) {
>                 io_close_queue(ifq);
>                 io_zcrx_scrub(ifq);
>         }
>  }
>
> -static void zcrx_unregister(struct io_zcrx_ifq *ifq)
> +static void zcrx_unregister(struct io_zcrx_ifq *ifq, struct io_ring_ctx *ctx)
>  {
> -       zcrx_unregister_user(ifq);
> +       zcrx_unregister_user(ifq, ctx);
>         io_put_zcrx_ifq(ifq);
>  }
>
> @@ -686,7 +707,7 @@ static int zcrx_box_release(struct inode *inode, struct file *file)
>
>         if (WARN_ON_ONCE(!ifq))
>                 return -EFAULT;
> -       zcrx_unregister(ifq);
> +       zcrx_unregister(ifq, NULL);
>         return 0;
>  }
>
> @@ -711,7 +732,7 @@ static int zcrx_export(struct io_ring_ctx *ctx, struct io_zcrx_ifq *ifq,
>         file = anon_inode_create_getfile("[zcrx]", &zcrx_box_fops,
>                                          ifq, O_CLOEXEC, NULL);
>         if (IS_ERR(file)) {
> -               zcrx_unregister(ifq);
> +               zcrx_unregister(ifq, NULL);
>                 return PTR_ERR(file);
>         }
>
> @@ -787,7 +808,7 @@ static int import_zcrx(struct io_ring_ctx *ctx,
>         scoped_guard(mutex, &ctx->mmap_lock)
>                 xa_erase(&ctx->zcrx_ctxs, id);
>  err:
> -       zcrx_unregister(ifq);
> +       zcrx_unregister(ifq, ctx);
>         return ret;
>  }
>
> @@ -932,12 +953,14 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
>                 ret = -EFAULT;
>                 goto err;
>         }
> +
> +       zcrx_set_ring_ctx(ifq, ctx);
>         return 0;
>  err:
>         scoped_guard(mutex, &ctx->mmap_lock)
>                 xa_erase(&ctx->zcrx_ctxs, id);
>  ifq_free:
> -       zcrx_unregister(ifq);
> +       zcrx_unregister(ifq, ctx);
>         return ret;
>  }
>
> @@ -967,7 +990,7 @@ void io_terminate_zcrx(struct io_ring_ctx *ctx)
>                         break;
>                 set_zcrx_entry_mark(ctx, id);
>                 id++;
> -               zcrx_unregister_user(ifq);
> +               zcrx_unregister_user(ifq, ctx);
>         }
>  }
>
> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
> index 9e1a6a1b11e8..6b565d0bf6da 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -73,6 +73,9 @@ struct io_zcrx_ifq {
>          */
>         struct mutex                    pp_lock;
>         struct io_mapped_region         rq_region;
> +
> +       spinlock_t                      ctx_lock;
> +       struct io_ring_ctx              *master_ctx;
>  };
>
>  #if defined(CONFIG_IO_URING_ZCRX)
> --
> 2.53.0-Meta
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers
  2026-05-18 15:35 ` [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers Clément Léger
@ 2026-05-19 15:21   ` Vishwanath Seshagiri
  0 siblings, 0 replies; 10+ messages in thread
From: Vishwanath Seshagiri @ 2026-05-19 15:21 UTC (permalink / raw)
  To: Clément Léger
  Cc: io-uring, Pavel Begunkov, Jens Axboe, linux-doc, linux-kernel,
	linux-kselftest, netdev, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jonathan Corbet,
	Shuah Khan, Vishwanath Seshagiri

On Mon, May 18, 2026 at 8:36 AM Clément Léger <cleger@meta.com> wrote:
>
> From: Pavel Begunkov <asml.silence@gmail.com>
>
> There are currently no easy ways for the user to know if zcrx is out of
> buffers and page pool fails to allocate. Add uapi for zcrx to communicate
> it back.
>
> It's implemented as a separate CQE, which for now is posted to the creator
> ctx. To use it, on registration the user space needs to pass an instance
> of struct zcrx_notification_desc, which tells the kernel the user_data
> for resulting CQEs and which event types are expected / allowed.
>
> When an allowed event happens, zcrx will post a CQE containing the
> specified user_data, and lower bits of cqe->res will be set to the event
> mask. Before the kernel could post another notification of the given
> type, the user needs to acknowledge that it processed the previous one
> by issuing IORING_REGISTER_ZCRX_CTRL with ZCRX_CTRL_ARM_NOTIFICATION.
>
> The only notification type the patch implements is
> ZCRX_NOTIF_NO_BUFFERS, but we'll need more of them in the future.
>
> Co-developed-by: Vishwanath Seshagiri <vishs@meta.com>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Vishwanath Seshagiri <vishs@meta.com>
> ---
>  include/uapi/linux/io_uring/zcrx.h | 24 ++++++++-
>  io_uring/io_uring.c                |  2 +-
>  io_uring/io_uring.h                |  1 +
>  io_uring/zcrx.c                    | 86 +++++++++++++++++++++++++++++-
>  io_uring/zcrx.h                    |  7 ++-
>  5 files changed, 115 insertions(+), 5 deletions(-)
>
> diff --git a/include/uapi/linux/io_uring/zcrx.h b/include/uapi/linux/io_uring/zcrx.h
> index 5ce02c7a6096..67185566ad3c 100644
> --- a/include/uapi/linux/io_uring/zcrx.h
> +++ b/include/uapi/linux/io_uring/zcrx.h
> @@ -65,6 +65,20 @@ enum zcrx_features {
>          * value in struct io_uring_zcrx_ifq_reg::rx_buf_len.
>          */
>         ZCRX_FEATURE_RX_PAGE_SIZE       = 1 << 0,
> +       ZCRX_FEATURE_NOTIFICATION       = 1 << 1,
> +};
> +
> +enum zcrx_notification_type {
> +       ZCRX_NOTIF_NO_BUFFERS,
> +
> +       __ZCRX_NOTIF_TYPE_LAST,
> +};
> +
> +struct zcrx_notification_desc {
> +       __u64   user_data;
> +       __u32   type_mask;
> +       __u32   __resv1;
> +       __u64   __resv2[10];
>  };
>
>  /*
> @@ -82,12 +96,14 @@ struct io_uring_zcrx_ifq_reg {
>         struct io_uring_zcrx_offsets offsets;
>         __u32   zcrx_id;
>         __u32   rx_buf_len;
> -       __u64   __resv[3];
> +       __u64   notif_desc; /* see struct zcrx_notification_desc */
> +       __u64   __resv[2];
>  };
>
>  enum zcrx_ctrl_op {
>         ZCRX_CTRL_FLUSH_RQ,
>         ZCRX_CTRL_EXPORT,
> +       ZCRX_CTRL_ARM_NOTIFICATION,
>
>         __ZCRX_CTRL_LAST,
>  };
> @@ -101,6 +117,11 @@ struct zcrx_ctrl_export {
>         __u32           __resv1[11];
>  };
>
> +struct zcrx_ctrl_arm_notif {
> +       __u32           notif_type;
> +       __u32           __resv[11];
> +};
> +
>  struct zcrx_ctrl {
>         __u32   zcrx_id;
>         __u32   op; /* see enum zcrx_ctrl_op */
> @@ -109,6 +130,7 @@ struct zcrx_ctrl {
>         union {
>                 struct zcrx_ctrl_export         zc_export;
>                 struct zcrx_ctrl_flush_rq       zc_flush;
> +               struct zcrx_ctrl_arm_notif      zc_arm_notif;
>         };
>  };
>
> diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
> index 2ebb0ba37c4f..c5972274cce1 100644
> --- a/io_uring/io_uring.c
> +++ b/io_uring/io_uring.c
> @@ -160,7 +160,7 @@ static void io_poison_cached_req(struct io_kiocb *req)
>         req->apoll = IO_URING_PTR_POISON;
>  }
>
> -static void io_poison_req(struct io_kiocb *req)
> +void io_poison_req(struct io_kiocb *req)
>  {
>         io_poison_cached_req(req);
>         req->async_data = IO_URING_PTR_POISON;
> diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
> index e612a66ee80e..de0a3bed58d1 100644
> --- a/io_uring/io_uring.h
> +++ b/io_uring/io_uring.h
> @@ -213,6 +213,7 @@ bool __io_alloc_req_refill(struct io_ring_ctx *ctx);
>
>  void io_activate_pollwq(struct io_ring_ctx *ctx);
>  void io_restriction_clone(struct io_restriction *dst, struct io_restriction *src);
> +void io_poison_req(struct io_kiocb *req);
>
>  static inline void io_lockdep_assert_cq_locked(struct io_ring_ctx *ctx)
>  {
> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
> index 34faf90423f4..463fbaead35b 100644
> --- a/io_uring/zcrx.c
> +++ b/io_uring/zcrx.c
> @@ -768,6 +768,8 @@ static int import_zcrx(struct io_ring_ctx *ctx,
>                 return -EINVAL;
>         if (reg->if_rxq || reg->rq_entries || reg->area_ptr || reg->region_ptr)
>                 return -EINVAL;
> +       if (reg->notif_desc)
> +               return -EINVAL;
>         if (reg->flags & ~ZCRX_REG_IMPORT)
>                 return -EINVAL;
>
> @@ -856,6 +858,7 @@ static int zcrx_register_netdev(struct io_zcrx_ifq *ifq,
>  int io_register_zcrx(struct io_ring_ctx *ctx,
>                      struct io_uring_zcrx_ifq_reg __user *arg)
>  {
> +       struct zcrx_notification_desc notif;
>         struct io_uring_zcrx_area_reg area;
>         struct io_uring_zcrx_ifq_reg reg;
>         struct io_uring_region_desc rd;
> @@ -899,10 +902,22 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
>         if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area)))
>                 return -EFAULT;
>
> +       memset(&notif, 0, sizeof(notif));
> +       if (reg.notif_desc && copy_from_user(&notif, u64_to_user_ptr(reg.notif_desc),
> +                                            sizeof(notif)))
> +               return -EFAULT;
> +       if (notif.type_mask & ~ZCRX_NOTIF_TYPE_MASK)
> +               return -EINVAL;
> +       if (notif.__resv1 || !mem_is_zero(&notif.__resv2, sizeof(notif.__resv2)))
> +               return -EINVAL;
> +
>         ifq = io_zcrx_ifq_alloc(ctx);
>         if (!ifq)
>                 return -ENOMEM;
>
> +       ifq->notif_data = notif.user_data;
> +       ifq->allowed_notif_mask = notif.type_mask;
> +
>         if (ctx->user) {
>                 get_uid(ctx->user);
>                 ifq->user = ctx->user;
> @@ -954,7 +969,8 @@ int io_register_zcrx(struct io_ring_ctx *ctx,
>                 goto err;
>         }
>
> -       zcrx_set_ring_ctx(ifq, ctx);
> +       if (notif.type_mask)
> +               zcrx_set_ring_ctx(ifq, ctx);
>         return 0;
>  err:
>         scoped_guard(mutex, &ctx->mmap_lock)
> @@ -1127,6 +1143,48 @@ static unsigned io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *if
>         return allocated;
>  }
>
> +static void zcrx_notif_tw(struct io_tw_req tw_req, io_tw_token_t tw)
> +{
> +       struct io_kiocb *req = tw_req.req;
> +       struct io_ring_ctx *ctx = req->ctx;
> +
> +       io_post_aux_cqe(ctx, req->cqe.user_data, req->cqe.res, 0);
> +       percpu_ref_put(&ctx->refs);
> +       io_poison_req(req);
> +       kmem_cache_free(req_cachep, req);
> +}
> +
> +static void zcrx_send_notif(struct io_zcrx_ifq *ifq, unsigned type)
> +{
> +       gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN | __GFP_ZERO;
> +       u32 type_mask = 1 << type;
> +       struct io_kiocb *req;
> +
> +       if (!(type_mask & ifq->allowed_notif_mask))
> +               return;
> +
> +       guard(spinlock_bh)(&ifq->ctx_lock);
> +       if (!ifq->master_ctx)
> +               return;
> +       if (type_mask & ifq->fired_notifs)
> +               return;
> +
> +       req = kmem_cache_alloc(req_cachep, gfp);
> +       if (unlikely(!req))
> +               return;
> +
> +       ifq->fired_notifs |= type_mask;
> +
> +       req->opcode = IORING_OP_NOP;
> +       req->cqe.user_data = ifq->notif_data;
> +       req->cqe.res = type;
> +       req->ctx = ifq->master_ctx;
> +       percpu_ref_get(&req->ctx->refs);
> +       req->tctx = NULL;
> +       req->io_task_work.func = zcrx_notif_tw;
> +       io_req_task_work_add(req);
> +}
> +
>  static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
>  {
>         struct io_zcrx_ifq *ifq = io_pp_to_ifq(pp);
> @@ -1143,8 +1201,10 @@ static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp)
>                 goto out_return;
>
>         allocated = io_zcrx_refill_slow(pp, ifq, netmems, to_alloc);
> -       if (!allocated)
> +       if (!allocated) {
> +               zcrx_send_notif(ifq, ZCRX_NOTIF_NO_BUFFERS);
>                 return 0;
> +       }
>  out_return:
>         zcrx_sync_for_device(pp, ifq, netmems, allocated);
>         allocated--;
> @@ -1293,12 +1353,32 @@ static int zcrx_flush_rq(struct io_ring_ctx *ctx, struct io_zcrx_ifq *zcrx,
>         return 0;
>  }
>
> +static int zcrx_arm_notif(struct io_ring_ctx *ctx, struct io_zcrx_ifq *zcrx,
> +                         struct zcrx_ctrl *ctrl)
> +{
> +       const struct zcrx_ctrl_arm_notif *an = &ctrl->zc_arm_notif;
> +       unsigned type_mask;
> +
> +       if (an->notif_type >= __ZCRX_NOTIF_TYPE_LAST)
> +               return -EINVAL;
> +       if (!mem_is_zero(&an->__resv, sizeof(an->__resv)))
> +               return -EINVAL;
> +
> +       guard(spinlock_bh)(&zcrx->ctx_lock);
> +       type_mask = 1U << an->notif_type;
> +       if (type_mask & ~zcrx->fired_notifs)
> +               return -EINVAL;
> +       zcrx->fired_notifs &= ~type_mask;
> +       return 0;
> +}
> +
>  int io_zcrx_ctrl(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
>  {
>         struct zcrx_ctrl ctrl;
>         struct io_zcrx_ifq *zcrx;
>
>         BUILD_BUG_ON(sizeof(ctrl.zc_export) != sizeof(ctrl.zc_flush));
> +       BUILD_BUG_ON(sizeof(ctrl.zc_export) != sizeof(ctrl.zc_arm_notif));
>
>         if (nr_args)
>                 return -EINVAL;
> @@ -1316,6 +1396,8 @@ int io_zcrx_ctrl(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
>                 return zcrx_flush_rq(ctx, zcrx, &ctrl);
>         case ZCRX_CTRL_EXPORT:
>                 return zcrx_export(ctx, zcrx, &ctrl, arg);
> +       case ZCRX_CTRL_ARM_NOTIFICATION:
> +               return zcrx_arm_notif(ctx, zcrx, &ctrl);
>         }
>
>         return -EOPNOTSUPP;
> diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h
> index 6b565d0bf6da..cca10d0d02ac 100644
> --- a/io_uring/zcrx.h
> +++ b/io_uring/zcrx.h
> @@ -9,7 +9,9 @@
>  #include <net/net_trackers.h>
>
>  #define ZCRX_SUPPORTED_REG_FLAGS       (ZCRX_REG_IMPORT | ZCRX_REG_NODEV)
> -#define ZCRX_FEATURES                  (ZCRX_FEATURE_RX_PAGE_SIZE)
> +#define ZCRX_FEATURES                  (ZCRX_FEATURE_RX_PAGE_SIZE |\
> +                                        ZCRX_FEATURE_NOTIFICATION)
> +#define ZCRX_NOTIF_TYPE_MASK           (1U << ZCRX_NOTIF_NO_BUFFERS)
>
>  struct io_zcrx_mem {
>         unsigned long                   size;
> @@ -76,6 +78,9 @@ struct io_zcrx_ifq {
>
>         spinlock_t                      ctx_lock;
>         struct io_ring_ctx              *master_ctx;
> +       u32                             allowed_notif_mask;
> +       u32                             fired_notifs;
> +       u64                             notif_data;
>  };
>
>  #if defined(CONFIG_IO_URING_ZCRX)
> --
> 2.53.0-Meta
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2026-05-19 15:21 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-18 15:35 [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Clément Léger
2026-05-18 15:35 ` [PATCH v2 1/6] io_uring/zcrx: add ctx pointer to zcrx Clément Léger
2026-05-19 15:19   ` Vishwanath Seshagiri
2026-05-18 15:35 ` [PATCH v2 2/6] io_uring/zcrx: notify user when out of buffers Clément Léger
2026-05-19 15:21   ` Vishwanath Seshagiri
2026-05-18 15:35 ` [PATCH v2 3/6] io_uring/zcrx: notify user on frag copy fallback Clément Léger
2026-05-18 15:35 ` [PATCH v2 4/6] io_uring/zcrx: add shared-memory notification statistics Clément Léger
2026-05-18 15:35 ` [PATCH v2 5/6] Documentation: networking: document zcrx notifications and statistics Clément Léger
2026-05-18 15:35 ` [PATCH v2 6/6] selftests: iou-zcrx: add notification and stats test for zcrx Clément Léger
2026-05-19 11:43 ` [PATCH v2 0/6] io_uring/zcrx: add CQE based notifications and stats reporting Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox