io-uring.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC v1 0/3] introduce io_uring querying
@ 2025-08-27 13:21 Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 1/3] io_uring: add helper for *REGISTER_SEND_MSG_RING Pavel Begunkov
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 13:21 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Introduce a versatile interface to query auxilary io_uring parameters.
It will be used to close a couple of API gaps, but in this series can
only tell what request and register opcodes, features and setup flags
are available. It'll replace IORING_REGISTER_PROBE  but with a much
more convenient interface. Patch 3 for API description.

Can be tested with:

https://github.com/isilence/liburing.git io_uring/query-v1

Note: RFC as I've got a last minute uapi adjustment I want to try.

Pavel Begunkov (3):
  io_uring: add helper for *REGISTER_SEND_MSG_RING
  io_uring: add macro for features and valid setup flags
  io_uring: introduce io_uring querying

 include/uapi/linux/io_uring.h       |  3 ++
 include/uapi/linux/io_uring/query.h | 40 ++++++++++++++
 io_uring/Makefile                   |  2 +-
 io_uring/io_uring.c                 | 21 +-------
 io_uring/io_uring.h                 | 20 +++++++
 io_uring/query.c                    | 84 +++++++++++++++++++++++++++++
 io_uring/query.h                    |  9 ++++
 io_uring/register.c                 | 39 +++++++++-----
 8 files changed, 184 insertions(+), 34 deletions(-)
 create mode 100644 include/uapi/linux/io_uring/query.h
 create mode 100644 io_uring/query.c
 create mode 100644 io_uring/query.h

-- 
2.49.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC v1 1/3] io_uring: add helper for *REGISTER_SEND_MSG_RING
  2025-08-27 13:21 [RFC v1 0/3] introduce io_uring querying Pavel Begunkov
@ 2025-08-27 13:21 ` Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 2/3] io_uring: add macro for features and valid setup flags Pavel Begunkov
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 13:21 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

Move handling of IORING_REGISTER_SEND_MSG_RING into a separate function
in preparation to growing io_uring_register_blind().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/register.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/io_uring/register.c b/io_uring/register.c
index a59589249fce..046dcb7ba4d1 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -877,6 +877,23 @@ struct file *io_uring_register_get_file(unsigned int fd, bool registered)
 	return ERR_PTR(-EOPNOTSUPP);
 }
 
+static int io_uring_register_send_msg_ring(void __user *arg, unsigned int nr_args)
+{
+	struct io_uring_sqe sqe;
+
+	if (!arg || nr_args != 1)
+		return -EINVAL;
+	if (copy_from_user(&sqe, arg, sizeof(sqe)))
+		return -EFAULT;
+	/* no flags supported */
+	if (sqe.flags)
+		return -EINVAL;
+	if (sqe.opcode != IORING_OP_MSG_RING)
+		return -EINVAL;
+
+	return io_uring_sync_msg_ring(&sqe);
+}
+
 /*
  * "blind" registration opcodes are ones where there's no ring given, and
  * hence the source fd must be -1.
@@ -885,21 +902,9 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
 				   unsigned int nr_args)
 {
 	switch (opcode) {
-	case IORING_REGISTER_SEND_MSG_RING: {
-		struct io_uring_sqe sqe;
-
-		if (!arg || nr_args != 1)
-			return -EINVAL;
-		if (copy_from_user(&sqe, arg, sizeof(sqe)))
-			return -EFAULT;
-		/* no flags supported */
-		if (sqe.flags)
-			return -EINVAL;
-		if (sqe.opcode == IORING_OP_MSG_RING)
-			return io_uring_sync_msg_ring(&sqe);
-		}
+	case IORING_REGISTER_SEND_MSG_RING:
+		return io_uring_register_send_msg_ring(arg, nr_args);
 	}
-
 	return -EINVAL;
 }
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v1 2/3] io_uring: add macro for features and valid setup flags
  2025-08-27 13:21 [RFC v1 0/3] introduce io_uring querying Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 1/3] io_uring: add helper for *REGISTER_SEND_MSG_RING Pavel Begunkov
@ 2025-08-27 13:21 ` Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 3/3] io_uring: introduce io_uring querying Pavel Begunkov
  2025-08-27 15:35 ` [RFC v1 0/3] " Jens Axboe
  3 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 13:21 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

The next patch will need the mask for available features and setup
flags. Add a macro constants for them to io_uring.h.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 io_uring/io_uring.c | 21 ++-------------------
 io_uring/io_uring.h | 20 ++++++++++++++++++++
 2 files changed, 22 insertions(+), 19 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 4ef69dd58734..8aac044cd53d 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3808,15 +3808,7 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
 	if (ret)
 		goto err;
 
-	p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
-			IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |
-			IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL |
-			IORING_FEAT_POLL_32BITS | IORING_FEAT_SQPOLL_NONFIXED |
-			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |
-			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |
-			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |
-			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT |
-			IORING_FEAT_RW_ATTR | IORING_FEAT_NO_IOWAIT;
+	p->features = IORING_FEATURES;
 
 	if (copy_to_user(params, p, sizeof(*p))) {
 		ret = -EFAULT;
@@ -3876,17 +3868,8 @@ static long io_uring_setup(u32 entries, struct io_uring_params __user *params)
 			return -EINVAL;
 	}
 
-	if (p.flags & ~(IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |
-			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |
-			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |
-			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |
-			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |
-			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |
-			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |
-			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |
-			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL))
+	if (p.flags & ~IORING_VALID_SETUP_FLAGS)
 		return -EINVAL;
-
 	return io_uring_create(entries, &p, params);
 }
 
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index abc6de227f74..37216d6eb102 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -18,6 +18,26 @@
 #include <trace/events/io_uring.h>
 #endif
 
+#define IORING_FEATURES (IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |\
+			IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |\
+			IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL |\
+			IORING_FEAT_POLL_32BITS | IORING_FEAT_SQPOLL_NONFIXED |\
+			IORING_FEAT_EXT_ARG | IORING_FEAT_NATIVE_WORKERS |\
+			IORING_FEAT_RSRC_TAGS | IORING_FEAT_CQE_SKIP |\
+			IORING_FEAT_LINKED_FILE | IORING_FEAT_REG_REG_RING |\
+			IORING_FEAT_RECVSEND_BUNDLE | IORING_FEAT_MIN_TIMEOUT |\
+			IORING_FEAT_RW_ATTR | IORING_FEAT_NO_IOWAIT)
+
+#define IORING_VALID_SETUP_FLAGS (IORING_SETUP_IOPOLL | IORING_SETUP_SQPOLL |\
+			IORING_SETUP_SQ_AFF | IORING_SETUP_CQSIZE |\
+			IORING_SETUP_CLAMP | IORING_SETUP_ATTACH_WQ |\
+			IORING_SETUP_R_DISABLED | IORING_SETUP_SUBMIT_ALL |\
+			IORING_SETUP_COOP_TASKRUN | IORING_SETUP_TASKRUN_FLAG |\
+			IORING_SETUP_SQE128 | IORING_SETUP_CQE32 |\
+			IORING_SETUP_SINGLE_ISSUER | IORING_SETUP_DEFER_TASKRUN |\
+			IORING_SETUP_NO_MMAP | IORING_SETUP_REGISTERED_FD_ONLY |\
+			IORING_SETUP_NO_SQARRAY | IORING_SETUP_HYBRID_IOPOLL)
+
 enum {
 	IOU_COMPLETE		= 0,
 
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC v1 3/3] io_uring: introduce io_uring querying
  2025-08-27 13:21 [RFC v1 0/3] introduce io_uring querying Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 1/3] io_uring: add helper for *REGISTER_SEND_MSG_RING Pavel Begunkov
  2025-08-27 13:21 ` [RFC v1 2/3] io_uring: add macro for features and valid setup flags Pavel Begunkov
@ 2025-08-27 13:21 ` Pavel Begunkov
  2025-08-27 18:04   ` Gabriel Krisman Bertazi
  2025-08-27 15:35 ` [RFC v1 0/3] " Jens Axboe
  3 siblings, 1 reply; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 13:21 UTC (permalink / raw)
  To: io-uring; +Cc: asml.silence

There are many characteristics of a ring or the io_uring subsystem the
user wants to query. Sometimes it's needed to be done before there is a
created ring, and sometimes it's needed at runtime in a slow path.
Introduce a querying interface to achieve that.

It was written with several requirements in mind:
- Can be used with or without an io_uring instance.
- Can query multiple attributes in one syscall.
- Backward and forward compatible.
- Should be reasobably easy to use.
- Reduce the kernel code size for introducing new query types.

API: it's implemented as a new registration op IORING_REGISTER_QUERY.
The user passes one or more query strutctures, each represented by
struct io_uring_query_hdr. The header stores common control fields for
query processing and expected to be wrapped into a larger structure
that has opcode specific fields.

The header contains
- The query opcode
- The result field, which on return contains the error code for the query
- The size of the query structure. The kernel will only populate up to
  the size, which helps with backward compatibility. The kernel can also
  reduce the size, so if the current kernel is older than the inteface
  the user tries to use, it'll get only the supported bits.
- next_entry field is used to chain multiple queries.

The patch adds a single query type for now, i.e. IO_URING_QUERY_OPCODES,
which tells what register / request / etc. opcodes are supported, but
there are particular plans to extend it.

Note: there is a request probing interface via IORING_REGISTER_PROBE,
but it's a mess. It requires the user to create a ring first, it only
works for requests, and requires dynamic allocations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
---
 include/uapi/linux/io_uring.h       |  3 ++
 include/uapi/linux/io_uring/query.h | 40 ++++++++++++++
 io_uring/Makefile                   |  2 +-
 io_uring/query.c                    | 84 +++++++++++++++++++++++++++++
 io_uring/query.h                    |  9 ++++
 io_uring/register.c                 |  6 +++
 6 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/io_uring/query.h
 create mode 100644 io_uring/query.c
 create mode 100644 io_uring/query.h

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 6957dc539d83..7a06da49e2cd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -665,6 +665,9 @@ enum io_uring_register_op {
 
 	IORING_REGISTER_MEM_REGION		= 34,
 
+	/* query various aspects of io_uring, see linux/io_uring/query.h */
+	IORING_REGISTER_QUERY			= 35,
+
 	/* this goes last */
 	IORING_REGISTER_LAST,
 
diff --git a/include/uapi/linux/io_uring/query.h b/include/uapi/linux/io_uring/query.h
new file mode 100644
index 000000000000..ca58e88095ed
--- /dev/null
+++ b/include/uapi/linux/io_uring/query.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
+/*
+ * Header file for the io_uring query interface.
+ *
+ * Copyright (C) 2025 Pavel Begunkov
+ */
+#ifndef LINUX_IO_URING_QUERY_H
+#define LINUX_IO_URING_QUERY_H
+
+#include <linux/types.h>
+
+struct io_uring_query_hdr {
+	__u64 next_entry;
+	__u32 query_op;
+	__u32 size;
+	__s32 result;
+	__u32 __resv[3];
+};
+
+enum {
+	IO_URING_QUERY_OPCODES			= 0,
+
+	__IO_URING_QUERY_MAX,
+};
+
+/* Doesn't require a ring */
+struct io_uring_query_opcode {
+	struct io_uring_query_hdr hdr;
+
+	/* The number of supported IORING_OP_* opcodes */
+	__u32	nr_request_opcodes;
+	/* The number of supported IORING_[UN]REGISTER_* opcodes */
+	__u32	nr_register_opcodes;
+	/* Bitmask of all supported IORING_FEAT_* flags */
+	__u64	features;
+	/* Bitmask of all supported IORING_SETUP_* flags */
+	__u64	ring_flags;
+};
+
+#endif
diff --git a/io_uring/Makefile b/io_uring/Makefile
index b3f1bd492804..bc4e4a3fa0a5 100644
--- a/io_uring/Makefile
+++ b/io_uring/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
 					sync.o msg_ring.o advise.o openclose.o \
 					statx.o timeout.o cancel.o \
 					waitid.o register.o truncate.o \
-					memmap.o alloc_cache.o
+					memmap.o alloc_cache.o query.o
 obj-$(CONFIG_IO_URING_ZCRX)	+= zcrx.o
 obj-$(CONFIG_IO_WQ)		+= io-wq.o
 obj-$(CONFIG_FUTEX)		+= futex.o
diff --git a/io_uring/query.c b/io_uring/query.c
new file mode 100644
index 000000000000..0ae9192f5a57
--- /dev/null
+++ b/io_uring/query.c
@@ -0,0 +1,84 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "linux/io_uring/query.h"
+
+#include "query.h"
+#include "io_uring.h"
+
+#define IO_MAX_QUERY_SIZE		512
+
+static int io_query_ops(void *buffer)
+{
+	struct io_uring_query_opcode *e = buffer;
+
+	BUILD_BUG_ON(sizeof(struct io_uring_query_opcode) > IO_MAX_QUERY_SIZE);
+
+	e->hdr.size = min(e->hdr.size, sizeof(*e));
+	e->nr_request_opcodes = IORING_OP_LAST;
+	e->nr_register_opcodes = IORING_REGISTER_LAST;
+	e->features = IORING_FEATURES;
+	e->ring_flags = IORING_VALID_SETUP_FLAGS;
+	return 0;
+}
+
+static int io_handle_query_entry(struct io_ring_ctx *ctx,
+				 void *buffer,
+				 void __user *uentry, u64 *next_entry)
+{
+	struct io_uring_query_hdr *hdr = buffer;
+	size_t entry_size = sizeof(*hdr);
+	int ret = -EINVAL;
+
+	if (copy_from_user(hdr, uentry, sizeof(*hdr)) ||
+	    hdr->size <= sizeof(*hdr))
+		return -EFAULT;
+
+	if (hdr->query_op >= __IO_URING_QUERY_MAX) {
+		ret = -EOPNOTSUPP;
+		goto out;
+	}
+	if (!mem_is_zero(hdr->__resv, sizeof(hdr->__resv)) || hdr->result)
+		goto out;
+
+	hdr->size = min(hdr->size, IO_MAX_QUERY_SIZE);
+	if (copy_from_user(buffer + sizeof(*hdr), uentry + sizeof(*hdr),
+			   hdr->size - sizeof(*hdr)))
+		return -EFAULT;
+
+	switch (hdr->query_op) {
+	case IO_URING_QUERY_OPCODES:
+		ret = io_query_ops(buffer);
+		break;
+	}
+	if (!ret)
+		entry_size = hdr->size;
+out:
+	hdr->result = ret;
+	hdr->size = entry_size;
+	if (copy_to_user(uentry, buffer, entry_size))
+		return -EFAULT;
+	*next_entry = hdr->next_entry;
+	return 0;
+}
+
+int io_query(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
+{
+	char entry_buffer[IO_MAX_QUERY_SIZE];
+	void __user *uentry = arg;
+	int ret;
+
+	memset(entry_buffer, 0, sizeof(entry_buffer));
+
+	if (nr_args)
+		return -EINVAL;
+
+	while (uentry) {
+		u64 next;
+
+		ret = io_handle_query_entry(ctx, entry_buffer, uentry, &next);
+		if (ret)
+			return ret;
+		uentry = u64_to_user_ptr(next);
+	}
+	return 0;
+}
diff --git a/io_uring/query.h b/io_uring/query.h
new file mode 100644
index 000000000000..171d47ccaaba
--- /dev/null
+++ b/io_uring/query.h
@@ -0,0 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef IORING_QUERY_H
+#define IORING_QUERY_H
+
+#include <linux/io_uring_types.h>
+
+int io_query(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args);
+
+#endif
diff --git a/io_uring/register.c b/io_uring/register.c
index 046dcb7ba4d1..6777bfe616ea 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -31,6 +31,7 @@
 #include "msg_ring.h"
 #include "memmap.h"
 #include "zcrx.h"
+#include "query.h"
 
 #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
 				 IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -835,6 +836,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_register_mem_region(ctx, arg);
 		break;
+	case IORING_REGISTER_QUERY:
+		ret = io_query(ctx, arg, nr_args);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -904,6 +908,8 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
 	switch (opcode) {
 	case IORING_REGISTER_SEND_MSG_RING:
 		return io_uring_register_send_msg_ring(arg, nr_args);
+	case IORING_REGISTER_QUERY:
+		return io_query(NULL, arg, nr_args);
 	}
 	return -EINVAL;
 }
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC v1 0/3] introduce io_uring querying
  2025-08-27 13:21 [RFC v1 0/3] introduce io_uring querying Pavel Begunkov
                   ` (2 preceding siblings ...)
  2025-08-27 13:21 ` [RFC v1 3/3] io_uring: introduce io_uring querying Pavel Begunkov
@ 2025-08-27 15:35 ` Jens Axboe
  2025-08-27 16:51   ` Pavel Begunkov
  3 siblings, 1 reply; 8+ messages in thread
From: Jens Axboe @ 2025-08-27 15:35 UTC (permalink / raw)
  To: Pavel Begunkov, io-uring

On 8/27/25 7:21 AM, Pavel Begunkov wrote:
> Introduce a versatile interface to query auxilary io_uring parameters.
> It will be used to close a couple of API gaps, but in this series can
> only tell what request and register opcodes, features and setup flags
> are available. It'll replace IORING_REGISTER_PROBE  but with a much
> more convenient interface. Patch 3 for API description.
> 
> Can be tested with:
> 
> https://github.com/isilence/liburing.git io_uring/query-v1
> 
> Note: RFC as I've got a last minute uapi adjustment I want to try.

Nice, was actually just dabbling in this yesterday, there are some
half assed patches here:

https://git.kernel.dk/cgit/linux/log/?h=io_uring-features

Your patch I had the identical one of, but moved it to a separate header
instead for adding more limits.

But I like your query style better than Yet Another
struct-with-resv-fields. I think that direction is good for sure.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v1 0/3] introduce io_uring querying
  2025-08-27 15:35 ` [RFC v1 0/3] " Jens Axboe
@ 2025-08-27 16:51   ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 16:51 UTC (permalink / raw)
  To: Jens Axboe, io-uring

On 8/27/25 16:35, Jens Axboe wrote:
> On 8/27/25 7:21 AM, Pavel Begunkov wrote:
>> Introduce a versatile interface to query auxilary io_uring parameters.
>> It will be used to close a couple of API gaps, but in this series can
>> only tell what request and register opcodes, features and setup flags
>> are available. It'll replace IORING_REGISTER_PROBE  but with a much
>> more convenient interface. Patch 3 for API description.
>>
>> Can be tested with:
>>
>> https://github.com/isilence/liburing.git io_uring/query-v1
>>
>> Note: RFC as I've got a last minute uapi adjustment I want to try.
> 
> Nice, was actually just dabbling in this yesterday, there are some
> half assed patches here:
> 
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-features

I can add some of it if you need them, e.g. enter_flags sounds
like a good idea, sqe flags is probably as well. Buffers/files
can go into a separate type.

> Your patch I had the identical one of, but moved it to a separate header
> instead for adding more limits.
> 
> But I like your query style better than Yet Another
> struct-with-resv-fields. I think that direction is good for sure.

A single struct with reserved fields won't help with my main use
case either, i.e. rings/SQ/CQ size calculation. That requires a
good bunch of parameters to be passed.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v1 3/3] io_uring: introduce io_uring querying
  2025-08-27 13:21 ` [RFC v1 3/3] io_uring: introduce io_uring querying Pavel Begunkov
@ 2025-08-27 18:04   ` Gabriel Krisman Bertazi
  2025-08-27 19:45     ` Pavel Begunkov
  0 siblings, 1 reply; 8+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-08-27 18:04 UTC (permalink / raw)
  To: Pavel Begunkov; +Cc: io-uring

Pavel Begunkov <asml.silence@gmail.com> writes:

> There are many characteristics of a ring or the io_uring subsystem the
> user wants to query. Sometimes it's needed to be done before there is a
> created ring, and sometimes it's needed at runtime in a slow path.
> Introduce a querying interface to achieve that.
>
> It was written with several requirements in mind:
> - Can be used with or without an io_uring instance.
> - Can query multiple attributes in one syscall.
> - Backward and forward compatible.
> - Should be reasobably easy to use.
> - Reduce the kernel code size for introducing new query types.

Hello, Pavel.

Correct me if I'm wrong, or if I completely missed the point, but this
is mostly about returning static information about what the kernel
supports, which can all be calculated at compile-time.

It seems it should be laid out as a procfs/sysfs /sys/kernel/io_uring
subtree instead, making it quickly parseable with the usual coreutils
command line tools, and then abstracted by some liburing APIs.  I don't
see the advantage of creating a custom way for fetching kernel features
information that only works for io_uring.

Sure, parsing sysfs is slow, but it doesn't need to be fast.  It is
annoying, but it can be abstracted in userspace by liburing.  It is more
consistent with the rest of the kernel and, for me, when tracking
customer issues, I can trust the newly introduce files will show up in
their supportconfig/sosreport without any extra change to these
applications.

Then there is the part about probing a specific ring for something, and
we have fdinfo. What information do we want to probe of a
particular ring that is missing?  Perhaps this feature should be split from the
general "is this feature supported" part.

Thanks!

> API: it's implemented as a new registration op IORING_REGISTER_QUERY.
> The user passes one or more query strutctures, each represented by
> struct io_uring_query_hdr. The header stores common control fields for
> query processing and expected to be wrapped into a larger structure
> that has opcode specific fields.
>
> The header contains
> - The query opcode
> - The result field, which on return contains the error code for the query
> - The size of the query structure. The kernel will only populate up to
>   the size, which helps with backward compatibility. The kernel can also
>   reduce the size, so if the current kernel is older than the inteface
>   the user tries to use, it'll get only the supported bits.
> - next_entry field is used to chain multiple queries.
>
> The patch adds a single query type for now, i.e. IO_URING_QUERY_OPCODES,
> which tells what register / request / etc. opcodes are supported, but
> there are particular plans to extend it.


>
> Note: there is a request probing interface via IORING_REGISTER_PROBE,
> but it's a mess. It requires the user to create a ring first, it only
> works for requests, and requires dynamic allocations.
>
> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
> ---
>  include/uapi/linux/io_uring.h       |  3 ++
>  include/uapi/linux/io_uring/query.h | 40 ++++++++++++++
>  io_uring/Makefile                   |  2 +-
>  io_uring/query.c                    | 84 +++++++++++++++++++++++++++++
>  io_uring/query.h                    |  9 ++++
>  io_uring/register.c                 |  6 +++
>  6 files changed, 143 insertions(+), 1 deletion(-)
>  create mode 100644 include/uapi/linux/io_uring/query.h
>  create mode 100644 io_uring/query.c
>  create mode 100644 io_uring/query.h
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 6957dc539d83..7a06da49e2cd 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -665,6 +665,9 @@ enum io_uring_register_op {
>  
>  	IORING_REGISTER_MEM_REGION		= 34,
>  
> +	/* query various aspects of io_uring, see linux/io_uring/query.h */
> +	IORING_REGISTER_QUERY			= 35,
> +
>  	/* this goes last */
>  	IORING_REGISTER_LAST,
>  
> diff --git a/include/uapi/linux/io_uring/query.h b/include/uapi/linux/io_uring/query.h
> new file mode 100644
> index 000000000000..ca58e88095ed
> --- /dev/null
> +++ b/include/uapi/linux/io_uring/query.h
> @@ -0,0 +1,40 @@
> +/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */
> +/*
> + * Header file for the io_uring query interface.
> + *
> + * Copyright (C) 2025 Pavel Begunkov
> + */
> +#ifndef LINUX_IO_URING_QUERY_H
> +#define LINUX_IO_URING_QUERY_H
> +
> +#include <linux/types.h>
> +
> +struct io_uring_query_hdr {
> +	__u64 next_entry;
> +	__u32 query_op;
> +	__u32 size;
> +	__s32 result;
> +	__u32 __resv[3];
> +};
> +
> +enum {
> +	IO_URING_QUERY_OPCODES			= 0,
> +
> +	__IO_URING_QUERY_MAX,
> +};
> +
> +/* Doesn't require a ring */
> +struct io_uring_query_opcode {
> +	struct io_uring_query_hdr hdr;
> +
> +	/* The number of supported IORING_OP_* opcodes */
> +	__u32	nr_request_opcodes;
> +	/* The number of supported IORING_[UN]REGISTER_* opcodes */
> +	__u32	nr_register_opcodes;
> +	/* Bitmask of all supported IORING_FEAT_* flags */
> +	__u64	features;
> +	/* Bitmask of all supported IORING_SETUP_* flags */
> +	__u64	ring_flags;
> +};
> +
> +#endif
> diff --git a/io_uring/Makefile b/io_uring/Makefile
> index b3f1bd492804..bc4e4a3fa0a5 100644
> --- a/io_uring/Makefile
> +++ b/io_uring/Makefile
> @@ -13,7 +13,7 @@ obj-$(CONFIG_IO_URING)		+= io_uring.o opdef.o kbuf.o rsrc.o notif.o \
>  					sync.o msg_ring.o advise.o openclose.o \
>  					statx.o timeout.o cancel.o \
>  					waitid.o register.o truncate.o \
> -					memmap.o alloc_cache.o
> +					memmap.o alloc_cache.o query.o
>  obj-$(CONFIG_IO_URING_ZCRX)	+= zcrx.o
>  obj-$(CONFIG_IO_WQ)		+= io-wq.o
>  obj-$(CONFIG_FUTEX)		+= futex.o
> diff --git a/io_uring/query.c b/io_uring/query.c
> new file mode 100644
> index 000000000000..0ae9192f5a57
> --- /dev/null
> +++ b/io_uring/query.c
> @@ -0,0 +1,84 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include "linux/io_uring/query.h"
> +
> +#include "query.h"
> +#include "io_uring.h"
> +
> +#define IO_MAX_QUERY_SIZE		512
> +
> +static int io_query_ops(void *buffer)
> +{
> +	struct io_uring_query_opcode *e = buffer;
> +
> +	BUILD_BUG_ON(sizeof(struct io_uring_query_opcode) > IO_MAX_QUERY_SIZE);
> +
> +	e->hdr.size = min(e->hdr.size, sizeof(*e));
> +	e->nr_request_opcodes = IORING_OP_LAST;
> +	e->nr_register_opcodes = IORING_REGISTER_LAST;
> +	e->features = IORING_FEATURES;
> +	e->ring_flags = IORING_VALID_SETUP_FLAGS;
> +	return 0;
> +}
> +
> +static int io_handle_query_entry(struct io_ring_ctx *ctx,
> +				 void *buffer,
> +				 void __user *uentry, u64 *next_entry)
> +{
> +	struct io_uring_query_hdr *hdr = buffer;
> +	size_t entry_size = sizeof(*hdr);
> +	int ret = -EINVAL;
> +
> +	if (copy_from_user(hdr, uentry, sizeof(*hdr)) ||
> +	    hdr->size <= sizeof(*hdr))
> +		return -EFAULT;
> +
> +	if (hdr->query_op >= __IO_URING_QUERY_MAX) {
> +		ret = -EOPNOTSUPP;
> +		goto out;
> +	}
> +	if (!mem_is_zero(hdr->__resv, sizeof(hdr->__resv)) || hdr->result)
> +		goto out;
> +
> +	hdr->size = min(hdr->size, IO_MAX_QUERY_SIZE);
> +	if (copy_from_user(buffer + sizeof(*hdr), uentry + sizeof(*hdr),
> +			   hdr->size - sizeof(*hdr)))
> +		return -EFAULT;
> +
> +	switch (hdr->query_op) {
> +	case IO_URING_QUERY_OPCODES:
> +		ret = io_query_ops(buffer);
> +		break;
> +	}
> +	if (!ret)
> +		entry_size = hdr->size;
> +out:
> +	hdr->result = ret;
> +	hdr->size = entry_size;
> +	if (copy_to_user(uentry, buffer, entry_size))
> +		return -EFAULT;
> +	*next_entry = hdr->next_entry;
> +	return 0;
> +}
> +
> +int io_query(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args)
> +{
> +	char entry_buffer[IO_MAX_QUERY_SIZE];
> +	void __user *uentry = arg;
> +	int ret;
> +
> +	memset(entry_buffer, 0, sizeof(entry_buffer));
> +
> +	if (nr_args)
> +		return -EINVAL;
> +
> +	while (uentry) {
> +		u64 next;
> +
> +		ret = io_handle_query_entry(ctx, entry_buffer, uentry, &next);
> +		if (ret)
> +			return ret;
> +		uentry = u64_to_user_ptr(next);
> +	}
> +	return 0;
> +}
> diff --git a/io_uring/query.h b/io_uring/query.h
> new file mode 100644
> index 000000000000..171d47ccaaba
> --- /dev/null
> +++ b/io_uring/query.h
> @@ -0,0 +1,9 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#ifndef IORING_QUERY_H
> +#define IORING_QUERY_H
> +
> +#include <linux/io_uring_types.h>
> +
> +int io_query(struct io_ring_ctx *ctx, void __user *arg, unsigned nr_args);
> +
> +#endif
> diff --git a/io_uring/register.c b/io_uring/register.c
> index 046dcb7ba4d1..6777bfe616ea 100644
> --- a/io_uring/register.c
> +++ b/io_uring/register.c
> @@ -31,6 +31,7 @@
>  #include "msg_ring.h"
>  #include "memmap.h"
>  #include "zcrx.h"
> +#include "query.h"
>  
>  #define IORING_MAX_RESTRICTIONS	(IORING_RESTRICTION_LAST + \
>  				 IORING_REGISTER_LAST + IORING_OP_LAST)
> @@ -835,6 +836,9 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
>  			break;
>  		ret = io_register_mem_region(ctx, arg);
>  		break;
> +	case IORING_REGISTER_QUERY:
> +		ret = io_query(ctx, arg, nr_args);
> +		break;
>  	default:
>  		ret = -EINVAL;
>  		break;
> @@ -904,6 +908,8 @@ static int io_uring_register_blind(unsigned int opcode, void __user *arg,
>  	switch (opcode) {
>  	case IORING_REGISTER_SEND_MSG_RING:
>  		return io_uring_register_send_msg_ring(arg, nr_args);
> +	case IORING_REGISTER_QUERY:
> +		return io_query(NULL, arg, nr_args);
>  	}
>  	return -EINVAL;
>  }

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC v1 3/3] io_uring: introduce io_uring querying
  2025-08-27 18:04   ` Gabriel Krisman Bertazi
@ 2025-08-27 19:45     ` Pavel Begunkov
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Begunkov @ 2025-08-27 19:45 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi; +Cc: io-uring

On 8/27/25 19:04, Gabriel Krisman Bertazi wrote:
> Pavel Begunkov <asml.silence@gmail.com> writes:
> 
>> There are many characteristics of a ring or the io_uring subsystem the
>> user wants to query. Sometimes it's needed to be done before there is a
>> created ring, and sometimes it's needed at runtime in a slow path.
>> Introduce a querying interface to achieve that.
>>
>> It was written with several requirements in mind:
>> - Can be used with or without an io_uring instance.
>> - Can query multiple attributes in one syscall.
>> - Backward and forward compatible.
>> - Should be reasobably easy to use.
>> - Reduce the kernel code size for introducing new query types.
> 
> Hello, Pavel.
> 
> Correct me if I'm wrong, or if I completely missed the point, but this
> is mostly about returning static information about what the kernel

It's primarily about configuring rings from within the application
at different steps, and it's always a huge mess when that requires
reading and parsing a text file. It's not all static either, my
main agenda here (not included) involves calculations.

> supports, which can all be calculated at compile-time.
I assume you mean kernel compilation, I can't rely on app
recompilation every time kernel changes.

> It seems it should be laid out as a procfs/sysfs /sys/kernel/io_uring
> subtree instead, making it quickly parseable with the usual coreutils
> command line tools, and then abstracted by some liburing APIs.  I don't
> see the advantage of creating a custom way for fetching kernel features
> information that only works for io_uring.
> 
> Sure, parsing sysfs is slow, but it doesn't need to be fast.  It is
> annoying, but it can be abstracted in userspace by liburing.  It is more

FWIW, I see usefulness in it not being painstakingly slow as it
might easily become with files. E.g. it can be reused to return
stats the app needs at runtime.

Funnily, it'd would create a dependency where you can't create
a ring without having another downgraded ring or using read/write
etc. syscall.

And that also won't work with fd-less ring, aka
IORING_SETUP_REGISTERED_FD_ONLY. Not sure why those are a thing,
but it wouldn't hurt to support it.

> consistent with the rest of the kernel and, for me, when tracking
> customer issues, I can trust the newly introduce files will show up in
> their supportconfig/sosreport without any extra change to these
> applications.
> 
> Then there is the part about probing a specific ring for something, and
> we have fdinfo. What information do we want to probe of a
> particular ring that is missing?  Perhaps this feature should be split from the
> general "is this feature supported" part.

The existing io_uring fdinfo is a huge mess. The format is not
great and it prints too much garbage, parsing it will be a misery.
It's not better implementation wise either, evident by the amount
of bugs it had and reversed lock nesting with trylocks.

And some queries are going to be parameterised, and there is no
good way to pass that to a file read.

-- 
Pavel Begunkov


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-27 19:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27 13:21 [RFC v1 0/3] introduce io_uring querying Pavel Begunkov
2025-08-27 13:21 ` [RFC v1 1/3] io_uring: add helper for *REGISTER_SEND_MSG_RING Pavel Begunkov
2025-08-27 13:21 ` [RFC v1 2/3] io_uring: add macro for features and valid setup flags Pavel Begunkov
2025-08-27 13:21 ` [RFC v1 3/3] io_uring: introduce io_uring querying Pavel Begunkov
2025-08-27 18:04   ` Gabriel Krisman Bertazi
2025-08-27 19:45     ` Pavel Begunkov
2025-08-27 15:35 ` [RFC v1 0/3] " Jens Axboe
2025-08-27 16:51   ` Pavel Begunkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).