* [PATCHSET RFC 0/3] Add support for ring resizing
@ 2024-10-22 2:08 Jens Axboe
2024-10-22 2:08 ` [PATCH 1/3] io_uring: move max entry definition and ring sizing into header Jens Axboe
` (3 more replies)
0 siblings, 4 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-22 2:08 UTC (permalink / raw)
To: io-uring
Hi,
Something that's come up several times over the years is how to deal
with ring sizing. The SQ ring sizing is usually trivial - it just
controls the batch submit size, and usually it's not that difficult
to just submit if the app fails to get a free SQE.
For the CQ ring, it's a different story. For networked workloads, it
can be hard to appropriately size the CQ ring without knowing exactly
how busy a given ring will be. This leads to applications grossly
over-sizing the ring, just in case, which is wasteful.
Here's a stab at supporting ring resizing. It supports resizing of
both rings, SQ and CQ, as it's really no different than just doing
the CQ ring itself. liburing has a 'resize-rings' branch with a bit
of support code, and a test case:
https://git.kernel.dk/cgit/liburing/log/?h=resize-rings
and these patches can also be found here:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-ring-resize
include/uapi/linux/io_uring.h | 3 +
io_uring/io_uring.c | 84 ++++++++++--------
io_uring/io_uring.h | 6 ++
io_uring/register.c | 161 ++++++++++++++++++++++++++++++++++
4 files changed, 216 insertions(+), 38 deletions(-)
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH 1/3] io_uring: move max entry definition and ring sizing into header
2024-10-22 2:08 [PATCHSET RFC 0/3] Add support for ring resizing Jens Axboe
@ 2024-10-22 2:08 ` Jens Axboe
2024-10-22 2:08 ` [PATCH 2/3] io_uring: abstract out a bit of the ring filling logic Jens Axboe
` (2 subsequent siblings)
3 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-22 2:08 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
In preparation for needing this somewhere else, move the definitions
for the maximum CQ and SQ ring size into io_uring.h. Make the
rings_size() helper available as well, and have it take just the setup
flags argument rather than the fill ring pointer. That's all that is
needed.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
io_uring/io_uring.c | 14 ++++++--------
io_uring/io_uring.h | 5 +++++
2 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 58b401900b41..6dea5242d666 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -105,9 +105,6 @@
#include "alloc_cache.h"
#include "eventfd.h"
-#define IORING_MAX_ENTRIES 32768
-#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)
-
#define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
IOSQE_IO_HARDLINK | IOSQE_ASYNC)
@@ -2667,8 +2664,8 @@ static void io_rings_free(struct io_ring_ctx *ctx)
ctx->sq_sqes = NULL;
}
-static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries,
- unsigned int cq_entries, size_t *sq_offset)
+unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
+ unsigned int cq_entries, size_t *sq_offset)
{
struct io_rings *rings;
size_t off, sq_array_size;
@@ -2676,7 +2673,7 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries
off = struct_size(rings, cqes, cq_entries);
if (off == SIZE_MAX)
return SIZE_MAX;
- if (ctx->flags & IORING_SETUP_CQE32) {
+ if (flags & IORING_SETUP_CQE32) {
if (check_shl_overflow(off, 1, &off))
return SIZE_MAX;
}
@@ -2687,7 +2684,7 @@ static unsigned long rings_size(struct io_ring_ctx *ctx, unsigned int sq_entries
return SIZE_MAX;
#endif
- if (ctx->flags & IORING_SETUP_NO_SQARRAY) {
+ if (flags & IORING_SETUP_NO_SQARRAY) {
*sq_offset = SIZE_MAX;
return off;
}
@@ -3434,7 +3431,8 @@ static __cold int io_allocate_scq_urings(struct io_ring_ctx *ctx,
ctx->sq_entries = p->sq_entries;
ctx->cq_entries = p->cq_entries;
- size = rings_size(ctx, p->sq_entries, p->cq_entries, &sq_array_offset);
+ size = rings_size(ctx->flags, p->sq_entries, p->cq_entries,
+ &sq_array_offset);
if (size == SIZE_MAX)
return -EOVERFLOW;
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 9cd9a127e9ed..4a471a810f02 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -65,6 +65,11 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
return dist >= 0 || atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
}
+#define IORING_MAX_ENTRIES 32768
+#define IORING_MAX_CQ_ENTRIES (2 * IORING_MAX_ENTRIES)
+
+unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
+ unsigned int cq_entries, size_t *sq_offset);
bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
int io_run_task_work_sig(struct io_ring_ctx *ctx);
void io_req_defer_failed(struct io_kiocb *req, s32 res);
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 2/3] io_uring: abstract out a bit of the ring filling logic
2024-10-22 2:08 [PATCHSET RFC 0/3] Add support for ring resizing Jens Axboe
2024-10-22 2:08 ` [PATCH 1/3] io_uring: move max entry definition and ring sizing into header Jens Axboe
@ 2024-10-22 2:08 ` Jens Axboe
2024-10-22 2:08 ` [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS Jens Axboe
2024-10-24 15:47 ` [PATCHSET RFC 0/3] Add support for ring resizing Jann Horn
3 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-22 2:08 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Abstract out a io_uring_fill_params() helper, which fills out the
necessary bits of struct io_uring_params. Add it to io_uring.h as well,
in preparation for having another internal user of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
io_uring/io_uring.c | 70 ++++++++++++++++++++++++++-------------------
io_uring/io_uring.h | 1 +
2 files changed, 41 insertions(+), 30 deletions(-)
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 6dea5242d666..b5974bdad48b 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3498,14 +3498,8 @@ static struct file *io_uring_get_file(struct io_ring_ctx *ctx)
O_RDWR | O_CLOEXEC, NULL);
}
-static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
- struct io_uring_params __user *params)
+int io_uring_fill_params(unsigned entries, struct io_uring_params *p)
{
- struct io_ring_ctx *ctx;
- struct io_uring_task *tctx;
- struct file *file;
- int ret;
-
if (!entries)
return -EINVAL;
if (entries > IORING_MAX_ENTRIES) {
@@ -3547,6 +3541,42 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
p->cq_entries = 2 * p->sq_entries;
}
+ p->sq_off.head = offsetof(struct io_rings, sq.head);
+ p->sq_off.tail = offsetof(struct io_rings, sq.tail);
+ p->sq_off.ring_mask = offsetof(struct io_rings, sq_ring_mask);
+ p->sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries);
+ p->sq_off.flags = offsetof(struct io_rings, sq_flags);
+ p->sq_off.dropped = offsetof(struct io_rings, sq_dropped);
+ p->sq_off.resv1 = 0;
+ if (!(p->flags & IORING_SETUP_NO_MMAP))
+ p->sq_off.user_addr = 0;
+
+ p->cq_off.head = offsetof(struct io_rings, cq.head);
+ p->cq_off.tail = offsetof(struct io_rings, cq.tail);
+ p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask);
+ p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries);
+ p->cq_off.overflow = offsetof(struct io_rings, cq_overflow);
+ p->cq_off.cqes = offsetof(struct io_rings, cqes);
+ p->cq_off.flags = offsetof(struct io_rings, cq_flags);
+ p->cq_off.resv1 = 0;
+ if (!(p->flags & IORING_SETUP_NO_MMAP))
+ p->cq_off.user_addr = 0;
+
+ return 0;
+}
+
+static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
+ struct io_uring_params __user *params)
+{
+ struct io_ring_ctx *ctx;
+ struct io_uring_task *tctx;
+ struct file *file;
+ int ret;
+
+ ret = io_uring_fill_params(entries, p);
+ if (unlikely(ret))
+ return ret;
+
ctx = io_ring_ctx_alloc(p);
if (!ctx)
return -ENOMEM;
@@ -3630,6 +3660,9 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
if (ret)
goto err;
+ if (!(p->flags & IORING_SETUP_NO_SQARRAY))
+ p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
+
ret = io_sq_offload_create(ctx, p);
if (ret)
goto err;
@@ -3638,29 +3671,6 @@ static __cold int io_uring_create(unsigned entries, struct io_uring_params *p,
if (ret)
goto err;
- p->sq_off.head = offsetof(struct io_rings, sq.head);
- p->sq_off.tail = offsetof(struct io_rings, sq.tail);
- p->sq_off.ring_mask = offsetof(struct io_rings, sq_ring_mask);
- p->sq_off.ring_entries = offsetof(struct io_rings, sq_ring_entries);
- p->sq_off.flags = offsetof(struct io_rings, sq_flags);
- p->sq_off.dropped = offsetof(struct io_rings, sq_dropped);
- if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
- p->sq_off.array = (char *)ctx->sq_array - (char *)ctx->rings;
- p->sq_off.resv1 = 0;
- if (!(ctx->flags & IORING_SETUP_NO_MMAP))
- p->sq_off.user_addr = 0;
-
- p->cq_off.head = offsetof(struct io_rings, cq.head);
- p->cq_off.tail = offsetof(struct io_rings, cq.tail);
- p->cq_off.ring_mask = offsetof(struct io_rings, cq_ring_mask);
- p->cq_off.ring_entries = offsetof(struct io_rings, cq_ring_entries);
- p->cq_off.overflow = offsetof(struct io_rings, cq_overflow);
- p->cq_off.cqes = offsetof(struct io_rings, cqes);
- p->cq_off.flags = offsetof(struct io_rings, cq_flags);
- p->cq_off.resv1 = 0;
- if (!(ctx->flags & IORING_SETUP_NO_MMAP))
- p->cq_off.user_addr = 0;
-
p->features = IORING_FEAT_SINGLE_MMAP | IORING_FEAT_NODROP |
IORING_FEAT_SUBMIT_STABLE | IORING_FEAT_RW_CUR_POS |
IORING_FEAT_CUR_PERSONALITY | IORING_FEAT_FAST_POLL |
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index 4a471a810f02..e3e6cb14de5d 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -70,6 +70,7 @@ static inline bool io_should_wake(struct io_wait_queue *iowq)
unsigned long rings_size(unsigned int flags, unsigned int sq_entries,
unsigned int cq_entries, size_t *sq_offset);
+int io_uring_fill_params(unsigned entries, struct io_uring_params *p);
bool io_cqe_cache_refill(struct io_ring_ctx *ctx, bool overflow);
int io_run_task_work_sig(struct io_ring_ctx *ctx);
void io_req_defer_failed(struct io_kiocb *req, s32 res);
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS
2024-10-22 2:08 [PATCHSET RFC 0/3] Add support for ring resizing Jens Axboe
2024-10-22 2:08 ` [PATCH 1/3] io_uring: move max entry definition and ring sizing into header Jens Axboe
2024-10-22 2:08 ` [PATCH 2/3] io_uring: abstract out a bit of the ring filling logic Jens Axboe
@ 2024-10-22 2:08 ` Jens Axboe
2024-10-24 15:47 ` [PATCHSET RFC 0/3] Add support for ring resizing Jann Horn
3 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-22 2:08 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Once a ring has been created, the size of the CQ and SQ rings are fixed.
Usually this isn't a problem on the SQ ring side, as it merely controls
the available number of requests that can be submitted in a single
system call, and there's rarely a need to change that.
For the CQ ring, it's a different story. For most efficient use of
io_uring, it's important that the CQ ring never overflows. This means
that applications must size it for the worst case scenario, which can
be wasteful.
Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize
the existing rings. It takes a struct io_uring_params argument, the same
one which is used to setup the ring initially, and resizes rings
according to the sizes given.
Certain properties are always inherited from the original ring setup,
like SQE128/CQE32 and other setup options. The implementation only
allows flag associated with how the CQ ring is sized and clamped.
Existing unconsumed SQE and CQE entries are copied as part of the
process. Any register op holds ->uring_lock, which prevents new
submissions, and the internal mapping holds the completion lock as well
across moving CQ ring state.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/uapi/linux/io_uring.h | 3 +
io_uring/register.c | 161 ++++++++++++++++++++++++++++++++++
2 files changed, 164 insertions(+)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..c4737892c7cd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -615,6 +615,9 @@ enum io_uring_register_op {
/* send MSG_RING without having a ring */
IORING_REGISTER_SEND_MSG_RING = 31,
+ /* resize CQ ring */
+ IORING_REGISTER_RESIZE_RINGS = 33,
+
/* this goes last */
IORING_REGISTER_LAST,
diff --git a/io_uring/register.c b/io_uring/register.c
index 52b2f9b74af8..8dfe46a1cfe4 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -29,6 +29,7 @@
#include "napi.h"
#include "eventfd.h"
#include "msg_ring.h"
+#include "memmap.h"
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -361,6 +362,160 @@ static int io_register_clock(struct io_ring_ctx *ctx,
return 0;
}
+/*
+ * State to maintain until we can swap. Both new and old state, used for
+ * either mapping or freeing.
+ */
+struct io_ring_ctx_rings {
+ unsigned short n_ring_pages;
+ unsigned short n_sqe_pages;
+ struct page **ring_pages;
+ struct page **sqe_pages;
+ struct io_uring_sqe *sq_sqes;
+ struct io_rings *rings;
+};
+
+static void io_register_free_rings(struct io_uring_params *p,
+ struct io_ring_ctx_rings *r)
+{
+ if (!(p->flags & IORING_SETUP_NO_MMAP)) {
+ io_pages_unmap(r->rings, &r->ring_pages, &r->n_ring_pages,
+ true);
+ io_pages_unmap(r->sq_sqes, &r->sqe_pages, &r->n_sqe_pages,
+ true);
+ } else {
+ io_pages_free(&r->ring_pages, r->n_ring_pages);
+ io_pages_free(&r->sqe_pages, r->n_sqe_pages);
+ vunmap(r->rings);
+ vunmap(r->sq_sqes);
+ }
+}
+
+#define swap_old(ctx, o, n, field) \
+ do { \
+ (o).field = (ctx)->field; \
+ (ctx)->field = (n).field; \
+ } while (0)
+
+#define RESIZE_FLAGS (IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP)
+#define COPY_FLAGS (IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \
+ IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP)
+
+static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
+{
+ struct io_ring_ctx_rings o = { }, n = { };
+ size_t size, sq_array_offset;
+ struct io_uring_params p;
+ unsigned i, tail;
+ void *ptr;
+ int ret;
+
+ if (copy_from_user(&p, arg, sizeof(p)))
+ return -EFAULT;
+ if (p.flags & ~RESIZE_FLAGS)
+ return -EINVAL;
+ /* nothing to do */
+ if (p.sq_entries == ctx->sq_entries && p.cq_entries == ctx->cq_entries)
+ return 0;
+ /* properties that are always inherited */
+ p.flags |= (ctx->flags & COPY_FLAGS);
+
+ ret = io_uring_fill_params(p.sq_entries, &p);
+ if (unlikely(ret))
+ return ret;
+
+ size = rings_size(p.flags, p.sq_entries, p.cq_entries,
+ &sq_array_offset);
+ if (size == SIZE_MAX)
+ return -EOVERFLOW;
+
+ if (!(p.flags & IORING_SETUP_NO_MMAP))
+ n.rings = io_pages_map(&n.ring_pages, &n.n_ring_pages, size);
+ else
+ n.rings = __io_uaddr_map(&n.ring_pages, &n.n_ring_pages,
+ p.cq_off.user_addr, size);
+ if (IS_ERR(n.rings))
+ return PTR_ERR(n.rings);
+
+ n.rings->sq_ring_mask = p.sq_entries - 1;
+ n.rings->cq_ring_mask = p.cq_entries - 1;
+ n.rings->sq_ring_entries = p.sq_entries;
+ n.rings->cq_ring_entries = p.cq_entries;
+
+ if (copy_to_user(arg, &p, sizeof(p))) {
+ io_register_free_rings(&p, &n);
+ return -EFAULT;
+ }
+
+ if (p.flags & IORING_SETUP_SQE128)
+ size = array_size(2 * sizeof(struct io_uring_sqe), p.sq_entries);
+ else
+ size = array_size(sizeof(struct io_uring_sqe), p.sq_entries);
+ if (size == SIZE_MAX) {
+ io_register_free_rings(&p, &n);
+ return -EOVERFLOW;
+ }
+
+ if (!(p.flags & IORING_SETUP_NO_MMAP))
+ ptr = io_pages_map(&n.sqe_pages, &n.n_sqe_pages, size);
+ else
+ ptr = __io_uaddr_map(&n.sqe_pages, &n.n_sqe_pages,
+ p.sq_off.user_addr,
+ size);
+ if (IS_ERR(ptr)) {
+ io_register_free_rings(&p, &n);
+ return PTR_ERR(ptr);
+ }
+
+ /* now copy entries, if any */
+ n.sq_sqes = ptr;
+ tail = ctx->rings->sq.tail;
+ for (i = ctx->rings->sq.head; i < tail; i++) {
+ unsigned src_head = i & (ctx->sq_entries - 1);
+ unsigned dst_head = i & n.rings->sq_ring_mask;
+
+ n.sq_sqes[dst_head] = ctx->sq_sqes[src_head];
+ }
+ n.rings->sq.head = ctx->rings->sq.head;
+ n.rings->sq.tail = ctx->rings->sq.tail;
+
+ spin_lock(&ctx->completion_lock);
+ tail = ctx->rings->cq.tail;
+ for (i = ctx->rings->cq.head; i < tail; i++) {
+ unsigned src_head = i & (ctx->cq_entries - 1);
+ unsigned dst_head = i & n.rings->cq_ring_mask;
+
+ n.rings->cqes[dst_head] = ctx->rings->cqes[src_head];
+ }
+ n.rings->cq.head = ctx->rings->cq.head;
+ n.rings->cq.tail = ctx->rings->cq.tail;
+ /* invalidate cached cqe refill */
+ ctx->cqe_cached = ctx->cqe_sentinel = NULL;
+
+ n.rings->sq_dropped = ctx->rings->sq_dropped;
+ n.rings->sq_flags = ctx->rings->sq_flags;
+ n.rings->cq_flags = ctx->rings->cq_flags;
+ n.rings->cq_overflow = ctx->rings->cq_overflow;
+
+ /* all done, store old pointers and assign new ones */
+ if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+ ctx->sq_array = (u32 *)((char *)n.rings + sq_array_offset);
+
+ ctx->sq_entries = p.sq_entries;
+ ctx->cq_entries = p.cq_entries;
+
+ swap_old(ctx, o, n, rings);
+ swap_old(ctx, o, n, n_ring_pages);
+ swap_old(ctx, o, n, n_sqe_pages);
+ swap_old(ctx, o, n, ring_pages);
+ swap_old(ctx, o, n, sqe_pages);
+ swap_old(ctx, o, n, sq_sqes);
+ spin_unlock(&ctx->completion_lock);
+
+ io_register_free_rings(&p, &o);
+ return 0;
+}
+
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
__releases(ctx->uring_lock)
@@ -549,6 +704,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_clone_buffers(ctx, arg);
break;
+ case IORING_REGISTER_RESIZE_RINGS:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_resize_rings(ctx, arg);
+ break;
default:
ret = -EINVAL;
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS
2024-10-23 15:59 [PATCHSET v2 " Jens Axboe
@ 2024-10-23 15:59 ` Jens Axboe
0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-23 15:59 UTC (permalink / raw)
To: io-uring; +Cc: Jens Axboe
Once a ring has been created, the size of the CQ and SQ rings are fixed.
Usually this isn't a problem on the SQ ring side, as it merely controls
the available number of requests that can be submitted in a single
system call, and there's rarely a need to change that.
For the CQ ring, it's a different story. For most efficient use of
io_uring, it's important that the CQ ring never overflows. This means
that applications must size it for the worst case scenario, which can
be wasteful.
Add IORING_REGISTER_RESIZE_RINGS, which allows an application to resize
the existing rings. It takes a struct io_uring_params argument, the same
one which is used to setup the ring initially, and resizes rings
according to the sizes given.
Certain properties are always inherited from the original ring setup,
like SQE128/CQE32 and other setup options. The implementation only
allows flag associated with how the CQ ring is sized and clamped.
Existing unconsumed SQE and CQE entries are copied as part of the
process. If either the SQ or CQ resized destination ring cannot hold the
entries already present in the source rings, then the operation is failed
with -EOVERFLOW. Any register op holds ->uring_lock, which prevents new
submissions, and the internal mapping holds the completion lock as well
across moving CQ ring state.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
---
include/uapi/linux/io_uring.h | 3 +
io_uring/register.c | 177 ++++++++++++++++++++++++++++++++++
2 files changed, 180 insertions(+)
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 86cb385fe0b5..c4737892c7cd 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -615,6 +615,9 @@ enum io_uring_register_op {
/* send MSG_RING without having a ring */
IORING_REGISTER_SEND_MSG_RING = 31,
+ /* resize CQ ring */
+ IORING_REGISTER_RESIZE_RINGS = 33,
+
/* this goes last */
IORING_REGISTER_LAST,
diff --git a/io_uring/register.c b/io_uring/register.c
index 52b2f9b74af8..e38d83c8bbf1 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -29,6 +29,7 @@
#include "napi.h"
#include "eventfd.h"
#include "msg_ring.h"
+#include "memmap.h"
#define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \
IORING_REGISTER_LAST + IORING_OP_LAST)
@@ -361,6 +362,176 @@ static int io_register_clock(struct io_ring_ctx *ctx,
return 0;
}
+/*
+ * State to maintain until we can swap. Both new and old state, used for
+ * either mapping or freeing.
+ */
+struct io_ring_ctx_rings {
+ unsigned short n_ring_pages;
+ unsigned short n_sqe_pages;
+ struct page **ring_pages;
+ struct page **sqe_pages;
+ struct io_uring_sqe *sq_sqes;
+ struct io_rings *rings;
+};
+
+static void io_register_free_rings(struct io_uring_params *p,
+ struct io_ring_ctx_rings *r)
+{
+ if (!(p->flags & IORING_SETUP_NO_MMAP)) {
+ io_pages_unmap(r->rings, &r->ring_pages, &r->n_ring_pages,
+ true);
+ io_pages_unmap(r->sq_sqes, &r->sqe_pages, &r->n_sqe_pages,
+ true);
+ } else {
+ io_pages_free(&r->ring_pages, r->n_ring_pages);
+ io_pages_free(&r->sqe_pages, r->n_sqe_pages);
+ vunmap(r->rings);
+ vunmap(r->sq_sqes);
+ }
+}
+
+#define swap_old(ctx, o, n, field) \
+ do { \
+ (o).field = (ctx)->field; \
+ (ctx)->field = (n).field; \
+ } while (0)
+
+#define RESIZE_FLAGS (IORING_SETUP_CQSIZE | IORING_SETUP_CLAMP)
+#define COPY_FLAGS (IORING_SETUP_NO_SQARRAY | IORING_SETUP_SQE128 | \
+ IORING_SETUP_CQE32 | IORING_SETUP_NO_MMAP)
+
+static int io_register_resize_rings(struct io_ring_ctx *ctx, void __user *arg)
+{
+ struct io_ring_ctx_rings o = { }, n = { };
+ size_t size, sq_array_offset;
+ struct io_uring_params p;
+ unsigned i, tail;
+ void *ptr;
+ int ret;
+
+ /* for single issuer, must be owner resizing */
+ if (ctx->flags & IORING_SETUP_SINGLE_ISSUER &&
+ current != ctx->submitter_task)
+ return -EEXIST;
+ if (copy_from_user(&p, arg, sizeof(p)))
+ return -EFAULT;
+ if (p.flags & ~RESIZE_FLAGS)
+ return -EINVAL;
+ /* nothing to do */
+ if (p.sq_entries == ctx->sq_entries && p.cq_entries == ctx->cq_entries)
+ return 0;
+ /* properties that are always inherited */
+ p.flags |= (ctx->flags & COPY_FLAGS);
+
+ ret = io_uring_fill_params(p.sq_entries, &p);
+ if (unlikely(ret))
+ return ret;
+
+ size = rings_size(p.flags, p.sq_entries, p.cq_entries,
+ &sq_array_offset);
+ if (size == SIZE_MAX)
+ return -EOVERFLOW;
+
+ if (!(p.flags & IORING_SETUP_NO_MMAP))
+ n.rings = io_pages_map(&n.ring_pages, &n.n_ring_pages, size);
+ else
+ n.rings = __io_uaddr_map(&n.ring_pages, &n.n_ring_pages,
+ p.cq_off.user_addr, size);
+ if (IS_ERR(n.rings))
+ return PTR_ERR(n.rings);
+
+ n.rings->sq_ring_mask = p.sq_entries - 1;
+ n.rings->cq_ring_mask = p.cq_entries - 1;
+ n.rings->sq_ring_entries = p.sq_entries;
+ n.rings->cq_ring_entries = p.cq_entries;
+
+ if (copy_to_user(arg, &p, sizeof(p))) {
+ io_register_free_rings(&p, &n);
+ return -EFAULT;
+ }
+
+ if (p.flags & IORING_SETUP_SQE128)
+ size = array_size(2 * sizeof(struct io_uring_sqe), p.sq_entries);
+ else
+ size = array_size(sizeof(struct io_uring_sqe), p.sq_entries);
+ if (size == SIZE_MAX) {
+ io_register_free_rings(&p, &n);
+ return -EOVERFLOW;
+ }
+
+ if (!(p.flags & IORING_SETUP_NO_MMAP))
+ ptr = io_pages_map(&n.sqe_pages, &n.n_sqe_pages, size);
+ else
+ ptr = __io_uaddr_map(&n.sqe_pages, &n.n_sqe_pages,
+ p.sq_off.user_addr,
+ size);
+ if (IS_ERR(ptr)) {
+ io_register_free_rings(&p, &n);
+ return PTR_ERR(ptr);
+ }
+
+ /*
+ * Now copy SQ and CQ entries, if any. If either of the destination
+ * rings can't hold what is already there, then fail the operation.
+ */
+ n.sq_sqes = ptr;
+ tail = ctx->rings->sq.tail;
+ if (tail - ctx->rings->sq.head > p.sq_entries) {
+ io_register_free_rings(&p, &n);
+ return -EOVERFLOW;
+ }
+ for (i = ctx->rings->sq.head; i < tail; i++) {
+ unsigned src_head = i & (ctx->sq_entries - 1);
+ unsigned dst_head = i & n.rings->sq_ring_mask;
+
+ n.sq_sqes[dst_head] = ctx->sq_sqes[src_head];
+ }
+ n.rings->sq.head = ctx->rings->sq.head;
+ n.rings->sq.tail = ctx->rings->sq.tail;
+
+ spin_lock(&ctx->completion_lock);
+ tail = ctx->rings->cq.tail;
+ if (tail - ctx->rings->cq.head > p.cq_entries) {
+ spin_unlock(&ctx->completion_lock);
+ io_register_free_rings(&p, &n);
+ return -EOVERFLOW;
+ }
+ for (i = ctx->rings->cq.head; i < tail; i++) {
+ unsigned src_head = i & (ctx->cq_entries - 1);
+ unsigned dst_head = i & n.rings->cq_ring_mask;
+
+ n.rings->cqes[dst_head] = ctx->rings->cqes[src_head];
+ }
+ n.rings->cq.head = ctx->rings->cq.head;
+ n.rings->cq.tail = ctx->rings->cq.tail;
+ /* invalidate cached cqe refill */
+ ctx->cqe_cached = ctx->cqe_sentinel = NULL;
+
+ n.rings->sq_dropped = ctx->rings->sq_dropped;
+ n.rings->sq_flags = ctx->rings->sq_flags;
+ n.rings->cq_flags = ctx->rings->cq_flags;
+ n.rings->cq_overflow = ctx->rings->cq_overflow;
+
+ /* all done, store old pointers and assign new ones */
+ if (!(ctx->flags & IORING_SETUP_NO_SQARRAY))
+ ctx->sq_array = (u32 *)((char *)n.rings + sq_array_offset);
+
+ ctx->sq_entries = p.sq_entries;
+ ctx->cq_entries = p.cq_entries;
+
+ swap_old(ctx, o, n, rings);
+ swap_old(ctx, o, n, n_ring_pages);
+ swap_old(ctx, o, n, n_sqe_pages);
+ swap_old(ctx, o, n, ring_pages);
+ swap_old(ctx, o, n, sqe_pages);
+ swap_old(ctx, o, n, sq_sqes);
+ spin_unlock(&ctx->completion_lock);
+
+ io_register_free_rings(&p, &o);
+ return 0;
+}
+
static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
void __user *arg, unsigned nr_args)
__releases(ctx->uring_lock)
@@ -549,6 +720,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
break;
ret = io_register_clone_buffers(ctx, arg);
break;
+ case IORING_REGISTER_RESIZE_RINGS:
+ ret = -EINVAL;
+ if (!arg || nr_args != 1)
+ break;
+ ret = io_register_resize_rings(ctx, arg);
+ break;
default:
ret = -EINVAL;
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCHSET RFC 0/3] Add support for ring resizing
2024-10-22 2:08 [PATCHSET RFC 0/3] Add support for ring resizing Jens Axboe
` (2 preceding siblings ...)
2024-10-22 2:08 ` [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS Jens Axboe
@ 2024-10-24 15:47 ` Jann Horn
2024-10-24 16:05 ` Jens Axboe
3 siblings, 1 reply; 7+ messages in thread
From: Jann Horn @ 2024-10-24 15:47 UTC (permalink / raw)
To: Jens Axboe; +Cc: io-uring
On Tue, Oct 22, 2024 at 4:08 AM Jens Axboe <axboe@kernel.dk> wrote:
> Here's a stab at supporting ring resizing. It supports resizing of
> both rings, SQ and CQ, as it's really no different than just doing
> the CQ ring itself. liburing has a 'resize-rings' branch with a bit
> of support code, and a test case:
>
> https://git.kernel.dk/cgit/liburing/log/?h=resize-rings
>
> and these patches can also be found here:
>
> https://git.kernel.dk/cgit/linux/log/?h=io_uring-ring-resize
You'd need to properly synchronize that path with io_uring_mmap(),
right? Take a lock that prevents concurrent mmap() from accessing
ctx->ring_pages while the resize is concurrently freeing that array,
so that you don't get UAF?
And I guess ideally you'd also zap the already-mapped pages from
corresponding VMAs with something like unmap_mapping_range(), though
that won't make a difference security-wise since the pages are
refcounted by the userspace mapping anyway.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCHSET RFC 0/3] Add support for ring resizing
2024-10-24 15:47 ` [PATCHSET RFC 0/3] Add support for ring resizing Jann Horn
@ 2024-10-24 16:05 ` Jens Axboe
0 siblings, 0 replies; 7+ messages in thread
From: Jens Axboe @ 2024-10-24 16:05 UTC (permalink / raw)
To: Jann Horn; +Cc: io-uring
On 10/24/24 9:47 AM, Jann Horn wrote:
> On Tue, Oct 22, 2024 at 4:08 AM Jens Axboe <axboe@kernel.dk> wrote:
>> Here's a stab at supporting ring resizing. It supports resizing of
>> both rings, SQ and CQ, as it's really no different than just doing
>> the CQ ring itself. liburing has a 'resize-rings' branch with a bit
>> of support code, and a test case:
>>
>> https://git.kernel.dk/cgit/liburing/log/?h=resize-rings
>>
>> and these patches can also be found here:
>>
>> https://git.kernel.dk/cgit/linux/log/?h=io_uring-ring-resize
>
> You'd need to properly synchronize that path with io_uring_mmap(),
> right? Take a lock that prevents concurrent mmap() from accessing
> ctx->ring_pages while the resize is concurrently freeing that array,
> so that you don't get UAF?
Yep indeed! It's missing the mmap_lock, I'll add that.
> And I guess ideally you'd also zap the already-mapped pages from
> corresponding VMAs with something like unmap_mapping_range(), though
> that won't make a difference security-wise since the pages are
> refcounted by the userspace mapping anyway.
Yes don't think we need to do anything there, just have userspace
unmap the old range upon return.
--
Jens Axboe
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2024-10-24 16:05 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-22 2:08 [PATCHSET RFC 0/3] Add support for ring resizing Jens Axboe
2024-10-22 2:08 ` [PATCH 1/3] io_uring: move max entry definition and ring sizing into header Jens Axboe
2024-10-22 2:08 ` [PATCH 2/3] io_uring: abstract out a bit of the ring filling logic Jens Axboe
2024-10-22 2:08 ` [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS Jens Axboe
2024-10-24 15:47 ` [PATCHSET RFC 0/3] Add support for ring resizing Jann Horn
2024-10-24 16:05 ` Jens Axboe
-- strict thread matches above, loose matches on Subject: below --
2024-10-23 15:59 [PATCHSET v2 " Jens Axboe
2024-10-23 15:59 ` [PATCH 3/3] io_uring/register: add IORING_REGISTER_RESIZE_RINGS Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).