* [PATCHv7 0/6] ublk zero copy support
@ 2025-02-26 18:20 Keith Busch
2025-02-26 18:20 ` [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path Keith Busch
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:20 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Nothing's changed from intended v6. This just fixes up the directory for
the mailed patch set.
Changes from v5:
Merged up to latest block for-next tree
Fixed up the io_uring read/write fixed prep to not set do_import, and
actually use the issue_flags when importing the buffer node (Pavel,
Caleb).
Used unambigious names for the read/write permissions of registered
kernel vectors, defined them using their symbolic names instead of
literals, and added a BUILD_BUG_ON to ensure the flags fits in the
type (Ming, Pavel).
Limit the io cache size to 64 elements (Pavel).
Enforce unpriveledged ublk dev can't use zero copy (Ming).
Various cleanups.
Keith Busch (5):
io_uring/rw: move fixed buffer import to issue path
io_uring: add support for kernel registered bvecs
ublk: zc register/unregister bvec
io_uring: add abstraction for buf_table rsrc data
io_uring: cache nodes and mapped buffers
Xinyu Zhang (1):
nvme: map uring_cmd data even if address is 0
drivers/block/ublk_drv.c | 119 +++++++++-----
drivers/nvme/host/ioctl.c | 2 +-
include/linux/io_uring/cmd.h | 7 +
include/linux/io_uring_types.h | 24 +--
include/uapi/linux/ublk_cmd.h | 4 +
io_uring/fdinfo.c | 8 +-
io_uring/filetable.c | 2 +-
io_uring/io_uring.c | 3 +
io_uring/nop.c | 2 +-
io_uring/opdef.c | 4 +-
io_uring/register.c | 2 +-
io_uring/rsrc.c | 280 ++++++++++++++++++++++++++-------
io_uring/rsrc.h | 10 +-
io_uring/rw.c | 39 +++--
io_uring/rw.h | 2 +
15 files changed, 389 insertions(+), 119 deletions(-)
--
2.43.5
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
@ 2025-02-26 18:20 ` Keith Busch
2025-02-26 19:04 ` Jens Axboe
2025-02-26 18:20 ` [PATCHv7 2/6] io_uring: add support for kernel registered bvecs Keith Busch
` (4 subsequent siblings)
5 siblings, 1 reply; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:20 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Registered buffers may depend on a linked command, which makes the prep
path too early to import. Move to the issue path when the node is
actually needed like all the other users of fixed buffers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
io_uring/opdef.c | 4 ++--
io_uring/rw.c | 39 ++++++++++++++++++++++++++++++---------
io_uring/rw.h | 2 ++
3 files changed, 34 insertions(+), 11 deletions(-)
diff --git a/io_uring/opdef.c b/io_uring/opdef.c
index 9344534780a02..db77df513d55b 100644
--- a/io_uring/opdef.c
+++ b/io_uring/opdef.c
@@ -105,7 +105,7 @@ const struct io_issue_def io_issue_defs[] = {
.iopoll_queue = 1,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_read_fixed,
- .issue = io_read,
+ .issue = io_read_fixed,
},
[IORING_OP_WRITE_FIXED] = {
.needs_file = 1,
@@ -119,7 +119,7 @@ const struct io_issue_def io_issue_defs[] = {
.iopoll_queue = 1,
.async_size = sizeof(struct io_async_rw),
.prep = io_prep_write_fixed,
- .issue = io_write,
+ .issue = io_write_fixed,
},
[IORING_OP_POLL_ADD] = {
.needs_file = 1,
diff --git a/io_uring/rw.c b/io_uring/rw.c
index eb369142d64ad..728d695d2552a 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -346,31 +346,30 @@ int io_prep_writev(struct io_kiocb *req, const struct io_uring_sqe *sqe)
return io_prep_rwv(req, sqe, ITER_SOURCE);
}
-static int io_prep_rw_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
int ddir)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
- struct io_async_rw *io;
+ struct io_async_rw *io = req->async_data;
int ret;
- ret = io_prep_rw(req, sqe, ddir, false);
- if (unlikely(ret))
- return ret;
+ if (io->bytes_done)
+ return 0;
- io = req->async_data;
- ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir, 0);
+ ret = io_import_reg_buf(req, &io->iter, rw->addr, rw->len, ddir,
+ issue_flags);
iov_iter_save_state(&io->iter, &io->iter_state);
return ret;
}
int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw_fixed(req, sqe, ITER_DEST);
+ return io_prep_rw(req, sqe, ITER_DEST, false);
}
int io_prep_write_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw_fixed(req, sqe, ITER_SOURCE);
+ return io_prep_rw(req, sqe, ITER_SOURCE, false);
}
/*
@@ -1136,6 +1135,28 @@ int io_write(struct io_kiocb *req, unsigned int issue_flags)
}
}
+int io_read_fixed(struct io_kiocb *req, unsigned int issue_flags)
+{
+ int ret;
+
+ ret = io_init_rw_fixed(req, issue_flags, ITER_DEST);
+ if (unlikely(ret))
+ return ret;
+
+ return io_read(req, issue_flags);
+}
+
+int io_write_fixed(struct io_kiocb *req, unsigned int issue_flags)
+{
+ int ret;
+
+ ret = io_init_rw_fixed(req, issue_flags, ITER_SOURCE);
+ if (unlikely(ret))
+ return ret;
+
+ return io_write(req, issue_flags);
+}
+
void io_rw_fail(struct io_kiocb *req)
{
int res;
diff --git a/io_uring/rw.h b/io_uring/rw.h
index a45e0c71b59d6..bf121b81ebe84 100644
--- a/io_uring/rw.h
+++ b/io_uring/rw.h
@@ -38,6 +38,8 @@ int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe);
int io_read(struct io_kiocb *req, unsigned int issue_flags);
int io_write(struct io_kiocb *req, unsigned int issue_flags);
+int io_read_fixed(struct io_kiocb *req, unsigned int issue_flags);
+int io_write_fixed(struct io_kiocb *req, unsigned int issue_flags);
void io_readv_writev_cleanup(struct io_kiocb *req);
void io_rw_fail(struct io_kiocb *req);
void io_req_rw_complete(struct io_kiocb *req, io_tw_token_t tw);
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
2025-02-26 18:20 ` [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path Keith Busch
@ 2025-02-26 18:20 ` Keith Busch
2025-02-26 20:36 ` Jens Axboe
2025-02-27 15:54 ` Pavel Begunkov
2025-02-26 18:20 ` [PATCHv7 3/6] nvme: map uring_cmd data even if address is 0 Keith Busch
` (3 subsequent siblings)
5 siblings, 2 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:20 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.
User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
include/linux/io_uring/cmd.h | 7 ++
io_uring/io_uring.c | 3 +
io_uring/rsrc.c | 122 +++++++++++++++++++++++++++++++++--
io_uring/rsrc.h | 8 +++
4 files changed, 133 insertions(+), 7 deletions(-)
diff --git a/include/linux/io_uring/cmd.h b/include/linux/io_uring/cmd.h
index 87150dc0a07cf..cf8d80d847344 100644
--- a/include/linux/io_uring/cmd.h
+++ b/include/linux/io_uring/cmd.h
@@ -4,6 +4,7 @@
#include <uapi/linux/io_uring.h>
#include <linux/io_uring_types.h>
+#include <linux/blk-mq.h>
/* only top 8 bits of sqe->uring_cmd_flags for kernel internal use */
#define IORING_URING_CMD_CANCELABLE (1U << 30)
@@ -125,4 +126,10 @@ static inline struct io_uring_cmd_data *io_uring_cmd_get_async_data(struct io_ur
return cmd_to_io_kiocb(cmd)->async_data;
}
+int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
+ void (*release)(void *), unsigned int index,
+ unsigned int issue_flags);
+void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
+ unsigned int issue_flags);
+
#endif /* _LINUX_IO_URING_CMD_H */
diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index db1c0792def63..31e936d468051 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -3947,6 +3947,9 @@ static int __init io_uring_init(void)
io_uring_optable_init();
+ /* imu->perm is u8 */
+ BUILD_BUG_ON((IO_IMU_DEST | IO_IMU_SOURCE) > U8_MAX);
+
/*
* Allow user copy in the per-command field, which starts after the
* file in io_kiocb and until the opcode field. The openat2 handling
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index f814526982c36..5b234e84dcba6 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -9,6 +9,7 @@
#include <linux/hugetlb.h>
#include <linux/compat.h>
#include <linux/io_uring.h>
+#include <linux/io_uring/cmd.h>
#include <uapi/linux/io_uring.h>
@@ -104,14 +105,21 @@ int io_buffer_validate(struct iovec *iov)
static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
{
struct io_mapped_ubuf *imu = node->buf;
- unsigned int i;
if (!refcount_dec_and_test(&imu->refs))
return;
- for (i = 0; i < imu->nr_bvecs; i++)
- unpin_user_page(imu->bvec[i].bv_page);
- if (imu->acct_pages)
- io_unaccount_mem(ctx, imu->acct_pages);
+
+ if (imu->release) {
+ imu->release(imu->priv);
+ } else {
+ unsigned int i;
+
+ for (i = 0; i < imu->nr_bvecs; i++)
+ unpin_user_page(imu->bvec[i].bv_page);
+ if (imu->acct_pages)
+ io_unaccount_mem(ctx, imu->acct_pages);
+ }
+
kvfree(imu);
}
@@ -761,6 +769,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
imu->len = iov->iov_len;
imu->nr_bvecs = nr_pages;
imu->folio_shift = PAGE_SHIFT;
+ imu->release = NULL;
+ imu->priv = NULL;
+ imu->perm = IO_IMU_DEST | IO_IMU_SOURCE;
if (coalesced)
imu->folio_shift = data.folio_shift;
refcount_set(&imu->refs, 1);
@@ -857,6 +868,94 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
+int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
+ void (*release)(void *), unsigned int index,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+ struct io_rsrc_data *data = &ctx->buf_table;
+ struct req_iterator rq_iter;
+ struct io_mapped_ubuf *imu;
+ struct io_rsrc_node *node;
+ struct bio_vec bv, *bvec;
+ u16 nr_bvecs;
+ int ret = 0;
+
+ io_ring_submit_lock(ctx, issue_flags);
+ if (index >= data->nr) {
+ ret = -EINVAL;
+ goto unlock;
+ }
+ index = array_index_nospec(index, data->nr);
+
+ if (data->nodes[index]) {
+ ret = -EBUSY;
+ goto unlock;
+ }
+
+ node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ if (!node) {
+ ret = -ENOMEM;
+ goto unlock;
+ }
+
+ nr_bvecs = blk_rq_nr_phys_segments(rq);
+ imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+ if (!imu) {
+ kfree(node);
+ ret = -ENOMEM;
+ goto unlock;
+ }
+
+ imu->ubuf = 0;
+ imu->len = blk_rq_bytes(rq);
+ imu->acct_pages = 0;
+ imu->folio_shift = PAGE_SHIFT;
+ imu->nr_bvecs = nr_bvecs;
+ refcount_set(&imu->refs, 1);
+ imu->release = release;
+ imu->priv = rq;
+
+ if (op_is_write(req_op(rq)))
+ imu->perm = IO_IMU_SOURCE;
+ else
+ imu->perm = IO_IMU_DEST;
+
+ bvec = imu->bvec;
+ rq_for_each_bvec(bv, rq, rq_iter)
+ *bvec++ = bv;
+
+ node->buf = imu;
+ data->nodes[index] = node;
+unlock:
+ io_ring_submit_unlock(ctx, issue_flags);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(io_buffer_register_bvec);
+
+void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
+ unsigned int issue_flags)
+{
+ struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
+ struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_node *node;
+
+ io_ring_submit_lock(ctx, issue_flags);
+ if (index >= data->nr)
+ goto unlock;
+ index = array_index_nospec(index, data->nr);
+
+ node = data->nodes[index];
+ if (!node || !node->buf->release)
+ goto unlock;
+
+ io_put_rsrc_node(ctx, node);
+ data->nodes[index] = NULL;
+unlock:
+ io_ring_submit_unlock(ctx, issue_flags);
+}
+EXPORT_SYMBOL_GPL(io_buffer_unregister_bvec);
+
static int io_import_fixed(int ddir, struct iov_iter *iter,
struct io_mapped_ubuf *imu,
u64 buf_addr, size_t len)
@@ -871,6 +970,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
/* not inside the mapped region */
if (unlikely(buf_addr < imu->ubuf || buf_end > (imu->ubuf + imu->len)))
return -EFAULT;
+ if (!(imu->perm & (1 << ddir)))
+ return -EFAULT;
/*
* Might not be a start of buffer, set size appropriately
@@ -883,8 +984,8 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
/*
* Don't use iov_iter_advance() here, as it's really slow for
* using the latter parts of a big fixed buffer - it iterates
- * over each segment manually. We can cheat a bit here, because
- * we know that:
+ * over each segment manually. We can cheat a bit here for user
+ * registered nodes, because we know that:
*
* 1) it's a BVEC iter, we set it up
* 2) all bvecs are the same in size, except potentially the
@@ -898,8 +999,15 @@ static int io_import_fixed(int ddir, struct iov_iter *iter,
*/
const struct bio_vec *bvec = imu->bvec;
+ /*
+ * Kernel buffer bvecs, on the other hand, don't necessarily
+ * have the size property of user registered ones, so we have
+ * to use the slow iter advance.
+ */
if (offset < bvec->bv_len) {
iter->iov_offset = offset;
+ } else if (imu->release) {
+ iov_iter_advance(iter, offset);
} else {
unsigned long seg_skip;
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index f0e9080599646..9668804afddc4 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -20,6 +20,11 @@ struct io_rsrc_node {
};
};
+enum {
+ IO_IMU_DEST = 1 << ITER_DEST,
+ IO_IMU_SOURCE = 1 << ITER_SOURCE,
+};
+
struct io_mapped_ubuf {
u64 ubuf;
unsigned int len;
@@ -27,6 +32,9 @@ struct io_mapped_ubuf {
unsigned int folio_shift;
refcount_t refs;
unsigned long acct_pages;
+ void (*release)(void *);
+ void *priv;
+ u8 perm;
struct bio_vec bvec[] __counted_by(nr_bvecs);
};
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCHv7 3/6] nvme: map uring_cmd data even if address is 0
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
2025-02-26 18:20 ` [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path Keith Busch
2025-02-26 18:20 ` [PATCHv7 2/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-26 18:20 ` Keith Busch
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
` (2 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:20 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Xinyu Zhang, Keith Busch
From: Xinyu Zhang <xizhang@purestorage.com>
When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.
We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.
With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/nvme/host/ioctl.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
index 5078d9aaf88fc..ecf136489044f 100644
--- a/drivers/nvme/host/ioctl.c
+++ b/drivers/nvme/host/ioctl.c
@@ -516,7 +516,7 @@ static int nvme_uring_cmd_io(struct nvme_ctrl *ctrl, struct nvme_ns *ns,
return PTR_ERR(req);
req->timeout = d.timeout_ms ? msecs_to_jiffies(d.timeout_ms) : 0;
- if (d.addr && d.data_len) {
+ if (d.data_len) {
ret = nvme_map_user_request(req, d.addr,
d.data_len, nvme_to_user_ptr(d.metadata),
d.metadata_len, ioucmd, vec, issue_flags);
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCHv7 4/6] ublk: zc register/unregister bvec
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
` (2 preceding siblings ...)
2025-02-26 18:20 ` [PATCHv7 3/6] nvme: map uring_cmd data even if address is 0 Keith Busch
@ 2025-02-26 18:20 ` Keith Busch
2025-02-26 18:40 ` Jens Axboe
` (2 more replies)
2025-02-26 18:21 ` [PATCHv7 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
2025-02-26 18:21 ` [PATCHv7 6/6] io_uring: cache nodes and mapped buffers Keith Busch
5 siblings, 3 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:20 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.
A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
drivers/block/ublk_drv.c | 119 +++++++++++++++++++++++-----------
include/uapi/linux/ublk_cmd.h | 4 ++
2 files changed, 86 insertions(+), 37 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 529085181f355..dc9ff869aa560 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -51,6 +51,9 @@
/* private ioctl command mirror */
#define UBLK_CMD_DEL_DEV_ASYNC _IOC_NR(UBLK_U_CMD_DEL_DEV_ASYNC)
+#define UBLK_IO_REGISTER_IO_BUF _IOC_NR(UBLK_U_IO_REGISTER_IO_BUF)
+#define UBLK_IO_UNREGISTER_IO_BUF _IOC_NR(UBLK_U_IO_UNREGISTER_IO_BUF)
+
/* All UBLK_F_* have to be included into UBLK_F_ALL */
#define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
| UBLK_F_URING_CMD_COMP_IN_TASK \
@@ -201,7 +204,7 @@ static inline struct ublksrv_io_desc *ublk_get_iod(struct ublk_queue *ubq,
int tag);
static inline bool ublk_dev_is_user_copy(const struct ublk_device *ub)
{
- return ub->dev_info.flags & UBLK_F_USER_COPY;
+ return ub->dev_info.flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
}
static inline bool ublk_dev_is_zoned(const struct ublk_device *ub)
@@ -581,7 +584,7 @@ static void ublk_apply_params(struct ublk_device *ub)
static inline bool ublk_support_user_copy(const struct ublk_queue *ubq)
{
- return ubq->flags & UBLK_F_USER_COPY;
+ return ubq->flags & (UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY);
}
static inline bool ublk_need_req_ref(const struct ublk_queue *ubq)
@@ -1747,6 +1750,77 @@ static inline void ublk_prep_cancel(struct io_uring_cmd *cmd,
io_uring_cmd_mark_cancelable(cmd, issue_flags);
}
+static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
+ struct ublk_queue *ubq, int tag, size_t offset)
+{
+ struct request *req;
+
+ if (!ublk_need_req_ref(ubq))
+ return NULL;
+
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
+ if (!req)
+ return NULL;
+
+ if (!ublk_get_req_ref(ubq, req))
+ return NULL;
+
+ if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
+ goto fail_put;
+
+ if (!ublk_rq_has_data(req))
+ goto fail_put;
+
+ if (offset > blk_rq_bytes(req))
+ goto fail_put;
+
+ return req;
+fail_put:
+ ublk_put_req_ref(ubq, req);
+ return NULL;
+}
+
+static void ublk_io_release(void *priv)
+{
+ struct request *rq = priv;
+ struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+
+ ublk_put_req_ref(ubq, rq);
+}
+
+static int ublk_register_io_buf(struct io_uring_cmd *cmd,
+ struct ublk_queue *ubq, unsigned int tag,
+ const struct ublksrv_io_cmd *ub_cmd,
+ unsigned int issue_flags)
+{
+ struct ublk_device *ub = cmd->file->private_data;
+ int index = (int)ub_cmd->addr, ret;
+ struct request *req;
+
+ req = __ublk_check_and_get_req(ub, ubq, tag, 0);
+ if (!req)
+ return -EINVAL;
+
+ ret = io_buffer_register_bvec(cmd, req, ublk_io_release, index,
+ issue_flags);
+ if (ret) {
+ ublk_put_req_ref(ubq, req);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
+ const struct ublksrv_io_cmd *ub_cmd,
+ unsigned int issue_flags)
+{
+ int index = (int)ub_cmd->addr;
+
+ io_buffer_unregister_bvec(cmd, index, issue_flags);
+ return 0;
+}
+
static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
unsigned int issue_flags,
const struct ublksrv_io_cmd *ub_cmd)
@@ -1798,6 +1872,10 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
ret = -EINVAL;
switch (_IOC_NR(cmd_op)) {
+ case UBLK_IO_REGISTER_IO_BUF:
+ return ublk_register_io_buf(cmd, ubq, tag, ub_cmd, issue_flags);
+ case UBLK_IO_UNREGISTER_IO_BUF:
+ return ublk_unregister_io_buf(cmd, ub_cmd, issue_flags);
case UBLK_IO_FETCH_REQ:
/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
if (ublk_queue_ready(ubq)) {
@@ -1872,36 +1950,6 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
return -EIOCBQUEUED;
}
-static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
- struct ublk_queue *ubq, int tag, size_t offset)
-{
- struct request *req;
-
- if (!ublk_need_req_ref(ubq))
- return NULL;
-
- req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
- if (!req)
- return NULL;
-
- if (!ublk_get_req_ref(ubq, req))
- return NULL;
-
- if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
- goto fail_put;
-
- if (!ublk_rq_has_data(req))
- goto fail_put;
-
- if (offset > blk_rq_bytes(req))
- goto fail_put;
-
- return req;
-fail_put:
- ublk_put_req_ref(ubq, req);
- return NULL;
-}
-
static inline int ublk_ch_uring_cmd_local(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
@@ -2459,7 +2507,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
* buffer by pwrite() to ublk char device, which can't be
* used for unprivileged device
*/
- if (info.flags & UBLK_F_USER_COPY)
+ if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
return -EINVAL;
}
@@ -2527,9 +2575,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_free_dev_number;
}
- /* We are not ready to support zero copy */
- ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY;
-
ub->dev_info.nr_hw_queues = min_t(unsigned int,
ub->dev_info.nr_hw_queues, nr_cpu_ids);
ublk_align_max_io_size(ub);
@@ -2860,7 +2905,7 @@ static int ublk_ctrl_get_features(struct io_uring_cmd *cmd)
{
const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
void __user *argp = (void __user *)(unsigned long)header->addr;
- u64 features = UBLK_F_ALL & ~UBLK_F_SUPPORT_ZERO_COPY;
+ u64 features = UBLK_F_ALL;
if (header->len != UBLK_FEATURES_LEN || !header->addr)
return -EINVAL;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index a8bc98bb69fce..74246c926b55f 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -94,6 +94,10 @@
_IOWR('u', UBLK_IO_COMMIT_AND_FETCH_REQ, struct ublksrv_io_cmd)
#define UBLK_U_IO_NEED_GET_DATA \
_IOWR('u', UBLK_IO_NEED_GET_DATA, struct ublksrv_io_cmd)
+#define UBLK_U_IO_REGISTER_IO_BUF \
+ _IOWR('u', 0x23, struct ublksrv_io_cmd)
+#define UBLK_U_IO_UNREGISTER_IO_BUF \
+ _IOWR('u', 0x24, struct ublksrv_io_cmd)
/* only ABORT means that no re-fetch */
#define UBLK_IO_RES_OK 0
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCHv7 5/6] io_uring: add abstraction for buf_table rsrc data
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
` (3 preceding siblings ...)
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-26 18:21 ` Keith Busch
2025-02-26 18:21 ` [PATCHv7 6/6] io_uring: cache nodes and mapped buffers Keith Busch
5 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:21 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
We'll need to add more fields specific to the registered buffers, so
make a layer for it now. No functional change in this patch.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
include/linux/io_uring_types.h | 6 +++-
io_uring/fdinfo.c | 8 +++---
io_uring/nop.c | 2 +-
io_uring/register.c | 2 +-
io_uring/rsrc.c | 51 +++++++++++++++++-----------------
5 files changed, 36 insertions(+), 33 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index c0fe8a00fe53a..a05ae4cb98a4c 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,6 +69,10 @@ struct io_file_table {
unsigned int alloc_hint;
};
+struct io_buf_table {
+ struct io_rsrc_data data;
+};
+
struct io_hash_bucket {
struct hlist_head list;
} ____cacheline_aligned_in_smp;
@@ -293,7 +297,7 @@ struct io_ring_ctx {
struct io_wq_work_list iopoll_list;
struct io_file_table file_table;
- struct io_rsrc_data buf_table;
+ struct io_buf_table buf_table;
struct io_submit_state submit_state;
diff --git a/io_uring/fdinfo.c b/io_uring/fdinfo.c
index f60d0a9d505e2..d389c06cbce10 100644
--- a/io_uring/fdinfo.c
+++ b/io_uring/fdinfo.c
@@ -217,12 +217,12 @@ __cold void io_uring_show_fdinfo(struct seq_file *m, struct file *file)
seq_puts(m, "\n");
}
}
- seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.nr);
- for (i = 0; has_lock && i < ctx->buf_table.nr; i++) {
+ seq_printf(m, "UserBufs:\t%u\n", ctx->buf_table.data.nr);
+ for (i = 0; has_lock && i < ctx->buf_table.data.nr; i++) {
struct io_mapped_ubuf *buf = NULL;
- if (ctx->buf_table.nodes[i])
- buf = ctx->buf_table.nodes[i]->buf;
+ if (ctx->buf_table.data.nodes[i])
+ buf = ctx->buf_table.data.nodes[i]->buf;
if (buf)
seq_printf(m, "%5u: 0x%llx/%u\n", i, buf->ubuf, buf->len);
else
diff --git a/io_uring/nop.c b/io_uring/nop.c
index ea539531cb5f6..da8870e00eee7 100644
--- a/io_uring/nop.c
+++ b/io_uring/nop.c
@@ -66,7 +66,7 @@ int io_nop(struct io_kiocb *req, unsigned int issue_flags)
ret = -EFAULT;
io_ring_submit_lock(ctx, issue_flags);
- node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index);
if (node) {
io_req_assign_buf_node(req, node);
ret = 0;
diff --git a/io_uring/register.c b/io_uring/register.c
index cc23a4c205cd4..f15a8d52ad30f 100644
--- a/io_uring/register.c
+++ b/io_uring/register.c
@@ -926,7 +926,7 @@ SYSCALL_DEFINE4(io_uring_register, unsigned int, fd, unsigned int, opcode,
ret = __io_uring_register(ctx, opcode, arg, nr_args);
trace_io_uring_register(ctx, opcode, ctx->file_table.data.nr,
- ctx->buf_table.nr, ret);
+ ctx->buf_table.data.nr, ret);
mutex_unlock(&ctx->uring_lock);
fput(file);
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index 5b234e84dcba6..c30a5cda08f3e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -236,9 +236,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
__u32 done;
int i, err;
- if (!ctx->buf_table.nr)
+ if (!ctx->buf_table.data.nr)
return -ENXIO;
- if (up->offset + nr_args > ctx->buf_table.nr)
+ if (up->offset + nr_args > ctx->buf_table.data.nr)
return -EINVAL;
for (done = 0; done < nr_args; done++) {
@@ -270,9 +270,9 @@ static int __io_sqe_buffers_update(struct io_ring_ctx *ctx,
}
node->tag = tag;
}
- i = array_index_nospec(up->offset + done, ctx->buf_table.nr);
- io_reset_rsrc_node(ctx, &ctx->buf_table, i);
- ctx->buf_table.nodes[i] = node;
+ i = array_index_nospec(up->offset + done, ctx->buf_table.data.nr);
+ io_reset_rsrc_node(ctx, &ctx->buf_table.data, i);
+ ctx->buf_table.data.nodes[i] = node;
if (ctx->compat)
user_data += sizeof(struct compat_iovec);
else
@@ -550,9 +550,9 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
{
- if (!ctx->buf_table.nr)
+ if (!ctx->buf_table.data.nr)
return -ENXIO;
- io_rsrc_data_free(ctx, &ctx->buf_table);
+ io_rsrc_data_free(ctx, &ctx->buf_table.data);
return 0;
}
@@ -579,8 +579,8 @@ static bool headpage_already_acct(struct io_ring_ctx *ctx, struct page **pages,
}
/* check previously registered pages */
- for (i = 0; i < ctx->buf_table.nr; i++) {
- struct io_rsrc_node *node = ctx->buf_table.nodes[i];
+ for (i = 0; i < ctx->buf_table.data.nr; i++) {
+ struct io_rsrc_node *node = ctx->buf_table.data.nodes[i];
struct io_mapped_ubuf *imu;
if (!node)
@@ -809,7 +809,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
- if (ctx->buf_table.nr)
+ if (ctx->buf_table.data.nr)
return -EBUSY;
if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
return -EINVAL;
@@ -862,7 +862,7 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
data.nodes[i] = node;
}
- ctx->buf_table = data;
+ ctx->buf_table.data = data;
if (ret)
io_sqe_buffers_unregister(ctx);
return ret;
@@ -873,7 +873,7 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
unsigned int issue_flags)
{
struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
- struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_data *data = &ctx->buf_table.data;
struct req_iterator rq_iter;
struct io_mapped_ubuf *imu;
struct io_rsrc_node *node;
@@ -937,7 +937,7 @@ void io_buffer_unregister_bvec(struct io_uring_cmd *cmd, unsigned int index,
unsigned int issue_flags)
{
struct io_ring_ctx *ctx = cmd_to_io_kiocb(cmd)->ctx;
- struct io_rsrc_data *data = &ctx->buf_table;
+ struct io_rsrc_data *data = &ctx->buf_table.data;
struct io_rsrc_node *node;
io_ring_submit_lock(ctx, issue_flags);
@@ -1034,7 +1034,7 @@ static inline struct io_rsrc_node *io_find_buf_node(struct io_kiocb *req,
return req->buf_node;
io_ring_submit_lock(ctx, issue_flags);
- node = io_rsrc_node_lookup(&ctx->buf_table, req->buf_index);
+ node = io_rsrc_node_lookup(&ctx->buf_table.data, req->buf_index);
if (node)
io_req_assign_buf_node(req, node);
io_ring_submit_unlock(ctx, issue_flags);
@@ -1084,10 +1084,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (!arg->nr && (arg->dst_off || arg->src_off))
return -EINVAL;
/* not allowed unless REPLACE is set */
- if (ctx->buf_table.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
+ if (ctx->buf_table.data.nr && !(arg->flags & IORING_REGISTER_DST_REPLACE))
return -EBUSY;
- nbufs = src_ctx->buf_table.nr;
+ nbufs = src_ctx->buf_table.data.nr;
if (!arg->nr)
arg->nr = nbufs;
else if (arg->nr > nbufs)
@@ -1097,13 +1097,13 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
return -EOVERFLOW;
- ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.nr));
+ ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
if (ret)
return ret;
/* Fill entries in data from dst that won't overlap with src */
- for (i = 0; i < min(arg->dst_off, ctx->buf_table.nr); i++) {
- struct io_rsrc_node *src_node = ctx->buf_table.nodes[i];
+ for (i = 0; i < min(arg->dst_off, ctx->buf_table.data.nr); i++) {
+ struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
if (src_node) {
data.nodes[i] = src_node;
@@ -1112,7 +1112,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
}
ret = -ENXIO;
- nbufs = src_ctx->buf_table.nr;
+ nbufs = src_ctx->buf_table.data.nr;
if (!nbufs)
goto out_free;
ret = -EINVAL;
@@ -1132,7 +1132,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
while (nr--) {
struct io_rsrc_node *dst_node, *src_node;
- src_node = io_rsrc_node_lookup(&src_ctx->buf_table, i);
+ src_node = io_rsrc_node_lookup(&src_ctx->buf_table.data, i);
if (!src_node) {
dst_node = NULL;
} else {
@@ -1154,7 +1154,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* old and new nodes at this point.
*/
if (arg->flags & IORING_REGISTER_DST_REPLACE)
- io_rsrc_data_free(ctx, &ctx->buf_table);
+ io_sqe_buffers_unregister(ctx);
/*
* ctx->buf_table must be empty now - either the contents are being
@@ -1162,10 +1162,9 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* copied to a ring that does not have buffers yet (checked at function
* entry).
*/
- WARN_ON_ONCE(ctx->buf_table.nr);
- ctx->buf_table = data;
+ WARN_ON_ONCE(ctx->buf_table.data.nr);
+ ctx->buf_table.data = data;
return 0;
-
out_free:
io_rsrc_data_free(ctx, &data);
return ret;
@@ -1190,7 +1189,7 @@ int io_register_clone_buffers(struct io_ring_ctx *ctx, void __user *arg)
return -EFAULT;
if (buf.flags & ~(IORING_REGISTER_SRC_REGISTERED|IORING_REGISTER_DST_REPLACE))
return -EINVAL;
- if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.nr)
+ if (!(buf.flags & IORING_REGISTER_DST_REPLACE) && ctx->buf_table.data.nr)
return -EBUSY;
if (memchr_inv(buf.pad, 0, sizeof(buf.pad)))
return -EINVAL;
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
` (4 preceding siblings ...)
2025-02-26 18:21 ` [PATCHv7 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
@ 2025-02-26 18:21 ` Keith Busch
2025-02-26 18:48 ` Jens Axboe
2025-02-26 19:36 ` Jens Axboe
5 siblings, 2 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 18:21 UTC (permalink / raw)
To: ming.lei, asml.silence, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
From: Keith Busch <kbusch@kernel.org>
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
---
include/linux/io_uring_types.h | 18 ++---
io_uring/filetable.c | 2 +-
io_uring/rsrc.c | 123 +++++++++++++++++++++++++--------
io_uring/rsrc.h | 2 +-
4 files changed, 107 insertions(+), 38 deletions(-)
diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h
index a05ae4cb98a4c..fda3221de2174 100644
--- a/include/linux/io_uring_types.h
+++ b/include/linux/io_uring_types.h
@@ -69,8 +69,18 @@ struct io_file_table {
unsigned int alloc_hint;
};
+struct io_alloc_cache {
+ void **entries;
+ unsigned int nr_cached;
+ unsigned int max_cached;
+ unsigned int elem_size;
+ unsigned int init_clear;
+};
+
struct io_buf_table {
struct io_rsrc_data data;
+ struct io_alloc_cache node_cache;
+ struct io_alloc_cache imu_cache;
};
struct io_hash_bucket {
@@ -224,14 +234,6 @@ struct io_submit_state {
struct blk_plug plug;
};
-struct io_alloc_cache {
- void **entries;
- unsigned int nr_cached;
- unsigned int max_cached;
- unsigned int elem_size;
- unsigned int init_clear;
-};
-
struct io_ring_ctx {
/* const or read-mostly hot data */
struct {
diff --git a/io_uring/filetable.c b/io_uring/filetable.c
index dd8eeec97acf6..a21660e3145ab 100644
--- a/io_uring/filetable.c
+++ b/io_uring/filetable.c
@@ -68,7 +68,7 @@ static int io_install_fixed_file(struct io_ring_ctx *ctx, struct file *file,
if (slot_index >= ctx->file_table.data.nr)
return -EINVAL;
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node)
return -ENOMEM;
diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c
index c30a5cda08f3e..8823f15d8fe2e 100644
--- a/io_uring/rsrc.c
+++ b/io_uring/rsrc.c
@@ -33,6 +33,9 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
#define IORING_MAX_FIXED_FILES (1U << 20)
#define IORING_MAX_REG_BUFFERS (1U << 14)
+#define IO_CACHED_BVECS_SEGS 32
+#define IO_CACHED_ELEMS 64
+
int __io_account_mem(struct user_struct *user, unsigned long nr_pages)
{
unsigned long page_limit, cur_pages, new_pages;
@@ -102,6 +105,22 @@ int io_buffer_validate(struct iovec *iov)
return 0;
}
+static struct io_mapped_ubuf *io_alloc_imu(struct io_ring_ctx *ctx,
+ int nr_bvecs)
+{
+ if (nr_bvecs <= IO_CACHED_BVECS_SEGS)
+ return io_cache_alloc(&ctx->buf_table.imu_cache, GFP_KERNEL);
+ return kvmalloc(struct_size_t(struct io_mapped_ubuf, bvec, nr_bvecs),
+ GFP_KERNEL);
+}
+
+static void io_free_imu(struct io_ring_ctx *ctx, struct io_mapped_ubuf *imu)
+{
+ if (imu->nr_bvecs > IO_CACHED_BVECS_SEGS ||
+ !io_alloc_cache_put(&ctx->buf_table.imu_cache, imu))
+ kvfree(imu);
+}
+
static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
{
struct io_mapped_ubuf *imu = node->buf;
@@ -120,22 +139,35 @@ static void io_buffer_unmap(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
io_unaccount_mem(ctx, imu->acct_pages);
}
- kvfree(imu);
+ io_free_imu(ctx, imu);
}
-struct io_rsrc_node *io_rsrc_node_alloc(int type)
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type)
{
struct io_rsrc_node *node;
- node = kzalloc(sizeof(*node), GFP_KERNEL);
+ if (type == IORING_RSRC_FILE)
+ node = kmalloc(sizeof(*node), GFP_KERNEL);
+ else
+ node = io_cache_alloc(&ctx->buf_table.node_cache, GFP_KERNEL);
if (node) {
node->type = type;
node->refs = 1;
+ node->tag = 0;
+ node->file_ptr = 0;
}
return node;
}
-__cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data)
+static __cold void __io_rsrc_data_free(struct io_rsrc_data *data)
+{
+ kvfree(data->nodes);
+ data->nodes = NULL;
+ data->nr = 0;
+}
+
+__cold void io_rsrc_data_free(struct io_ring_ctx *ctx,
+ struct io_rsrc_data *data)
{
if (!data->nr)
return;
@@ -143,9 +175,7 @@ __cold void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data
if (data->nodes[data->nr])
io_put_rsrc_node(ctx, data->nodes[data->nr]);
}
- kvfree(data->nodes);
- data->nodes = NULL;
- data->nr = 0;
+ __io_rsrc_data_free(data);
}
__cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
@@ -159,6 +189,33 @@ __cold int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr)
return -ENOMEM;
}
+static __cold int io_rsrc_buffer_alloc(struct io_buf_table *table, unsigned nr)
+{
+ const int imu_cache_size = struct_size_t(struct io_mapped_ubuf, bvec,
+ IO_CACHED_BVECS_SEGS);
+ const int node_size = sizeof(struct io_rsrc_node);
+ int ret;
+
+ ret = io_rsrc_data_alloc(&table->data, nr);
+ if (ret)
+ return ret;
+
+ if (io_alloc_cache_init(&table->node_cache, IO_CACHED_ELEMS,
+ node_size, 0))
+ goto free_data;
+
+ if (io_alloc_cache_init(&table->imu_cache, IO_CACHED_ELEMS,
+ imu_cache_size, 0))
+ goto free_cache;
+
+ return 0;
+free_cache:
+ io_alloc_cache_free(&table->node_cache, kfree);
+free_data:
+ __io_rsrc_data_free(&table->data);
+ return -ENOMEM;
+}
+
static int __io_sqe_files_update(struct io_ring_ctx *ctx,
struct io_uring_rsrc_update2 *up,
unsigned nr_args)
@@ -208,7 +265,7 @@ static int __io_sqe_files_update(struct io_ring_ctx *ctx,
err = -EBADF;
break;
}
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node) {
err = -ENOMEM;
fput(file);
@@ -460,6 +517,8 @@ void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node)
case IORING_RSRC_BUFFER:
if (node->buf)
io_buffer_unmap(ctx, node);
+ if (io_alloc_cache_put(&ctx->buf_table.node_cache, node))
+ return;
break;
default:
WARN_ON_ONCE(1);
@@ -528,7 +587,7 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
goto fail;
}
ret = -ENOMEM;
- node = io_rsrc_node_alloc(IORING_RSRC_FILE);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_FILE);
if (!node) {
fput(file);
goto fail;
@@ -548,11 +607,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
return ret;
}
+static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
+ struct io_buf_table *table)
+{
+ io_rsrc_data_free(ctx, &table->data);
+ io_alloc_cache_free(&table->node_cache, kfree);
+ io_alloc_cache_free(&table->imu_cache, kfree);
+}
+
int io_sqe_buffers_unregister(struct io_ring_ctx *ctx)
{
if (!ctx->buf_table.data.nr)
return -ENXIO;
- io_rsrc_data_free(ctx, &ctx->buf_table.data);
+ io_rsrc_buffer_free(ctx, &ctx->buf_table);
return 0;
}
@@ -733,7 +800,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
if (!iov->iov_base)
return NULL;
- node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
if (!node)
return ERR_PTR(-ENOMEM);
node->buf = NULL;
@@ -753,7 +820,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
coalesced = io_coalesce_buffer(&pages, &nr_pages, &data);
}
- imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL);
+ imu = io_alloc_imu(ctx, nr_pages);
if (!imu)
goto done;
@@ -789,7 +856,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
}
done:
if (ret) {
- kvfree(imu);
+ io_free_imu(ctx, imu);
if (node)
io_put_rsrc_node(ctx, node);
node = ERR_PTR(ret);
@@ -802,9 +869,9 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
unsigned int nr_args, u64 __user *tags)
{
struct page *last_hpage = NULL;
- struct io_rsrc_data data;
struct iovec fast_iov, *iov = &fast_iov;
const struct iovec __user *uvec;
+ struct io_buf_table table;
int i, ret;
BUILD_BUG_ON(IORING_MAX_REG_BUFFERS >= (1u << 16));
@@ -813,13 +880,14 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
return -EBUSY;
if (!nr_args || nr_args > IORING_MAX_REG_BUFFERS)
return -EINVAL;
- ret = io_rsrc_data_alloc(&data, nr_args);
+ ret = io_rsrc_buffer_alloc(&table, nr_args);
if (ret)
return ret;
if (!arg)
memset(iov, 0, sizeof(*iov));
+ ctx->buf_table = table;
for (i = 0; i < nr_args; i++) {
struct io_rsrc_node *node;
u64 tag = 0;
@@ -859,10 +927,8 @@ int io_sqe_buffers_register(struct io_ring_ctx *ctx, void __user *arg,
}
node->tag = tag;
}
- data.nodes[i] = node;
+ table.data.nodes[i] = node;
}
-
- ctx->buf_table.data = data;
if (ret)
io_sqe_buffers_unregister(ctx);
return ret;
@@ -893,14 +959,15 @@ int io_buffer_register_bvec(struct io_uring_cmd *cmd, struct request *rq,
goto unlock;
}
- node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
if (!node) {
ret = -ENOMEM;
goto unlock;
}
nr_bvecs = blk_rq_nr_phys_segments(rq);
- imu = kvmalloc(struct_size(imu, bvec, nr_bvecs), GFP_KERNEL);
+
+ imu = io_alloc_imu(ctx, nr_bvecs);
if (!imu) {
kfree(node);
ret = -ENOMEM;
@@ -1066,7 +1133,7 @@ static void lock_two_rings(struct io_ring_ctx *ctx1, struct io_ring_ctx *ctx2)
static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx,
struct io_uring_clone_buffers *arg)
{
- struct io_rsrc_data data;
+ struct io_buf_table table;
int i, ret, off, nr;
unsigned int nbufs;
@@ -1097,7 +1164,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (check_add_overflow(arg->nr, arg->dst_off, &nbufs))
return -EOVERFLOW;
- ret = io_rsrc_data_alloc(&data, max(nbufs, ctx->buf_table.data.nr));
+ ret = io_rsrc_buffer_alloc(&table, max(nbufs, ctx->buf_table.data.nr));
if (ret)
return ret;
@@ -1106,7 +1173,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
struct io_rsrc_node *src_node = ctx->buf_table.data.nodes[i];
if (src_node) {
- data.nodes[i] = src_node;
+ table.data.nodes[i] = src_node;
src_node->refs++;
}
}
@@ -1136,7 +1203,7 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
if (!src_node) {
dst_node = NULL;
} else {
- dst_node = io_rsrc_node_alloc(IORING_RSRC_BUFFER);
+ dst_node = io_rsrc_node_alloc(ctx, IORING_RSRC_BUFFER);
if (!dst_node) {
ret = -ENOMEM;
goto out_free;
@@ -1145,12 +1212,12 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
refcount_inc(&src_node->buf->refs);
dst_node->buf = src_node->buf;
}
- data.nodes[off++] = dst_node;
+ table.data.nodes[off++] = dst_node;
i++;
}
/*
- * If asked for replace, put the old table. data->nodes[] holds both
+ * If asked for replace, put the old table. table.data->nodes[] holds both
* old and new nodes at this point.
*/
if (arg->flags & IORING_REGISTER_DST_REPLACE)
@@ -1163,10 +1230,10 @@ static int io_clone_buffers(struct io_ring_ctx *ctx, struct io_ring_ctx *src_ctx
* entry).
*/
WARN_ON_ONCE(ctx->buf_table.data.nr);
- ctx->buf_table.data = data;
+ ctx->buf_table = table;
return 0;
out_free:
- io_rsrc_data_free(ctx, &data);
+ io_rsrc_buffer_free(ctx, &table);
return ret;
}
diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h
index 9668804afddc4..4b39d8104df19 100644
--- a/io_uring/rsrc.h
+++ b/io_uring/rsrc.h
@@ -47,7 +47,7 @@ struct io_imu_folio_data {
unsigned int nr_folios;
};
-struct io_rsrc_node *io_rsrc_node_alloc(int type);
+struct io_rsrc_node *io_rsrc_node_alloc(struct io_ring_ctx *ctx, int type);
void io_free_rsrc_node(struct io_ring_ctx *ctx, struct io_rsrc_node *node);
void io_rsrc_data_free(struct io_ring_ctx *ctx, struct io_rsrc_data *data);
int io_rsrc_data_alloc(struct io_rsrc_data *data, unsigned nr);
--
2.43.5
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCHv7 4/6] ublk: zc register/unregister bvec
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
@ 2025-02-26 18:40 ` Jens Axboe
2025-02-27 15:08 ` kernel test robot
2025-02-27 15:42 ` kernel test robot
2 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 18:40 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 11:20 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Provide new operations for the user to request mapping an active request
> to an io uring instance's buf_table. The user has to provide the index
> it wants to install the buffer.
>
> A reference count is taken on the request to ensure it can't be
> completed while it is active in a ring's buf_table.
Looks pretty sane to me, just a few minor nits below where only one of
them actually is required to change.
> +static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
> + const struct ublksrv_io_cmd *ub_cmd,
> + unsigned int issue_flags)
> +{
> + int index = (int)ub_cmd->addr;
> +
> + io_buffer_unregister_bvec(cmd, index, issue_flags);
> + return 0;
> +}
> +
Minor nit here too, I'd drop 'index' and just cast it in the argument.
> -static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> - struct ublk_queue *ubq, int tag, size_t offset)
> -{
> - struct request *req;
> -
> - if (!ublk_need_req_ref(ubq))
> - return NULL;
> -
> - req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], tag);
> - if (!req)
> - return NULL;
> -
> - if (!ublk_get_req_ref(ubq, req))
> - return NULL;
> -
> - if (unlikely(!blk_mq_request_started(req) || req->tag != tag))
> - goto fail_put;
> -
> - if (!ublk_rq_has_data(req))
> - goto fail_put;
> -
> - if (offset > blk_rq_bytes(req))
> - goto fail_put;
> -
> - return req;
> -fail_put:
> - ublk_put_req_ref(ubq, req);
> - return NULL;
> -}
> -
This could be a prep patch to cut down on unrelated changes, but not
really important.
> @@ -2459,7 +2507,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
> * buffer by pwrite() to ublk char device, which can't be
> * used for unprivileged device
> */
> - if (info.flags & UBLK_F_USER_COPY)
> + if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
> return -EINVAL;
> }
Missing parens here around mask.
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 18:21 ` [PATCHv7 6/6] io_uring: cache nodes and mapped buffers Keith Busch
@ 2025-02-26 18:48 ` Jens Axboe
2025-02-26 19:36 ` Jens Axboe
1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 18:48 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 11:21 AM, Keith Busch wrote:
> @@ -789,7 +856,7 @@ static struct io_rsrc_node *io_sqe_buffer_register(struct io_ring_ctx *ctx,
> }
> done:
> if (ret) {
> - kvfree(imu);
> + io_free_imu(ctx, imu);
> if (node)
> io_put_rsrc_node(ctx, node);
> node = ERR_PTR(ret);
'imu' can be NULL here in the error path, which this is. That was fine
before with kvfree(), but it'll bomb now. Best fix is probably just to
check for non-NULL before calling io_free_imu(), as other paths should
be fine.
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path
2025-02-26 18:20 ` [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path Keith Busch
@ 2025-02-26 19:04 ` Jens Axboe
2025-02-26 20:20 ` Jens Axboe
0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 19:04 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 11:20 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Registered buffers may depend on a linked command, which makes the prep
> path too early to import. Move to the issue path when the node is
> actually needed like all the other users of fixed buffers.
Conceptually I think this patch is fine, but it does bother me with
random bool arguments. We could fold in something like the (totally
tested) below diff to get rid of that. What do you think?
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 728d695d2552..a8a46a32f20d 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -248,8 +248,8 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
return ret;
}
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
- int ddir, bool do_import)
+static int __io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+ int ddir)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
unsigned ioprio;
@@ -285,14 +285,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
- if (do_import && !io_do_buffer_select(req)) {
- struct io_async_rw *io = req->async_data;
-
- ret = io_import_rw_buffer(ddir, req, io, 0);
- if (unlikely(ret))
- return ret;
- }
-
attr_type_mask = READ_ONCE(sqe->attr_type_mask);
if (attr_type_mask) {
u64 attr_ptr;
@@ -307,27 +299,52 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
return ret;
}
+static int io_rw_do_import(struct io_kiocb *req, int ddir)
+{
+ if (!io_do_buffer_select(req)) {
+ struct io_async_rw *io = req->async_data;
+ int ret;
+
+ ret = io_import_rw_buffer(ddir, req, io, 0);
+ if (unlikely(ret))
+ return ret;
+ }
+
+ return 0;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+ int ddir)
+{
+ int ret;
+
+ ret = __io_prep_rw(req, sqe, ddir);
+ if (unlikely(ret))
+ return ret;
+
+ return io_rw_do_import(req, ITER_DEST);
+}
+
int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_DEST, true);
+ return io_prep_rw(req, sqe, ITER_DEST);
}
int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_SOURCE, true);
+ return io_prep_rw(req, sqe, ITER_SOURCE);
}
static int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir)
{
- const bool do_import = !(req->flags & REQ_F_BUFFER_SELECT);
int ret;
- ret = io_prep_rw(req, sqe, ddir, do_import);
+ ret = io_prep_rw(req, sqe, ddir);
if (unlikely(ret))
return ret;
- if (do_import)
- return 0;
+ if (!(req->flags & REQ_F_BUFFER_SELECT))
+ return io_rw_do_import(req, ddir);
/*
* Have to do this validation here, as this is in io_read() rw->len
@@ -364,12 +381,12 @@ static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_DEST, false);
+ return io_prep_rw(req, sqe, ITER_DEST);
}
int io_prep_write_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_SOURCE, false);
+ return io_prep_rw(req, sqe, ITER_SOURCE);
}
/*
@@ -385,7 +402,7 @@ int io_read_mshot_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!(req->flags & REQ_F_BUFFER_SELECT))
return -EINVAL;
- ret = io_prep_rw(req, sqe, ITER_DEST, false);
+ ret = io_prep_rw(req, sqe, ITER_DEST);
if (unlikely(ret))
return ret;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 18:21 ` [PATCHv7 6/6] io_uring: cache nodes and mapped buffers Keith Busch
2025-02-26 18:48 ` Jens Axboe
@ 2025-02-26 19:36 ` Jens Axboe
2025-02-26 19:43 ` Jens Axboe
1 sibling, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 19:36 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 11:21 AM, Keith Busch wrote:
> + return 0;
> +free_cache:
> + io_alloc_cache_free(&table->node_cache, kfree);
kvfree?
> @@ -548,11 +607,19 @@ int io_sqe_files_register(struct io_ring_ctx *ctx, void __user *arg,
> return ret;
> }
>
> +static void io_rsrc_buffer_free(struct io_ring_ctx *ctx,
> + struct io_buf_table *table)
> +{
> + io_rsrc_data_free(ctx, &table->data);
> + io_alloc_cache_free(&table->node_cache, kfree);
> + io_alloc_cache_free(&table->imu_cache, kfree);
> +}
Ditto
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 19:36 ` Jens Axboe
@ 2025-02-26 19:43 ` Jens Axboe
2025-02-26 19:58 ` Jens Axboe
0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 19:43 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
Ignore this one, that was bogus. Trying to understand a crash in here.
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 19:43 ` Jens Axboe
@ 2025-02-26 19:58 ` Jens Axboe
2025-02-26 20:00 ` Keith Busch
0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 19:58 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 12:43 PM, Jens Axboe wrote:
> Ignore this one, that was bogus. Trying to understand a crash in here.
Just to wrap this up, it's the combining of table and cache that's the
issue. The table can get torn down and freed while nodes are still
alive - which is fine, but not with this change, as there's then no
way to sanely put the cached entry.
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 6/6] io_uring: cache nodes and mapped buffers
2025-02-26 19:58 ` Jens Axboe
@ 2025-02-26 20:00 ` Keith Busch
0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-26 20:00 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, ming.lei, asml.silence, linux-block, io-uring, bernd,
csander, linux-nvme
On Wed, Feb 26, 2025 at 12:58:21PM -0700, Jens Axboe wrote:
> On 2/26/25 12:43 PM, Jens Axboe wrote:
> > Ignore this one, that was bogus. Trying to understand a crash in here.
>
> Just to wrap this up, it's the combining of table and cache that's the
> issue. The table can get torn down and freed while nodes are still
> alive - which is fine, but not with this change, as there's then no
> way to sanely put the cached entry.
Yes, the imu can outlive the table it came from, which means Pavel was
right to suggest I not tied the cache to the table. Thanks for pointing
this out.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path
2025-02-26 19:04 ` Jens Axboe
@ 2025-02-26 20:20 ` Jens Axboe
2025-02-27 21:46 ` Keith Busch
0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 20:20 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 12:04 PM, Jens Axboe wrote:
> On 2/26/25 11:20 AM, Keith Busch wrote:
>> From: Keith Busch <kbusch@kernel.org>
>>
>> Registered buffers may depend on a linked command, which makes the prep
>> path too early to import. Move to the issue path when the node is
>> actually needed like all the other users of fixed buffers.
>
> Conceptually I think this patch is fine, but it does bother me with
> random bool arguments. We could fold in something like the (totally
> tested) below diff to get rid of that. What do you think?
>
> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> + int ddir)
> +{
> + int ret;
> +
> + ret = __io_prep_rw(req, sqe, ddir);
> + if (unlikely(ret))
> + return ret;
> +
> + return io_rw_do_import(req, ITER_DEST);
Oops, should be 'ddir' here too of course. Updated below, does pass my
testing fwiw.
diff --git a/io_uring/rw.c b/io_uring/rw.c
index 728d695d2552..4ac2d004b352 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -248,8 +248,8 @@ static int io_prep_rw_pi(struct io_kiocb *req, struct io_rw *rw, int ddir,
return ret;
}
-static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
- int ddir, bool do_import)
+static int __io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+ int ddir)
{
struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw);
unsigned ioprio;
@@ -285,14 +285,6 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
rw->len = READ_ONCE(sqe->len);
rw->flags = READ_ONCE(sqe->rw_flags);
- if (do_import && !io_do_buffer_select(req)) {
- struct io_async_rw *io = req->async_data;
-
- ret = io_import_rw_buffer(ddir, req, io, 0);
- if (unlikely(ret))
- return ret;
- }
-
attr_type_mask = READ_ONCE(sqe->attr_type_mask);
if (attr_type_mask) {
u64 attr_ptr;
@@ -307,27 +299,52 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
return ret;
}
+static int io_rw_do_import(struct io_kiocb *req, int ddir)
+{
+ if (!io_do_buffer_select(req)) {
+ struct io_async_rw *io = req->async_data;
+ int ret;
+
+ ret = io_import_rw_buffer(ddir, req, io, 0);
+ if (unlikely(ret))
+ return ret;
+ }
+
+ return 0;
+}
+
+static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
+ int ddir)
+{
+ int ret;
+
+ ret = __io_prep_rw(req, sqe, ddir);
+ if (unlikely(ret))
+ return ret;
+
+ return io_rw_do_import(req, ddir);
+}
+
int io_prep_read(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_DEST, true);
+ return io_prep_rw(req, sqe, ITER_DEST);
}
int io_prep_write(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_SOURCE, true);
+ return io_prep_rw(req, sqe, ITER_SOURCE);
}
static int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe,
int ddir)
{
- const bool do_import = !(req->flags & REQ_F_BUFFER_SELECT);
int ret;
- ret = io_prep_rw(req, sqe, ddir, do_import);
+ ret = io_prep_rw(req, sqe, ddir);
if (unlikely(ret))
return ret;
- if (do_import)
- return 0;
+ if (!(req->flags & REQ_F_BUFFER_SELECT))
+ return io_rw_do_import(req, ddir);
/*
* Have to do this validation here, as this is in io_read() rw->len
@@ -364,12 +381,12 @@ static int io_init_rw_fixed(struct io_kiocb *req, unsigned int issue_flags,
int io_prep_read_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_DEST, false);
+ return io_prep_rw(req, sqe, ITER_DEST);
}
int io_prep_write_fixed(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
- return io_prep_rw(req, sqe, ITER_SOURCE, false);
+ return io_prep_rw(req, sqe, ITER_SOURCE);
}
/*
@@ -385,7 +402,7 @@ int io_read_mshot_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (!(req->flags & REQ_F_BUFFER_SELECT))
return -EINVAL;
- ret = io_prep_rw(req, sqe, ITER_DEST, false);
+ ret = io_prep_rw(req, sqe, ITER_DEST);
if (unlikely(ret))
return ret;
--
Jens Axboe
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-26 18:20 ` [PATCHv7 2/6] io_uring: add support for kernel registered bvecs Keith Busch
@ 2025-02-26 20:36 ` Jens Axboe
2025-02-27 15:54 ` Pavel Begunkov
1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2025-02-26 20:36 UTC (permalink / raw)
To: Keith Busch, ming.lei, asml.silence, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 11:20 AM, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
>
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
Just a suggestion, but might make sense to not use ->release() as
a gating whether this is a kernel buffer or not, there's room in the
struct anyway with the 'u8 perm' having holes anyway. And if we did
that, then we could just have a default release that does the unpin
and put rather than needing to check and have branches for the two
types of release. Yes the indirect function call isn't free either,
like the branches aren't, but I don't think it matters on the release
side. At least not enough to care, and it'd help streamline the code
a bit and not overload ->release() with meaning "oh this is a kernel
buffer".
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 4/6] ublk: zc register/unregister bvec
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
2025-02-26 18:40 ` Jens Axboe
@ 2025-02-27 15:08 ` kernel test robot
2025-02-27 15:42 ` kernel test robot
2 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-02-27 15:08 UTC (permalink / raw)
To: Keith Busch; +Cc: llvm, oe-kbuild-all
Hi Keith,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20250226]
[cannot apply to axboe-block/for-next linux-nvme/for-next linus/master v6.14-rc4 v6.14-rc3 v6.14-rc2 v6.14-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/io_uring-rw-move-fixed-buffer-import-to-issue-path/20250227-022853
base: next-20250226
patch link: https://lore.kernel.org/r/20250226182102.2631321-5-kbusch%40meta.com
patch subject: [PATCHv7 4/6] ublk: zc register/unregister bvec
config: hexagon-allyesconfig (https://download.01.org/0day-ci/archive/20250227/202502272211.0KJ6ijWp-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250227/202502272211.0KJ6ijWp-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502272211.0KJ6ijWp-lkp@intel.com/
All warnings (new ones prefixed by >>):
>> drivers/block/ublk_drv.c:2510:18: warning: '&' within '|' [-Wbitwise-op-parentheses]
2510 | if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~ ~
drivers/block/ublk_drv.c:2510:18: note: place parentheses around the '&' expression to silence this warning
2510 | if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
| ~~~~~~~~~~~^~~~~~~~~~~~~~~~~~
1 warning generated.
vim +2510 drivers/block/ublk_drv.c
2455
2456 static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
2457 {
2458 const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
2459 void __user *argp = (void __user *)(unsigned long)header->addr;
2460 struct ublksrv_ctrl_dev_info info;
2461 struct ublk_device *ub;
2462 int ret = -EINVAL;
2463
2464 if (header->len < sizeof(info) || !header->addr)
2465 return -EINVAL;
2466 if (header->queue_id != (u16)-1) {
2467 pr_warn("%s: queue_id is wrong %x\n",
2468 __func__, header->queue_id);
2469 return -EINVAL;
2470 }
2471
2472 if (copy_from_user(&info, argp, sizeof(info)))
2473 return -EFAULT;
2474
2475 if (capable(CAP_SYS_ADMIN))
2476 info.flags &= ~UBLK_F_UNPRIVILEGED_DEV;
2477 else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
2478 return -EPERM;
2479
2480 /* forbid nonsense combinations of recovery flags */
2481 switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
2482 case 0:
2483 case UBLK_F_USER_RECOVERY:
2484 case (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_REISSUE):
2485 case (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_FAIL_IO):
2486 break;
2487 default:
2488 pr_warn("%s: invalid recovery flags %llx\n", __func__,
2489 info.flags & UBLK_F_ALL_RECOVERY_FLAGS);
2490 return -EINVAL;
2491 }
2492
2493 /*
2494 * unprivileged device can't be trusted, but RECOVERY and
2495 * RECOVERY_REISSUE still may hang error handling, so can't
2496 * support recovery features for unprivileged ublk now
2497 *
2498 * TODO: provide forward progress for RECOVERY handler, so that
2499 * unprivileged device can benefit from it
2500 */
2501 if (info.flags & UBLK_F_UNPRIVILEGED_DEV) {
2502 info.flags &= ~(UBLK_F_USER_RECOVERY_REISSUE |
2503 UBLK_F_USER_RECOVERY);
2504
2505 /*
2506 * For USER_COPY, we depends on userspace to fill request
2507 * buffer by pwrite() to ublk char device, which can't be
2508 * used for unprivileged device
2509 */
> 2510 if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
2511 return -EINVAL;
2512 }
2513
2514 /* the created device is always owned by current user */
2515 ublk_store_owner_uid_gid(&info.owner_uid, &info.owner_gid);
2516
2517 if (header->dev_id != info.dev_id) {
2518 pr_warn("%s: dev id not match %u %u\n",
2519 __func__, header->dev_id, info.dev_id);
2520 return -EINVAL;
2521 }
2522
2523 if (header->dev_id != U32_MAX && header->dev_id >= UBLK_MAX_UBLKS) {
2524 pr_warn("%s: dev id is too large. Max supported is %d\n",
2525 __func__, UBLK_MAX_UBLKS - 1);
2526 return -EINVAL;
2527 }
2528
2529 ublk_dump_dev_info(&info);
2530
2531 ret = mutex_lock_killable(&ublk_ctl_mutex);
2532 if (ret)
2533 return ret;
2534
2535 ret = -EACCES;
2536 if (ublks_added >= ublks_max)
2537 goto out_unlock;
2538
2539 ret = -ENOMEM;
2540 ub = kzalloc(sizeof(*ub), GFP_KERNEL);
2541 if (!ub)
2542 goto out_unlock;
2543 mutex_init(&ub->mutex);
2544 spin_lock_init(&ub->lock);
2545 INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
2546
2547 ret = ublk_alloc_dev_number(ub, header->dev_id);
2548 if (ret < 0)
2549 goto out_free_ub;
2550
2551 memcpy(&ub->dev_info, &info, sizeof(info));
2552
2553 /* update device id */
2554 ub->dev_info.dev_id = ub->ub_number;
2555
2556 /*
2557 * 64bit flags will be copied back to userspace as feature
2558 * negotiation result, so have to clear flags which driver
2559 * doesn't support yet, then userspace can get correct flags
2560 * (features) to handle.
2561 */
2562 ub->dev_info.flags &= UBLK_F_ALL;
2563
2564 ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
2565 UBLK_F_URING_CMD_COMP_IN_TASK;
2566
2567 /* GET_DATA isn't needed any more with USER_COPY */
2568 if (ublk_dev_is_user_copy(ub))
2569 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
2570
2571 /* Zoned storage support requires user copy feature */
2572 if (ublk_dev_is_zoned(ub) &&
2573 (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !ublk_dev_is_user_copy(ub))) {
2574 ret = -EINVAL;
2575 goto out_free_dev_number;
2576 }
2577
2578 ub->dev_info.nr_hw_queues = min_t(unsigned int,
2579 ub->dev_info.nr_hw_queues, nr_cpu_ids);
2580 ublk_align_max_io_size(ub);
2581
2582 ret = ublk_init_queues(ub);
2583 if (ret)
2584 goto out_free_dev_number;
2585
2586 ret = ublk_add_tag_set(ub);
2587 if (ret)
2588 goto out_deinit_queues;
2589
2590 ret = -EFAULT;
2591 if (copy_to_user(argp, &ub->dev_info, sizeof(info)))
2592 goto out_free_tag_set;
2593
2594 /*
2595 * Add the char dev so that ublksrv daemon can be setup.
2596 * ublk_add_chdev() will cleanup everything if it fails.
2597 */
2598 ret = ublk_add_chdev(ub);
2599 goto out_unlock;
2600
2601 out_free_tag_set:
2602 blk_mq_free_tag_set(&ub->tag_set);
2603 out_deinit_queues:
2604 ublk_deinit_queues(ub);
2605 out_free_dev_number:
2606 ublk_free_dev_number(ub);
2607 out_free_ub:
2608 mutex_destroy(&ub->mutex);
2609 kfree(ub);
2610 out_unlock:
2611 mutex_unlock(&ublk_ctl_mutex);
2612 return ret;
2613 }
2614
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 4/6] ublk: zc register/unregister bvec
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
2025-02-26 18:40 ` Jens Axboe
2025-02-27 15:08 ` kernel test robot
@ 2025-02-27 15:42 ` kernel test robot
2 siblings, 0 replies; 23+ messages in thread
From: kernel test robot @ 2025-02-27 15:42 UTC (permalink / raw)
To: Keith Busch; +Cc: oe-kbuild-all
Hi Keith,
kernel test robot noticed the following build warnings:
[auto build test WARNING on next-20250226]
[cannot apply to axboe-block/for-next linux-nvme/for-next linus/master v6.14-rc4 v6.14-rc3 v6.14-rc2 v6.14-rc4]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Keith-Busch/io_uring-rw-move-fixed-buffer-import-to-issue-path/20250227-022853
base: next-20250226
patch link: https://lore.kernel.org/r/20250226182102.2631321-5-kbusch%40meta.com
patch subject: [PATCHv7 4/6] ublk: zc register/unregister bvec
config: i386-randconfig-014-20250227 (https://download.01.org/0day-ci/archive/20250227/202502272337.CubkA5JN-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250227/202502272337.CubkA5JN-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502272337.CubkA5JN-lkp@intel.com/
All warnings (new ones prefixed by >>):
drivers/block/ublk_drv.c: In function 'ublk_ctrl_add_dev':
>> drivers/block/ublk_drv.c:2510:32: warning: suggest parentheses around arithmetic in operand of '|' [-Wparentheses]
2510 | if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
Kconfig warnings: (for reference only)
WARNING: unmet direct dependencies detected for FB_IOMEM_HELPERS
Depends on [n]: HAS_IOMEM [=y] && FB_CORE [=n]
Selected by [m]:
- DRM_XE_DISPLAY [=y] && HAS_IOMEM [=y] && DRM [=y] && DRM_XE [=m] && DRM_XE [=m]=m [=m] && HAS_IOPORT [=y]
vim +2510 drivers/block/ublk_drv.c
2455
2456 static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
2457 {
2458 const struct ublksrv_ctrl_cmd *header = io_uring_sqe_cmd(cmd->sqe);
2459 void __user *argp = (void __user *)(unsigned long)header->addr;
2460 struct ublksrv_ctrl_dev_info info;
2461 struct ublk_device *ub;
2462 int ret = -EINVAL;
2463
2464 if (header->len < sizeof(info) || !header->addr)
2465 return -EINVAL;
2466 if (header->queue_id != (u16)-1) {
2467 pr_warn("%s: queue_id is wrong %x\n",
2468 __func__, header->queue_id);
2469 return -EINVAL;
2470 }
2471
2472 if (copy_from_user(&info, argp, sizeof(info)))
2473 return -EFAULT;
2474
2475 if (capable(CAP_SYS_ADMIN))
2476 info.flags &= ~UBLK_F_UNPRIVILEGED_DEV;
2477 else if (!(info.flags & UBLK_F_UNPRIVILEGED_DEV))
2478 return -EPERM;
2479
2480 /* forbid nonsense combinations of recovery flags */
2481 switch (info.flags & UBLK_F_ALL_RECOVERY_FLAGS) {
2482 case 0:
2483 case UBLK_F_USER_RECOVERY:
2484 case (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_REISSUE):
2485 case (UBLK_F_USER_RECOVERY | UBLK_F_USER_RECOVERY_FAIL_IO):
2486 break;
2487 default:
2488 pr_warn("%s: invalid recovery flags %llx\n", __func__,
2489 info.flags & UBLK_F_ALL_RECOVERY_FLAGS);
2490 return -EINVAL;
2491 }
2492
2493 /*
2494 * unprivileged device can't be trusted, but RECOVERY and
2495 * RECOVERY_REISSUE still may hang error handling, so can't
2496 * support recovery features for unprivileged ublk now
2497 *
2498 * TODO: provide forward progress for RECOVERY handler, so that
2499 * unprivileged device can benefit from it
2500 */
2501 if (info.flags & UBLK_F_UNPRIVILEGED_DEV) {
2502 info.flags &= ~(UBLK_F_USER_RECOVERY_REISSUE |
2503 UBLK_F_USER_RECOVERY);
2504
2505 /*
2506 * For USER_COPY, we depends on userspace to fill request
2507 * buffer by pwrite() to ublk char device, which can't be
2508 * used for unprivileged device
2509 */
> 2510 if (info.flags & UBLK_F_USER_COPY | UBLK_F_SUPPORT_ZERO_COPY)
2511 return -EINVAL;
2512 }
2513
2514 /* the created device is always owned by current user */
2515 ublk_store_owner_uid_gid(&info.owner_uid, &info.owner_gid);
2516
2517 if (header->dev_id != info.dev_id) {
2518 pr_warn("%s: dev id not match %u %u\n",
2519 __func__, header->dev_id, info.dev_id);
2520 return -EINVAL;
2521 }
2522
2523 if (header->dev_id != U32_MAX && header->dev_id >= UBLK_MAX_UBLKS) {
2524 pr_warn("%s: dev id is too large. Max supported is %d\n",
2525 __func__, UBLK_MAX_UBLKS - 1);
2526 return -EINVAL;
2527 }
2528
2529 ublk_dump_dev_info(&info);
2530
2531 ret = mutex_lock_killable(&ublk_ctl_mutex);
2532 if (ret)
2533 return ret;
2534
2535 ret = -EACCES;
2536 if (ublks_added >= ublks_max)
2537 goto out_unlock;
2538
2539 ret = -ENOMEM;
2540 ub = kzalloc(sizeof(*ub), GFP_KERNEL);
2541 if (!ub)
2542 goto out_unlock;
2543 mutex_init(&ub->mutex);
2544 spin_lock_init(&ub->lock);
2545 INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
2546
2547 ret = ublk_alloc_dev_number(ub, header->dev_id);
2548 if (ret < 0)
2549 goto out_free_ub;
2550
2551 memcpy(&ub->dev_info, &info, sizeof(info));
2552
2553 /* update device id */
2554 ub->dev_info.dev_id = ub->ub_number;
2555
2556 /*
2557 * 64bit flags will be copied back to userspace as feature
2558 * negotiation result, so have to clear flags which driver
2559 * doesn't support yet, then userspace can get correct flags
2560 * (features) to handle.
2561 */
2562 ub->dev_info.flags &= UBLK_F_ALL;
2563
2564 ub->dev_info.flags |= UBLK_F_CMD_IOCTL_ENCODE |
2565 UBLK_F_URING_CMD_COMP_IN_TASK;
2566
2567 /* GET_DATA isn't needed any more with USER_COPY */
2568 if (ublk_dev_is_user_copy(ub))
2569 ub->dev_info.flags &= ~UBLK_F_NEED_GET_DATA;
2570
2571 /* Zoned storage support requires user copy feature */
2572 if (ublk_dev_is_zoned(ub) &&
2573 (!IS_ENABLED(CONFIG_BLK_DEV_ZONED) || !ublk_dev_is_user_copy(ub))) {
2574 ret = -EINVAL;
2575 goto out_free_dev_number;
2576 }
2577
2578 ub->dev_info.nr_hw_queues = min_t(unsigned int,
2579 ub->dev_info.nr_hw_queues, nr_cpu_ids);
2580 ublk_align_max_io_size(ub);
2581
2582 ret = ublk_init_queues(ub);
2583 if (ret)
2584 goto out_free_dev_number;
2585
2586 ret = ublk_add_tag_set(ub);
2587 if (ret)
2588 goto out_deinit_queues;
2589
2590 ret = -EFAULT;
2591 if (copy_to_user(argp, &ub->dev_info, sizeof(info)))
2592 goto out_free_tag_set;
2593
2594 /*
2595 * Add the char dev so that ublksrv daemon can be setup.
2596 * ublk_add_chdev() will cleanup everything if it fails.
2597 */
2598 ret = ublk_add_chdev(ub);
2599 goto out_unlock;
2600
2601 out_free_tag_set:
2602 blk_mq_free_tag_set(&ub->tag_set);
2603 out_deinit_queues:
2604 ublk_deinit_queues(ub);
2605 out_free_dev_number:
2606 ublk_free_dev_number(ub);
2607 out_free_ub:
2608 mutex_destroy(&ub->mutex);
2609 kfree(ub);
2610 out_unlock:
2611 mutex_unlock(&ublk_ctl_mutex);
2612 return ret;
2613 }
2614
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-26 18:20 ` [PATCHv7 2/6] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-26 20:36 ` Jens Axboe
@ 2025-02-27 15:54 ` Pavel Begunkov
2025-02-27 16:04 ` Jens Axboe
2025-02-27 22:45 ` Keith Busch
1 sibling, 2 replies; 23+ messages in thread
From: Pavel Begunkov @ 2025-02-27 15:54 UTC (permalink / raw)
To: Keith Busch, ming.lei, axboe, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/26/25 18:20, Keith Busch wrote:
> From: Keith Busch <kbusch@kernel.org>
>
> Provide an interface for the kernel to leverage the existing
> pre-registered buffers that io_uring provides. User space can reference
> these later to achieve zero-copy IO.
>
> User space must register an empty fixed buffer table with io_uring in
> order for the kernel to make use of it.
Can you also fail rw.c:loop_rw_iter()? Something like:
loop_rw_iter() {
if ((req->flags & REQ_F_BUF_NODE) &&
req->buf_node->buf->release)
return -EFAULT;
}
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-27 15:54 ` Pavel Begunkov
@ 2025-02-27 16:04 ` Jens Axboe
2025-02-27 22:45 ` Keith Busch
1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2025-02-27 16:04 UTC (permalink / raw)
To: Pavel Begunkov, Keith Busch, ming.lei, linux-block, io-uring
Cc: bernd, csander, linux-nvme, Keith Busch
On 2/27/25 8:54 AM, Pavel Begunkov wrote:
> On 2/26/25 18:20, Keith Busch wrote:
>> From: Keith Busch <kbusch@kernel.org>
>>
>> Provide an interface for the kernel to leverage the existing
>> pre-registered buffers that io_uring provides. User space can reference
>> these later to achieve zero-copy IO.
>>
>> User space must register an empty fixed buffer table with io_uring in
>> order for the kernel to make use of it.
>
> Can you also fail rw.c:loop_rw_iter()? Something like:
>
> loop_rw_iter() {
> if ((req->flags & REQ_F_BUF_NODE) &&
> req->buf_node->buf->release)
> return -EFAULT;
> }
Good point...
--
Jens Axboe
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path
2025-02-26 20:20 ` Jens Axboe
@ 2025-02-27 21:46 ` Keith Busch
0 siblings, 0 replies; 23+ messages in thread
From: Keith Busch @ 2025-02-27 21:46 UTC (permalink / raw)
To: Jens Axboe
Cc: Keith Busch, ming.lei, asml.silence, linux-block, io-uring, bernd,
csander, linux-nvme
On Wed, Feb 26, 2025 at 01:20:24PM -0700, Jens Axboe wrote:
> +static int io_rw_do_import(struct io_kiocb *req, int ddir)
> +{
> + if (!io_do_buffer_select(req)) {
> + struct io_async_rw *io = req->async_data;
> + int ret;
> +
> + ret = io_import_rw_buffer(ddir, req, io, 0);
> + if (unlikely(ret))
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> + int ddir)
> +{
> + int ret;
> +
> + ret = __io_prep_rw(req, sqe, ddir);
> + if (unlikely(ret))
> + return ret;
> +
> + return io_rw_do_import(req, ddir);
> +}
...
> static int io_prep_rwv(struct io_kiocb *req, const struct io_uring_sqe *sqe,
> int ddir)
> {
> - const bool do_import = !(req->flags & REQ_F_BUFFER_SELECT);
> int ret;
>
> - ret = io_prep_rw(req, sqe, ddir, do_import);
> + ret = io_prep_rw(req, sqe, ddir);
> if (unlikely(ret))
> return ret;
> - if (do_import)
> - return 0;
> + if (!(req->flags & REQ_F_BUFFER_SELECT))
> + return io_rw_do_import(req, ddir);
Not sure if I'm missing something subtle here, but wanted to point out
I've changed the above in the version I'm about to send: io_prep_rw()
already calls io_rw_do_import(), so we can just return 0 here.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-27 15:54 ` Pavel Begunkov
2025-02-27 16:04 ` Jens Axboe
@ 2025-02-27 22:45 ` Keith Busch
2025-02-28 12:19 ` Pavel Begunkov
1 sibling, 1 reply; 23+ messages in thread
From: Keith Busch @ 2025-02-27 22:45 UTC (permalink / raw)
To: Pavel Begunkov
Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
csander, linux-nvme
On Thu, Feb 27, 2025 at 03:54:31PM +0000, Pavel Begunkov wrote:
> On 2/26/25 18:20, Keith Busch wrote:
> > From: Keith Busch <kbusch@kernel.org>
> >
> > Provide an interface for the kernel to leverage the existing
> > pre-registered buffers that io_uring provides. User space can reference
> > these later to achieve zero-copy IO.
> >
> > User space must register an empty fixed buffer table with io_uring in
> > order for the kernel to make use of it.
>
> Can you also fail rw.c:loop_rw_iter()? Something like:
>
> loop_rw_iter() {
> if ((req->flags & REQ_F_BUF_NODE) &&
> req->buf_node->buf->release)
> return -EFAULT;
> }
For posterity: the suggestion is because this function uses the
file_operations' .read/.write callbacks, which expect __user pointers.
Playing devil's advocate here, I don't see how user space might know
ahead of time if the file they opened implements the supported _iter
versions. I think only esoteric and legacy interfaces still use it, so
maybe we don't care.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCHv7 2/6] io_uring: add support for kernel registered bvecs
2025-02-27 22:45 ` Keith Busch
@ 2025-02-28 12:19 ` Pavel Begunkov
0 siblings, 0 replies; 23+ messages in thread
From: Pavel Begunkov @ 2025-02-28 12:19 UTC (permalink / raw)
To: Keith Busch
Cc: Keith Busch, ming.lei, axboe, linux-block, io-uring, bernd,
csander, linux-nvme
On 2/27/25 22:45, Keith Busch wrote:
> On Thu, Feb 27, 2025 at 03:54:31PM +0000, Pavel Begunkov wrote:
>> On 2/26/25 18:20, Keith Busch wrote:
>>> From: Keith Busch <kbusch@kernel.org>
>>>
>>> Provide an interface for the kernel to leverage the existing
>>> pre-registered buffers that io_uring provides. User space can reference
>>> these later to achieve zero-copy IO.
>>>
>>> User space must register an empty fixed buffer table with io_uring in
>>> order for the kernel to make use of it.
>>
>> Can you also fail rw.c:loop_rw_iter()? Something like:
>>
>> loop_rw_iter() {
>> if ((req->flags & REQ_F_BUF_NODE) &&
>> req->buf_node->buf->release)
>> return -EFAULT;
>> }
>
> For posterity: the suggestion is because this function uses the
> file_operations' .read/.write callbacks, which expect __user pointers.
>
> Playing devil's advocate here, I don't see how user space might know
> ahead of time if the file they opened implements the supported _iter
> versions. I think only esoteric and legacy interfaces still use it, so
> maybe we don't care.
Sure, but it's not like we can do anything about it anyway, we're
not going temporarily mmap the buffer into userspace for that. Normal
registered buffers use user ptrs we stashed during registration,
but I don't think even that is the right thing to do.
Reminds me that Jens was trying to completely get rid of the ->read/write
callbacks in favour of *iter variants, maybe that will happen at some
point.
--
Pavel Begunkov
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-02-28 12:18 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-26 18:20 [PATCHv7 0/6] ublk zero copy support Keith Busch
2025-02-26 18:20 ` [PATCHv7 1/6] io_uring/rw: move fixed buffer import to issue path Keith Busch
2025-02-26 19:04 ` Jens Axboe
2025-02-26 20:20 ` Jens Axboe
2025-02-27 21:46 ` Keith Busch
2025-02-26 18:20 ` [PATCHv7 2/6] io_uring: add support for kernel registered bvecs Keith Busch
2025-02-26 20:36 ` Jens Axboe
2025-02-27 15:54 ` Pavel Begunkov
2025-02-27 16:04 ` Jens Axboe
2025-02-27 22:45 ` Keith Busch
2025-02-28 12:19 ` Pavel Begunkov
2025-02-26 18:20 ` [PATCHv7 3/6] nvme: map uring_cmd data even if address is 0 Keith Busch
2025-02-26 18:20 ` [PATCHv7 4/6] ublk: zc register/unregister bvec Keith Busch
2025-02-26 18:40 ` Jens Axboe
2025-02-27 15:08 ` kernel test robot
2025-02-27 15:42 ` kernel test robot
2025-02-26 18:21 ` [PATCHv7 5/6] io_uring: add abstraction for buf_table rsrc data Keith Busch
2025-02-26 18:21 ` [PATCHv7 6/6] io_uring: cache nodes and mapped buffers Keith Busch
2025-02-26 18:48 ` Jens Axboe
2025-02-26 19:36 ` Jens Axboe
2025-02-26 19:43 ` Jens Axboe
2025-02-26 19:58 ` Jens Axboe
2025-02-26 20:00 ` Keith Busch
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.