* [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
@ 2025-08-25 12:48 Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
0 siblings, 2 replies; 9+ messages in thread
From: Ming Lei @ 2025-08-25 12:48 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Uday Shankar, Keith Busch, Caleb Sander Mateos, Ming Lei
Hi Jens,
The 1st patch fixes ublk_io_release(), and avoids warning from ublk
selftest(test_stress_04.sh).
The 2nd patch adds test case for this issue.
V2:
- release ublk char device in async way for avoiding dependency with
io_uring_release(), where sqe buffers may be unregistered finally
Ming Lei (2):
ublk: avoid ublk_io_release() called after ublk char dev is closed
ublk selftests: add --no_ublk_fixed_fd for not using registered ublk
char device
drivers/block/ublk_drv.c | 94 ++++++++++++++++++-
tools/testing/selftests/ublk/file_backed.c | 10 +-
tools/testing/selftests/ublk/kublk.c | 38 ++++++--
tools/testing/selftests/ublk/kublk.h | 46 ++++++---
tools/testing/selftests/ublk/null.c | 4 +-
tools/testing/selftests/ublk/stripe.c | 4 +-
.../testing/selftests/ublk/test_stress_04.sh | 6 +-
7 files changed, 167 insertions(+), 35 deletions(-)
--
2.47.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
@ 2025-08-25 12:48 ` Ming Lei
2025-08-26 5:49 ` Caleb Sander Mateos
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-08-25 12:48 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Uday Shankar, Keith Busch, Caleb Sander Mateos, Ming Lei
When running test_stress_04.sh, the following warning is triggered:
WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
This happens when the daemon is abruptly killed:
- some references may still be held, because registering IO buffer
doesn't grab ublk char device reference
OR
- io->task_registered_buffers won't be cleared because io buffer is
released from non-daemon context
For zero-copy and auto buffer register modes, I/O reference crosses
syscalls, so IO reference may not be dropped naturally when ublk server is
killed abruptly. However, when releasing io_uring context, it is guaranteed
that the reference is dropped finally, see io_sqe_buffers_unregister() from
io_ring_ctx_free().
Fix this by adding ublk_drain_io_references() that:
- Waits for active I/O references dropped in async way by scheduling
work function, for avoiding ublk dev and io_uring file's release
dependency
- Reinitializes io->ref and io->task_registered_buffers to clean state
This ensures the reference count state is clean when ublk_queue_reinit()
is called, preventing the warning and potential use-after-free.
Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 92 insertions(+), 2 deletions(-)
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 99abd67b708b..f608c7efdc05 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -239,6 +239,7 @@ struct ublk_device {
struct mutex cancel_mutex;
bool canceling;
pid_t ublksrv_tgid;
+ struct delayed_work exit_work;
};
/* header of ublk_params */
@@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
ublk_get_queue(ub, i)->canceling = canceling;
}
-static int ublk_ch_release(struct inode *inode, struct file *filp)
+static void ublk_reset_io_ref(struct ublk_device *ub)
{
- struct ublk_device *ub = filp->private_data;
+ int i, j;
+
+ if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
+ UBLK_F_AUTO_BUF_REG)))
+ return;
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ for (j = 0; j < ubq->q_depth; j++) {
+ struct ublk_io *io = &ubq->ios[j];
+ /*
+ * Reinitialize reference counting fields after
+ * draining. This ensures clean state for queue
+ * reinitialization.
+ */
+ refcount_set(&io->ref, 0);
+ io->task_registered_buffers = 0;
+ }
+ }
+}
+
+static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
+{
+ int i, j;
+
+ if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
+ UBLK_F_AUTO_BUF_REG)))
+ return false;
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ for (j = 0; j < ubq->q_depth; j++) {
+ struct ublk_io *io = &ubq->ios[j];
+ unsigned int refs = refcount_read(&io->ref) +
+ io->task_registered_buffers;
+
+ /*
+ * UBLK_REFCOUNT_INIT or zero means no active
+ * reference
+ */
+ if (refs != UBLK_REFCOUNT_INIT && refs != 0)
+ return true;
+ }
+ }
+ return false;
+}
+
+static void ublk_ch_release_work_fn(struct work_struct *work)
+{
+ struct ublk_device *ub =
+ container_of(work, struct ublk_device, exit_work.work);
struct gendisk *disk;
int i;
+ /*
+ * For zero-copy and auto buffer register modes, I/O references
+ * might not be dropped naturally when the daemon is killed, but
+ * io_uring guarantees that registered bvec kernel buffers are
+ * unregistered finally when freeing io_uring context, then the
+ * active references are dropped.
+ *
+ * Wait until active references are dropped for avoiding use-after-free
+ *
+ * registered buffer may be unregistered in io_ring's release hander,
+ * so have to wait by scheduling work function for avoiding the two
+ * file release dependency.
+ */
+ if (ublk_has_io_with_active_ref(ub)) {
+ schedule_delayed_work(&ub->exit_work, 1);
+ return;
+ }
+
+ ublk_reset_io_ref(ub);
+
/*
* disk isn't attached yet, either device isn't live, or it has
* been removed already, so we needn't to do anything
@@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
ublk_reset_ch_dev(ub);
out:
clear_bit(UB_STATE_OPEN, &ub->state);
+
+ /* put the reference grabbed in ublk_ch_release() */
+ ublk_put_device(ub);
+}
+
+static int ublk_ch_release(struct inode *inode, struct file *filp)
+{
+ struct ublk_device *ub = filp->private_data;
+
+ /*
+ * Grab ublk device reference, so it won't be gone until we are
+ * really released from work function.
+ */
+ ub = ublk_get_device(ub);
+
+ INIT_DELAYED_WORK(&ub->exit_work, ublk_ch_release_work_fn);
+ schedule_delayed_work(&ub->exit_work, 0);
return 0;
}
--
2.47.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
@ 2025-08-25 12:48 ` Ming Lei
2025-08-26 4:53 ` Caleb Sander Mateos
1 sibling, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-08-25 12:48 UTC (permalink / raw)
To: Jens Axboe, linux-block
Cc: Uday Shankar, Keith Busch, Caleb Sander Mateos, Ming Lei
Add a new command line option --no_ublk_fixed_fd that excludes the ublk
control device (/dev/ublkcN) from io_uring's registered files array.
When this option is used, only backing files are registered starting
from index 1, while the ublk control device is accessed using its raw
file descriptor.
Add ublk_get_registered_fd() helper function that returns the appropriate
file descriptor for use with io_uring operations, taking ublk_queue *
parameter instead of ublk_thread * for better performance.
Key optimizations implemented:
- Cache UBLKS_Q_NO_UBLK_FIXED_FD flag in ublk_queue.flags to avoid
reading dev->no_ublk_fixed_fd in fast path
- Cache ublk char device fd in ublk_queue.ublk_fd for fast access
- Update ublk_get_registered_fd() to use ublk_queue * parameter
- Update io_uring_prep_buf_register/unregister() to use ublk_queue *
- Replace ublk_device * access with ublk_queue * access in fast paths
This improves performance by:
- Eliminating device structure traversal in hot paths
- Better cache locality with queue-local data access
- Reduced indirection for flag and fd lookups
The helper handles both modes:
- Normal mode: Returns registered file indices directly
- --no_ublk_fixed_fd mode: Returns raw FD for ublk device (index 0)
and adjusted indices for backing files (index 1 becomes 0, etc.)
Also pass --no_ublk_fixed_fd to test_stress_04.sh for covering
plain ublk char device mode.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
tools/testing/selftests/ublk/file_backed.c | 10 ++--
tools/testing/selftests/ublk/kublk.c | 38 ++++++++++++---
tools/testing/selftests/ublk/kublk.h | 46 +++++++++++++------
tools/testing/selftests/ublk/null.c | 4 +-
tools/testing/selftests/ublk/stripe.c | 4 +-
.../testing/selftests/ublk/test_stress_04.sh | 6 +--
6 files changed, 75 insertions(+), 33 deletions(-)
diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
index 2d93ac860bd5..cd9fe69ecce2 100644
--- a/tools/testing/selftests/ublk/file_backed.c
+++ b/tools/testing/selftests/ublk/file_backed.c
@@ -20,7 +20,7 @@ static int loop_queue_flush_io(struct ublk_thread *t, struct ublk_queue *q,
struct io_uring_sqe *sqe[1];
ublk_io_alloc_sqes(t, sqe, 1);
- io_uring_prep_fsync(sqe[0], 1 /*fds[1]*/, IORING_FSYNC_DATASYNC);
+ io_uring_prep_fsync(sqe[0], ublk_get_registered_fd(q, 1) /*fds[1]*/, IORING_FSYNC_DATASYNC);
io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
/* bit63 marks us as tgt io */
sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
@@ -42,7 +42,7 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
if (!sqe[0])
return -ENOMEM;
- io_uring_prep_rw(op, sqe[0], 1 /*fds[1]*/,
+ io_uring_prep_rw(op, sqe[0], ublk_get_registered_fd(q, 1) /*fds[1]*/,
addr,
iod->nr_sectors << 9,
iod->start_sector << 9);
@@ -56,19 +56,19 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
ublk_io_alloc_sqes(t, sqe, 3);
- io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+ io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
sqe[0]->user_data = build_user_data(tag,
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
- io_uring_prep_rw(op, sqe[1], 1 /*fds[1]*/, 0,
+ io_uring_prep_rw(op, sqe[1], ublk_get_registered_fd(q, 1) /*fds[1]*/, 0,
iod->nr_sectors << 9,
iod->start_sector << 9);
sqe[1]->buf_index = tag;
sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK;
sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
- io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+ io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
return 2;
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 95188065b2e9..ede1725f32bb 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -432,7 +432,7 @@ static void ublk_thread_deinit(struct ublk_thread *t)
}
}
-static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
+static int ublk_queue_init(struct ublk_queue *q, unsigned long long extra_flags)
{
struct ublk_dev *dev = q->dev;
int depth = dev->dev_info.queue_depth;
@@ -446,6 +446,9 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
q->flags = dev->dev_info.flags;
q->flags |= extra_flags;
+ /* Cache fd in queue for fast path access */
+ q->ublk_fd = dev->fds[0];
+
cmd_buf_size = ublk_queue_cmd_buf_sz(q);
off = UBLKSRV_CMD_BUF_OFFSET + q->q_id * ublk_queue_max_cmd_buf_sz();
q->io_cmd_buf = mmap(0, cmd_buf_size, PROT_READ,
@@ -481,9 +484,10 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
return -ENOMEM;
}
-static int ublk_thread_init(struct ublk_thread *t)
+static int ublk_thread_init(struct ublk_thread *t, unsigned long long extra_flags)
{
struct ublk_dev *dev = t->dev;
+ unsigned long long flags = dev->dev_info.flags | extra_flags;
int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
int ret;
@@ -512,7 +516,17 @@ static int ublk_thread_init(struct ublk_thread *t)
io_uring_register_ring_fd(&t->ring);
- ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
+ if (flags & UBLKS_Q_NO_UBLK_FIXED_FD) {
+ /* Register only backing files starting from index 1, exclude ublk control device */
+ if (dev->nr_fds > 1) {
+ ret = io_uring_register_files(&t->ring, &dev->fds[1], dev->nr_fds - 1);
+ } else {
+ /* No backing files to register, skip file registration */
+ ret = 0;
+ }
+ } else {
+ ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
+ }
if (ret) {
ublk_err("ublk dev %d thread %d register files failed %d\n",
t->dev->dev_info.dev_id, t->idx, ret);
@@ -626,9 +640,12 @@ int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io)
/* These fields should be written once, never change */
ublk_set_sqe_cmd_op(sqe[0], cmd_op);
- sqe[0]->fd = 0; /* dev->fds[0] */
+ sqe[0]->fd = ublk_get_registered_fd(q, 0); /* dev->fds[0] */
sqe[0]->opcode = IORING_OP_URING_CMD;
- sqe[0]->flags = IOSQE_FIXED_FILE;
+ if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD)
+ sqe[0]->flags = 0; /* Use raw FD, not fixed file */
+ else
+ sqe[0]->flags = IOSQE_FIXED_FILE;
sqe[0]->rw_flags = 0;
cmd->tag = io->tag;
cmd->q_id = q->q_id;
@@ -832,6 +849,7 @@ struct ublk_thread_info {
unsigned idx;
sem_t *ready;
cpu_set_t *affinity;
+ unsigned long long extra_flags;
};
static void *ublk_io_handler_fn(void *data)
@@ -844,7 +862,7 @@ static void *ublk_io_handler_fn(void *data)
t->dev = info->dev;
t->idx = info->idx;
- ret = ublk_thread_init(t);
+ ret = ublk_thread_init(t, info->extra_flags);
if (ret) {
ublk_err("ublk dev %d thread %u init failed\n",
dev_id, t->idx);
@@ -934,6 +952,8 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
if (ctx->auto_zc_fallback)
extra_flags = UBLKS_Q_AUTO_BUF_REG_FALLBACK;
+ if (ctx->no_ublk_fixed_fd)
+ extra_flags |= UBLKS_Q_NO_UBLK_FIXED_FD;
for (i = 0; i < dinfo->nr_hw_queues; i++) {
dev->q[i].dev = dev;
@@ -951,6 +971,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
tinfo[i].dev = dev;
tinfo[i].idx = i;
tinfo[i].ready = &ready;
+ tinfo[i].extra_flags = extra_flags;
/*
* If threads are not tied 1:1 to queues, setting thread
@@ -1471,7 +1492,7 @@ static void __cmd_create_help(char *exe, bool recovery)
printf("%s %s -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id]\n",
exe, recovery ? "recover" : "add");
printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n");
- printf("\t[-e 0|1 ] [-i 0|1]\n");
+ printf("\t[-e 0|1 ] [-i 0|1] [--no_ublk_fixed_fd]\n");
printf("\t[--nthreads threads] [--per_io_tasks]\n");
printf("\t[target options] [backfile1] [backfile2] ...\n");
printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n");
@@ -1534,6 +1555,7 @@ int main(int argc, char *argv[])
{ "size", 1, NULL, 's'},
{ "nthreads", 1, NULL, 0 },
{ "per_io_tasks", 0, NULL, 0 },
+ { "no_ublk_fixed_fd", 0, NULL, 0 },
{ 0, 0, 0, 0 }
};
const struct ublk_tgt_ops *ops = NULL;
@@ -1613,6 +1635,8 @@ int main(int argc, char *argv[])
ctx.nthreads = strtol(optarg, NULL, 10);
if (!strcmp(longopts[option_idx].name, "per_io_tasks"))
ctx.per_io_tasks = 1;
+ if (!strcmp(longopts[option_idx].name, "no_ublk_fixed_fd"))
+ ctx.no_ublk_fixed_fd = 1;
break;
case '?':
/*
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 219233f8a053..9d5bbf559659 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -77,6 +77,7 @@ struct dev_ctx {
unsigned int recovery:1;
unsigned int auto_zc_fallback:1;
unsigned int per_io_tasks:1;
+ unsigned int no_ublk_fixed_fd:1;
int _evtfd;
int _shmid;
@@ -166,7 +167,9 @@ struct ublk_queue {
/* borrow one bit of ublk uapi flags, which may never be used */
#define UBLKS_Q_AUTO_BUF_REG_FALLBACK (1ULL << 63)
+#define UBLKS_Q_NO_UBLK_FIXED_FD (1ULL << 62)
__u64 flags;
+ int ublk_fd; /* cached ublk char device fd */
struct ublk_io ios[UBLK_QUEUE_DEPTH];
};
@@ -273,34 +276,48 @@ static inline int ublk_io_alloc_sqes(struct ublk_thread *t,
return nr_sqes;
}
-static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
- int dev_fd, int tag, int q_id, __u64 index)
+static inline int ublk_get_registered_fd(struct ublk_queue *q, int fd_index)
+{
+ if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD) {
+ if (fd_index == 0)
+ /* Return the raw ublk FD for index 0 */
+ return q->ublk_fd;
+ /* Adjust index for backing files (index 1 becomes 0, etc.) */
+ return fd_index - 1;
+ }
+ return fd_index;
+}
+
+static inline void __io_uring_prep_buf_reg_unreg(struct io_uring_sqe *sqe,
+ struct ublk_queue *q, int tag, int q_id, __u64 index)
{
struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
+ int dev_fd = ublk_get_registered_fd(q, 0);
io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
sqe->opcode = IORING_OP_URING_CMD;
- sqe->flags |= IOSQE_FIXED_FILE;
- sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF;
+ if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD)
+ sqe->flags &= ~IOSQE_FIXED_FILE;
+ else
+ sqe->flags |= IOSQE_FIXED_FILE;
cmd->tag = tag;
cmd->addr = index;
cmd->q_id = q_id;
}
-static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
- int dev_fd, int tag, int q_id, __u64 index)
+static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
+ struct ublk_queue *q, int tag, int q_id, __u64 index)
{
- struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
+ __io_uring_prep_buf_reg_unreg(sqe, q, tag, q_id, index);
+ sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF;
+}
- io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
- sqe->opcode = IORING_OP_URING_CMD;
- sqe->flags |= IOSQE_FIXED_FILE;
+static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
+ struct ublk_queue *q, int tag, int q_id, __u64 index)
+{
+ __io_uring_prep_buf_reg_unreg(sqe, q, tag, q_id, index);
sqe->cmd_op = UBLK_U_IO_UNREGISTER_IO_BUF;
-
- cmd->tag = tag;
- cmd->addr = index;
- cmd->q_id = q_id;
}
static inline void *ublk_get_sqe_cmd(const struct io_uring_sqe *sqe)
@@ -396,6 +413,7 @@ static inline int ublk_queue_no_buf(const struct ublk_queue *q)
return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
}
+
extern const struct ublk_tgt_ops null_tgt_ops;
extern const struct ublk_tgt_ops loop_tgt_ops;
extern const struct ublk_tgt_ops stripe_tgt_ops;
diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
index f0e0003a4860..280043f6b689 100644
--- a/tools/testing/selftests/ublk/null.c
+++ b/tools/testing/selftests/ublk/null.c
@@ -63,7 +63,7 @@ static int null_queue_zc_io(struct ublk_thread *t, struct ublk_queue *q,
ublk_io_alloc_sqes(t, sqe, 3);
- io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+ io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
sqe[0]->user_data = build_user_data(tag,
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
@@ -71,7 +71,7 @@ static int null_queue_zc_io(struct ublk_thread *t, struct ublk_queue *q,
__setup_nop_io(tag, iod, sqe[1], q->q_id);
sqe[1]->flags |= IOSQE_IO_HARDLINK;
- io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
+ io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
// buf register is marked as IOSQE_CQE_SKIP_SUCCESS
diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c
index 1fb9b7cc281b..791fa8dc1651 100644
--- a/tools/testing/selftests/ublk/stripe.c
+++ b/tools/testing/selftests/ublk/stripe.c
@@ -142,7 +142,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
ublk_io_alloc_sqes(t, sqe, s->nr + extra);
if (zc) {
- io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, io->buf_index);
+ io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, io->buf_index);
sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
sqe[0]->user_data = build_user_data(tag,
ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
@@ -168,7 +168,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
if (zc) {
struct io_uring_sqe *unreg = sqe[s->nr + 1];
- io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, io->buf_index);
+ io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, io->buf_index);
unreg->user_data = build_user_data(
tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1);
}
diff --git a/tools/testing/selftests/ublk/test_stress_04.sh b/tools/testing/selftests/ublk/test_stress_04.sh
index 40d1437ca298..3f901db4d09d 100755
--- a/tools/testing/selftests/ublk/test_stress_04.sh
+++ b/tools/testing/selftests/ublk/test_stress_04.sh
@@ -28,14 +28,14 @@ _create_backfile 0 256M
_create_backfile 1 128M
_create_backfile 2 128M
-ublk_io_and_kill_daemon 8G -t null -q 4 -z &
-ublk_io_and_kill_daemon 256M -t loop -q 4 -z "${UBLK_BACKFILES[0]}" &
+ublk_io_and_kill_daemon 8G -t null -q 4 -z --no_ublk_fixed_fd &
+ublk_io_and_kill_daemon 256M -t loop -q 4 -z --no_ublk_fixed_fd "${UBLK_BACKFILES[0]}" &
ublk_io_and_kill_daemon 256M -t stripe -q 4 -z "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
if _have_feature "AUTO_BUF_REG"; then
ublk_io_and_kill_daemon 8G -t null -q 4 --auto_zc &
ublk_io_and_kill_daemon 256M -t loop -q 4 --auto_zc "${UBLK_BACKFILES[0]}" &
- ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
+ ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc --no_ublk_fixed_fd "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback &
fi
--
2.47.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
@ 2025-08-26 4:53 ` Caleb Sander Mateos
2025-08-26 8:36 ` Ming Lei
0 siblings, 1 reply; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-08-26 4:53 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> Add a new command line option --no_ublk_fixed_fd that excludes the ublk
> control device (/dev/ublkcN) from io_uring's registered files array.
> When this option is used, only backing files are registered starting
> from index 1, while the ublk control device is accessed using its raw
> file descriptor.
>
> Add ublk_get_registered_fd() helper function that returns the appropriate
> file descriptor for use with io_uring operations, taking ublk_queue *
> parameter instead of ublk_thread * for better performance.
>
> Key optimizations implemented:
> - Cache UBLKS_Q_NO_UBLK_FIXED_FD flag in ublk_queue.flags to avoid
> reading dev->no_ublk_fixed_fd in fast path
> - Cache ublk char device fd in ublk_queue.ublk_fd for fast access
> - Update ublk_get_registered_fd() to use ublk_queue * parameter
> - Update io_uring_prep_buf_register/unregister() to use ublk_queue *
> - Replace ublk_device * access with ublk_queue * access in fast paths
>
> This improves performance by:
> - Eliminating device structure traversal in hot paths
> - Better cache locality with queue-local data access
> - Reduced indirection for flag and fd lookups
Are you saying that performance is better when using the raw
/dev/ublkcN fd instead of an io_uring registered file? That would be
really surprising to me, since the whole point of io_uring file
registration is to avoid the file reference counting in the I/O path.
Best,
Caleb
>
> The helper handles both modes:
> - Normal mode: Returns registered file indices directly
> - --no_ublk_fixed_fd mode: Returns raw FD for ublk device (index 0)
> and adjusted indices for backing files (index 1 becomes 0, etc.)
>
> Also pass --no_ublk_fixed_fd to test_stress_04.sh for covering
> plain ublk char device mode.
>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> tools/testing/selftests/ublk/file_backed.c | 10 ++--
> tools/testing/selftests/ublk/kublk.c | 38 ++++++++++++---
> tools/testing/selftests/ublk/kublk.h | 46 +++++++++++++------
> tools/testing/selftests/ublk/null.c | 4 +-
> tools/testing/selftests/ublk/stripe.c | 4 +-
> .../testing/selftests/ublk/test_stress_04.sh | 6 +--
> 6 files changed, 75 insertions(+), 33 deletions(-)
>
> diff --git a/tools/testing/selftests/ublk/file_backed.c b/tools/testing/selftests/ublk/file_backed.c
> index 2d93ac860bd5..cd9fe69ecce2 100644
> --- a/tools/testing/selftests/ublk/file_backed.c
> +++ b/tools/testing/selftests/ublk/file_backed.c
> @@ -20,7 +20,7 @@ static int loop_queue_flush_io(struct ublk_thread *t, struct ublk_queue *q,
> struct io_uring_sqe *sqe[1];
>
> ublk_io_alloc_sqes(t, sqe, 1);
> - io_uring_prep_fsync(sqe[0], 1 /*fds[1]*/, IORING_FSYNC_DATASYNC);
> + io_uring_prep_fsync(sqe[0], ublk_get_registered_fd(q, 1) /*fds[1]*/, IORING_FSYNC_DATASYNC);
> io_uring_sqe_set_flags(sqe[0], IOSQE_FIXED_FILE);
> /* bit63 marks us as tgt io */
> sqe[0]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
> @@ -42,7 +42,7 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
> if (!sqe[0])
> return -ENOMEM;
>
> - io_uring_prep_rw(op, sqe[0], 1 /*fds[1]*/,
> + io_uring_prep_rw(op, sqe[0], ublk_get_registered_fd(q, 1) /*fds[1]*/,
> addr,
> iod->nr_sectors << 9,
> iod->start_sector << 9);
> @@ -56,19 +56,19 @@ static int loop_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
>
> ublk_io_alloc_sqes(t, sqe, 3);
>
> - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
> sqe[0]->user_data = build_user_data(tag,
> ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
>
> - io_uring_prep_rw(op, sqe[1], 1 /*fds[1]*/, 0,
> + io_uring_prep_rw(op, sqe[1], ublk_get_registered_fd(q, 1) /*fds[1]*/, 0,
> iod->nr_sectors << 9,
> iod->start_sector << 9);
> sqe[1]->buf_index = tag;
> sqe[1]->flags |= IOSQE_FIXED_FILE | IOSQE_IO_HARDLINK;
> sqe[1]->user_data = build_user_data(tag, ublk_op, 0, q->q_id, 1);
>
> - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> + io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
>
> return 2;
> diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
> index 95188065b2e9..ede1725f32bb 100644
> --- a/tools/testing/selftests/ublk/kublk.c
> +++ b/tools/testing/selftests/ublk/kublk.c
> @@ -432,7 +432,7 @@ static void ublk_thread_deinit(struct ublk_thread *t)
> }
> }
>
> -static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
> +static int ublk_queue_init(struct ublk_queue *q, unsigned long long extra_flags)
> {
> struct ublk_dev *dev = q->dev;
> int depth = dev->dev_info.queue_depth;
> @@ -446,6 +446,9 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
> q->flags = dev->dev_info.flags;
> q->flags |= extra_flags;
>
> + /* Cache fd in queue for fast path access */
> + q->ublk_fd = dev->fds[0];
> +
> cmd_buf_size = ublk_queue_cmd_buf_sz(q);
> off = UBLKSRV_CMD_BUF_OFFSET + q->q_id * ublk_queue_max_cmd_buf_sz();
> q->io_cmd_buf = mmap(0, cmd_buf_size, PROT_READ,
> @@ -481,9 +484,10 @@ static int ublk_queue_init(struct ublk_queue *q, unsigned extra_flags)
> return -ENOMEM;
> }
>
> -static int ublk_thread_init(struct ublk_thread *t)
> +static int ublk_thread_init(struct ublk_thread *t, unsigned long long extra_flags)
> {
> struct ublk_dev *dev = t->dev;
> + unsigned long long flags = dev->dev_info.flags | extra_flags;
> int ring_depth = dev->tgt.sq_depth, cq_depth = dev->tgt.cq_depth;
> int ret;
>
> @@ -512,7 +516,17 @@ static int ublk_thread_init(struct ublk_thread *t)
>
> io_uring_register_ring_fd(&t->ring);
>
> - ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
> + if (flags & UBLKS_Q_NO_UBLK_FIXED_FD) {
> + /* Register only backing files starting from index 1, exclude ublk control device */
> + if (dev->nr_fds > 1) {
> + ret = io_uring_register_files(&t->ring, &dev->fds[1], dev->nr_fds - 1);
> + } else {
> + /* No backing files to register, skip file registration */
> + ret = 0;
> + }
> + } else {
> + ret = io_uring_register_files(&t->ring, dev->fds, dev->nr_fds);
> + }
> if (ret) {
> ublk_err("ublk dev %d thread %d register files failed %d\n",
> t->dev->dev_info.dev_id, t->idx, ret);
> @@ -626,9 +640,12 @@ int ublk_queue_io_cmd(struct ublk_thread *t, struct ublk_io *io)
>
> /* These fields should be written once, never change */
> ublk_set_sqe_cmd_op(sqe[0], cmd_op);
> - sqe[0]->fd = 0; /* dev->fds[0] */
> + sqe[0]->fd = ublk_get_registered_fd(q, 0); /* dev->fds[0] */
> sqe[0]->opcode = IORING_OP_URING_CMD;
> - sqe[0]->flags = IOSQE_FIXED_FILE;
> + if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD)
> + sqe[0]->flags = 0; /* Use raw FD, not fixed file */
> + else
> + sqe[0]->flags = IOSQE_FIXED_FILE;
> sqe[0]->rw_flags = 0;
> cmd->tag = io->tag;
> cmd->q_id = q->q_id;
> @@ -832,6 +849,7 @@ struct ublk_thread_info {
> unsigned idx;
> sem_t *ready;
> cpu_set_t *affinity;
> + unsigned long long extra_flags;
> };
>
> static void *ublk_io_handler_fn(void *data)
> @@ -844,7 +862,7 @@ static void *ublk_io_handler_fn(void *data)
> t->dev = info->dev;
> t->idx = info->idx;
>
> - ret = ublk_thread_init(t);
> + ret = ublk_thread_init(t, info->extra_flags);
> if (ret) {
> ublk_err("ublk dev %d thread %u init failed\n",
> dev_id, t->idx);
> @@ -934,6 +952,8 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
>
> if (ctx->auto_zc_fallback)
> extra_flags = UBLKS_Q_AUTO_BUF_REG_FALLBACK;
> + if (ctx->no_ublk_fixed_fd)
> + extra_flags |= UBLKS_Q_NO_UBLK_FIXED_FD;
>
> for (i = 0; i < dinfo->nr_hw_queues; i++) {
> dev->q[i].dev = dev;
> @@ -951,6 +971,7 @@ static int ublk_start_daemon(const struct dev_ctx *ctx, struct ublk_dev *dev)
> tinfo[i].dev = dev;
> tinfo[i].idx = i;
> tinfo[i].ready = &ready;
> + tinfo[i].extra_flags = extra_flags;
>
> /*
> * If threads are not tied 1:1 to queues, setting thread
> @@ -1471,7 +1492,7 @@ static void __cmd_create_help(char *exe, bool recovery)
> printf("%s %s -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id]\n",
> exe, recovery ? "recover" : "add");
> printf("\t[--foreground] [--quiet] [-z] [--auto_zc] [--auto_zc_fallback] [--debug_mask mask] [-r 0|1 ] [-g]\n");
> - printf("\t[-e 0|1 ] [-i 0|1]\n");
> + printf("\t[-e 0|1 ] [-i 0|1] [--no_ublk_fixed_fd]\n");
> printf("\t[--nthreads threads] [--per_io_tasks]\n");
> printf("\t[target options] [backfile1] [backfile2] ...\n");
> printf("\tdefault: nr_queues=2(max 32), depth=128(max 1024), dev_id=-1(auto allocation)\n");
> @@ -1534,6 +1555,7 @@ int main(int argc, char *argv[])
> { "size", 1, NULL, 's'},
> { "nthreads", 1, NULL, 0 },
> { "per_io_tasks", 0, NULL, 0 },
> + { "no_ublk_fixed_fd", 0, NULL, 0 },
> { 0, 0, 0, 0 }
> };
> const struct ublk_tgt_ops *ops = NULL;
> @@ -1613,6 +1635,8 @@ int main(int argc, char *argv[])
> ctx.nthreads = strtol(optarg, NULL, 10);
> if (!strcmp(longopts[option_idx].name, "per_io_tasks"))
> ctx.per_io_tasks = 1;
> + if (!strcmp(longopts[option_idx].name, "no_ublk_fixed_fd"))
> + ctx.no_ublk_fixed_fd = 1;
> break;
> case '?':
> /*
> diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
> index 219233f8a053..9d5bbf559659 100644
> --- a/tools/testing/selftests/ublk/kublk.h
> +++ b/tools/testing/selftests/ublk/kublk.h
> @@ -77,6 +77,7 @@ struct dev_ctx {
> unsigned int recovery:1;
> unsigned int auto_zc_fallback:1;
> unsigned int per_io_tasks:1;
> + unsigned int no_ublk_fixed_fd:1;
>
> int _evtfd;
> int _shmid;
> @@ -166,7 +167,9 @@ struct ublk_queue {
>
> /* borrow one bit of ublk uapi flags, which may never be used */
> #define UBLKS_Q_AUTO_BUF_REG_FALLBACK (1ULL << 63)
> +#define UBLKS_Q_NO_UBLK_FIXED_FD (1ULL << 62)
> __u64 flags;
> + int ublk_fd; /* cached ublk char device fd */
> struct ublk_io ios[UBLK_QUEUE_DEPTH];
> };
>
> @@ -273,34 +276,48 @@ static inline int ublk_io_alloc_sqes(struct ublk_thread *t,
> return nr_sqes;
> }
>
> -static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
> - int dev_fd, int tag, int q_id, __u64 index)
> +static inline int ublk_get_registered_fd(struct ublk_queue *q, int fd_index)
> +{
> + if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD) {
> + if (fd_index == 0)
> + /* Return the raw ublk FD for index 0 */
> + return q->ublk_fd;
> + /* Adjust index for backing files (index 1 becomes 0, etc.) */
> + return fd_index - 1;
> + }
> + return fd_index;
> +}
> +
> +static inline void __io_uring_prep_buf_reg_unreg(struct io_uring_sqe *sqe,
> + struct ublk_queue *q, int tag, int q_id, __u64 index)
> {
> struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> + int dev_fd = ublk_get_registered_fd(q, 0);
>
> io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> sqe->opcode = IORING_OP_URING_CMD;
> - sqe->flags |= IOSQE_FIXED_FILE;
> - sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF;
> + if (q->flags & UBLKS_Q_NO_UBLK_FIXED_FD)
> + sqe->flags &= ~IOSQE_FIXED_FILE;
> + else
> + sqe->flags |= IOSQE_FIXED_FILE;
>
> cmd->tag = tag;
> cmd->addr = index;
> cmd->q_id = q_id;
> }
>
> -static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
> - int dev_fd, int tag, int q_id, __u64 index)
> +static inline void io_uring_prep_buf_register(struct io_uring_sqe *sqe,
> + struct ublk_queue *q, int tag, int q_id, __u64 index)
> {
> - struct ublksrv_io_cmd *cmd = (struct ublksrv_io_cmd *)sqe->cmd;
> + __io_uring_prep_buf_reg_unreg(sqe, q, tag, q_id, index);
> + sqe->cmd_op = UBLK_U_IO_REGISTER_IO_BUF;
> +}
>
> - io_uring_prep_read(sqe, dev_fd, 0, 0, 0);
> - sqe->opcode = IORING_OP_URING_CMD;
> - sqe->flags |= IOSQE_FIXED_FILE;
> +static inline void io_uring_prep_buf_unregister(struct io_uring_sqe *sqe,
> + struct ublk_queue *q, int tag, int q_id, __u64 index)
> +{
> + __io_uring_prep_buf_reg_unreg(sqe, q, tag, q_id, index);
> sqe->cmd_op = UBLK_U_IO_UNREGISTER_IO_BUF;
> -
> - cmd->tag = tag;
> - cmd->addr = index;
> - cmd->q_id = q_id;
> }
>
> static inline void *ublk_get_sqe_cmd(const struct io_uring_sqe *sqe)
> @@ -396,6 +413,7 @@ static inline int ublk_queue_no_buf(const struct ublk_queue *q)
> return ublk_queue_use_zc(q) || ublk_queue_use_auto_zc(q);
> }
>
> +
> extern const struct ublk_tgt_ops null_tgt_ops;
> extern const struct ublk_tgt_ops loop_tgt_ops;
> extern const struct ublk_tgt_ops stripe_tgt_ops;
> diff --git a/tools/testing/selftests/ublk/null.c b/tools/testing/selftests/ublk/null.c
> index f0e0003a4860..280043f6b689 100644
> --- a/tools/testing/selftests/ublk/null.c
> +++ b/tools/testing/selftests/ublk/null.c
> @@ -63,7 +63,7 @@ static int null_queue_zc_io(struct ublk_thread *t, struct ublk_queue *q,
>
> ublk_io_alloc_sqes(t, sqe, 3);
>
> - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> sqe[0]->user_data = build_user_data(tag,
> ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
> sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
> @@ -71,7 +71,7 @@ static int null_queue_zc_io(struct ublk_thread *t, struct ublk_queue *q,
> __setup_nop_io(tag, iod, sqe[1], q->q_id);
> sqe[1]->flags |= IOSQE_IO_HARDLINK;
>
> - io_uring_prep_buf_unregister(sqe[2], 0, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> + io_uring_prep_buf_unregister(sqe[2], q, tag, q->q_id, ublk_get_io(q, tag)->buf_index);
> sqe[2]->user_data = build_user_data(tag, ublk_cmd_op_nr(sqe[2]->cmd_op), 0, q->q_id, 1);
>
> // buf register is marked as IOSQE_CQE_SKIP_SUCCESS
> diff --git a/tools/testing/selftests/ublk/stripe.c b/tools/testing/selftests/ublk/stripe.c
> index 1fb9b7cc281b..791fa8dc1651 100644
> --- a/tools/testing/selftests/ublk/stripe.c
> +++ b/tools/testing/selftests/ublk/stripe.c
> @@ -142,7 +142,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
> ublk_io_alloc_sqes(t, sqe, s->nr + extra);
>
> if (zc) {
> - io_uring_prep_buf_register(sqe[0], 0, tag, q->q_id, io->buf_index);
> + io_uring_prep_buf_register(sqe[0], q, tag, q->q_id, io->buf_index);
> sqe[0]->flags |= IOSQE_CQE_SKIP_SUCCESS | IOSQE_IO_HARDLINK;
> sqe[0]->user_data = build_user_data(tag,
> ublk_cmd_op_nr(sqe[0]->cmd_op), 0, q->q_id, 1);
> @@ -168,7 +168,7 @@ static int stripe_queue_tgt_rw_io(struct ublk_thread *t, struct ublk_queue *q,
> if (zc) {
> struct io_uring_sqe *unreg = sqe[s->nr + 1];
>
> - io_uring_prep_buf_unregister(unreg, 0, tag, q->q_id, io->buf_index);
> + io_uring_prep_buf_unregister(unreg, q, tag, q->q_id, io->buf_index);
> unreg->user_data = build_user_data(
> tag, ublk_cmd_op_nr(unreg->cmd_op), 0, q->q_id, 1);
> }
> diff --git a/tools/testing/selftests/ublk/test_stress_04.sh b/tools/testing/selftests/ublk/test_stress_04.sh
> index 40d1437ca298..3f901db4d09d 100755
> --- a/tools/testing/selftests/ublk/test_stress_04.sh
> +++ b/tools/testing/selftests/ublk/test_stress_04.sh
> @@ -28,14 +28,14 @@ _create_backfile 0 256M
> _create_backfile 1 128M
> _create_backfile 2 128M
>
> -ublk_io_and_kill_daemon 8G -t null -q 4 -z &
> -ublk_io_and_kill_daemon 256M -t loop -q 4 -z "${UBLK_BACKFILES[0]}" &
> +ublk_io_and_kill_daemon 8G -t null -q 4 -z --no_ublk_fixed_fd &
> +ublk_io_and_kill_daemon 256M -t loop -q 4 -z --no_ublk_fixed_fd "${UBLK_BACKFILES[0]}" &
> ublk_io_and_kill_daemon 256M -t stripe -q 4 -z "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
>
> if _have_feature "AUTO_BUF_REG"; then
> ublk_io_and_kill_daemon 8G -t null -q 4 --auto_zc &
> ublk_io_and_kill_daemon 256M -t loop -q 4 --auto_zc "${UBLK_BACKFILES[0]}" &
> - ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
> + ublk_io_and_kill_daemon 256M -t stripe -q 4 --auto_zc --no_ublk_fixed_fd "${UBLK_BACKFILES[1]}" "${UBLK_BACKFILES[2]}" &
> ublk_io_and_kill_daemon 8G -t null -q 4 -z --auto_zc --auto_zc_fallback &
> fi
>
> --
> 2.47.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
@ 2025-08-26 5:49 ` Caleb Sander Mateos
2025-08-26 8:32 ` Ming Lei
0 siblings, 1 reply; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-08-26 5:49 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> When running test_stress_04.sh, the following warning is triggered:
>
> WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
>
> This happens when the daemon is abruptly killed:
>
> - some references may still be held, because registering IO buffer
> doesn't grab ublk char device reference
Ah, good observation. That's definitely a problem.
>
> OR
>
> - io->task_registered_buffers won't be cleared because io buffer is
> released from non-daemon context
I don't think the task_registered_buffers optimization is really
involved here; that's just a different way of tracking the reference
count. Regardless of what task the buffer is registered or
unregistered on, the buffer still counts as 1 reference on the
ublk_io. Summing up the reference counts and making sure they are both
reset to 0 seems like a good approach to me.
>
> For zero-copy and auto buffer register modes, I/O reference crosses
> syscalls, so IO reference may not be dropped naturally when ublk server is
> killed abruptly. However, when releasing io_uring context, it is guaranteed
> that the reference is dropped finally, see io_sqe_buffers_unregister() from
> io_ring_ctx_free().
>
> Fix this by adding ublk_drain_io_references() that:
> - Waits for active I/O references dropped in async way by scheduling
> work function, for avoiding ublk dev and io_uring file's release
> dependency
> - Reinitializes io->ref and io->task_registered_buffers to clean state
>
> This ensures the reference count state is clean when ublk_queue_reinit()
> is called, preventing the warning and potential use-after-free.
One scenario I worry about is if the ublk server has already issued
UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
has registered buffer(s). It's possible the ublk server hasn't
finished performing I/O to/from the registered buffers and so the I/O
isn't really complete yet. But when io_uring automatically releases
the registered buffers, the reference count will hit 0 and the ublk
I/O will be completed successfully. There seems to be some data
corruption risk in this scenario. But maybe it doesn't make sense for
a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
before knowing whether the zero-copy I/Os succeeded?
>
> Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 92 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 99abd67b708b..f608c7efdc05 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -239,6 +239,7 @@ struct ublk_device {
> struct mutex cancel_mutex;
> bool canceling;
> pid_t ublksrv_tgid;
> + struct delayed_work exit_work;
> };
>
> /* header of ublk_params */
> @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> ublk_get_queue(ub, i)->canceling = canceling;
> }
>
> -static int ublk_ch_release(struct inode *inode, struct file *filp)
> +static void ublk_reset_io_ref(struct ublk_device *ub)
> {
> - struct ublk_device *ub = filp->private_data;
> + int i, j;
> +
> + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> + UBLK_F_AUTO_BUF_REG)))
> + return;
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> + for (j = 0; j < ubq->q_depth; j++) {
> + struct ublk_io *io = &ubq->ios[j];
> + /*
> + * Reinitialize reference counting fields after
> + * draining. This ensures clean state for queue
> + * reinitialization.
> + */
> + refcount_set(&io->ref, 0);
> + io->task_registered_buffers = 0;
> + }
> + }
> +}
> +
> +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> +{
> + int i, j;
> +
> + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> + UBLK_F_AUTO_BUF_REG)))
> + return false;
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> + for (j = 0; j < ubq->q_depth; j++) {
> + struct ublk_io *io = &ubq->ios[j];
> + unsigned int refs = refcount_read(&io->ref) +
> + io->task_registered_buffers;
> +
> + /*
> + * UBLK_REFCOUNT_INIT or zero means no active
> + * reference
> + */
> + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> + return true;
It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
UBLK_REFCOUNT_INIT active references. It would be safer to check
UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
count needs to reach 0.
> + }
> + }
> + return false;
> +}
> +
> +static void ublk_ch_release_work_fn(struct work_struct *work)
> +{
> + struct ublk_device *ub =
> + container_of(work, struct ublk_device, exit_work.work);
> struct gendisk *disk;
> int i;
>
> + /*
> + * For zero-copy and auto buffer register modes, I/O references
> + * might not be dropped naturally when the daemon is killed, but
> + * io_uring guarantees that registered bvec kernel buffers are
> + * unregistered finally when freeing io_uring context, then the
> + * active references are dropped.
> + *
> + * Wait until active references are dropped for avoiding use-after-free
> + *
> + * registered buffer may be unregistered in io_ring's release hander,
> + * so have to wait by scheduling work function for avoiding the two
> + * file release dependency.
> + */
> + if (ublk_has_io_with_active_ref(ub)) {
> + schedule_delayed_work(&ub->exit_work, 1);
> + return;
> + }
> +
> + ublk_reset_io_ref(ub);
Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
combined into a single nested loop that waits for each ublk_io's
references to be released and then resets its reference counts to 0?
Looks like the ub->dev_info.flags checks could also be consolidated.
> +
> /*
> * disk isn't attached yet, either device isn't live, or it has
> * been removed already, so we needn't to do anything
> @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
> ublk_reset_ch_dev(ub);
> out:
> clear_bit(UB_STATE_OPEN, &ub->state);
> +
> + /* put the reference grabbed in ublk_ch_release() */
> + ublk_put_device(ub);
> +}
> +
> +static int ublk_ch_release(struct inode *inode, struct file *filp)
> +{
> + struct ublk_device *ub = filp->private_data;
> +
> + /*
> + * Grab ublk device reference, so it won't be gone until we are
> + * really released from work function.
> + */
> + ub = ublk_get_device(ub);
Can taking a reference fail here? If so, the NULL return value would
need to be handled. If not, the "ub =" could be dropped.
Best,
Caleb
> +
> + INIT_DELAYED_WORK(&ub->exit_work, ublk_ch_release_work_fn);
> + schedule_delayed_work(&ub->exit_work, 0);
> return 0;
> }
>
> --
> 2.47.0
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
2025-08-26 5:49 ` Caleb Sander Mateos
@ 2025-08-26 8:32 ` Ming Lei
2025-08-29 15:37 ` Caleb Sander Mateos
0 siblings, 1 reply; 9+ messages in thread
From: Ming Lei @ 2025-08-26 8:32 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > When running test_stress_04.sh, the following warning is triggered:
> >
> > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> >
> > This happens when the daemon is abruptly killed:
> >
> > - some references may still be held, because registering IO buffer
> > doesn't grab ublk char device reference
>
> Ah, good observation. That's definitely a problem.
>
> >
> > OR
> >
> > - io->task_registered_buffers won't be cleared because io buffer is
> > released from non-daemon context
>
> I don't think the task_registered_buffers optimization is really
> involved here; that's just a different way of tracking the reference
> count. Regardless of what task the buffer is registered or
> unregistered on, the buffer still counts as 1 reference on the
> ublk_io. Summing up the reference counts and making sure they are both
> reset to 0 seems like a good approach to me.
The warning in ublk_queue_reinit() is triggered because
- the reference is not dropped
OR
- the io buf unregister is done from another task context(kernel wq), so
both io->ref and io->task_registered_buffers are not zero, which is started
from task_registered_buffers optimization
>
> >
> > For zero-copy and auto buffer register modes, I/O reference crosses
> > syscalls, so IO reference may not be dropped naturally when ublk server is
> > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > io_ring_ctx_free().
> >
> > Fix this by adding ublk_drain_io_references() that:
> > - Waits for active I/O references dropped in async way by scheduling
> > work function, for avoiding ublk dev and io_uring file's release
> > dependency
> > - Reinitializes io->ref and io->task_registered_buffers to clean state
> >
> > This ensures the reference count state is clean when ublk_queue_reinit()
> > is called, preventing the warning and potential use-after-free.
>
> One scenario I worry about is if the ublk server has already issued
> UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> has registered buffer(s). It's possible the ublk server hasn't
> finished performing I/O to/from the registered buffers and so the I/O
> isn't really complete yet. But when io_uring automatically releases
> the registered buffers, the reference count will hit 0 and the ublk
> I/O will be completed successfully. There seems to be some data
> corruption risk in this scenario.
The final io buffer unregister is from freeing io_uring conext in
io_uring_release(), any user of this uring context has been done,
so it isn't possible that ublk server is performing io with the
un-registered buffer.
> But maybe it doesn't make sense for
> a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> before knowing whether the zero-copy I/Os succeeded?
>
> >
> > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 92 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 99abd67b708b..f608c7efdc05 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -239,6 +239,7 @@ struct ublk_device {
> > struct mutex cancel_mutex;
> > bool canceling;
> > pid_t ublksrv_tgid;
> > + struct delayed_work exit_work;
> > };
> >
> > /* header of ublk_params */
> > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > ublk_get_queue(ub, i)->canceling = canceling;
> > }
> >
> > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > {
> > - struct ublk_device *ub = filp->private_data;
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + /*
> > + * Reinitialize reference counting fields after
> > + * draining. This ensures clean state for queue
> > + * reinitialization.
> > + */
> > + refcount_set(&io->ref, 0);
> > + io->task_registered_buffers = 0;
> > + }
> > + }
> > +}
> > +
> > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > +{
> > + int i, j;
> > +
> > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > + UBLK_F_AUTO_BUF_REG)))
> > + return false;
> > +
> > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > + for (j = 0; j < ubq->q_depth; j++) {
> > + struct ublk_io *io = &ubq->ios[j];
> > + unsigned int refs = refcount_read(&io->ref) +
> > + io->task_registered_buffers;
> > +
> > + /*
> > + * UBLK_REFCOUNT_INIT or zero means no active
> > + * reference
> > + */
> > + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > + return true;
>
> It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> UBLK_REFCOUNT_INIT active references. It would be safer to check
> UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> count needs to reach 0.
It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
any time if the io buffer isn't registered, so it is enough and
simpler.
>
> > + }
> > + }
> > + return false;
> > +}
> > +
> > +static void ublk_ch_release_work_fn(struct work_struct *work)
> > +{
> > + struct ublk_device *ub =
> > + container_of(work, struct ublk_device, exit_work.work);
> > struct gendisk *disk;
> > int i;
> >
> > + /*
> > + * For zero-copy and auto buffer register modes, I/O references
> > + * might not be dropped naturally when the daemon is killed, but
> > + * io_uring guarantees that registered bvec kernel buffers are
> > + * unregistered finally when freeing io_uring context, then the
> > + * active references are dropped.
> > + *
> > + * Wait until active references are dropped for avoiding use-after-free
> > + *
> > + * registered buffer may be unregistered in io_ring's release hander,
> > + * so have to wait by scheduling work function for avoiding the two
> > + * file release dependency.
> > + */
> > + if (ublk_has_io_with_active_ref(ub)) {
> > + schedule_delayed_work(&ub->exit_work, 1);
> > + return;
> > + }
> > +
> > + ublk_reset_io_ref(ub);
>
> Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
> combined into a single nested loop that waits for each ublk_io's
> references to be released and then resets its reference counts to 0?
> Looks like the ub->dev_info.flags checks could also be consolidated.
This way looks more readable, otherwise ublk_has_io_with_active_ref()
has to check and reset. Not a problem, I can move it to one single helper.
>
> > +
> > /*
> > * disk isn't attached yet, either device isn't live, or it has
> > * been removed already, so we needn't to do anything
> > @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
> > ublk_reset_ch_dev(ub);
> > out:
> > clear_bit(UB_STATE_OPEN, &ub->state);
> > +
> > + /* put the reference grabbed in ublk_ch_release() */
> > + ublk_put_device(ub);
> > +}
> > +
> > +static int ublk_ch_release(struct inode *inode, struct file *filp)
> > +{
> > + struct ublk_device *ub = filp->private_data;
> > +
> > + /*
> > + * Grab ublk device reference, so it won't be gone until we are
> > + * really released from work function.
> > + */
> > + ub = ublk_get_device(ub);
>
> Can taking a reference fail here? If so, the NULL return value would
> need to be handled. If not, the "ub =" could be dropped.
Of course it has to get a reference, otherwise it isn't safe to use `ub`
is the release handler, I will drop "ub =".
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device
2025-08-26 4:53 ` Caleb Sander Mateos
@ 2025-08-26 8:36 ` Ming Lei
0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-08-26 8:36 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Mon, Aug 25, 2025 at 09:53:27PM -0700, Caleb Sander Mateos wrote:
> On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > Add a new command line option --no_ublk_fixed_fd that excludes the ublk
> > control device (/dev/ublkcN) from io_uring's registered files array.
> > When this option is used, only backing files are registered starting
> > from index 1, while the ublk control device is accessed using its raw
> > file descriptor.
> >
> > Add ublk_get_registered_fd() helper function that returns the appropriate
> > file descriptor for use with io_uring operations, taking ublk_queue *
> > parameter instead of ublk_thread * for better performance.
> >
> > Key optimizations implemented:
> > - Cache UBLKS_Q_NO_UBLK_FIXED_FD flag in ublk_queue.flags to avoid
> > reading dev->no_ublk_fixed_fd in fast path
> > - Cache ublk char device fd in ublk_queue.ublk_fd for fast access
> > - Update ublk_get_registered_fd() to use ublk_queue * parameter
> > - Update io_uring_prep_buf_register/unregister() to use ublk_queue *
> > - Replace ublk_device * access with ublk_queue * access in fast paths
> >
> > This improves performance by:
> > - Eliminating device structure traversal in hot paths
> > - Better cache locality with queue-local data access
> > - Reduced indirection for flag and fd lookups
>
> Are you saying that performance is better when using the raw
> /dev/ublkcN fd instead of an io_uring registered file? That would be
> really surprising to me, since the whole point of io_uring file
> registration is to avoid the file reference counting in the I/O path.
No, here it just describes one implementation detail by caching per-device
flag(no_ublk_fixed_fd) in queue's flag.
The test can trigger hang with patch V1 by passing --no_ublk_fixed_fd
because /dev/ublkcN and io_uring closes can be depended, both are run
from task work context in current task.
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
2025-08-26 8:32 ` Ming Lei
@ 2025-08-29 15:37 ` Caleb Sander Mateos
2025-09-01 8:44 ` Ming Lei
0 siblings, 1 reply; 9+ messages in thread
From: Caleb Sander Mateos @ 2025-08-29 15:37 UTC (permalink / raw)
To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Tue, Aug 26, 2025 at 1:33 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > When running test_stress_04.sh, the following warning is triggered:
> > >
> > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> > >
> > > This happens when the daemon is abruptly killed:
> > >
> > > - some references may still be held, because registering IO buffer
> > > doesn't grab ublk char device reference
> >
> > Ah, good observation. That's definitely a problem.
> >
> > >
> > > OR
> > >
> > > - io->task_registered_buffers won't be cleared because io buffer is
> > > released from non-daemon context
> >
> > I don't think the task_registered_buffers optimization is really
> > involved here; that's just a different way of tracking the reference
> > count. Regardless of what task the buffer is registered or
> > unregistered on, the buffer still counts as 1 reference on the
> > ublk_io. Summing up the reference counts and making sure they are both
> > reset to 0 seems like a good approach to me.
>
> The warning in ublk_queue_reinit() is triggered because
>
> - the reference is not dropped
>
> OR
>
> - the io buf unregister is done from another task context(kernel wq), so
> both io->ref and io->task_registered_buffers are not zero, which is started
> from task_registered_buffers optimization
Right, but I would consider that to be an outstanding reference. I
think we're on the same page.
>
> >
> > >
> > > For zero-copy and auto buffer register modes, I/O reference crosses
> > > syscalls, so IO reference may not be dropped naturally when ublk server is
> > > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > > io_ring_ctx_free().
> > >
> > > Fix this by adding ublk_drain_io_references() that:
> > > - Waits for active I/O references dropped in async way by scheduling
> > > work function, for avoiding ublk dev and io_uring file's release
> > > dependency
> > > - Reinitializes io->ref and io->task_registered_buffers to clean state
> > >
> > > This ensures the reference count state is clean when ublk_queue_reinit()
> > > is called, preventing the warning and potential use-after-free.
> >
> > One scenario I worry about is if the ublk server has already issued
> > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> > has registered buffer(s). It's possible the ublk server hasn't
> > finished performing I/O to/from the registered buffers and so the I/O
> > isn't really complete yet. But when io_uring automatically releases
> > the registered buffers, the reference count will hit 0 and the ublk
> > I/O will be completed successfully. There seems to be some data
> > corruption risk in this scenario.
>
> The final io buffer unregister is from freeing io_uring conext in
> io_uring_release(), any user of this uring context has been done,
> so it isn't possible that ublk server is performing io with the
> un-registered buffer.
Well, the ublk server could have been killed before it could submit an
I/O using the registered buffer. But it's an existing concern,
certainly not introduced by your patch. I think you can make a
reasonable argument that a ublk server shouldn't submit a
UBLK_IO_COMMIT_AND_FETCH_REQ until it knows the I/O completed
successfully or with an error, otherwise it won't be able to change
the result code if the I/O using the registered buffer unexpectedly
fails.
>
> > But maybe it doesn't make sense for
> > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> > before knowing whether the zero-copy I/Os succeeded?
> >
> > >
> > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > > 1 file changed, 92 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > index 99abd67b708b..f608c7efdc05 100644
> > > --- a/drivers/block/ublk_drv.c
> > > +++ b/drivers/block/ublk_drv.c
> > > @@ -239,6 +239,7 @@ struct ublk_device {
> > > struct mutex cancel_mutex;
> > > bool canceling;
> > > pid_t ublksrv_tgid;
> > > + struct delayed_work exit_work;
> > > };
> > >
> > > /* header of ublk_params */
> > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > > ublk_get_queue(ub, i)->canceling = canceling;
> > > }
> > >
> > > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > > {
> > > - struct ublk_device *ub = filp->private_data;
> > > + int i, j;
> > > +
> > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > + UBLK_F_AUTO_BUF_REG)))
> > > + return;
> > > +
> > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > +
> > > + for (j = 0; j < ubq->q_depth; j++) {
> > > + struct ublk_io *io = &ubq->ios[j];
> > > + /*
> > > + * Reinitialize reference counting fields after
> > > + * draining. This ensures clean state for queue
> > > + * reinitialization.
> > > + */
> > > + refcount_set(&io->ref, 0);
> > > + io->task_registered_buffers = 0;
> > > + }
> > > + }
> > > +}
> > > +
> > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > > +{
> > > + int i, j;
> > > +
> > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > + UBLK_F_AUTO_BUF_REG)))
> > > + return false;
> > > +
> > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > +
> > > + for (j = 0; j < ubq->q_depth; j++) {
> > > + struct ublk_io *io = &ubq->ios[j];
> > > + unsigned int refs = refcount_read(&io->ref) +
> > > + io->task_registered_buffers;
> > > +
> > > + /*
> > > + * UBLK_REFCOUNT_INIT or zero means no active
> > > + * reference
> > > + */
> > > + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > > + return true;
> >
> > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> > UBLK_REFCOUNT_INIT active references. It would be safer to check
> > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> > count needs to reach 0.
>
> It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
> any time if the io buffer isn't registered, so it is enough and
> simpler.
I think it's theoretically possible to make the reference count
arbitrarily high with ublk registered buffers, which can cause it to
actually have UBLK_REFCOUNT_INIT references (not because all
references are released) or the reference count to overflow back to 0.
One way to do this would be to register the same I/O's buffer in many
slots of many io_urings at the same time. Another possibility would be
to repeatedly register the buffer, use it in an I/O that never
completes (e.g. recv from an empty socket), and then unregister the
buffer. But this reference count overflow issue is an existing issue,
not introduced by the patch.
Best,
Caleb
>
> >
> > > + }
> > > + }
> > > + return false;
> > > +}
> > > +
> > > +static void ublk_ch_release_work_fn(struct work_struct *work)
> > > +{
> > > + struct ublk_device *ub =
> > > + container_of(work, struct ublk_device, exit_work.work);
> > > struct gendisk *disk;
> > > int i;
> > >
> > > + /*
> > > + * For zero-copy and auto buffer register modes, I/O references
> > > + * might not be dropped naturally when the daemon is killed, but
> > > + * io_uring guarantees that registered bvec kernel buffers are
> > > + * unregistered finally when freeing io_uring context, then the
> > > + * active references are dropped.
> > > + *
> > > + * Wait until active references are dropped for avoiding use-after-free
> > > + *
> > > + * registered buffer may be unregistered in io_ring's release hander,
> > > + * so have to wait by scheduling work function for avoiding the two
> > > + * file release dependency.
> > > + */
> > > + if (ublk_has_io_with_active_ref(ub)) {
> > > + schedule_delayed_work(&ub->exit_work, 1);
> > > + return;
> > > + }
> > > +
> > > + ublk_reset_io_ref(ub);
> >
> > Why the 2 separate loops over nr_hw_queues and q_depth? Could they be
> > combined into a single nested loop that waits for each ublk_io's
> > references to be released and then resets its reference counts to 0?
> > Looks like the ub->dev_info.flags checks could also be consolidated.
>
> This way looks more readable, otherwise ublk_has_io_with_active_ref()
> has to check and reset. Not a problem, I can move it to one single helper.
>
> >
> > > +
> > > /*
> > > * disk isn't attached yet, either device isn't live, or it has
> > > * been removed already, so we needn't to do anything
> > > @@ -1673,6 +1746,23 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > ublk_reset_ch_dev(ub);
> > > out:
> > > clear_bit(UB_STATE_OPEN, &ub->state);
> > > +
> > > + /* put the reference grabbed in ublk_ch_release() */
> > > + ublk_put_device(ub);
> > > +}
> > > +
> > > +static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > +{
> > > + struct ublk_device *ub = filp->private_data;
> > > +
> > > + /*
> > > + * Grab ublk device reference, so it won't be gone until we are
> > > + * really released from work function.
> > > + */
> > > + ub = ublk_get_device(ub);
> >
> > Can taking a reference fail here? If so, the NULL return value would
> > need to be handled. If not, the "ub =" could be dropped.
>
> Of course it has to get a reference, otherwise it isn't safe to use `ub`
> is the release handler, I will drop "ub =".
>
>
> Thanks,
> Ming
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH V2 1/2] ublk: avoid ublk_io_release() called after ublk char dev is closed
2025-08-29 15:37 ` Caleb Sander Mateos
@ 2025-09-01 8:44 ` Ming Lei
0 siblings, 0 replies; 9+ messages in thread
From: Ming Lei @ 2025-09-01 8:44 UTC (permalink / raw)
To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar, Keith Busch
On Fri, Aug 29, 2025 at 08:37:28AM -0700, Caleb Sander Mateos wrote:
> On Tue, Aug 26, 2025 at 1:33 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 10:49:49PM -0700, Caleb Sander Mateos wrote:
> > > On Mon, Aug 25, 2025 at 5:49 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > When running test_stress_04.sh, the following warning is triggered:
> > > >
> > > > WARNING: CPU: 1 PID: 135 at drivers/block/ublk_drv.c:1933 ublk_ch_release+0x423/0x4b0 [ublk_drv]
> > > >
> > > > This happens when the daemon is abruptly killed:
> > > >
> > > > - some references may still be held, because registering IO buffer
> > > > doesn't grab ublk char device reference
> > >
> > > Ah, good observation. That's definitely a problem.
> > >
> > > >
> > > > OR
> > > >
> > > > - io->task_registered_buffers won't be cleared because io buffer is
> > > > released from non-daemon context
> > >
> > > I don't think the task_registered_buffers optimization is really
> > > involved here; that's just a different way of tracking the reference
> > > count. Regardless of what task the buffer is registered or
> > > unregistered on, the buffer still counts as 1 reference on the
> > > ublk_io. Summing up the reference counts and making sure they are both
> > > reset to 0 seems like a good approach to me.
> >
> > The warning in ublk_queue_reinit() is triggered because
> >
> > - the reference is not dropped
> >
> > OR
> >
> > - the io buf unregister is done from another task context(kernel wq), so
> > both io->ref and io->task_registered_buffers are not zero, which is started
> > from task_registered_buffers optimization
>
> Right, but I would consider that to be an outstanding reference. I
> think we're on the same page.
>
> >
> > >
> > > >
> > > > For zero-copy and auto buffer register modes, I/O reference crosses
> > > > syscalls, so IO reference may not be dropped naturally when ublk server is
> > > > killed abruptly. However, when releasing io_uring context, it is guaranteed
> > > > that the reference is dropped finally, see io_sqe_buffers_unregister() from
> > > > io_ring_ctx_free().
> > > >
> > > > Fix this by adding ublk_drain_io_references() that:
> > > > - Waits for active I/O references dropped in async way by scheduling
> > > > work function, for avoiding ublk dev and io_uring file's release
> > > > dependency
> > > > - Reinitializes io->ref and io->task_registered_buffers to clean state
> > > >
> > > > This ensures the reference count state is clean when ublk_queue_reinit()
> > > > is called, preventing the warning and potential use-after-free.
> > >
> > > One scenario I worry about is if the ublk server has already issued
> > > UBLK_IO_COMMIT_AND_FETCH_REQ for an I/O but is killed while it still
> > > has registered buffer(s). It's possible the ublk server hasn't
> > > finished performing I/O to/from the registered buffers and so the I/O
> > > isn't really complete yet. But when io_uring automatically releases
> > > the registered buffers, the reference count will hit 0 and the ublk
> > > I/O will be completed successfully. There seems to be some data
> > > corruption risk in this scenario.
> >
> > The final io buffer unregister is from freeing io_uring conext in
> > io_uring_release(), any user of this uring context has been done,
> > so it isn't possible that ublk server is performing io with the
> > un-registered buffer.
>
> Well, the ublk server could have been killed before it could submit an
> I/O using the registered buffer. But it's an existing concern,
> certainly not introduced by your patch. I think you can make a
> reasonable argument that a ublk server shouldn't submit a
> UBLK_IO_COMMIT_AND_FETCH_REQ until it knows the I/O completed
> successfully or with an error, otherwise it won't be able to change
> the result code if the I/O using the registered buffer unexpectedly
> fails.
Yes, it was one trouble because ublk server can do whatever.
However, since your task_registered_buffers optimization, ublk request
won't be completed from freeing uring_ctx context any more, and it is always
aborted from ublk_abort_queue(), so it is failed in case of killed ublk
server.
The situation can be improved by failing request explicitly by setting io->res
as -EIO, then -EIO can be taken first in case of killed ublk server.
>
> >
> > > But maybe it doesn't make sense for
> > > a ublk server to issue UBLK_IO_COMMIT_AND_FETCH_REQ with a result code
> > > before knowing whether the zero-copy I/Os succeeded?
> > >
> > > >
> > > > Fixes: 1f6540e2aabb ("ublk: zc register/unregister bvec")
> > > > Fixes: 1ceeedb59749 ("ublk: optimize UBLK_IO_UNREGISTER_IO_BUF on daemon task")
> > > > Fixes: 8a8fe42d765b ("ublk: optimize UBLK_IO_REGISTER_IO_BUF on daemon task")
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > > drivers/block/ublk_drv.c | 94 +++++++++++++++++++++++++++++++++++++++-
> > > > 1 file changed, 92 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > > > index 99abd67b708b..f608c7efdc05 100644
> > > > --- a/drivers/block/ublk_drv.c
> > > > +++ b/drivers/block/ublk_drv.c
> > > > @@ -239,6 +239,7 @@ struct ublk_device {
> > > > struct mutex cancel_mutex;
> > > > bool canceling;
> > > > pid_t ublksrv_tgid;
> > > > + struct delayed_work exit_work;
> > > > };
> > > >
> > > > /* header of ublk_params */
> > > > @@ -1595,12 +1596,84 @@ static void ublk_set_canceling(struct ublk_device *ub, bool canceling)
> > > > ublk_get_queue(ub, i)->canceling = canceling;
> > > > }
> > > >
> > > > -static int ublk_ch_release(struct inode *inode, struct file *filp)
> > > > +static void ublk_reset_io_ref(struct ublk_device *ub)
> > > > {
> > > > - struct ublk_device *ub = filp->private_data;
> > > > + int i, j;
> > > > +
> > > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > + UBLK_F_AUTO_BUF_REG)))
> > > > + return;
> > > > +
> > > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > + for (j = 0; j < ubq->q_depth; j++) {
> > > > + struct ublk_io *io = &ubq->ios[j];
> > > > + /*
> > > > + * Reinitialize reference counting fields after
> > > > + * draining. This ensures clean state for queue
> > > > + * reinitialization.
> > > > + */
> > > > + refcount_set(&io->ref, 0);
> > > > + io->task_registered_buffers = 0;
> > > > + }
> > > > + }
> > > > +}
> > > > +
> > > > +static bool ublk_has_io_with_active_ref(struct ublk_device *ub)
> > > > +{
> > > > + int i, j;
> > > > +
> > > > + if (!(ub->dev_info.flags & (UBLK_F_SUPPORT_ZERO_COPY |
> > > > + UBLK_F_AUTO_BUF_REG)))
> > > > + return false;
> > > > +
> > > > + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > > > + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > > > +
> > > > + for (j = 0; j < ubq->q_depth; j++) {
> > > > + struct ublk_io *io = &ubq->ios[j];
> > > > + unsigned int refs = refcount_read(&io->ref) +
> > > > + io->task_registered_buffers;
> > > > +
> > > > + /*
> > > > + * UBLK_REFCOUNT_INIT or zero means no active
> > > > + * reference
> > > > + */
> > > > + if (refs != UBLK_REFCOUNT_INIT && refs != 0)
> > > > + return true;
> > >
> > > It's technically possible to hit refs == UBLK_REFCOUNT_INIT by having
> > > UBLK_REFCOUNT_INIT active references. It would be safer to check
> > > UBLK_IO_FLAG_OWNED_BY_SRV: if the flag is set, the reference count
> > > needs to reach UBLK_REFCOUNT_INIT; if the flag is unset, the reference
> > > count needs to reach 0.
> >
> > It is actually one invariant that the two's sum is zero or UBLK_REFCOUNT_INIT
> > any time if the io buffer isn't registered, so it is enough and
> > simpler.
>
> I think it's theoretically possible to make the reference count
> arbitrarily high with ublk registered buffers, which can cause it to
> actually have UBLK_REFCOUNT_INIT references (not because all
> references are released) or the reference count to overflow back to 0.
> One way to do this would be to register the same I/O's buffer in many
> slots of many io_urings at the same time.
The reference is initialized as UBLK_REFCOUNT_INIT, the only way for
triggering it is to make the counter overflow first.
> Another possibility would be
> to repeatedly register the buffer, use it in an I/O that never
> completes (e.g. recv from an empty socket), and then unregister the
> buffer. But this reference count overflow issue is an existing issue,
> not introduced by the patch.
Yes, overflow can be detected & warned by refcount checking, also it is only
allowed for privileged user, so looks this way is just fine.
Thanks,
Ming
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-09-01 8:45 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-25 12:48 [PATCH V2 0/2] ublk: avoid ublk_io_release() called after ublk char dev is closed Ming Lei
2025-08-25 12:48 ` [PATCH V2 1/2] " Ming Lei
2025-08-26 5:49 ` Caleb Sander Mateos
2025-08-26 8:32 ` Ming Lei
2025-08-29 15:37 ` Caleb Sander Mateos
2025-09-01 8:44 ` Ming Lei
2025-08-25 12:48 ` [PATCH V2 2/2] ublk selftests: add --no_ublk_fixed_fd for not using registered ublk char device Ming Lei
2025-08-26 4:53 ` Caleb Sander Mateos
2025-08-26 8:36 ` Ming Lei
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox