[PATCH 0/9] ublk: simplify & improve IO canceling

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] ublk: simplify & improve IO canceling
@ 2025-04-14 11:25 Ming Lei
  2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

Hello,

The 1st patch is one bug fix.

Patch 2nd ~ 8th simplifies & improves IO canceling when ublk server daemon
is exiting by taking two stage canceling:

- canceling active uring_cmd from its cancel function

- move inflight requests aborting into ublk char device release handler

With this way, implementation is simplified a lot, meantime ub->mutex is
not required before queue becomes quiescing, so forward progress is
guaranteed.

This approach & main implementation is from Uday's patch of
"improve detection and handling of ublk server exit".

The last patch is selftest code for showing the improvement ublk server
exit, 30sec timeout is avoided, which depends on patchset of
"[PATCH V2 00/13] selftests: ublk: test cleanup & add more tests"[1].

[1] https://lore.kernel.org/linux-block/20250412023035.2649275-1-ming.lei@redhat.com/T/#medbf7024e57beaf1c53e4cef6e076421463839d0

Pass both ublk kernel selftests and ublksrv 'make test T=generic'.

Thanks,


Ming Lei (6):
  ublk: don't try to stop disk if ->ub_disk is NULL
  ublk: add ublk_force_abort_dev()
  ublk: rely on ->canceling for dealing with
    ublk_nosrv_dev_should_queue_io
  ublk: move device reset into ublk_ch_release()
  ublk: remove __ublk_quiesce_dev()
  ublk: simplify aborting ublk request

Uday Shankar (3):
  ublk: properly serialize all FETCH_REQs
  ublk: improve detection and handling of ublk server exit
  selftests: ublk: add generic_06 for covering fault inject

 drivers/block/ublk_drv.c                      | 484 +++++++++---------
 tools/testing/selftests/ublk/Makefile         |   4 +-
 tools/testing/selftests/ublk/fault_inject.c   |  98 ++++
 tools/testing/selftests/ublk/kublk.c          |   3 +-
 tools/testing/selftests/ublk/kublk.h          |  12 +-
 .../testing/selftests/ublk/test_generic_06.sh |  41 ++
 6 files changed, 397 insertions(+), 245 deletions(-)
 create mode 100644 tools/testing/selftests/ublk/fault_inject.c
 create mode 100755 tools/testing/selftests/ublk/test_generic_06.sh

-- 
2.47.0


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 19:44   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 2/9] ublk: properly serialize all FETCH_REQs Ming Lei
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

In ublk_stop_dev(), if ublk device state becomes UBLK_S_DEV_DEAD, we
will return immediately. This way is correct, but not enough, because
ublk device may transition to other state, such UBLK_S_DEV_QUIECED,
when it may have been stopped already. Then kernel panic is triggered.

Fix it by checking ->ub_disk directly, this way is simpler and effective
since ub->mutex covers ->ub_disk change.

Fixes: bbae8d1f526b ("ublk_drv: consider recovery feature in aborting mechanism")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index cdb1543fa4a9..15de4881f25b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1784,7 +1784,7 @@ static void ublk_stop_dev(struct ublk_device *ub)
 	struct gendisk *disk;
 
 	mutex_lock(&ub->mutex);
-	if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+	if (!ub->ub_disk)
 		goto unlock;
 	if (ublk_nosrv_dev_should_queue_io(ub)) {
 		if (ub->dev_info.state == UBLK_S_DEV_LIVE)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
  2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 19:58   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 3/9] ublk: add ublk_force_abort_dev() Ming Lei
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

From: Uday Shankar <ushankar@purestorage.com>

Most uring_cmds issued against ublk character devices are serialized
because each command affects only one queue, and there is an early check
which only allows a single task (the queue's ubq_daemon) to issue
uring_cmds against that queue. However, this mechanism does not work for
FETCH_REQs, since they are expected before ubq_daemon is set. Since
FETCH_REQs are only used at initialization and not in the fast path,
serialize them using the per-ublk-device mutex. This fixes a number of
data races that were previously possible if a badly behaved ublk server
decided to issue multiple FETCH_REQs against the same qid/tag
concurrently.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
---
 drivers/block/ublk_drv.c | 24 +++++++++++++++++-------
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 15de4881f25b..79f42ed7339f 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1832,8 +1832,8 @@ static void ublk_nosrv_work(struct work_struct *work)
 
 /* device can only be started after all IOs are ready */
 static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
+	__must_hold(&ub->mutex)
 {
-	mutex_lock(&ub->mutex);
 	ubq->nr_io_ready++;
 	if (ublk_queue_ready(ubq)) {
 		ubq->ubq_daemon = current;
@@ -1845,7 +1845,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
 	}
 	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
 		complete_all(&ub->completion);
-	mutex_unlock(&ub->mutex);
 }
 
 static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
@@ -1985,17 +1984,25 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	case UBLK_IO_UNREGISTER_IO_BUF:
 		return ublk_unregister_io_buf(cmd, ub_cmd->addr, issue_flags);
 	case UBLK_IO_FETCH_REQ:
+		mutex_lock(&ub->mutex);
 		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
 		if (ublk_queue_ready(ubq)) {
 			ret = -EBUSY;
-			goto out;
+			goto out_unlock;
 		}
 		/*
 		 * The io is being handled by server, so COMMIT_RQ is expected
 		 * instead of FETCH_REQ
 		 */
 		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
-			goto out;
+			goto out_unlock;
+
+		/*
+		 * Check again (with mutex held) that the I/O is not
+		 * active - if so, someone may have already fetched it
+		 */
+		if (io->flags & UBLK_IO_FLAG_ACTIVE)
+			goto out_unlock;
 
 		if (ublk_need_map_io(ubq)) {
 			/*
@@ -2003,15 +2010,16 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			 * DATA is not enabled
 			 */
 			if (!ub_cmd->addr && !ublk_need_get_data(ubq))
-				goto out;
+				goto out_unlock;
 		} else if (ub_cmd->addr) {
 			/* User copy requires addr to be unset */
 			ret = -EINVAL;
-			goto out;
+			goto out_unlock;
 		}
 
 		ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
 		ublk_mark_io_ready(ub, ubq);
+		mutex_unlock(&ub->mutex);
 		break;
 	case UBLK_IO_COMMIT_AND_FETCH_REQ:
 		req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
@@ -2051,7 +2059,9 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	ublk_prep_cancel(cmd, issue_flags, ubq, tag);
 	return -EIOCBQUEUED;
 
- out:
+out_unlock:
+	mutex_unlock(&ub->mutex);
+out:
 	pr_devel("%s: complete: cmd op %d, tag %d ret %x io_flags %x\n",
 			__func__, cmd_op, tag, ret, io->flags);
 	return ret;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/9] ublk: add ublk_force_abort_dev()
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
  2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
  2025-04-14 11:25 ` [PATCH 2/9] ublk: properly serialize all FETCH_REQs Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:06   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io Ming Lei
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
single function.

Meantime not change device state to QUIESCED any more, since the disk is
going to be removed soon.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 21 ++++++++-------------
 1 file changed, 8 insertions(+), 13 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 79f42ed7339f..7e2c4084c243 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1743,22 +1743,20 @@ static void __ublk_quiesce_dev(struct ublk_device *ub)
 	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
 }
 
-static void ublk_unquiesce_dev(struct ublk_device *ub)
+static void ublk_force_abort_dev(struct ublk_device *ub)
 {
 	int i;
 
-	pr_devel("%s: unquiesce ub: dev_id %d state %s\n",
+	pr_devel("%s: force abort ub: dev_id %d state %s\n",
 			__func__, ub->dev_info.dev_id,
 			ub->dev_info.state == UBLK_S_DEV_LIVE ?
 			"LIVE" : "QUIESCED");
-	/* quiesce_work has run. We let requeued rqs be aborted
-	 * before running fallback_wq. "force_abort" must be seen
-	 * after request queue is unqiuesced. Then del_gendisk()
-	 * can move on.
-	 */
+	blk_mq_quiesce_queue(ub->ub_disk->queue);
+	if (ub->dev_info.state == UBLK_S_DEV_LIVE)
+		ublk_wait_tagset_rqs_idle(ub);
+
 	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
 		ublk_get_queue(ub, i)->force_abort = true;
-
 	blk_mq_unquiesce_queue(ub->ub_disk->queue);
 	/* We may have requeued some rqs in ublk_quiesce_queue() */
 	blk_mq_kick_requeue_list(ub->ub_disk->queue);
@@ -1786,11 +1784,8 @@ static void ublk_stop_dev(struct ublk_device *ub)
 	mutex_lock(&ub->mutex);
 	if (!ub->ub_disk)
 		goto unlock;
-	if (ublk_nosrv_dev_should_queue_io(ub)) {
-		if (ub->dev_info.state == UBLK_S_DEV_LIVE)
-			__ublk_quiesce_dev(ub);
-		ublk_unquiesce_dev(ub);
-	}
+	if (ublk_nosrv_dev_should_queue_io(ub))
+		ublk_force_abort_dev(ub);
 	del_gendisk(ub->ub_disk);
 	disk = ublk_detach_disk(ub);
 	put_disk(disk);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (2 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 3/9] ublk: add ublk_force_abort_dev() Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:15   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 5/9] ublk: move device reset into ublk_ch_release() Ming Lei
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
queue as quiesced. This way is fragile because queue quiesce crosses syscalls
or process contexts.

Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(),
because it has been used for this purpose during io_uring context exiting, and it
can be reused before recovering too.

Meantime we have to move reset of ubq->canceling from ublk_ctrl_end_recovery() to
ublk_ctrl_end_recovery(), when IO handling can be recovered completely.

Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
in same context.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 31 +++++++++++++++++--------------
 1 file changed, 17 insertions(+), 14 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 7e2c4084c243..e0213222e3cf 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1734,13 +1734,19 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
 
 static void __ublk_quiesce_dev(struct ublk_device *ub)
 {
+	int i;
+
 	pr_devel("%s: quiesce ub: dev_id %d state %s\n",
 			__func__, ub->dev_info.dev_id,
 			ub->dev_info.state == UBLK_S_DEV_LIVE ?
 			"LIVE" : "QUIESCED");
 	blk_mq_quiesce_queue(ub->ub_disk->queue);
+	/* mark every queue as canceling */
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+		ublk_get_queue(ub, i)->canceling = true;
 	ublk_wait_tagset_rqs_idle(ub);
 	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
+	blk_mq_unquiesce_queue(ub->ub_disk->queue);
 }
 
 static void ublk_force_abort_dev(struct ublk_device *ub)
@@ -2950,7 +2956,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
 	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
 	ubq->ubq_daemon = NULL;
 	ubq->timeout = false;
-	ubq->canceling = false;
 
 	for (i = 0; i < ubq->q_depth; i++) {
 		struct ublk_io *io = &ubq->ios[i];
@@ -3037,20 +3042,18 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 	pr_devel("%s: new ublksrv_pid %d, dev id %d\n",
 			__func__, ublksrv_pid, header->dev_id);
 
-	if (ublk_nosrv_dev_should_queue_io(ub)) {
-		ub->dev_info.state = UBLK_S_DEV_LIVE;
-		blk_mq_unquiesce_queue(ub->ub_disk->queue);
-		pr_devel("%s: queue unquiesced, dev id %d.\n",
-				__func__, header->dev_id);
-		blk_mq_kick_requeue_list(ub->ub_disk->queue);
-	} else {
-		blk_mq_quiesce_queue(ub->ub_disk->queue);
-		ub->dev_info.state = UBLK_S_DEV_LIVE;
-		for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
-			ublk_get_queue(ub, i)->fail_io = false;
-		}
-		blk_mq_unquiesce_queue(ub->ub_disk->queue);
+	blk_mq_quiesce_queue(ub->ub_disk->queue);
+	ub->dev_info.state = UBLK_S_DEV_LIVE;
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		ubq->canceling = false;
+		ubq->fail_io = false;
 	}
+	blk_mq_unquiesce_queue(ub->ub_disk->queue);
+	pr_devel("%s: queue unquiesced, dev id %d.\n",
+			__func__, header->dev_id);
+	blk_mq_kick_requeue_list(ub->ub_disk->queue);
 
 	ret = 0;
  out_unlock:
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/9] ublk: move device reset into ublk_ch_release()
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (3 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:29   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 6/9] ublk: improve detection and handling of ublk server exit Ming Lei
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

ublk_ch_release() is called after ublk char device is closed, when all
uring_cmd are done, so it is perfect fine to move ublk device reset to
ublk_ch_release() from ublk_ctrl_start_recovery().

This way can avoid to grab the exiting daemon task_struct too long.

However, reset of the following ublk IO flags has to be moved until ublk
io_uring queues are ready:

- ubq->canceling

For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
is recovered

- ubq->fail_io

For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
recovered

- ublk_io->flags

For preventing using io->cmd

With this way, recovery is simplified a lot.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 121 +++++++++++++++++++++++----------------
 1 file changed, 72 insertions(+), 49 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index e0213222e3cf..b68bd4172fa8 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1074,7 +1074,7 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
 
 static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
 {
-	return ubq->ubq_daemon->flags & PF_EXITING;
+	return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING;
 }
 
 /* todo: handle partial completion */
@@ -1470,6 +1470,37 @@ static const struct blk_mq_ops ublk_mq_ops = {
 	.timeout	= ublk_timeout,
 };
 
+static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
+{
+	int i;
+
+	/* All old ioucmds have to be completed */
+	ubq->nr_io_ready = 0;
+
+	/*
+	 * old daemon is PF_EXITING, put it now
+	 *
+	 * It could be NULL in case of closing one quisced device.
+	 */
+	if (ubq->ubq_daemon)
+		put_task_struct(ubq->ubq_daemon);
+	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
+	ubq->ubq_daemon = NULL;
+	ubq->timeout = false;
+
+	for (i = 0; i < ubq->q_depth; i++) {
+		struct ublk_io *io = &ubq->ios[i];
+
+		/*
+		 * UBLK_IO_FLAG_CANCELED is kept for avoiding to touch
+		 * io->cmd
+		 */
+		io->flags &= UBLK_IO_FLAG_CANCELED;
+		io->cmd = NULL;
+		io->addr = 0;
+	}
+}
+
 static int ublk_ch_open(struct inode *inode, struct file *filp)
 {
 	struct ublk_device *ub = container_of(inode->i_cdev,
@@ -1481,10 +1512,26 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
 	return 0;
 }
 
+static void ublk_reset_ch_dev(struct ublk_device *ub)
+{
+	int i;
+
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+		ublk_queue_reinit(ub, ublk_get_queue(ub, i));
+
+	/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
+	ub->mm = NULL;
+	ub->nr_queues_ready = 0;
+	ub->nr_privileged_daemon = 0;
+}
+
 static int ublk_ch_release(struct inode *inode, struct file *filp)
 {
 	struct ublk_device *ub = filp->private_data;
 
+	/* all uring_cmd has been done now, reset device & ubq */
+	ublk_reset_ch_dev(ub);
+
 	clear_bit(UB_STATE_OPEN, &ub->state);
 	return 0;
 }
@@ -1831,6 +1878,24 @@ static void ublk_nosrv_work(struct work_struct *work)
 	ublk_cancel_dev(ub);
 }
 
+/* reset ublk io_uring queue & io flags */
+static void ublk_reset_io_flags(struct ublk_device *ub)
+{
+	int i, j;
+
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		/* UBLK_IO_FLAG_CANCELED can be cleared now */
+		spin_lock(&ubq->cancel_lock);
+		for (j = 0; j < ubq->q_depth; j++)
+			ubq->ios[j].flags &= ~UBLK_IO_FLAG_CANCELED;
+		spin_unlock(&ubq->cancel_lock);
+		ubq->canceling = false;
+		ubq->fail_io = false;
+	}
+}
+
 /* device can only be started after all IOs are ready */
 static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
 	__must_hold(&ub->mutex)
@@ -1844,8 +1909,12 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
 		if (capable(CAP_SYS_ADMIN))
 			ub->nr_privileged_daemon++;
 	}
-	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
+
+	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) {
+		/* now we are ready for handling ublk io request */
+		ublk_reset_io_flags(ub);
 		complete_all(&ub->completion);
+	}
 }
 
 static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
@@ -2943,41 +3012,14 @@ static int ublk_ctrl_set_params(struct ublk_device *ub,
 	return ret;
 }
 
-static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
-{
-	int i;
-
-	WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
-
-	/* All old ioucmds have to be completed */
-	ubq->nr_io_ready = 0;
-	/* old daemon is PF_EXITING, put it now */
-	put_task_struct(ubq->ubq_daemon);
-	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
-	ubq->ubq_daemon = NULL;
-	ubq->timeout = false;
-
-	for (i = 0; i < ubq->q_depth; i++) {
-		struct ublk_io *io = &ubq->ios[i];
-
-		/* forget everything now and be ready for new FETCH_REQ */
-		io->flags = 0;
-		io->cmd = NULL;
-		io->addr = 0;
-	}
-}
-
 static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 		const struct ublksrv_ctrl_cmd *header)
 {
 	int ret = -EINVAL;
-	int i;
 
 	mutex_lock(&ub->mutex);
 	if (ublk_nosrv_should_stop_dev(ub))
 		goto out_unlock;
-	if (!ub->nr_queues_ready)
-		goto out_unlock;
 	/*
 	 * START_RECOVERY is only allowd after:
 	 *
@@ -3001,12 +3043,6 @@ static int ublk_ctrl_start_recovery(struct ublk_device *ub,
 		goto out_unlock;
 	}
 	pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
-	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
-		ublk_queue_reinit(ub, ublk_get_queue(ub, i));
-	/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
-	ub->mm = NULL;
-	ub->nr_queues_ready = 0;
-	ub->nr_privileged_daemon = 0;
 	init_completion(&ub->completion);
 	ret = 0;
  out_unlock:
@@ -3019,7 +3055,6 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 {
 	int ublksrv_pid = (int)header->data[0];
 	int ret = -EINVAL;
-	int i;
 
 	pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n",
 			__func__, ub->dev_info.nr_hw_queues, header->dev_id);
@@ -3039,22 +3074,10 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
 		goto out_unlock;
 	}
 	ub->dev_info.ublksrv_pid = ublksrv_pid;
+	ub->dev_info.state = UBLK_S_DEV_LIVE;
 	pr_devel("%s: new ublksrv_pid %d, dev id %d\n",
 			__func__, ublksrv_pid, header->dev_id);
-
-	blk_mq_quiesce_queue(ub->ub_disk->queue);
-	ub->dev_info.state = UBLK_S_DEV_LIVE;
-	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
-		struct ublk_queue *ubq = ublk_get_queue(ub, i);
-
-		ubq->canceling = false;
-		ubq->fail_io = false;
-	}
-	blk_mq_unquiesce_queue(ub->ub_disk->queue);
-	pr_devel("%s: queue unquiesced, dev id %d.\n",
-			__func__, header->dev_id);
 	blk_mq_kick_requeue_list(ub->ub_disk->queue);
-
 	ret = 0;
  out_unlock:
 	mutex_unlock(&ub->mutex);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/9] ublk: improve detection and handling of ublk server exit
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (4 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 5/9] ublk: move device reset into ublk_ch_release() Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:36   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 7/9] ublk: remove __ublk_quiesce_dev() Ming Lei
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

From: Uday Shankar <ushankar@purestorage.com>

There are currently two ways in which ublk server exit is detected by
ublk_drv:

1. uring_cmd cancellation. If there are any outstanding uring_cmds which
   have not been completed to the ublk server when it exits, io_uring
   calls the uring_cmd callback with a special cancellation flag as the
   issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
   "saturated queue" case, when all I/Os for a given queue are in the
   ublk server, and therefore there are no outstanding uring_cmds to
   cancel when the ublk server exits.

There are a couple of issues with this approach:

- It is complex and inelegant to have two methods to detect the same
  condition
- The second method detects ublk server exit only after a long delay
  (~30s, the default timeout assigned by the block layer). This delays
  the nosrv behavior from kicking in and potential subsequent recovery
  of the device.

The second issue is brought to light with the new test_generic_06 which
will be added in following patch. It fails before this fix:

selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_06 : [FAIL]

Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:

- This one place can handle both saturated and unsaturated queues. Thus,
  it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
  needed if we want to relax restrictions in the driver like letting
  only one thread serve each queue

There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.

With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:

selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 231 ++++++++++++++++++++++-----------------
 1 file changed, 131 insertions(+), 100 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b68bd4172fa8..e02f205f6da4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -199,8 +199,6 @@ struct ublk_device {
 	struct completion	completion;
 	unsigned int		nr_queues_ready;
 	unsigned int		nr_privileged_daemon;
-
-	struct work_struct	nosrv_work;
 };
 
 /* header of ublk_params */
@@ -209,8 +207,9 @@ struct ublk_params_header {
 	__u32	types;
 };
 
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
-
+static void ublk_stop_dev_unlocked(struct ublk_device *ub);
+static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
+static void __ublk_quiesce_dev(struct ublk_device *ub);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 		struct ublk_queue *ubq, int tag, size_t offset);
 static inline unsigned int ublk_req_build_flags(struct request *req);
@@ -1336,8 +1335,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
 static enum blk_eh_timer_return ublk_timeout(struct request *rq)
 {
 	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
-	unsigned int nr_inflight = 0;
-	int i;
 
 	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
 		if (!ubq->timeout) {
@@ -1348,26 +1345,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
 		return BLK_EH_DONE;
 	}
 
-	if (!ubq_daemon_is_dying(ubq))
-		return BLK_EH_RESET_TIMER;
-
-	for (i = 0; i < ubq->q_depth; i++) {
-		struct ublk_io *io = &ubq->ios[i];
-
-		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
-			nr_inflight++;
-	}
-
-	/* cancelable uring_cmd can't help us if all commands are in-flight */
-	if (nr_inflight == ubq->q_depth) {
-		struct ublk_device *ub = ubq->dev;
-
-		if (ublk_abort_requests(ub, ubq)) {
-			schedule_work(&ub->nosrv_work);
-		}
-		return BLK_EH_DONE;
-	}
-
 	return BLK_EH_RESET_TIMER;
 }
 
@@ -1525,13 +1502,112 @@ static void ublk_reset_ch_dev(struct ublk_device *ub)
 	ub->nr_privileged_daemon = 0;
 }
 
+static struct gendisk *ublk_get_disk(struct ublk_device *ub)
+{
+	struct gendisk *disk;
+
+	spin_lock(&ub->lock);
+	disk = ub->ub_disk;
+	if (disk)
+		get_device(disk_to_dev(disk));
+	spin_unlock(&ub->lock);
+
+	return disk;
+}
+
+static void ublk_put_disk(struct gendisk *disk)
+{
+	if (disk)
+		put_device(disk_to_dev(disk));
+}
+
 static int ublk_ch_release(struct inode *inode, struct file *filp)
 {
 	struct ublk_device *ub = filp->private_data;
+	struct gendisk *disk;
+	int i;
+
+	/*
+	 * disk isn't attached yet, either device isn't live, or it has
+	 * been removed already, so we needn't to do anything
+	 */
+	disk = ublk_get_disk(ub);
+	if (!disk)
+		goto out;
+
+	/*
+	 * All uring_cmd are done now, so abort any request outstanding to
+	 * the ublk server
+	 *
+	 * This can be done in lockless way because ublk server has been
+	 * gone
+	 *
+	 * More importantly, we have to provide forward progress guarantee
+	 * without holding ub->mutex, otherwise control task grabbing
+	 * ub->mutex triggers deadlock
+	 *
+	 * All requests may be inflight, so ->canceling may not be set, set
+	 * it now.
+	 */
+	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+		struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+		ubq->canceling = true;
+		ublk_abort_queue(ub, ubq);
+	}
+	blk_mq_kick_requeue_list(disk->queue);
+
+	/*
+	 * All infligh requests have been completed or requeued and any new
+	 * request will be failed or requeued via `->canceling` now, so it is
+	 * fine to grab ub->mutex now.
+	 */
+	mutex_lock(&ub->mutex);
+
+	/* double check after grabbing lock */
+	if (!ub->ub_disk)
+		goto unlock;
+
+	/*
+	 * Transition the device to the nosrv state. What exactly this
+	 * means depends on the recovery flags
+	 */
+	blk_mq_quiesce_queue(disk->queue);
+	if (ublk_nosrv_should_stop_dev(ub)) {
+		/*
+		 * Allow any pending/future I/O to pass through quickly
+		 * with an error. This is needed because del_gendisk
+		 * waits for all pending I/O to complete
+		 */
+		for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+			ublk_get_queue(ub, i)->force_abort = true;
+		blk_mq_unquiesce_queue(disk->queue);
+
+		ublk_stop_dev_unlocked(ub);
+	} else {
+		if (ublk_nosrv_dev_should_queue_io(ub)) {
+			/*
+			 * keep request queue as quiesced for queuing new IO
+			 * during QUIESCED state
+			 *
+			 * request queue will be unquiesced in ending
+			 * recovery, see ublk_ctrl_end_recovery
+			 */
+			__ublk_quiesce_dev(ub);
+		} else {
+			ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
+			for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+				ublk_get_queue(ub, i)->fail_io = true;
+		}
+		blk_mq_unquiesce_queue(disk->queue);
+	}
+unlock:
+	mutex_unlock(&ub->mutex);
+	ublk_put_disk(disk);
 
 	/* all uring_cmd has been done now, reset device & ubq */
 	ublk_reset_ch_dev(ub);
-
+out:
 	clear_bit(UB_STATE_OPEN, &ub->state);
 	return 0;
 }
@@ -1627,37 +1703,22 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
 }
 
 /* Must be called when queue is frozen */
-static bool ublk_mark_queue_canceling(struct ublk_queue *ubq)
+static void ublk_mark_queue_canceling(struct ublk_queue *ubq)
 {
-	bool canceled;
-
 	spin_lock(&ubq->cancel_lock);
-	canceled = ubq->canceling;
-	if (!canceled)
+	if (!ubq->canceling)
 		ubq->canceling = true;
 	spin_unlock(&ubq->cancel_lock);
-
-	return canceled;
 }
 
-static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
+static void ublk_start_cancel(struct ublk_queue *ubq)
 {
-	bool was_canceled = ubq->canceling;
-	struct gendisk *disk;
-
-	if (was_canceled)
-		return false;
-
-	spin_lock(&ub->lock);
-	disk = ub->ub_disk;
-	if (disk)
-		get_device(disk_to_dev(disk));
-	spin_unlock(&ub->lock);
+	struct ublk_device *ub = ubq->dev;
+	struct gendisk *disk = ublk_get_disk(ub);
 
 	/* Our disk has been dead */
 	if (!disk)
-		return false;
-
+		return;
 	/*
 	 * Now we are serialized with ublk_queue_rq()
 	 *
@@ -1666,15 +1727,9 @@ static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
 	 * touch completed uring_cmd
 	 */
 	blk_mq_quiesce_queue(disk->queue);
-	was_canceled = ublk_mark_queue_canceling(ubq);
-	if (!was_canceled) {
-		/* abort queue is for making forward progress */
-		ublk_abort_queue(ub, ubq);
-	}
+	ublk_mark_queue_canceling(ubq);
 	blk_mq_unquiesce_queue(disk->queue);
-	put_device(disk_to_dev(disk));
-
-	return !was_canceled;
+	ublk_put_disk(disk);
 }
 
 static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
@@ -1698,6 +1753,17 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
 /*
  * The ublk char device won't be closed when calling cancel fn, so both
  * ublk device and queue are guaranteed to be live
+ *
+ * Two-stage cancel:
+ *
+ * - make every active uring_cmd done in ->cancel_fn()
+ *
+ * - aborting inflight ublk IO requests in ublk char device release handler,
+ *   which depends on 1st stage because device can only be closed iff all
+ *   uring_cmd are done
+ *
+ * Do _not_ try to acquire ub->mutex before all inflight requests are
+ * aborted, otherwise deadlock may be caused.
  */
 static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
 		unsigned int issue_flags)
@@ -1705,8 +1771,6 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
 	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
 	struct ublk_queue *ubq = pdu->ubq;
 	struct task_struct *task;
-	struct ublk_device *ub;
-	bool need_schedule;
 	struct ublk_io *io;
 
 	if (WARN_ON_ONCE(!ubq))
@@ -1719,16 +1783,12 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
 	if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
 		return;
 
-	ub = ubq->dev;
-	need_schedule = ublk_abort_requests(ub, ubq);
+	if (!ubq->canceling)
+		ublk_start_cancel(ubq);
 
 	io = &ubq->ios[pdu->tag];
 	WARN_ON_ONCE(io->cmd != cmd);
 	ublk_cancel_cmd(ubq, io, issue_flags);
-
-	if (need_schedule) {
-		schedule_work(&ub->nosrv_work);
-	}
 }
 
 static inline bool ublk_queue_ready(struct ublk_queue *ubq)
@@ -1779,6 +1839,7 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
 	}
 }
 
+/* Now all inflight requests have been aborted */
 static void __ublk_quiesce_dev(struct ublk_device *ub)
 {
 	int i;
@@ -1787,13 +1848,11 @@ static void __ublk_quiesce_dev(struct ublk_device *ub)
 			__func__, ub->dev_info.dev_id,
 			ub->dev_info.state == UBLK_S_DEV_LIVE ?
 			"LIVE" : "QUIESCED");
-	blk_mq_quiesce_queue(ub->ub_disk->queue);
 	/* mark every queue as canceling */
 	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
 		ublk_get_queue(ub, i)->canceling = true;
 	ublk_wait_tagset_rqs_idle(ub);
 	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
-	blk_mq_unquiesce_queue(ub->ub_disk->queue);
 }
 
 static void ublk_force_abort_dev(struct ublk_device *ub)
@@ -1830,50 +1889,25 @@ static struct gendisk *ublk_detach_disk(struct ublk_device *ub)
 	return disk;
 }
 
-static void ublk_stop_dev(struct ublk_device *ub)
+static void ublk_stop_dev_unlocked(struct ublk_device *ub)
+	__must_hold(&ub->mutex)
 {
 	struct gendisk *disk;
 
-	mutex_lock(&ub->mutex);
 	if (!ub->ub_disk)
-		goto unlock;
+		return;
+
 	if (ublk_nosrv_dev_should_queue_io(ub))
 		ublk_force_abort_dev(ub);
 	del_gendisk(ub->ub_disk);
 	disk = ublk_detach_disk(ub);
 	put_disk(disk);
- unlock:
-	mutex_unlock(&ub->mutex);
-	ublk_cancel_dev(ub);
 }
 
-static void ublk_nosrv_work(struct work_struct *work)
+static void ublk_stop_dev(struct ublk_device *ub)
 {
-	struct ublk_device *ub =
-		container_of(work, struct ublk_device, nosrv_work);
-	int i;
-
-	if (ublk_nosrv_should_stop_dev(ub)) {
-		ublk_stop_dev(ub);
-		return;
-	}
-
 	mutex_lock(&ub->mutex);
-	if (ub->dev_info.state != UBLK_S_DEV_LIVE)
-		goto unlock;
-
-	if (ublk_nosrv_dev_should_queue_io(ub)) {
-		__ublk_quiesce_dev(ub);
-	} else {
-		blk_mq_quiesce_queue(ub->ub_disk->queue);
-		ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
-		for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
-			ublk_get_queue(ub, i)->fail_io = true;
-		}
-		blk_mq_unquiesce_queue(ub->ub_disk->queue);
-	}
-
- unlock:
+	ublk_stop_dev_unlocked(ub);
 	mutex_unlock(&ub->mutex);
 	ublk_cancel_dev(ub);
 }
@@ -2491,7 +2525,6 @@ static void ublk_remove(struct ublk_device *ub)
 	bool unprivileged;
 
 	ublk_stop_dev(ub);
-	cancel_work_sync(&ub->nosrv_work);
 	cdev_device_del(&ub->cdev, &ub->cdev_dev);
 	unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
 	ublk_put_device(ub);
@@ -2776,7 +2809,6 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
 		goto out_unlock;
 	mutex_init(&ub->mutex);
 	spin_lock_init(&ub->lock);
-	INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
 
 	ret = ublk_alloc_dev_number(ub, header->dev_id);
 	if (ret < 0)
@@ -2908,7 +2940,6 @@ static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
 static int ublk_ctrl_stop_dev(struct ublk_device *ub)
 {
 	ublk_stop_dev(ub);
-	cancel_work_sync(&ub->nosrv_work);
 	return 0;
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/9] ublk: remove __ublk_quiesce_dev()
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (5 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 6/9] ublk: improve detection and handling of ublk server exit Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:37   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 8/9] ublk: simplify aborting ublk request Ming Lei
  2025-04-14 11:25 ` [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject Ming Lei
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

Remove __ublk_quiesce_dev() and open code for updating device state as
QUIESCED.

We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
because all inflight requests are aborted in ublk char device release
handler.

Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
because it is done unconditionally now in ublk_ch_release().

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 22 +---------------------
 1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index e02f205f6da4..f827c2ef00a9 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -209,7 +209,6 @@ struct ublk_params_header {
 
 static void ublk_stop_dev_unlocked(struct ublk_device *ub);
 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
-static void __ublk_quiesce_dev(struct ublk_device *ub);
 static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
 		struct ublk_queue *ubq, int tag, size_t offset);
 static inline unsigned int ublk_req_build_flags(struct request *req);
@@ -1589,11 +1588,8 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
 			/*
 			 * keep request queue as quiesced for queuing new IO
 			 * during QUIESCED state
-			 *
-			 * request queue will be unquiesced in ending
-			 * recovery, see ublk_ctrl_end_recovery
 			 */
-			__ublk_quiesce_dev(ub);
+			ub->dev_info.state = UBLK_S_DEV_QUIESCED;
 		} else {
 			ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
 			for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
@@ -1839,22 +1835,6 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
 	}
 }
 
-/* Now all inflight requests have been aborted */
-static void __ublk_quiesce_dev(struct ublk_device *ub)
-{
-	int i;
-
-	pr_devel("%s: quiesce ub: dev_id %d state %s\n",
-			__func__, ub->dev_info.dev_id,
-			ub->dev_info.state == UBLK_S_DEV_LIVE ?
-			"LIVE" : "QUIESCED");
-	/* mark every queue as canceling */
-	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
-		ublk_get_queue(ub, i)->canceling = true;
-	ublk_wait_tagset_rqs_idle(ub);
-	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
-}
-
 static void ublk_force_abort_dev(struct ublk_device *ub)
 {
 	int i;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 8/9] ublk: simplify aborting ublk request
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (6 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 7/9] ublk: remove __ublk_quiesce_dev() Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:42   ` Uday Shankar
  2025-04-14 11:25 ` [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject Ming Lei
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

Now ublk_abort_queue() is moved to ublk char device release handler,
meantime our request queue is "quiesced" because either ->canceling was
set from uring_cmd cancel function or all IOs are inflight and can't be
completed by ublk server, things becomes easy much:

- all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
for handling completion from uring_cmd

- ublk char device is closed, no one can hold IO request reference any more,
so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 82 ++++++++++------------------------------
 1 file changed, 20 insertions(+), 62 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index f827c2ef00a9..37a0cb8011c1 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -122,15 +122,6 @@ struct ublk_uring_cmd_pdu {
  */
 #define UBLK_IO_FLAG_OWNED_BY_SRV 0x02
 
-/*
- * IO command is aborted, so this flag is set in case of
- * !UBLK_IO_FLAG_ACTIVE.
- *
- * After this flag is observed, any pending or new incoming request
- * associated with this io command will be failed immediately
- */
-#define UBLK_IO_FLAG_ABORTED 0x04
-
 /*
  * UBLK_IO_FLAG_NEED_GET_DATA is set because IO command requires
  * get data buffer address from ublksrv.
@@ -1083,12 +1074,6 @@ static inline void __ublk_complete_rq(struct request *req)
 	unsigned int unmapped_bytes;
 	blk_status_t res = BLK_STS_OK;
 
-	/* called from ublk_abort_queue() code path */
-	if (io->flags & UBLK_IO_FLAG_ABORTED) {
-		res = BLK_STS_IOERR;
-		goto exit;
-	}
-
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
 		io->res = -EIO;
@@ -1138,47 +1123,6 @@ static void ublk_complete_rq(struct kref *ref)
 	__ublk_complete_rq(req);
 }
 
-static void ublk_do_fail_rq(struct request *req)
-{
-	struct ublk_queue *ubq = req->mq_hctx->driver_data;
-
-	if (ublk_nosrv_should_reissue_outstanding(ubq->dev))
-		blk_mq_requeue_request(req, false);
-	else
-		__ublk_complete_rq(req);
-}
-
-static void ublk_fail_rq_fn(struct kref *ref)
-{
-	struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
-			ref);
-	struct request *req = blk_mq_rq_from_pdu(data);
-
-	ublk_do_fail_rq(req);
-}
-
-/*
- * Since ublk_rq_task_work_cb always fails requests immediately during
- * exiting, __ublk_fail_req() is only called from abort context during
- * exiting. So lock is unnecessary.
- *
- * Also aborting may not be started yet, keep in mind that one failed
- * request may be issued by block layer again.
- */
-static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
-		struct request *req)
-{
-	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
-
-	if (ublk_need_req_ref(ubq)) {
-		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
-
-		kref_put(&data->ref, ublk_fail_rq_fn);
-	} else {
-		ublk_do_fail_rq(req);
-	}
-}
-
 static void ubq_complete_io_cmd(struct ublk_io *io, int res,
 				unsigned issue_flags)
 {
@@ -1670,10 +1614,26 @@ static void ublk_commit_completion(struct ublk_device *ub,
 		ublk_put_req_ref(ubq, req);
 }
 
+static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
+		struct request *req)
+{
+	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
+
+	if (ublk_nosrv_should_reissue_outstanding(ubq->dev))
+		blk_mq_requeue_request(req, false);
+	else {
+		io->res = -EIO;
+		__ublk_complete_rq(req);
+	}
+}
+
 /*
- * Called from ubq_daemon context via cancel fn, meantime quiesce ublk
- * blk-mq queue, so we are called exclusively with blk-mq and ubq_daemon
- * context, so everything is serialized.
+ * Called from ublk char device release handler, when any uring_cmd is
+ * done, meantime request queue is "quiesced" since all inflight requests
+ * can't be completed because ublk server is dead.
+ *
+ * So no one can hold our request IO reference any more, simply ignore the
+ * reference, and complete the request immediately
  */
 static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
 {
@@ -1690,10 +1650,8 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
 			 * will do it
 			 */
 			rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
-			if (rq && blk_mq_request_started(rq)) {
-				io->flags |= UBLK_IO_FLAG_ABORTED;
+			if (rq && blk_mq_request_started(rq))
 				__ublk_fail_req(ubq, io, rq);
-			}
 		}
 	}
 }
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject
  2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
                   ` (7 preceding siblings ...)
  2025-04-14 11:25 ` [PATCH 8/9] ublk: simplify aborting ublk request Ming Lei
@ 2025-04-14 11:25 ` Ming Lei
  2025-04-14 20:44   ` Uday Shankar
  8 siblings, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-14 11:25 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

From: Uday Shankar <ushankar@purestorage.com>

Add one simple fault inject target, and verify if it can be exited in
expected period.

Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 tools/testing/selftests/ublk/Makefile         |  4 +-
 tools/testing/selftests/ublk/fault_inject.c   | 98 +++++++++++++++++++
 tools/testing/selftests/ublk/kublk.c          |  3 +-
 tools/testing/selftests/ublk/kublk.h          | 12 ++-
 .../testing/selftests/ublk/test_generic_06.sh | 41 ++++++++
 5 files changed, 155 insertions(+), 3 deletions(-)
 create mode 100644 tools/testing/selftests/ublk/fault_inject.c
 create mode 100755 tools/testing/selftests/ublk/test_generic_06.sh

diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
index dddc64036aa1..ec4624a283bc 100644
--- a/tools/testing/selftests/ublk/Makefile
+++ b/tools/testing/selftests/ublk/Makefile
@@ -8,6 +8,7 @@ TEST_PROGS += test_generic_02.sh
 TEST_PROGS += test_generic_03.sh
 TEST_PROGS += test_generic_04.sh
 TEST_PROGS += test_generic_05.sh
+TEST_PROGS += test_generic_06.sh
 
 TEST_PROGS += test_null_01.sh
 TEST_PROGS += test_null_02.sh
@@ -31,7 +32,8 @@ TEST_GEN_PROGS_EXTENDED = kublk
 
 include ../lib.mk
 
-$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c
+$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
+	fault_inject.c
 
 check:
 	shellcheck -x -f gcc *.sh
diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c
new file mode 100644
index 000000000000..94a8e729ba4c
--- /dev/null
+++ b/tools/testing/selftests/ublk/fault_inject.c
@@ -0,0 +1,98 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Fault injection ublk target. Hack this up however you like for
+ * testing specific behaviors of ublk_drv. Currently is a null target
+ * with a configurable delay before completing each I/O. This delay can
+ * be used to test ublk_drv's handling of I/O outstanding to the ublk
+ * server when it dies.
+ */
+
+#include "kublk.h"
+
+static int ublk_fault_inject_tgt_init(const struct dev_ctx *ctx,
+				      struct ublk_dev *dev)
+{
+	const struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
+	unsigned long dev_size = 250UL << 30;
+
+	dev->tgt.dev_size = dev_size;
+	dev->tgt.params = (struct ublk_params) {
+		.types = UBLK_PARAM_TYPE_BASIC,
+		.basic = {
+			.logical_bs_shift	= 9,
+			.physical_bs_shift	= 12,
+			.io_opt_shift		= 12,
+			.io_min_shift		= 9,
+			.max_sectors		= info->max_io_buf_bytes >> 9,
+			.dev_sectors		= dev_size >> 9,
+		},
+	};
+
+	dev->private_data = (void *)(unsigned long)(ctx->fault_inject.delay_us * 1000);
+	return 0;
+}
+
+static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
+{
+	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+	struct io_uring_sqe *sqe;
+	struct __kernel_timespec ts = {
+		.tv_nsec = (long long)q->dev->private_data,
+	};
+
+	ublk_queue_alloc_sqes(q, &sqe, 1);
+	io_uring_prep_timeout(sqe, &ts, 1, 0);
+	sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1);
+
+	ublk_queued_tgt_io(q, tag, 1);
+
+	return 0;
+}
+
+static void ublk_fault_inject_tgt_io_done(struct ublk_queue *q, int tag,
+					  const struct io_uring_cqe *cqe)
+{
+	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
+
+	if (cqe->res != -ETIME)
+		ublk_err("%s: unexpected cqe res %d\n", __func__, cqe->res);
+
+	if (ublk_completed_tgt_io(q, tag))
+		ublk_complete_io(q, tag, iod->nr_sectors << 9);
+	else
+		ublk_err("%s: io not complete after 1 cqe\n", __func__);
+}
+
+static void ublk_fault_inject_cmd_line(struct dev_ctx *ctx, int argc, char *argv[])
+{
+	static const struct option longopts[] = {
+		{ "delay_us", 	1,	NULL,  0  },
+		{ 0, 0, 0, 0 }
+	};
+	int option_idx, opt;
+
+	ctx->fault_inject.delay_us = 0;
+	while ((opt = getopt_long(argc, argv, "",
+				  longopts, &option_idx)) != -1) {
+		switch (opt) {
+		case 0:
+			if (!strcmp(longopts[option_idx].name, "delay_us"))
+				ctx->fault_inject.delay_us = strtoll(optarg, NULL, 10);
+		}
+	}
+}
+
+static void ublk_fault_inject_usage(const struct ublk_tgt_ops *ops)
+{
+	printf("\tfault_inject: [--delay_us us (default 0)]\n");
+}
+
+const struct ublk_tgt_ops fault_inject_tgt_ops = {
+	.name = "fault_inject",
+	.init_tgt = ublk_fault_inject_tgt_init,
+	.queue_io = ublk_fault_inject_queue_io,
+	.tgt_io_done = ublk_fault_inject_tgt_io_done,
+	.parse_cmd_line = ublk_fault_inject_cmd_line,
+	.usage = ublk_fault_inject_usage,
+};
diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
index 0cd6dce3f303..759f06637146 100644
--- a/tools/testing/selftests/ublk/kublk.c
+++ b/tools/testing/selftests/ublk/kublk.c
@@ -12,6 +12,7 @@ static const struct ublk_tgt_ops *tgt_ops_list[] = {
 	&null_tgt_ops,
 	&loop_tgt_ops,
 	&stripe_tgt_ops,
+	&fault_inject_tgt_ops,
 };
 
 static const struct ublk_tgt_ops *ublk_find_tgt(const char *name)
@@ -1234,7 +1235,7 @@ static void __cmd_create_help(char *exe, bool recovery)
 {
 	int i;
 
-	printf("%s %s -t [null|loop|stripe] [-q nr_queues] [-d depth] [-n dev_id]\n",
+	printf("%s %s -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id]\n",
 			exe, recovery ? "recover" : "add");
 	printf("\t[--foreground] [--quiet] [-z] [--debug_mask mask] [-r 0|1 ] [-g 0|1]\n");
 	printf("\t[-e 0|1 ] [-i 0|1]\n");
diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
index 3d2b9f14491c..29571eb296f1 100644
--- a/tools/testing/selftests/ublk/kublk.h
+++ b/tools/testing/selftests/ublk/kublk.h
@@ -68,6 +68,11 @@ struct stripe_ctx {
 	unsigned int    chunk_size;
 };
 
+struct fault_inject_ctx {
+	/* fault_inject */
+	unsigned long   delay_us;
+};
+
 struct dev_ctx {
 	char tgt_type[16];
 	unsigned long flags;
@@ -81,6 +86,9 @@ struct dev_ctx {
 	unsigned int	fg:1;
 	unsigned int	recovery:1;
 
+	/* fault_inject */
+	long long	delay_us;
+
 	int _evtfd;
 	int _shmid;
 
@@ -88,7 +96,8 @@ struct dev_ctx {
 	struct ublk_dev *shadow_dev;
 
 	union {
-		struct stripe_ctx  stripe;
+		struct stripe_ctx 	stripe;
+		struct fault_inject_ctx fault_inject;
 	};
 };
 
@@ -384,6 +393,7 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
 extern const struct ublk_tgt_ops null_tgt_ops;
 extern const struct ublk_tgt_ops loop_tgt_ops;
 extern const struct ublk_tgt_ops stripe_tgt_ops;
+extern const struct ublk_tgt_ops fault_inject_tgt_ops;
 
 void backing_file_tgt_deinit(struct ublk_dev *dev);
 int backing_file_tgt_init(struct ublk_dev *dev);
diff --git a/tools/testing/selftests/ublk/test_generic_06.sh b/tools/testing/selftests/ublk/test_generic_06.sh
new file mode 100755
index 000000000000..b67230c42c84
--- /dev/null
+++ b/tools/testing/selftests/ublk/test_generic_06.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
+
+TID="generic_06"
+ERR_CODE=0
+
+_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"
+
+# configure ublk server to sleep 2s before completing each I/O
+dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
+_check_add_dev $TID $?
+
+STARTTIME=${SECONDS}
+
+dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 status=none > /dev/null 2>&1 &
+dd_pid=$!
+
+__ublk_kill_daemon ${dev_id} "DEAD"
+
+wait $dd_pid
+dd_exitcode=$?
+
+ENDTIME=${SECONDS}
+ELAPSED=$(($ENDTIME - $STARTTIME))
+
+# assert that dd sees an error and exits quickly after ublk server is
+# killed. previously this relied on seeing an I/O timeout and so would
+# take ~30s
+if [ $dd_exitcode -eq 0 ]; then
+        echo "dd unexpectedly exited successfully!"
+        ERR_CODE=255
+fi
+if [ $ELAPSED -ge 5 ]; then
+        echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
+        ERR_CODE=255
+fi
+
+_cleanup_test "fault_inject"
+_show_result $TID $ERR_CODE
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL
  2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
@ 2025-04-14 19:44   ` Uday Shankar
  2025-04-15  1:32     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 19:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:42PM +0800, Ming Lei wrote:
> In ublk_stop_dev(), if ublk device state becomes UBLK_S_DEV_DEAD, we
> will return immediately. This way is correct, but not enough, because
> ublk device may transition to other state, such UBLK_S_DEV_QUIECED,
> when it may have been stopped already. Then kernel panic is triggered.

How can this happen? If a device is stopped, it is in the
UBLK_S_DEV_DEAD state. Won't that make us fall out of this check in
ublk_nosrv_work, so we wont transition to UBLK_S_DEV_QUIESCED or other
nosrv states?

	mutex_lock(&ub->mutex);
	if (ub->dev_info.state != UBLK_S_DEV_LIVE)
		goto unlock;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 11:25 ` [PATCH 2/9] ublk: properly serialize all FETCH_REQs Ming Lei
@ 2025-04-14 19:58   ` Uday Shankar
  2025-04-14 20:39     ` Jens Axboe
  2025-04-16  1:04     ` Uday Shankar
  0 siblings, 2 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 19:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:43PM +0800, Ming Lei wrote:
> From: Uday Shankar <ushankar@purestorage.com>
> 
> Most uring_cmds issued against ublk character devices are serialized
> because each command affects only one queue, and there is an early check
> which only allows a single task (the queue's ubq_daemon) to issue
> uring_cmds against that queue. However, this mechanism does not work for
> FETCH_REQs, since they are expected before ubq_daemon is set. Since
> FETCH_REQs are only used at initialization and not in the fast path,
> serialize them using the per-ublk-device mutex. This fixes a number of
> data races that were previously possible if a badly behaved ublk server
> decided to issue multiple FETCH_REQs against the same qid/tag
> concurrently.
> 
> Reviewed-by: Ming Lei <ming.lei@redhat.com>
> Reported-by: Caleb Sander Mateos <csander@purestorage.com>
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>

Thanks for picking this up. Can you use the following patch instead? It
has two changes compared to [1]:

- Factor FETCH command into its own function
- Return -EAGAIN for non-blocking dispatch because we are taking a
  mutex.

[1] https://lore.kernel.org/linux-block/20250410-ublk_task_per_io-v3-1-b811e8f4554a@purestorage.com/

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2fd05c1bd30b..8efb7668ab2c 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1809,8 +1809,8 @@ static void ublk_nosrv_work(struct work_struct *work)
 
 /* device can only be started after all IOs are ready */
 static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
+	__must_hold(&ub->mutex)
 {
-	mutex_lock(&ub->mutex);
 	ubq->nr_io_ready++;
 	if (ublk_queue_ready(ubq)) {
 		ubq->ubq_daemon = current;
@@ -1822,7 +1822,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
 	}
 	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
 		complete_all(&ub->completion);
-	mutex_unlock(&ub->mutex);
 }
 
 static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
@@ -1906,6 +1905,52 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
 	return io_buffer_unregister_bvec(cmd, index, issue_flags);
 }
 
+static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
+		      struct ublk_queue *ubq, struct ublk_io *io,
+		      const struct ublksrv_io_cmd *ub_cmd,
+		      unsigned int issue_flags)
+{
+	int ret = 0;
+
+	if (issue_flags & IO_URING_F_NONBLOCK)
+		return -EAGAIN;
+
+	mutex_lock(&ub->mutex);
+	/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
+	if (ublk_queue_ready(ubq)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* allow each command to be FETCHed at most once */
+	if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV);
+
+	if (ublk_need_map_io(ubq)) {
+		/*
+		 * FETCH_RQ has to provide IO buffer if NEED GET
+		 * DATA is not enabled
+		 */
+		if (!ub_cmd->addr && !ublk_need_get_data(ubq))
+			goto out;
+	} else if (ub_cmd->addr) {
+		/* User copy requires addr to be unset */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
+	ublk_mark_io_ready(ub, ubq);
+
+out:
+	mutex_unlock(&ub->mutex);
+	return ret;
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			       unsigned int issue_flags,
 			       const struct ublksrv_io_cmd *ub_cmd)
@@ -1962,34 +2007,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	case UBLK_IO_UNREGISTER_IO_BUF:
 		return ublk_unregister_io_buf(cmd, ub_cmd->addr, issue_flags);
 	case UBLK_IO_FETCH_REQ:
-		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
-		if (ublk_queue_ready(ubq)) {
-			ret = -EBUSY;
-			goto out;
-		}
-		/*
-		 * The io is being handled by server, so COMMIT_RQ is expected
-		 * instead of FETCH_REQ
-		 */
-		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
-			goto out;
-
-		if (ublk_need_map_io(ubq)) {
-			/*
-			 * FETCH_RQ has to provide IO buffer if NEED GET
-			 * DATA is not enabled
-			 */
-			if (!ub_cmd->addr && !ublk_need_get_data(ubq))
-				goto out;
-		} else if (ub_cmd->addr) {
-			/* User copy requires addr to be unset */
-			ret = -EINVAL;
-			goto out;
-		}
-
-		ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
-		ublk_mark_io_ready(ub, ubq);
-		break;
+		return ublk_fetch(cmd, ub, ubq, io, ub_cmd, issue_flags);
 	case UBLK_IO_COMMIT_AND_FETCH_REQ:
 		req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/9] ublk: add ublk_force_abort_dev()
  2025-04-14 11:25 ` [PATCH 3/9] ublk: add ublk_force_abort_dev() Ming Lei
@ 2025-04-14 20:06   ` Uday Shankar
  0 siblings, 0 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:06 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:44PM +0800, Ming Lei wrote:
> Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
> in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
> single function.
> 
> Meantime not change device state to QUIESCED any more, since the disk is
> going to be removed soon.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Uday Shankar <ushankar@purestorage.com>

> ---
>  drivers/block/ublk_drv.c | 21 ++++++++-------------
>  1 file changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 79f42ed7339f..7e2c4084c243 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1743,22 +1743,20 @@ static void __ublk_quiesce_dev(struct ublk_device *ub)
>  	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
>  }
>  
> -static void ublk_unquiesce_dev(struct ublk_device *ub)
> +static void ublk_force_abort_dev(struct ublk_device *ub)
>  {
>  	int i;
>  
> -	pr_devel("%s: unquiesce ub: dev_id %d state %s\n",
> +	pr_devel("%s: force abort ub: dev_id %d state %s\n",
>  			__func__, ub->dev_info.dev_id,
>  			ub->dev_info.state == UBLK_S_DEV_LIVE ?
>  			"LIVE" : "QUIESCED");
> -	/* quiesce_work has run. We let requeued rqs be aborted
> -	 * before running fallback_wq. "force_abort" must be seen
> -	 * after request queue is unqiuesced. Then del_gendisk()
> -	 * can move on.
> -	 */
> +	blk_mq_quiesce_queue(ub->ub_disk->queue);
> +	if (ub->dev_info.state == UBLK_S_DEV_LIVE)
> +		ublk_wait_tagset_rqs_idle(ub);
> +
>  	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
>  		ublk_get_queue(ub, i)->force_abort = true;
> -
>  	blk_mq_unquiesce_queue(ub->ub_disk->queue);
>  	/* We may have requeued some rqs in ublk_quiesce_queue() */
>  	blk_mq_kick_requeue_list(ub->ub_disk->queue);
> @@ -1786,11 +1784,8 @@ static void ublk_stop_dev(struct ublk_device *ub)
>  	mutex_lock(&ub->mutex);
>  	if (!ub->ub_disk)
>  		goto unlock;
> -	if (ublk_nosrv_dev_should_queue_io(ub)) {
> -		if (ub->dev_info.state == UBLK_S_DEV_LIVE)
> -			__ublk_quiesce_dev(ub);
> -		ublk_unquiesce_dev(ub);
> -	}
> +	if (ublk_nosrv_dev_should_queue_io(ub))
> +		ublk_force_abort_dev(ub);
>  	del_gendisk(ub->ub_disk);
>  	disk = ublk_detach_disk(ub);
>  	put_disk(disk);
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  2025-04-14 11:25 ` [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io Ming Lei
@ 2025-04-14 20:15   ` Uday Shankar
  2025-04-15  1:48     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:15 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:45PM +0800, Ming Lei wrote:
> Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
> queue as quiesced. This way is fragile because queue quiesce crosses syscalls
> or process contexts.
> 
> Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(),
> because it has been used for this purpose during io_uring context exiting, and it
> can be reused before recovering too.
> 
> Meantime we have to move reset of ubq->canceling from ublk_ctrl_end_recovery() to
> ublk_ctrl_end_recovery(), when IO handling can be recovered completely.

First one here should be ublk_ctrl_start_recovery or ublk_queue_reinit

> 
> Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
> in same context.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 31 +++++++++++++++++--------------
>  1 file changed, 17 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 7e2c4084c243..e0213222e3cf 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1734,13 +1734,19 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
>  
>  static void __ublk_quiesce_dev(struct ublk_device *ub)
>  {
> +	int i;
> +
>  	pr_devel("%s: quiesce ub: dev_id %d state %s\n",
>  			__func__, ub->dev_info.dev_id,
>  			ub->dev_info.state == UBLK_S_DEV_LIVE ?
>  			"LIVE" : "QUIESCED");
>  	blk_mq_quiesce_queue(ub->ub_disk->queue);
> +	/* mark every queue as canceling */
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +		ublk_get_queue(ub, i)->canceling = true;
>  	ublk_wait_tagset_rqs_idle(ub);
>  	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
> +	blk_mq_unquiesce_queue(ub->ub_disk->queue);

So the queue is not actually quiesced when we are in UBLK_S_DEV_QUIESCED
anymore, and we rely on __ublk_abort_rq to requeue I/O submitted in this
QUIESCED state. This requeued I/O will immediately resubmit, right?
Isn't this a waste of CPU?

>  }
>  
>  static void ublk_force_abort_dev(struct ublk_device *ub)
> @@ -2950,7 +2956,6 @@ static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
>  	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
>  	ubq->ubq_daemon = NULL;
>  	ubq->timeout = false;
> -	ubq->canceling = false;
>  
>  	for (i = 0; i < ubq->q_depth; i++) {
>  		struct ublk_io *io = &ubq->ios[i];
> @@ -3037,20 +3042,18 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
>  	pr_devel("%s: new ublksrv_pid %d, dev id %d\n",
>  			__func__, ublksrv_pid, header->dev_id);
>  
> -	if (ublk_nosrv_dev_should_queue_io(ub)) {
> -		ub->dev_info.state = UBLK_S_DEV_LIVE;
> -		blk_mq_unquiesce_queue(ub->ub_disk->queue);
> -		pr_devel("%s: queue unquiesced, dev id %d.\n",
> -				__func__, header->dev_id);
> -		blk_mq_kick_requeue_list(ub->ub_disk->queue);
> -	} else {
> -		blk_mq_quiesce_queue(ub->ub_disk->queue);
> -		ub->dev_info.state = UBLK_S_DEV_LIVE;
> -		for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> -			ublk_get_queue(ub, i)->fail_io = false;
> -		}
> -		blk_mq_unquiesce_queue(ub->ub_disk->queue);
> +	blk_mq_quiesce_queue(ub->ub_disk->queue);
> +	ub->dev_info.state = UBLK_S_DEV_LIVE;
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +		ubq->canceling = false;
> +		ubq->fail_io = false;
>  	}
> +	blk_mq_unquiesce_queue(ub->ub_disk->queue);
> +	pr_devel("%s: queue unquiesced, dev id %d.\n",
> +			__func__, header->dev_id);
> +	blk_mq_kick_requeue_list(ub->ub_disk->queue);
>  
>  	ret = 0;
>   out_unlock:
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] ublk: move device reset into ublk_ch_release()
  2025-04-14 11:25 ` [PATCH 5/9] ublk: move device reset into ublk_ch_release() Ming Lei
@ 2025-04-14 20:29   ` Uday Shankar
  2025-04-15  1:50     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:29 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:46PM +0800, Ming Lei wrote:
> ublk_ch_release() is called after ublk char device is closed, when all
> uring_cmd are done, so it is perfect fine to move ublk device reset to
> ublk_ch_release() from ublk_ctrl_start_recovery().
> 
> This way can avoid to grab the exiting daemon task_struct too long.

Nice, I had noticed this leak too, where we keep the task struct ref
until the new daemon comes around. Thanks for the fix!

> 
> However, reset of the following ublk IO flags has to be moved until ublk
> io_uring queues are ready:
> 
> - ubq->canceling
> 
> For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
> is recovered
> 
> - ubq->fail_io
> 
> For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
> recovered
> 
> - ublk_io->flags
> 
> For preventing using io->cmd
> 
> With this way, recovery is simplified a lot.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 121 +++++++++++++++++++++++----------------
>  1 file changed, 72 insertions(+), 49 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index e0213222e3cf..b68bd4172fa8 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1074,7 +1074,7 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>  
>  static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
>  {
> -	return ubq->ubq_daemon->flags & PF_EXITING;
> +	return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING;
>  }
>  
>  /* todo: handle partial completion */
> @@ -1470,6 +1470,37 @@ static const struct blk_mq_ops ublk_mq_ops = {
>  	.timeout	= ublk_timeout,
>  };
>  
> +static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> +{
> +	int i;
> +
> +	/* All old ioucmds have to be completed */
> +	ubq->nr_io_ready = 0;
> +
> +	/*
> +	 * old daemon is PF_EXITING, put it now
> +	 *
> +	 * It could be NULL in case of closing one quisced device.
> +	 */
> +	if (ubq->ubq_daemon)
> +		put_task_struct(ubq->ubq_daemon);
> +	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
> +	ubq->ubq_daemon = NULL;
> +	ubq->timeout = false;
> +
> +	for (i = 0; i < ubq->q_depth; i++) {
> +		struct ublk_io *io = &ubq->ios[i];
> +
> +		/*
> +		 * UBLK_IO_FLAG_CANCELED is kept for avoiding to touch
> +		 * io->cmd
> +		 */
> +		io->flags &= UBLK_IO_FLAG_CANCELED;
> +		io->cmd = NULL;
> +		io->addr = 0;
> +	}
> +}
> +
>  static int ublk_ch_open(struct inode *inode, struct file *filp)
>  {
>  	struct ublk_device *ub = container_of(inode->i_cdev,
> @@ -1481,10 +1512,26 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
>  	return 0;
>  }
>  
> +static void ublk_reset_ch_dev(struct ublk_device *ub)
> +{
> +	int i;
> +
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +		ublk_queue_reinit(ub, ublk_get_queue(ub, i));
> +
> +	/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
> +	ub->mm = NULL;
> +	ub->nr_queues_ready = 0;
> +	ub->nr_privileged_daemon = 0;
> +}
> +
>  static int ublk_ch_release(struct inode *inode, struct file *filp)
>  {
>  	struct ublk_device *ub = filp->private_data;
>  
> +	/* all uring_cmd has been done now, reset device & ubq */
> +	ublk_reset_ch_dev(ub);
> +
>  	clear_bit(UB_STATE_OPEN, &ub->state);
>  	return 0;
>  }
> @@ -1831,6 +1878,24 @@ static void ublk_nosrv_work(struct work_struct *work)
>  	ublk_cancel_dev(ub);
>  }
>  
> +/* reset ublk io_uring queue & io flags */
> +static void ublk_reset_io_flags(struct ublk_device *ub)
> +{
> +	int i, j;
> +
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +		/* UBLK_IO_FLAG_CANCELED can be cleared now */
> +		spin_lock(&ubq->cancel_lock);

Do we need this? I think at this point there shouldn't be any concurrent
activity we need to protect against.

> +		for (j = 0; j < ubq->q_depth; j++)
> +			ubq->ios[j].flags &= ~UBLK_IO_FLAG_CANCELED;
> +		spin_unlock(&ubq->cancel_lock);
> +		ubq->canceling = false;
> +		ubq->fail_io = false;
> +	}
> +}
> +
>  /* device can only be started after all IOs are ready */
>  static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
>  	__must_hold(&ub->mutex)
> @@ -1844,8 +1909,12 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
>  		if (capable(CAP_SYS_ADMIN))
>  			ub->nr_privileged_daemon++;
>  	}
> -	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
> +
> +	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) {
> +		/* now we are ready for handling ublk io request */
> +		ublk_reset_io_flags(ub);
>  		complete_all(&ub->completion);
> +	}
>  }
>  
>  static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
> @@ -2943,41 +3012,14 @@ static int ublk_ctrl_set_params(struct ublk_device *ub,
>  	return ret;
>  }
>  
> -static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> -{
> -	int i;
> -
> -	WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
> -
> -	/* All old ioucmds have to be completed */
> -	ubq->nr_io_ready = 0;
> -	/* old daemon is PF_EXITING, put it now */
> -	put_task_struct(ubq->ubq_daemon);
> -	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
> -	ubq->ubq_daemon = NULL;
> -	ubq->timeout = false;
> -
> -	for (i = 0; i < ubq->q_depth; i++) {
> -		struct ublk_io *io = &ubq->ios[i];
> -
> -		/* forget everything now and be ready for new FETCH_REQ */
> -		io->flags = 0;
> -		io->cmd = NULL;
> -		io->addr = 0;
> -	}
> -}
> -
>  static int ublk_ctrl_start_recovery(struct ublk_device *ub,
>  		const struct ublksrv_ctrl_cmd *header)
>  {
>  	int ret = -EINVAL;
> -	int i;
>  
>  	mutex_lock(&ub->mutex);
>  	if (ublk_nosrv_should_stop_dev(ub))
>  		goto out_unlock;
> -	if (!ub->nr_queues_ready)
> -		goto out_unlock;
>  	/*
>  	 * START_RECOVERY is only allowd after:
>  	 *
> @@ -3001,12 +3043,6 @@ static int ublk_ctrl_start_recovery(struct ublk_device *ub,
>  		goto out_unlock;
>  	}
>  	pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
> -	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> -		ublk_queue_reinit(ub, ublk_get_queue(ub, i));
> -	/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
> -	ub->mm = NULL;
> -	ub->nr_queues_ready = 0;
> -	ub->nr_privileged_daemon = 0;
>  	init_completion(&ub->completion);
>  	ret = 0;
>   out_unlock:
> @@ -3019,7 +3055,6 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
>  {
>  	int ublksrv_pid = (int)header->data[0];
>  	int ret = -EINVAL;
> -	int i;
>  
>  	pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n",
>  			__func__, ub->dev_info.nr_hw_queues, header->dev_id);
> @@ -3039,22 +3074,10 @@ static int ublk_ctrl_end_recovery(struct ublk_device *ub,
>  		goto out_unlock;
>  	}
>  	ub->dev_info.ublksrv_pid = ublksrv_pid;
> +	ub->dev_info.state = UBLK_S_DEV_LIVE;
>  	pr_devel("%s: new ublksrv_pid %d, dev id %d\n",
>  			__func__, ublksrv_pid, header->dev_id);
> -
> -	blk_mq_quiesce_queue(ub->ub_disk->queue);
> -	ub->dev_info.state = UBLK_S_DEV_LIVE;
> -	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> -		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> -
> -		ubq->canceling = false;
> -		ubq->fail_io = false;
> -	}
> -	blk_mq_unquiesce_queue(ub->ub_disk->queue);
> -	pr_devel("%s: queue unquiesced, dev id %d.\n",
> -			__func__, header->dev_id);
>  	blk_mq_kick_requeue_list(ub->ub_disk->queue);
> -
>  	ret = 0;
>   out_unlock:
>  	mutex_unlock(&ub->mutex);
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] ublk: improve detection and handling of ublk server exit
  2025-04-14 11:25 ` [PATCH 6/9] ublk: improve detection and handling of ublk server exit Ming Lei
@ 2025-04-14 20:36   ` Uday Shankar
  2025-04-15  1:54     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:36 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:47PM +0800, Ming Lei wrote:
> From: Uday Shankar <ushankar@purestorage.com>
> 
> There are currently two ways in which ublk server exit is detected by
> ublk_drv:
> 
> 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
>    have not been completed to the ublk server when it exits, io_uring
>    calls the uring_cmd callback with a special cancellation flag as the
>    issuing task is exiting.
> 2. I/O timeout. This is needed in addition to the above to handle the
>    "saturated queue" case, when all I/Os for a given queue are in the
>    ublk server, and therefore there are no outstanding uring_cmds to
>    cancel when the ublk server exits.
> 
> There are a couple of issues with this approach:
> 
> - It is complex and inelegant to have two methods to detect the same
>   condition
> - The second method detects ublk server exit only after a long delay
>   (~30s, the default timeout assigned by the block layer). This delays
>   the nosrv behavior from kicking in and potential subsequent recovery
>   of the device.
> 
> The second issue is brought to light with the new test_generic_06 which
> will be added in following patch. It fails before this fix:
> 
> selftests: ublk: test_generic_06.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 30.0611 s, 0.0 kB/s
> DEAD
> dd took 31 seconds to exit (>= 5s tolerance)!
> generic_06 : [FAIL]
> 
> Fix this by instead detecting and handling ublk server exit in the
> character file release callback. This has several advantages:
> 
> - This one place can handle both saturated and unsaturated queues. Thus,
>   it replaces both preexisting methods of detecting ublk server exit.
> - It runs quickly on ublk server exit - there is no 30s delay.
> - It starts the process of removing task references in ublk_drv. This is
>   needed if we want to relax restrictions in the driver like letting
>   only one thread serve each queue
> 
> There is also the disadvantage that the character file release callback
> can also be triggered by intentional close of the file, which is a
> significant behavior change. Preexisting ublk servers (libublksrv) are
> dependent on the ability to open/close the file multiple times. To
> address this, only transition to a nosrv state if the file is released
> while the ublk device is live. This allows for programs to open/close
> the file multiple times during setup. It is still a behavior change if a
> ublk server decides to close/reopen the file while the device is LIVE
> (i.e. while it is responsible for serving I/O), but that would be highly
> unusual. This behavior is in line with what is done by FUSE, which is
> very similar to ublk in that a userspace daemon is providing services
> traditionally provided by the kernel.
> 
> With this change in, the new test (and all other selftests, and all
> ublksrv tests) pass:
> 
> selftests: ublk: test_generic_06.sh
> dev id is 0
> dd: error writing '/dev/ublkb0': Input/output error
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.0376731 s, 0.0 kB/s
> DEAD
> generic_04 : [PASS]
> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 231 ++++++++++++++++++++++-----------------
>  1 file changed, 131 insertions(+), 100 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index b68bd4172fa8..e02f205f6da4 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -199,8 +199,6 @@ struct ublk_device {
>  	struct completion	completion;
>  	unsigned int		nr_queues_ready;
>  	unsigned int		nr_privileged_daemon;
> -
> -	struct work_struct	nosrv_work;
>  };
>  
>  /* header of ublk_params */
> @@ -209,8 +207,9 @@ struct ublk_params_header {
>  	__u32	types;
>  };
>  
> -static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
> -
> +static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> +static void __ublk_quiesce_dev(struct ublk_device *ub);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
>  		struct ublk_queue *ubq, int tag, size_t offset);
>  static inline unsigned int ublk_req_build_flags(struct request *req);
> @@ -1336,8 +1335,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
>  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  {
>  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> -	unsigned int nr_inflight = 0;
> -	int i;
>  
>  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
>  		if (!ubq->timeout) {
> @@ -1348,26 +1345,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
>  		return BLK_EH_DONE;
>  	}
>  
> -	if (!ubq_daemon_is_dying(ubq))
> -		return BLK_EH_RESET_TIMER;
> -
> -	for (i = 0; i < ubq->q_depth; i++) {
> -		struct ublk_io *io = &ubq->ios[i];
> -
> -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> -			nr_inflight++;
> -	}
> -
> -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> -	if (nr_inflight == ubq->q_depth) {
> -		struct ublk_device *ub = ubq->dev;
> -
> -		if (ublk_abort_requests(ub, ubq)) {
> -			schedule_work(&ub->nosrv_work);
> -		}
> -		return BLK_EH_DONE;
> -	}
> -
>  	return BLK_EH_RESET_TIMER;
>  }
>  
> @@ -1525,13 +1502,112 @@ static void ublk_reset_ch_dev(struct ublk_device *ub)
>  	ub->nr_privileged_daemon = 0;
>  }
>  
> +static struct gendisk *ublk_get_disk(struct ublk_device *ub)
> +{
> +	struct gendisk *disk;
> +
> +	spin_lock(&ub->lock);
> +	disk = ub->ub_disk;
> +	if (disk)
> +		get_device(disk_to_dev(disk));
> +	spin_unlock(&ub->lock);
> +
> +	return disk;
> +}
> +
> +static void ublk_put_disk(struct gendisk *disk)
> +{
> +	if (disk)
> +		put_device(disk_to_dev(disk));
> +}
> +
>  static int ublk_ch_release(struct inode *inode, struct file *filp)
>  {
>  	struct ublk_device *ub = filp->private_data;
> +	struct gendisk *disk;
> +	int i;
> +
> +	/*
> +	 * disk isn't attached yet, either device isn't live, or it has
> +	 * been removed already, so we needn't to do anything
> +	 */
> +	disk = ublk_get_disk(ub);
> +	if (!disk)
> +		goto out;
> +
> +	/*
> +	 * All uring_cmd are done now, so abort any request outstanding to
> +	 * the ublk server
> +	 *
> +	 * This can be done in lockless way because ublk server has been
> +	 * gone
> +	 *
> +	 * More importantly, we have to provide forward progress guarantee
> +	 * without holding ub->mutex, otherwise control task grabbing
> +	 * ub->mutex triggers deadlock
> +	 *
> +	 * All requests may be inflight, so ->canceling may not be set, set
> +	 * it now.
> +	 */
> +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> +		ubq->canceling = true;
> +		ublk_abort_queue(ub, ubq);
> +	}
> +	blk_mq_kick_requeue_list(disk->queue);
> +
> +	/*
> +	 * All infligh requests have been completed or requeued and any new
> +	 * request will be failed or requeued via `->canceling` now, so it is
> +	 * fine to grab ub->mutex now.
> +	 */
> +	mutex_lock(&ub->mutex);
> +
> +	/* double check after grabbing lock */
> +	if (!ub->ub_disk)
> +		goto unlock;
> +
> +	/*
> +	 * Transition the device to the nosrv state. What exactly this
> +	 * means depends on the recovery flags
> +	 */
> +	blk_mq_quiesce_queue(disk->queue);
> +	if (ublk_nosrv_should_stop_dev(ub)) {
> +		/*
> +		 * Allow any pending/future I/O to pass through quickly
> +		 * with an error. This is needed because del_gendisk
> +		 * waits for all pending I/O to complete
> +		 */
> +		for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +			ublk_get_queue(ub, i)->force_abort = true;
> +		blk_mq_unquiesce_queue(disk->queue);
> +
> +		ublk_stop_dev_unlocked(ub);
> +	} else {
> +		if (ublk_nosrv_dev_should_queue_io(ub)) {
> +			/*
> +			 * keep request queue as quiesced for queuing new IO
> +			 * during QUIESCED state
> +			 *
> +			 * request queue will be unquiesced in ending
> +			 * recovery, see ublk_ctrl_end_recovery
> +			 */

Does this comment need an update after

[PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io

If I read that one right, actually the request queue is not quiesced
anymore during the UBLK_S_DEV_QUIESCED state

> +			__ublk_quiesce_dev(ub);
> +		} else {
> +			ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
> +			for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> +				ublk_get_queue(ub, i)->fail_io = true;
> +		}
> +		blk_mq_unquiesce_queue(disk->queue);
> +	}
> +unlock:
> +	mutex_unlock(&ub->mutex);
> +	ublk_put_disk(disk);
>  
>  	/* all uring_cmd has been done now, reset device & ubq */
>  	ublk_reset_ch_dev(ub);
> -
> +out:
>  	clear_bit(UB_STATE_OPEN, &ub->state);
>  	return 0;
>  }
> @@ -1627,37 +1703,22 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>  }
>  
>  /* Must be called when queue is frozen */
> -static bool ublk_mark_queue_canceling(struct ublk_queue *ubq)
> +static void ublk_mark_queue_canceling(struct ublk_queue *ubq)
>  {
> -	bool canceled;
> -
>  	spin_lock(&ubq->cancel_lock);
> -	canceled = ubq->canceling;
> -	if (!canceled)
> +	if (!ubq->canceling)
>  		ubq->canceling = true;
>  	spin_unlock(&ubq->cancel_lock);
> -
> -	return canceled;
>  }
>  
> -static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
> +static void ublk_start_cancel(struct ublk_queue *ubq)
>  {
> -	bool was_canceled = ubq->canceling;
> -	struct gendisk *disk;
> -
> -	if (was_canceled)
> -		return false;
> -
> -	spin_lock(&ub->lock);
> -	disk = ub->ub_disk;
> -	if (disk)
> -		get_device(disk_to_dev(disk));
> -	spin_unlock(&ub->lock);
> +	struct ublk_device *ub = ubq->dev;
> +	struct gendisk *disk = ublk_get_disk(ub);
>  
>  	/* Our disk has been dead */
>  	if (!disk)
> -		return false;
> -
> +		return;
>  	/*
>  	 * Now we are serialized with ublk_queue_rq()
>  	 *
> @@ -1666,15 +1727,9 @@ static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq)
>  	 * touch completed uring_cmd
>  	 */
>  	blk_mq_quiesce_queue(disk->queue);
> -	was_canceled = ublk_mark_queue_canceling(ubq);
> -	if (!was_canceled) {
> -		/* abort queue is for making forward progress */
> -		ublk_abort_queue(ub, ubq);
> -	}
> +	ublk_mark_queue_canceling(ubq);
>  	blk_mq_unquiesce_queue(disk->queue);
> -	put_device(disk_to_dev(disk));
> -
> -	return !was_canceled;
> +	ublk_put_disk(disk);
>  }
>  
>  static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
> @@ -1698,6 +1753,17 @@ static void ublk_cancel_cmd(struct ublk_queue *ubq, struct ublk_io *io,
>  /*
>   * The ublk char device won't be closed when calling cancel fn, so both
>   * ublk device and queue are guaranteed to be live
> + *
> + * Two-stage cancel:
> + *
> + * - make every active uring_cmd done in ->cancel_fn()
> + *
> + * - aborting inflight ublk IO requests in ublk char device release handler,
> + *   which depends on 1st stage because device can only be closed iff all
> + *   uring_cmd are done
> + *
> + * Do _not_ try to acquire ub->mutex before all inflight requests are
> + * aborted, otherwise deadlock may be caused.
>   */
>  static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
>  		unsigned int issue_flags)
> @@ -1705,8 +1771,6 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
>  	struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
>  	struct ublk_queue *ubq = pdu->ubq;
>  	struct task_struct *task;
> -	struct ublk_device *ub;
> -	bool need_schedule;
>  	struct ublk_io *io;
>  
>  	if (WARN_ON_ONCE(!ubq))
> @@ -1719,16 +1783,12 @@ static void ublk_uring_cmd_cancel_fn(struct io_uring_cmd *cmd,
>  	if (WARN_ON_ONCE(task && task != ubq->ubq_daemon))
>  		return;
>  
> -	ub = ubq->dev;
> -	need_schedule = ublk_abort_requests(ub, ubq);
> +	if (!ubq->canceling)
> +		ublk_start_cancel(ubq);
>  
>  	io = &ubq->ios[pdu->tag];
>  	WARN_ON_ONCE(io->cmd != cmd);
>  	ublk_cancel_cmd(ubq, io, issue_flags);
> -
> -	if (need_schedule) {
> -		schedule_work(&ub->nosrv_work);
> -	}
>  }
>  
>  static inline bool ublk_queue_ready(struct ublk_queue *ubq)
> @@ -1779,6 +1839,7 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
>  	}
>  }
>  
> +/* Now all inflight requests have been aborted */
>  static void __ublk_quiesce_dev(struct ublk_device *ub)
>  {
>  	int i;
> @@ -1787,13 +1848,11 @@ static void __ublk_quiesce_dev(struct ublk_device *ub)
>  			__func__, ub->dev_info.dev_id,
>  			ub->dev_info.state == UBLK_S_DEV_LIVE ?
>  			"LIVE" : "QUIESCED");
> -	blk_mq_quiesce_queue(ub->ub_disk->queue);
>  	/* mark every queue as canceling */
>  	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
>  		ublk_get_queue(ub, i)->canceling = true;
>  	ublk_wait_tagset_rqs_idle(ub);
>  	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
> -	blk_mq_unquiesce_queue(ub->ub_disk->queue);
>  }
>  
>  static void ublk_force_abort_dev(struct ublk_device *ub)
> @@ -1830,50 +1889,25 @@ static struct gendisk *ublk_detach_disk(struct ublk_device *ub)
>  	return disk;
>  }
>  
> -static void ublk_stop_dev(struct ublk_device *ub)
> +static void ublk_stop_dev_unlocked(struct ublk_device *ub)
> +	__must_hold(&ub->mutex)
>  {
>  	struct gendisk *disk;
>  
> -	mutex_lock(&ub->mutex);
>  	if (!ub->ub_disk)
> -		goto unlock;
> +		return;
> +
>  	if (ublk_nosrv_dev_should_queue_io(ub))
>  		ublk_force_abort_dev(ub);
>  	del_gendisk(ub->ub_disk);
>  	disk = ublk_detach_disk(ub);
>  	put_disk(disk);
> - unlock:
> -	mutex_unlock(&ub->mutex);
> -	ublk_cancel_dev(ub);
>  }
>  
> -static void ublk_nosrv_work(struct work_struct *work)
> +static void ublk_stop_dev(struct ublk_device *ub)
>  {
> -	struct ublk_device *ub =
> -		container_of(work, struct ublk_device, nosrv_work);
> -	int i;
> -
> -	if (ublk_nosrv_should_stop_dev(ub)) {
> -		ublk_stop_dev(ub);
> -		return;
> -	}
> -
>  	mutex_lock(&ub->mutex);
> -	if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> -		goto unlock;
> -
> -	if (ublk_nosrv_dev_should_queue_io(ub)) {
> -		__ublk_quiesce_dev(ub);
> -	} else {
> -		blk_mq_quiesce_queue(ub->ub_disk->queue);
> -		ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
> -		for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> -			ublk_get_queue(ub, i)->fail_io = true;
> -		}
> -		blk_mq_unquiesce_queue(ub->ub_disk->queue);
> -	}
> -
> - unlock:
> +	ublk_stop_dev_unlocked(ub);
>  	mutex_unlock(&ub->mutex);
>  	ublk_cancel_dev(ub);
>  }
> @@ -2491,7 +2525,6 @@ static void ublk_remove(struct ublk_device *ub)
>  	bool unprivileged;
>  
>  	ublk_stop_dev(ub);
> -	cancel_work_sync(&ub->nosrv_work);
>  	cdev_device_del(&ub->cdev, &ub->cdev_dev);
>  	unprivileged = ub->dev_info.flags & UBLK_F_UNPRIVILEGED_DEV;
>  	ublk_put_device(ub);
> @@ -2776,7 +2809,6 @@ static int ublk_ctrl_add_dev(const struct ublksrv_ctrl_cmd *header)
>  		goto out_unlock;
>  	mutex_init(&ub->mutex);
>  	spin_lock_init(&ub->lock);
> -	INIT_WORK(&ub->nosrv_work, ublk_nosrv_work);
>  
>  	ret = ublk_alloc_dev_number(ub, header->dev_id);
>  	if (ret < 0)
> @@ -2908,7 +2940,6 @@ static inline void ublk_ctrl_cmd_dump(struct io_uring_cmd *cmd)
>  static int ublk_ctrl_stop_dev(struct ublk_device *ub)
>  {
>  	ublk_stop_dev(ub);
> -	cancel_work_sync(&ub->nosrv_work);
>  	return 0;
>  }
>  
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/9] ublk: remove __ublk_quiesce_dev()
  2025-04-14 11:25 ` [PATCH 7/9] ublk: remove __ublk_quiesce_dev() Ming Lei
@ 2025-04-14 20:37   ` Uday Shankar
  0 siblings, 0 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:48PM +0800, Ming Lei wrote:
> Remove __ublk_quiesce_dev() and open code for updating device state as
> QUIESCED.
> 
> We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
> because all inflight requests are aborted in ublk char device release
> handler.
> 
> Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
> because it is done unconditionally now in ublk_ch_release().
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Uday Shankar <ushankar@purestorage.com>

> ---
>  drivers/block/ublk_drv.c | 22 +---------------------
>  1 file changed, 1 insertion(+), 21 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index e02f205f6da4..f827c2ef00a9 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -209,7 +209,6 @@ struct ublk_params_header {
>  
>  static void ublk_stop_dev_unlocked(struct ublk_device *ub);
>  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> -static void __ublk_quiesce_dev(struct ublk_device *ub);
>  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
>  		struct ublk_queue *ubq, int tag, size_t offset);
>  static inline unsigned int ublk_req_build_flags(struct request *req);
> @@ -1589,11 +1588,8 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
>  			/*
>  			 * keep request queue as quiesced for queuing new IO
>  			 * during QUIESCED state
> -			 *
> -			 * request queue will be unquiesced in ending
> -			 * recovery, see ublk_ctrl_end_recovery
>  			 */
> -			__ublk_quiesce_dev(ub);
> +			ub->dev_info.state = UBLK_S_DEV_QUIESCED;
>  		} else {
>  			ub->dev_info.state = UBLK_S_DEV_FAIL_IO;
>  			for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> @@ -1839,22 +1835,6 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
>  	}
>  }
>  
> -/* Now all inflight requests have been aborted */
> -static void __ublk_quiesce_dev(struct ublk_device *ub)
> -{
> -	int i;
> -
> -	pr_devel("%s: quiesce ub: dev_id %d state %s\n",
> -			__func__, ub->dev_info.dev_id,
> -			ub->dev_info.state == UBLK_S_DEV_LIVE ?
> -			"LIVE" : "QUIESCED");
> -	/* mark every queue as canceling */
> -	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> -		ublk_get_queue(ub, i)->canceling = true;
> -	ublk_wait_tagset_rqs_idle(ub);
> -	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
> -}
> -
>  static void ublk_force_abort_dev(struct ublk_device *ub)
>  {
>  	int i;
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 19:58   ` Uday Shankar
@ 2025-04-14 20:39     ` Jens Axboe
  2025-04-14 20:52       ` Uday Shankar
  2025-04-16  1:13       ` Ming Lei
  2025-04-16  1:04     ` Uday Shankar
  1 sibling, 2 replies; 32+ messages in thread
From: Jens Axboe @ 2025-04-14 20:39 UTC (permalink / raw)
  To: Uday Shankar, Ming Lei; +Cc: linux-block, Caleb Sander Mateos

On 4/14/25 1:58 PM, Uday Shankar wrote:
> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> +		      struct ublk_queue *ubq, struct ublk_io *io,
> +		      const struct ublksrv_io_cmd *ub_cmd,
> +		      unsigned int issue_flags)
> +{
> +	int ret = 0;
> +
> +	if (issue_flags & IO_URING_F_NONBLOCK)
> +		return -EAGAIN;
> +
> +	mutex_lock(&ub->mutex);

This looks like overkill, if we can trylock the mutex that should surely
be fine? And I would imagine succeed most of the time, hence making the
inline/fastpath fine with F_NONBLOCK?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 8/9] ublk: simplify aborting ublk request
  2025-04-14 11:25 ` [PATCH 8/9] ublk: simplify aborting ublk request Ming Lei
@ 2025-04-14 20:42   ` Uday Shankar
  0 siblings, 0 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:42 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:49PM +0800, Ming Lei wrote:
> Now ublk_abort_queue() is moved to ublk char device release handler,
> meantime our request queue is "quiesced" because either ->canceling was
> set from uring_cmd cancel function or all IOs are inflight and can't be
> completed by ublk server, things becomes easy much:
> 
> - all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
> for handling completion from uring_cmd
> 
> - ublk char device is closed, no one can hold IO request reference any more,
> so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.
> 
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Uday Shankar <ushankar@purestorage.com>

> ---
>  drivers/block/ublk_drv.c | 82 ++++++++++------------------------------
>  1 file changed, 20 insertions(+), 62 deletions(-)
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index f827c2ef00a9..37a0cb8011c1 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -122,15 +122,6 @@ struct ublk_uring_cmd_pdu {
>   */
>  #define UBLK_IO_FLAG_OWNED_BY_SRV 0x02
>  
> -/*
> - * IO command is aborted, so this flag is set in case of
> - * !UBLK_IO_FLAG_ACTIVE.
> - *
> - * After this flag is observed, any pending or new incoming request
> - * associated with this io command will be failed immediately
> - */
> -#define UBLK_IO_FLAG_ABORTED 0x04
> -
>  /*
>   * UBLK_IO_FLAG_NEED_GET_DATA is set because IO command requires
>   * get data buffer address from ublksrv.
> @@ -1083,12 +1074,6 @@ static inline void __ublk_complete_rq(struct request *req)
>  	unsigned int unmapped_bytes;
>  	blk_status_t res = BLK_STS_OK;
>  
> -	/* called from ublk_abort_queue() code path */
> -	if (io->flags & UBLK_IO_FLAG_ABORTED) {
> -		res = BLK_STS_IOERR;
> -		goto exit;
> -	}
> -
>  	/* failed read IO if nothing is read */
>  	if (!io->res && req_op(req) == REQ_OP_READ)
>  		io->res = -EIO;
> @@ -1138,47 +1123,6 @@ static void ublk_complete_rq(struct kref *ref)
>  	__ublk_complete_rq(req);
>  }
>  
> -static void ublk_do_fail_rq(struct request *req)
> -{
> -	struct ublk_queue *ubq = req->mq_hctx->driver_data;
> -
> -	if (ublk_nosrv_should_reissue_outstanding(ubq->dev))
> -		blk_mq_requeue_request(req, false);
> -	else
> -		__ublk_complete_rq(req);
> -}
> -
> -static void ublk_fail_rq_fn(struct kref *ref)
> -{
> -	struct ublk_rq_data *data = container_of(ref, struct ublk_rq_data,
> -			ref);
> -	struct request *req = blk_mq_rq_from_pdu(data);
> -
> -	ublk_do_fail_rq(req);
> -}
> -
> -/*
> - * Since ublk_rq_task_work_cb always fails requests immediately during
> - * exiting, __ublk_fail_req() is only called from abort context during
> - * exiting. So lock is unnecessary.
> - *
> - * Also aborting may not be started yet, keep in mind that one failed
> - * request may be issued by block layer again.
> - */
> -static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
> -		struct request *req)
> -{
> -	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> -
> -	if (ublk_need_req_ref(ubq)) {
> -		struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
> -
> -		kref_put(&data->ref, ublk_fail_rq_fn);
> -	} else {
> -		ublk_do_fail_rq(req);
> -	}
> -}
> -
>  static void ubq_complete_io_cmd(struct ublk_io *io, int res,
>  				unsigned issue_flags)
>  {
> @@ -1670,10 +1614,26 @@ static void ublk_commit_completion(struct ublk_device *ub,
>  		ublk_put_req_ref(ubq, req);
>  }
>  
> +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
> +		struct request *req)
> +{
> +	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> +
> +	if (ublk_nosrv_should_reissue_outstanding(ubq->dev))
> +		blk_mq_requeue_request(req, false);
> +	else {
> +		io->res = -EIO;
> +		__ublk_complete_rq(req);
> +	}
> +}
> +
>  /*
> - * Called from ubq_daemon context via cancel fn, meantime quiesce ublk
> - * blk-mq queue, so we are called exclusively with blk-mq and ubq_daemon
> - * context, so everything is serialized.
> + * Called from ublk char device release handler, when any uring_cmd is
> + * done, meantime request queue is "quiesced" since all inflight requests
> + * can't be completed because ublk server is dead.
> + *
> + * So no one can hold our request IO reference any more, simply ignore the
> + * reference, and complete the request immediately
>   */
>  static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>  {
> @@ -1690,10 +1650,8 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>  			 * will do it
>  			 */
>  			rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
> -			if (rq && blk_mq_request_started(rq)) {
> -				io->flags |= UBLK_IO_FLAG_ABORTED;
> +			if (rq && blk_mq_request_started(rq))
>  				__ublk_fail_req(ubq, io, rq);
> -			}
>  		}
>  	}
>  }
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject
  2025-04-14 11:25 ` [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject Ming Lei
@ 2025-04-14 20:44   ` Uday Shankar
  2025-04-15  1:57     ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:44 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 07:25:50PM +0800, Ming Lei wrote:
> From: Uday Shankar <ushankar@purestorage.com>
> 
> Add one simple fault inject target, and verify if it can be exited in
> expected period.

What "it" refers to is unclear. Maybe say "verify if an application
using ublk device sees an I/O error quickly after the ublk server dies?"

> 
> Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  tools/testing/selftests/ublk/Makefile         |  4 +-
>  tools/testing/selftests/ublk/fault_inject.c   | 98 +++++++++++++++++++
>  tools/testing/selftests/ublk/kublk.c          |  3 +-
>  tools/testing/selftests/ublk/kublk.h          | 12 ++-
>  .../testing/selftests/ublk/test_generic_06.sh | 41 ++++++++
>  5 files changed, 155 insertions(+), 3 deletions(-)
>  create mode 100644 tools/testing/selftests/ublk/fault_inject.c
>  create mode 100755 tools/testing/selftests/ublk/test_generic_06.sh
> 
> diff --git a/tools/testing/selftests/ublk/Makefile b/tools/testing/selftests/ublk/Makefile
> index dddc64036aa1..ec4624a283bc 100644
> --- a/tools/testing/selftests/ublk/Makefile
> +++ b/tools/testing/selftests/ublk/Makefile
> @@ -8,6 +8,7 @@ TEST_PROGS += test_generic_02.sh
>  TEST_PROGS += test_generic_03.sh
>  TEST_PROGS += test_generic_04.sh
>  TEST_PROGS += test_generic_05.sh
> +TEST_PROGS += test_generic_06.sh
>  
>  TEST_PROGS += test_null_01.sh
>  TEST_PROGS += test_null_02.sh
> @@ -31,7 +32,8 @@ TEST_GEN_PROGS_EXTENDED = kublk
>  
>  include ../lib.mk
>  
> -$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c
> +$(TEST_GEN_PROGS_EXTENDED): kublk.c null.c file_backed.c common.c stripe.c \
> +	fault_inject.c
>  
>  check:
>  	shellcheck -x -f gcc *.sh
> diff --git a/tools/testing/selftests/ublk/fault_inject.c b/tools/testing/selftests/ublk/fault_inject.c
> new file mode 100644
> index 000000000000..94a8e729ba4c
> --- /dev/null
> +++ b/tools/testing/selftests/ublk/fault_inject.c
> @@ -0,0 +1,98 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Fault injection ublk target. Hack this up however you like for
> + * testing specific behaviors of ublk_drv. Currently is a null target
> + * with a configurable delay before completing each I/O. This delay can
> + * be used to test ublk_drv's handling of I/O outstanding to the ublk
> + * server when it dies.
> + */
> +
> +#include "kublk.h"
> +
> +static int ublk_fault_inject_tgt_init(const struct dev_ctx *ctx,
> +				      struct ublk_dev *dev)
> +{
> +	const struct ublksrv_ctrl_dev_info *info = &dev->dev_info;
> +	unsigned long dev_size = 250UL << 30;
> +
> +	dev->tgt.dev_size = dev_size;
> +	dev->tgt.params = (struct ublk_params) {
> +		.types = UBLK_PARAM_TYPE_BASIC,
> +		.basic = {
> +			.logical_bs_shift	= 9,
> +			.physical_bs_shift	= 12,
> +			.io_opt_shift		= 12,
> +			.io_min_shift		= 9,
> +			.max_sectors		= info->max_io_buf_bytes >> 9,
> +			.dev_sectors		= dev_size >> 9,
> +		},
> +	};
> +
> +	dev->private_data = (void *)(unsigned long)(ctx->fault_inject.delay_us * 1000);
> +	return 0;
> +}
> +
> +static int ublk_fault_inject_queue_io(struct ublk_queue *q, int tag)
> +{
> +	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
> +	struct io_uring_sqe *sqe;
> +	struct __kernel_timespec ts = {
> +		.tv_nsec = (long long)q->dev->private_data,
> +	};
> +
> +	ublk_queue_alloc_sqes(q, &sqe, 1);
> +	io_uring_prep_timeout(sqe, &ts, 1, 0);
> +	sqe->user_data = build_user_data(tag, ublksrv_get_op(iod), 0, 1);
> +
> +	ublk_queued_tgt_io(q, tag, 1);
> +
> +	return 0;
> +}
> +
> +static void ublk_fault_inject_tgt_io_done(struct ublk_queue *q, int tag,
> +					  const struct io_uring_cqe *cqe)
> +{
> +	const struct ublksrv_io_desc *iod = ublk_get_iod(q, tag);
> +
> +	if (cqe->res != -ETIME)
> +		ublk_err("%s: unexpected cqe res %d\n", __func__, cqe->res);
> +
> +	if (ublk_completed_tgt_io(q, tag))
> +		ublk_complete_io(q, tag, iod->nr_sectors << 9);
> +	else
> +		ublk_err("%s: io not complete after 1 cqe\n", __func__);
> +}
> +
> +static void ublk_fault_inject_cmd_line(struct dev_ctx *ctx, int argc, char *argv[])
> +{
> +	static const struct option longopts[] = {
> +		{ "delay_us", 	1,	NULL,  0  },
> +		{ 0, 0, 0, 0 }
> +	};
> +	int option_idx, opt;
> +
> +	ctx->fault_inject.delay_us = 0;
> +	while ((opt = getopt_long(argc, argv, "",
> +				  longopts, &option_idx)) != -1) {
> +		switch (opt) {
> +		case 0:
> +			if (!strcmp(longopts[option_idx].name, "delay_us"))
> +				ctx->fault_inject.delay_us = strtoll(optarg, NULL, 10);
> +		}
> +	}
> +}
> +
> +static void ublk_fault_inject_usage(const struct ublk_tgt_ops *ops)
> +{
> +	printf("\tfault_inject: [--delay_us us (default 0)]\n");
> +}
> +
> +const struct ublk_tgt_ops fault_inject_tgt_ops = {
> +	.name = "fault_inject",
> +	.init_tgt = ublk_fault_inject_tgt_init,
> +	.queue_io = ublk_fault_inject_queue_io,
> +	.tgt_io_done = ublk_fault_inject_tgt_io_done,
> +	.parse_cmd_line = ublk_fault_inject_cmd_line,
> +	.usage = ublk_fault_inject_usage,
> +};
> diff --git a/tools/testing/selftests/ublk/kublk.c b/tools/testing/selftests/ublk/kublk.c
> index 0cd6dce3f303..759f06637146 100644
> --- a/tools/testing/selftests/ublk/kublk.c
> +++ b/tools/testing/selftests/ublk/kublk.c
> @@ -12,6 +12,7 @@ static const struct ublk_tgt_ops *tgt_ops_list[] = {
>  	&null_tgt_ops,
>  	&loop_tgt_ops,
>  	&stripe_tgt_ops,
> +	&fault_inject_tgt_ops,
>  };
>  
>  static const struct ublk_tgt_ops *ublk_find_tgt(const char *name)
> @@ -1234,7 +1235,7 @@ static void __cmd_create_help(char *exe, bool recovery)
>  {
>  	int i;
>  
> -	printf("%s %s -t [null|loop|stripe] [-q nr_queues] [-d depth] [-n dev_id]\n",
> +	printf("%s %s -t [null|loop|stripe|fault_inject] [-q nr_queues] [-d depth] [-n dev_id]\n",
>  			exe, recovery ? "recover" : "add");
>  	printf("\t[--foreground] [--quiet] [-z] [--debug_mask mask] [-r 0|1 ] [-g 0|1]\n");
>  	printf("\t[-e 0|1 ] [-i 0|1]\n");
> diff --git a/tools/testing/selftests/ublk/kublk.h b/tools/testing/selftests/ublk/kublk.h
> index 3d2b9f14491c..29571eb296f1 100644
> --- a/tools/testing/selftests/ublk/kublk.h
> +++ b/tools/testing/selftests/ublk/kublk.h
> @@ -68,6 +68,11 @@ struct stripe_ctx {
>  	unsigned int    chunk_size;
>  };
>  
> +struct fault_inject_ctx {
> +	/* fault_inject */
> +	unsigned long   delay_us;
> +};
> +
>  struct dev_ctx {
>  	char tgt_type[16];
>  	unsigned long flags;
> @@ -81,6 +86,9 @@ struct dev_ctx {
>  	unsigned int	fg:1;
>  	unsigned int	recovery:1;
>  
> +	/* fault_inject */
> +	long long	delay_us;
> +
>  	int _evtfd;
>  	int _shmid;
>  
> @@ -88,7 +96,8 @@ struct dev_ctx {
>  	struct ublk_dev *shadow_dev;
>  
>  	union {
> -		struct stripe_ctx  stripe;
> +		struct stripe_ctx 	stripe;
> +		struct fault_inject_ctx fault_inject;
>  	};
>  };
>  
> @@ -384,6 +393,7 @@ static inline int ublk_queue_use_zc(const struct ublk_queue *q)
>  extern const struct ublk_tgt_ops null_tgt_ops;
>  extern const struct ublk_tgt_ops loop_tgt_ops;
>  extern const struct ublk_tgt_ops stripe_tgt_ops;
> +extern const struct ublk_tgt_ops fault_inject_tgt_ops;
>  
>  void backing_file_tgt_deinit(struct ublk_dev *dev);
>  int backing_file_tgt_init(struct ublk_dev *dev);
> diff --git a/tools/testing/selftests/ublk/test_generic_06.sh b/tools/testing/selftests/ublk/test_generic_06.sh
> new file mode 100755
> index 000000000000..b67230c42c84
> --- /dev/null
> +++ b/tools/testing/selftests/ublk/test_generic_06.sh
> @@ -0,0 +1,41 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +. "$(cd "$(dirname "$0")" && pwd)"/test_common.sh
> +
> +TID="generic_06"
> +ERR_CODE=0
> +
> +_prep_test "fault_inject" "fast cleanup when all I/Os of one hctx are in server"
> +
> +# configure ublk server to sleep 2s before completing each I/O
> +dev_id=$(_add_ublk_dev -t fault_inject -q 2 -d 1 --delay_us 2000000)
> +_check_add_dev $TID $?
> +
> +STARTTIME=${SECONDS}
> +
> +dd if=/dev/urandom of=/dev/ublkb${dev_id} oflag=direct bs=4k count=1 status=none > /dev/null 2>&1 &
> +dd_pid=$!
> +
> +__ublk_kill_daemon ${dev_id} "DEAD"
> +
> +wait $dd_pid
> +dd_exitcode=$?
> +
> +ENDTIME=${SECONDS}
> +ELAPSED=$(($ENDTIME - $STARTTIME))
> +
> +# assert that dd sees an error and exits quickly after ublk server is
> +# killed. previously this relied on seeing an I/O timeout and so would
> +# take ~30s
> +if [ $dd_exitcode -eq 0 ]; then
> +        echo "dd unexpectedly exited successfully!"
> +        ERR_CODE=255
> +fi
> +if [ $ELAPSED -ge 5 ]; then
> +        echo "dd took $ELAPSED seconds to exit (>= 5s tolerance)!"
> +        ERR_CODE=255
> +fi
> +
> +_cleanup_test "fault_inject"
> +_show_result $TID $ERR_CODE
> -- 
> 2.47.0
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 20:39     ` Jens Axboe
@ 2025-04-14 20:52       ` Uday Shankar
  2025-04-14 21:00         ` Jens Axboe
  2025-04-15  1:40         ` Ming Lei
  2025-04-16  1:13       ` Ming Lei
  1 sibling, 2 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-14 20:52 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ming Lei, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
> On 4/14/25 1:58 PM, Uday Shankar wrote:
> > +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> > +		      struct ublk_queue *ubq, struct ublk_io *io,
> > +		      const struct ublksrv_io_cmd *ub_cmd,
> > +		      unsigned int issue_flags)
> > +{
> > +	int ret = 0;
> > +
> > +	if (issue_flags & IO_URING_F_NONBLOCK)
> > +		return -EAGAIN;
> > +
> > +	mutex_lock(&ub->mutex);
> 
> This looks like overkill, if we can trylock the mutex that should surely
> be fine? And I would imagine succeed most of the time, hence making the
> inline/fastpath fine with F_NONBLOCK?

Yeah, makes sense. How about this?

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index cdb1543fa4a9..bf4a88cb1413 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1832,8 +1832,8 @@ static void ublk_nosrv_work(struct work_struct *work)
 
 /* device can only be started after all IOs are ready */
 static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
+	__must_hold(&ub->mutex)
 {
-	mutex_lock(&ub->mutex);
 	ubq->nr_io_ready++;
 	if (ublk_queue_ready(ubq)) {
 		ubq->ubq_daemon = current;
@@ -1845,7 +1845,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
 	}
 	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
 		complete_all(&ub->completion);
-	mutex_unlock(&ub->mutex);
 }
 
 static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
@@ -1929,6 +1928,55 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
 	return io_buffer_unregister_bvec(cmd, index, issue_flags);
 }
 
+static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
+		      struct ublk_queue *ubq, struct ublk_io *io,
+		      const struct ublksrv_io_cmd *ub_cmd,
+		      unsigned int issue_flags)
+{
+	int ret = 0;
+
+	if (!mutex_trylock(&ub->mutex)) {
+		if (issue_flags & IO_URING_F_NONBLOCK)
+			return -EAGAIN;
+		else
+			mutex_lock(&ub->mutex);
+	}
+
+	/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
+	if (ublk_queue_ready(ubq)) {
+		ret = -EBUSY;
+		goto out;
+	}
+
+	/* allow each command to be FETCHed at most once */
+	if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV);
+
+	if (ublk_need_map_io(ubq)) {
+		/*
+		 * FETCH_RQ has to provide IO buffer if NEED GET
+		 * DATA is not enabled
+		 */
+		if (!ub_cmd->addr && !ublk_need_get_data(ubq))
+			goto out;
+	} else if (ub_cmd->addr) {
+		/* User copy requires addr to be unset */
+		ret = -EINVAL;
+		goto out;
+	}
+
+	ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
+	ublk_mark_io_ready(ub, ubq);
+
+out:
+	mutex_unlock(&ub->mutex);
+	return ret;
+}
+
 static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 			       unsigned int issue_flags,
 			       const struct ublksrv_io_cmd *ub_cmd)
@@ -1985,34 +2033,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
 	case UBLK_IO_UNREGISTER_IO_BUF:
 		return ublk_unregister_io_buf(cmd, ub_cmd->addr, issue_flags);
 	case UBLK_IO_FETCH_REQ:
-		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
-		if (ublk_queue_ready(ubq)) {
-			ret = -EBUSY;
-			goto out;
-		}
-		/*
-		 * The io is being handled by server, so COMMIT_RQ is expected
-		 * instead of FETCH_REQ
-		 */
-		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
-			goto out;
-
-		if (ublk_need_map_io(ubq)) {
-			/*
-			 * FETCH_RQ has to provide IO buffer if NEED GET
-			 * DATA is not enabled
-			 */
-			if (!ub_cmd->addr && !ublk_need_get_data(ubq))
-				goto out;
-		} else if (ub_cmd->addr) {
-			/* User copy requires addr to be unset */
-			ret = -EINVAL;
-			goto out;
-		}
-
-		ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
-		ublk_mark_io_ready(ub, ubq);
-		break;
+		return ublk_fetch(cmd, ub, ubq, io, ub_cmd, issue_flags);
 	case UBLK_IO_COMMIT_AND_FETCH_REQ:
 		req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
 


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 20:52       ` Uday Shankar
@ 2025-04-14 21:00         ` Jens Axboe
  2025-04-15  1:40         ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Jens Axboe @ 2025-04-14 21:00 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Ming Lei, linux-block, Caleb Sander Mateos

On 4/14/25 2:52 PM, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
>> On 4/14/25 1:58 PM, Uday Shankar wrote:
>>> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
>>> +		      struct ublk_queue *ubq, struct ublk_io *io,
>>> +		      const struct ublksrv_io_cmd *ub_cmd,
>>> +		      unsigned int issue_flags)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	if (issue_flags & IO_URING_F_NONBLOCK)
>>> +		return -EAGAIN;
>>> +
>>> +	mutex_lock(&ub->mutex);
>>
>> This looks like overkill, if we can trylock the mutex that should surely
>> be fine? And I would imagine succeed most of the time, hence making the
>> inline/fastpath fine with F_NONBLOCK?
> 
> Yeah, makes sense. How about this?
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index cdb1543fa4a9..bf4a88cb1413 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1832,8 +1832,8 @@ static void ublk_nosrv_work(struct work_struct *work)
>  
>  /* device can only be started after all IOs are ready */
>  static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
> +	__must_hold(&ub->mutex)
>  {
> -	mutex_lock(&ub->mutex);
>  	ubq->nr_io_ready++;
>  	if (ublk_queue_ready(ubq)) {
>  		ubq->ubq_daemon = current;
> @@ -1845,7 +1845,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
>  	}
>  	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
>  		complete_all(&ub->completion);
> -	mutex_unlock(&ub->mutex);
>  }
>  
>  static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
> @@ -1929,6 +1928,55 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
>  	return io_buffer_unregister_bvec(cmd, index, issue_flags);
>  }
>  
> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> +		      struct ublk_queue *ubq, struct ublk_io *io,
> +		      const struct ublksrv_io_cmd *ub_cmd,
> +		      unsigned int issue_flags)
> +{
> +	int ret = 0;
> +
> +	if (!mutex_trylock(&ub->mutex)) {
> +		if (issue_flags & IO_URING_F_NONBLOCK)
> +			return -EAGAIN;
> +		else
> +			mutex_lock(&ub->mutex);
> +	}

That looks better, though I'd just do:

	if (!mutex_trylock(&ub->mutex)) {
		if (issue_flags & IO_URING_F_NONBLOCK)
			return -EAGAIN;
		mutex_lock(&ub->mutex);
	}

which gets rid of a redundant else and reads simpler to me.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL
  2025-04-14 19:44   ` Uday Shankar
@ 2025-04-15  1:32     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:32 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 01:44:47PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:42PM +0800, Ming Lei wrote:
> > In ublk_stop_dev(), if ublk device state becomes UBLK_S_DEV_DEAD, we
> > will return immediately. This way is correct, but not enough, because
> > ublk device may transition to other state, such UBLK_S_DEV_QUIECED,
> > when it may have been stopped already. Then kernel panic is triggered.
> 
> How can this happen? If a device is stopped, it is in the
> UBLK_S_DEV_DEAD state. Won't that make us fall out of this check in
> ublk_nosrv_work, so we wont transition to UBLK_S_DEV_QUIESCED or other
> nosrv states?
> 
> 	mutex_lock(&ub->mutex);
> 	if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> 		goto unlock;

You are right.

I just verified that all tests can pass after reverting this patch.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 20:52       ` Uday Shankar
  2025-04-14 21:00         ` Jens Axboe
@ 2025-04-15  1:40         ` Ming Lei
  1 sibling, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:40 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:52:59PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
> > On 4/14/25 1:58 PM, Uday Shankar wrote:
> > > +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> > > +		      struct ublk_queue *ubq, struct ublk_io *io,
> > > +		      const struct ublksrv_io_cmd *ub_cmd,
> > > +		      unsigned int issue_flags)
> > > +{
> > > +	int ret = 0;
> > > +
> > > +	if (issue_flags & IO_URING_F_NONBLOCK)
> > > +		return -EAGAIN;
> > > +
> > > +	mutex_lock(&ub->mutex);
> > 
> > This looks like overkill, if we can trylock the mutex that should surely
> > be fine? And I would imagine succeed most of the time, hence making the
> > inline/fastpath fine with F_NONBLOCK?
> 
> Yeah, makes sense. How about this?
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index cdb1543fa4a9..bf4a88cb1413 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1832,8 +1832,8 @@ static void ublk_nosrv_work(struct work_struct *work)
>  
>  /* device can only be started after all IOs are ready */
>  static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
> +	__must_hold(&ub->mutex)
>  {
> -	mutex_lock(&ub->mutex);
>  	ubq->nr_io_ready++;
>  	if (ublk_queue_ready(ubq)) {
>  		ubq->ubq_daemon = current;
> @@ -1845,7 +1845,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
>  	}
>  	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
>  		complete_all(&ub->completion);
> -	mutex_unlock(&ub->mutex);
>  }
>  
>  static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
> @@ -1929,6 +1928,55 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
>  	return io_buffer_unregister_bvec(cmd, index, issue_flags);
>  }
>  
> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> +		      struct ublk_queue *ubq, struct ublk_io *io,
> +		      const struct ublksrv_io_cmd *ub_cmd,
> +		      unsigned int issue_flags)
> +{
> +	int ret = 0;
> +
> +	if (!mutex_trylock(&ub->mutex)) {
> +		if (issue_flags & IO_URING_F_NONBLOCK)
> +			return -EAGAIN;
> +		else
> +			mutex_lock(&ub->mutex);

Thinking of further, looks ub->mutex has been fat enough, here we can
use ub->lock(spin lock) to serialize the setup, then trylock & -EAGAIN
can be avoided.

It is fine to replace the mutex in ublk_mark_io_ready() with spinlock
actually.



Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  2025-04-14 20:15   ` Uday Shankar
@ 2025-04-15  1:48     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:48 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:15:39PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:45PM +0800, Ming Lei wrote:
> > Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
> > queue as quiesced. This way is fragile because queue quiesce crosses syscalls
> > or process contexts.
> > 
> > Switch to rely on ubq->canceling for dealing with ublk_nosrv_dev_should_queue_io(),
> > because it has been used for this purpose during io_uring context exiting, and it
> > can be reused before recovering too.
> > 
> > Meantime we have to move reset of ubq->canceling from ublk_ctrl_end_recovery() to
> > ublk_ctrl_end_recovery(), when IO handling can be recovered completely.
> 
> First one here should be ublk_ctrl_start_recovery or ublk_queue_reinit

Yeah.

> 
> > 
> > Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
> > in same context.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 31 +++++++++++++++++--------------
> >  1 file changed, 17 insertions(+), 14 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 7e2c4084c243..e0213222e3cf 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -1734,13 +1734,19 @@ static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
> >  
> >  static void __ublk_quiesce_dev(struct ublk_device *ub)
> >  {
> > +	int i;
> > +
> >  	pr_devel("%s: quiesce ub: dev_id %d state %s\n",
> >  			__func__, ub->dev_info.dev_id,
> >  			ub->dev_info.state == UBLK_S_DEV_LIVE ?
> >  			"LIVE" : "QUIESCED");
> >  	blk_mq_quiesce_queue(ub->ub_disk->queue);
> > +	/* mark every queue as canceling */
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > +		ublk_get_queue(ub, i)->canceling = true;
> >  	ublk_wait_tagset_rqs_idle(ub);
> >  	ub->dev_info.state = UBLK_S_DEV_QUIESCED;
> > +	blk_mq_unquiesce_queue(ub->ub_disk->queue);
> 
> So the queue is not actually quiesced when we are in UBLK_S_DEV_QUIESCED
> anymore, and we rely on __ublk_abort_rq to requeue I/O submitted in this
> QUIESCED state. This requeued I/O will immediately resubmit, right?
> Isn't this a waste of CPU?

__ublk_abort_rq() just adds request into requeue list, and doesn't requeue
actually, so there isn't waste of CPU.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/9] ublk: move device reset into ublk_ch_release()
  2025-04-14 20:29   ` Uday Shankar
@ 2025-04-15  1:50     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:50 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:29:48PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:46PM +0800, Ming Lei wrote:
> > ublk_ch_release() is called after ublk char device is closed, when all
> > uring_cmd are done, so it is perfect fine to move ublk device reset to
> > ublk_ch_release() from ublk_ctrl_start_recovery().
> > 
> > This way can avoid to grab the exiting daemon task_struct too long.
> 
> Nice, I had noticed this leak too, where we keep the task struct ref
> until the new daemon comes around. Thanks for the fix!
> 
> > 
> > However, reset of the following ublk IO flags has to be moved until ublk
> > io_uring queues are ready:
> > 
> > - ubq->canceling
> > 
> > For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
> > is recovered
> > 
> > - ubq->fail_io
> > 
> > For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
> > recovered
> > 
> > - ublk_io->flags
> > 
> > For preventing using io->cmd
> > 
> > With this way, recovery is simplified a lot.
> > 
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 121 +++++++++++++++++++++++----------------
> >  1 file changed, 72 insertions(+), 49 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index e0213222e3cf..b68bd4172fa8 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -1074,7 +1074,7 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
> >  
> >  static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
> >  {
> > -	return ubq->ubq_daemon->flags & PF_EXITING;
> > +	return !ubq->ubq_daemon || ubq->ubq_daemon->flags & PF_EXITING;
> >  }
> >  
> >  /* todo: handle partial completion */
> > @@ -1470,6 +1470,37 @@ static const struct blk_mq_ops ublk_mq_ops = {
> >  	.timeout	= ublk_timeout,
> >  };
> >  
> > +static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> > +{
> > +	int i;
> > +
> > +	/* All old ioucmds have to be completed */
> > +	ubq->nr_io_ready = 0;
> > +
> > +	/*
> > +	 * old daemon is PF_EXITING, put it now
> > +	 *
> > +	 * It could be NULL in case of closing one quisced device.
> > +	 */
> > +	if (ubq->ubq_daemon)
> > +		put_task_struct(ubq->ubq_daemon);
> > +	/* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
> > +	ubq->ubq_daemon = NULL;
> > +	ubq->timeout = false;
> > +
> > +	for (i = 0; i < ubq->q_depth; i++) {
> > +		struct ublk_io *io = &ubq->ios[i];
> > +
> > +		/*
> > +		 * UBLK_IO_FLAG_CANCELED is kept for avoiding to touch
> > +		 * io->cmd
> > +		 */
> > +		io->flags &= UBLK_IO_FLAG_CANCELED;
> > +		io->cmd = NULL;
> > +		io->addr = 0;
> > +	}
> > +}
> > +
> >  static int ublk_ch_open(struct inode *inode, struct file *filp)
> >  {
> >  	struct ublk_device *ub = container_of(inode->i_cdev,
> > @@ -1481,10 +1512,26 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
> >  	return 0;
> >  }
> >  
> > +static void ublk_reset_ch_dev(struct ublk_device *ub)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > +		ublk_queue_reinit(ub, ublk_get_queue(ub, i));
> > +
> > +	/* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
> > +	ub->mm = NULL;
> > +	ub->nr_queues_ready = 0;
> > +	ub->nr_privileged_daemon = 0;
> > +}
> > +
> >  static int ublk_ch_release(struct inode *inode, struct file *filp)
> >  {
> >  	struct ublk_device *ub = filp->private_data;
> >  
> > +	/* all uring_cmd has been done now, reset device & ubq */
> > +	ublk_reset_ch_dev(ub);
> > +
> >  	clear_bit(UB_STATE_OPEN, &ub->state);
> >  	return 0;
> >  }
> > @@ -1831,6 +1878,24 @@ static void ublk_nosrv_work(struct work_struct *work)
> >  	ublk_cancel_dev(ub);
> >  }
> >  
> > +/* reset ublk io_uring queue & io flags */
> > +static void ublk_reset_io_flags(struct ublk_device *ub)
> > +{
> > +	int i, j;
> > +
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > +		/* UBLK_IO_FLAG_CANCELED can be cleared now */
> > +		spin_lock(&ubq->cancel_lock);
> 
> Do we need this? I think at this point there shouldn't be any concurrent
> activity we need to protect against.

Yeah, the lock isn't necessary, but doing it here actually has document benefit.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/9] ublk: improve detection and handling of ublk server exit
  2025-04-14 20:36   ` Uday Shankar
@ 2025-04-15  1:54     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:54 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:36:39PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:47PM +0800, Ming Lei wrote:
> > From: Uday Shankar <ushankar@purestorage.com>
> > 
> > There are currently two ways in which ublk server exit is detected by
> > ublk_drv:
> > 
> > 1. uring_cmd cancellation. If there are any outstanding uring_cmds which
> >    have not been completed to the ublk server when it exits, io_uring
> >    calls the uring_cmd callback with a special cancellation flag as the
> >    issuing task is exiting.
> > 2. I/O timeout. This is needed in addition to the above to handle the
> >    "saturated queue" case, when all I/Os for a given queue are in the
> >    ublk server, and therefore there are no outstanding uring_cmds to
> >    cancel when the ublk server exits.
> > 
> > There are a couple of issues with this approach:
> > 
> > - It is complex and inelegant to have two methods to detect the same
> >   condition
> > - The second method detects ublk server exit only after a long delay
> >   (~30s, the default timeout assigned by the block layer). This delays
> >   the nosrv behavior from kicking in and potential subsequent recovery
> >   of the device.
> > 
> > The second issue is brought to light with the new test_generic_06 which
> > will be added in following patch. It fails before this fix:
> > 
> > selftests: ublk: test_generic_06.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 30.0611 s, 0.0 kB/s
> > DEAD
> > dd took 31 seconds to exit (>= 5s tolerance)!
> > generic_06 : [FAIL]
> > 
> > Fix this by instead detecting and handling ublk server exit in the
> > character file release callback. This has several advantages:
> > 
> > - This one place can handle both saturated and unsaturated queues. Thus,
> >   it replaces both preexisting methods of detecting ublk server exit.
> > - It runs quickly on ublk server exit - there is no 30s delay.
> > - It starts the process of removing task references in ublk_drv. This is
> >   needed if we want to relax restrictions in the driver like letting
> >   only one thread serve each queue
> > 
> > There is also the disadvantage that the character file release callback
> > can also be triggered by intentional close of the file, which is a
> > significant behavior change. Preexisting ublk servers (libublksrv) are
> > dependent on the ability to open/close the file multiple times. To
> > address this, only transition to a nosrv state if the file is released
> > while the ublk device is live. This allows for programs to open/close
> > the file multiple times during setup. It is still a behavior change if a
> > ublk server decides to close/reopen the file while the device is LIVE
> > (i.e. while it is responsible for serving I/O), but that would be highly
> > unusual. This behavior is in line with what is done by FUSE, which is
> > very similar to ublk in that a userspace daemon is providing services
> > traditionally provided by the kernel.
> > 
> > With this change in, the new test (and all other selftests, and all
> > ublksrv tests) pass:
> > 
> > selftests: ublk: test_generic_06.sh
> > dev id is 0
> > dd: error writing '/dev/ublkb0': Input/output error
> > 1+0 records in
> > 0+0 records out
> > 0 bytes copied, 0.0376731 s, 0.0 kB/s
> > DEAD
> > generic_04 : [PASS]
> > 
> > Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/ublk_drv.c | 231 ++++++++++++++++++++++-----------------
> >  1 file changed, 131 insertions(+), 100 deletions(-)
> > 
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index b68bd4172fa8..e02f205f6da4 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -199,8 +199,6 @@ struct ublk_device {
> >  	struct completion	completion;
> >  	unsigned int		nr_queues_ready;
> >  	unsigned int		nr_privileged_daemon;
> > -
> > -	struct work_struct	nosrv_work;
> >  };
> >  
> >  /* header of ublk_params */
> > @@ -209,8 +207,9 @@ struct ublk_params_header {
> >  	__u32	types;
> >  };
> >  
> > -static bool ublk_abort_requests(struct ublk_device *ub, struct ublk_queue *ubq);
> > -
> > +static void ublk_stop_dev_unlocked(struct ublk_device *ub);
> > +static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq);
> > +static void __ublk_quiesce_dev(struct ublk_device *ub);
> >  static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> >  		struct ublk_queue *ubq, int tag, size_t offset);
> >  static inline unsigned int ublk_req_build_flags(struct request *req);
> > @@ -1336,8 +1335,6 @@ static void ublk_queue_cmd_list(struct ublk_queue *ubq, struct rq_list *l)
> >  static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  {
> >  	struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> > -	unsigned int nr_inflight = 0;
> > -	int i;
> >  
> >  	if (ubq->flags & UBLK_F_UNPRIVILEGED_DEV) {
> >  		if (!ubq->timeout) {
> > @@ -1348,26 +1345,6 @@ static enum blk_eh_timer_return ublk_timeout(struct request *rq)
> >  		return BLK_EH_DONE;
> >  	}
> >  
> > -	if (!ubq_daemon_is_dying(ubq))
> > -		return BLK_EH_RESET_TIMER;
> > -
> > -	for (i = 0; i < ubq->q_depth; i++) {
> > -		struct ublk_io *io = &ubq->ios[i];
> > -
> > -		if (!(io->flags & UBLK_IO_FLAG_ACTIVE))
> > -			nr_inflight++;
> > -	}
> > -
> > -	/* cancelable uring_cmd can't help us if all commands are in-flight */
> > -	if (nr_inflight == ubq->q_depth) {
> > -		struct ublk_device *ub = ubq->dev;
> > -
> > -		if (ublk_abort_requests(ub, ubq)) {
> > -			schedule_work(&ub->nosrv_work);
> > -		}
> > -		return BLK_EH_DONE;
> > -	}
> > -
> >  	return BLK_EH_RESET_TIMER;
> >  }
> >  
> > @@ -1525,13 +1502,112 @@ static void ublk_reset_ch_dev(struct ublk_device *ub)
> >  	ub->nr_privileged_daemon = 0;
> >  }
> >  
> > +static struct gendisk *ublk_get_disk(struct ublk_device *ub)
> > +{
> > +	struct gendisk *disk;
> > +
> > +	spin_lock(&ub->lock);
> > +	disk = ub->ub_disk;
> > +	if (disk)
> > +		get_device(disk_to_dev(disk));
> > +	spin_unlock(&ub->lock);
> > +
> > +	return disk;
> > +}
> > +
> > +static void ublk_put_disk(struct gendisk *disk)
> > +{
> > +	if (disk)
> > +		put_device(disk_to_dev(disk));
> > +}
> > +
> >  static int ublk_ch_release(struct inode *inode, struct file *filp)
> >  {
> >  	struct ublk_device *ub = filp->private_data;
> > +	struct gendisk *disk;
> > +	int i;
> > +
> > +	/*
> > +	 * disk isn't attached yet, either device isn't live, or it has
> > +	 * been removed already, so we needn't to do anything
> > +	 */
> > +	disk = ublk_get_disk(ub);
> > +	if (!disk)
> > +		goto out;
> > +
> > +	/*
> > +	 * All uring_cmd are done now, so abort any request outstanding to
> > +	 * the ublk server
> > +	 *
> > +	 * This can be done in lockless way because ublk server has been
> > +	 * gone
> > +	 *
> > +	 * More importantly, we have to provide forward progress guarantee
> > +	 * without holding ub->mutex, otherwise control task grabbing
> > +	 * ub->mutex triggers deadlock
> > +	 *
> > +	 * All requests may be inflight, so ->canceling may not be set, set
> > +	 * it now.
> > +	 */
> > +	for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> > +		struct ublk_queue *ubq = ublk_get_queue(ub, i);
> > +
> > +		ubq->canceling = true;
> > +		ublk_abort_queue(ub, ubq);
> > +	}
> > +	blk_mq_kick_requeue_list(disk->queue);
> > +
> > +	/*
> > +	 * All infligh requests have been completed or requeued and any new
> > +	 * request will be failed or requeued via `->canceling` now, so it is
> > +	 * fine to grab ub->mutex now.
> > +	 */
> > +	mutex_lock(&ub->mutex);
> > +
> > +	/* double check after grabbing lock */
> > +	if (!ub->ub_disk)
> > +		goto unlock;
> > +
> > +	/*
> > +	 * Transition the device to the nosrv state. What exactly this
> > +	 * means depends on the recovery flags
> > +	 */
> > +	blk_mq_quiesce_queue(disk->queue);
> > +	if (ublk_nosrv_should_stop_dev(ub)) {
> > +		/*
> > +		 * Allow any pending/future I/O to pass through quickly
> > +		 * with an error. This is needed because del_gendisk
> > +		 * waits for all pending I/O to complete
> > +		 */
> > +		for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> > +			ublk_get_queue(ub, i)->force_abort = true;
> > +		blk_mq_unquiesce_queue(disk->queue);
> > +
> > +		ublk_stop_dev_unlocked(ub);
> > +	} else {
> > +		if (ublk_nosrv_dev_should_queue_io(ub)) {
> > +			/*
> > +			 * keep request queue as quiesced for queuing new IO
> > +			 * during QUIESCED state
> > +			 *
> > +			 * request queue will be unquiesced in ending
> > +			 * recovery, see ublk_ctrl_end_recovery
> > +			 */
> 
> Does this comment need an update after
> 
> [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
> 
> If I read that one right, actually the request queue is not quiesced
> anymore during the UBLK_S_DEV_QUIESCED state

The comment is removed actually in the following patch, but it shouldn't have
been added here, will do it in V2.


Thanks, 
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject
  2025-04-14 20:44   ` Uday Shankar
@ 2025-04-15  1:57     ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-15  1:57 UTC (permalink / raw)
  To: Uday Shankar; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:44:37PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:50PM +0800, Ming Lei wrote:
> > From: Uday Shankar <ushankar@purestorage.com>
> > 
> > Add one simple fault inject target, and verify if it can be exited in
> > expected period.
> 
> What "it" refers to is unclear. Maybe say "verify if an application
> using ublk device sees an I/O error quickly after the ublk server dies?"

Looks good, will change commit log to the above words.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 19:58   ` Uday Shankar
  2025-04-14 20:39     ` Jens Axboe
@ 2025-04-16  1:04     ` Uday Shankar
  1 sibling, 0 replies; 32+ messages in thread
From: Uday Shankar @ 2025-04-16  1:04 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 01:58:47PM -0600, Uday Shankar wrote:
> On Mon, Apr 14, 2025 at 07:25:43PM +0800, Ming Lei wrote:
> > From: Uday Shankar <ushankar@purestorage.com>
> > 
> > Most uring_cmds issued against ublk character devices are serialized
> > because each command affects only one queue, and there is an early check
> > which only allows a single task (the queue's ubq_daemon) to issue
> > uring_cmds against that queue. However, this mechanism does not work for
> > FETCH_REQs, since they are expected before ubq_daemon is set. Since
> > FETCH_REQs are only used at initialization and not in the fast path,
> > serialize them using the per-ublk-device mutex. This fixes a number of
> > data races that were previously possible if a badly behaved ublk server
> > decided to issue multiple FETCH_REQs against the same qid/tag
> > concurrently.
> > 
> > Reviewed-by: Ming Lei <ming.lei@redhat.com>
> > Reported-by: Caleb Sander Mateos <csander@purestorage.com>
> > Signed-off-by: Uday Shankar <ushankar@purestorage.com>
> 
> Thanks for picking this up. Can you use the following patch instead? It
> has two changes compared to [1]:
> 
> - Factor FETCH command into its own function
> - Return -EAGAIN for non-blocking dispatch because we are taking a
>   mutex.
> 
> [1] https://lore.kernel.org/linux-block/20250410-ublk_task_per_io-v3-1-b811e8f4554a@purestorage.com/
> 
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 2fd05c1bd30b..8efb7668ab2c 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1809,8 +1809,8 @@ static void ublk_nosrv_work(struct work_struct *work)
>  
>  /* device can only be started after all IOs are ready */
>  static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
> +	__must_hold(&ub->mutex)
>  {
> -	mutex_lock(&ub->mutex);
>  	ubq->nr_io_ready++;
>  	if (ublk_queue_ready(ubq)) {
>  		ubq->ubq_daemon = current;
> @@ -1822,7 +1822,6 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
>  	}
>  	if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
>  		complete_all(&ub->completion);
> -	mutex_unlock(&ub->mutex);
>  }
>  
>  static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
> @@ -1906,6 +1905,52 @@ static int ublk_unregister_io_buf(struct io_uring_cmd *cmd,
>  	return io_buffer_unregister_bvec(cmd, index, issue_flags);
>  }
>  
> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> +		      struct ublk_queue *ubq, struct ublk_io *io,
> +		      const struct ublksrv_io_cmd *ub_cmd,
> +		      unsigned int issue_flags)
> +{
> +	int ret = 0;
> +
> +	if (issue_flags & IO_URING_F_NONBLOCK)
> +		return -EAGAIN;
> +
> +	mutex_lock(&ub->mutex);
> +	/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
> +	if (ublk_queue_ready(ubq)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	/* allow each command to be FETCHed at most once */
> +	if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_OWNED_BY_SRV);
> +
> +	if (ublk_need_map_io(ubq)) {
> +		/*
> +		 * FETCH_RQ has to provide IO buffer if NEED GET
> +		 * DATA is not enabled
> +		 */
> +		if (!ub_cmd->addr && !ublk_need_get_data(ubq))
> +			goto out;
> +	} else if (ub_cmd->addr) {
> +		/* User copy requires addr to be unset */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
> +	ublk_mark_io_ready(ub, ubq);
> +
> +out:
> +	mutex_unlock(&ub->mutex);
> +	return ret;
> +}
> +
>  static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>  			       unsigned int issue_flags,
>  			       const struct ublksrv_io_cmd *ub_cmd)
> @@ -1962,34 +2007,7 @@ static int __ublk_ch_uring_cmd(struct io_uring_cmd *cmd,
>  	case UBLK_IO_UNREGISTER_IO_BUF:
>  		return ublk_unregister_io_buf(cmd, ub_cmd->addr, issue_flags);
>  	case UBLK_IO_FETCH_REQ:
> -		/* UBLK_IO_FETCH_REQ is only allowed before queue is setup */
> -		if (ublk_queue_ready(ubq)) {
> -			ret = -EBUSY;
> -			goto out;
> -		}
> -		/*
> -		 * The io is being handled by server, so COMMIT_RQ is expected
> -		 * instead of FETCH_REQ
> -		 */
> -		if (io->flags & UBLK_IO_FLAG_OWNED_BY_SRV)
> -			goto out;
> -
> -		if (ublk_need_map_io(ubq)) {
> -			/*
> -			 * FETCH_RQ has to provide IO buffer if NEED GET
> -			 * DATA is not enabled
> -			 */
> -			if (!ub_cmd->addr && !ublk_need_get_data(ubq))
> -				goto out;
> -		} else if (ub_cmd->addr) {
> -			/* User copy requires addr to be unset */
> -			ret = -EINVAL;
> -			goto out;
> -		}
> -
> -		ublk_fill_io_cmd(io, cmd, ub_cmd->addr);
> -		ublk_mark_io_ready(ub, ubq);
> -		break;
> +		return ublk_fetch(cmd, ub, ubq, io, ub_cmd, issue_flags);

One more bug here, this skips the

ublk_prep_cancel(cmd, issue_flags, ubq, tag);
return -EIOCBQUEUED;

that is after the switch statement.

>  	case UBLK_IO_COMMIT_AND_FETCH_REQ:
>  		req = blk_mq_tag_to_rq(ub->tag_set.tags[ub_cmd->q_id], tag);
>  
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-14 20:39     ` Jens Axboe
  2025-04-14 20:52       ` Uday Shankar
@ 2025-04-16  1:13       ` Ming Lei
  2025-04-16  1:17         ` Jens Axboe
  1 sibling, 1 reply; 32+ messages in thread
From: Ming Lei @ 2025-04-16  1:13 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Uday Shankar, linux-block, Caleb Sander Mateos

On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
> On 4/14/25 1:58 PM, Uday Shankar wrote:
> > +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> > +		      struct ublk_queue *ubq, struct ublk_io *io,
> > +		      const struct ublksrv_io_cmd *ub_cmd,
> > +		      unsigned int issue_flags)
> > +{
> > +	int ret = 0;
> > +
> > +	if (issue_flags & IO_URING_F_NONBLOCK)
> > +		return -EAGAIN;
> > +
> > +	mutex_lock(&ub->mutex);
> 
> This looks like overkill, if we can trylock the mutex that should surely
> be fine? And I would imagine succeed most of the time, hence making the
> inline/fastpath fine with F_NONBLOCK?

The mutex is the innermost lock and it won't block for handling FETCH
command, which is just called during queue setting up stage, so I think
trylock isn't necessary, but also brings complexity.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-16  1:13       ` Ming Lei
@ 2025-04-16  1:17         ` Jens Axboe
  2025-04-16  2:04           ` Ming Lei
  0 siblings, 1 reply; 32+ messages in thread
From: Jens Axboe @ 2025-04-16  1:17 UTC (permalink / raw)
  To: Ming Lei; +Cc: Uday Shankar, linux-block, Caleb Sander Mateos

On 4/15/25 7:13 PM, Ming Lei wrote:
> On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
>> On 4/14/25 1:58 PM, Uday Shankar wrote:
>>> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
>>> +		      struct ublk_queue *ubq, struct ublk_io *io,
>>> +		      const struct ublksrv_io_cmd *ub_cmd,
>>> +		      unsigned int issue_flags)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	if (issue_flags & IO_URING_F_NONBLOCK)
>>> +		return -EAGAIN;
>>> +
>>> +	mutex_lock(&ub->mutex);
>>
>> This looks like overkill, if we can trylock the mutex that should surely
>> be fine? And I would imagine succeed most of the time, hence making the
>> inline/fastpath fine with F_NONBLOCK?
> 
> The mutex is the innermost lock and it won't block for handling FETCH
> command, which is just called during queue setting up stage, so I think
> trylock isn't necessary, but also brings complexity.

Then the NONBLOCK check can go away, and a comment added instead on why
it's fine. Or maybe even a WARN_ON_ONCE() if trylock fails or something.
Otherwise it's going to look like a code bug.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/9] ublk: properly serialize all FETCH_REQs
  2025-04-16  1:17         ` Jens Axboe
@ 2025-04-16  2:04           ` Ming Lei
  0 siblings, 0 replies; 32+ messages in thread
From: Ming Lei @ 2025-04-16  2:04 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Uday Shankar, linux-block, Caleb Sander Mateos

On Tue, Apr 15, 2025 at 07:17:09PM -0600, Jens Axboe wrote:
> On 4/15/25 7:13 PM, Ming Lei wrote:
> > On Mon, Apr 14, 2025 at 02:39:33PM -0600, Jens Axboe wrote:
> >> On 4/14/25 1:58 PM, Uday Shankar wrote:
> >>> +static int ublk_fetch(struct io_uring_cmd *cmd, struct ublk_device *ub,
> >>> +		      struct ublk_queue *ubq, struct ublk_io *io,
> >>> +		      const struct ublksrv_io_cmd *ub_cmd,
> >>> +		      unsigned int issue_flags)
> >>> +{
> >>> +	int ret = 0;
> >>> +
> >>> +	if (issue_flags & IO_URING_F_NONBLOCK)
> >>> +		return -EAGAIN;
> >>> +
> >>> +	mutex_lock(&ub->mutex);
> >>
> >> This looks like overkill, if we can trylock the mutex that should surely
> >> be fine? And I would imagine succeed most of the time, hence making the
> >> inline/fastpath fine with F_NONBLOCK?
> > 
> > The mutex is the innermost lock and it won't block for handling FETCH
> > command, which is just called during queue setting up stage, so I think
> > trylock isn't necessary, but also brings complexity.
> 
> Then the NONBLOCK check can go away, and a comment added instead on why
> it's fine. Or maybe even a WARN_ON_ONCE() if trylock fails or something.
> Otherwise it's going to look like a code bug.

Yes, the NONBLOCK check isn't needed. 

ublk uring cmd is always handled with !(issue_flags & IO_URING_F_UNLOCKED), please
see ublk_ch_uring_cmd() and ublk_ch_uring_cmd_local().


thanks,
Ming


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-04-16  2:04 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-14 11:25 [PATCH 0/9] ublk: simplify & improve IO canceling Ming Lei
2025-04-14 11:25 ` [PATCH 1/9] ublk: don't try to stop disk if ->ub_disk is NULL Ming Lei
2025-04-14 19:44   ` Uday Shankar
2025-04-15  1:32     ` Ming Lei
2025-04-14 11:25 ` [PATCH 2/9] ublk: properly serialize all FETCH_REQs Ming Lei
2025-04-14 19:58   ` Uday Shankar
2025-04-14 20:39     ` Jens Axboe
2025-04-14 20:52       ` Uday Shankar
2025-04-14 21:00         ` Jens Axboe
2025-04-15  1:40         ` Ming Lei
2025-04-16  1:13       ` Ming Lei
2025-04-16  1:17         ` Jens Axboe
2025-04-16  2:04           ` Ming Lei
2025-04-16  1:04     ` Uday Shankar
2025-04-14 11:25 ` [PATCH 3/9] ublk: add ublk_force_abort_dev() Ming Lei
2025-04-14 20:06   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 4/9] ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io Ming Lei
2025-04-14 20:15   ` Uday Shankar
2025-04-15  1:48     ` Ming Lei
2025-04-14 11:25 ` [PATCH 5/9] ublk: move device reset into ublk_ch_release() Ming Lei
2025-04-14 20:29   ` Uday Shankar
2025-04-15  1:50     ` Ming Lei
2025-04-14 11:25 ` [PATCH 6/9] ublk: improve detection and handling of ublk server exit Ming Lei
2025-04-14 20:36   ` Uday Shankar
2025-04-15  1:54     ` Ming Lei
2025-04-14 11:25 ` [PATCH 7/9] ublk: remove __ublk_quiesce_dev() Ming Lei
2025-04-14 20:37   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 8/9] ublk: simplify aborting ublk request Ming Lei
2025-04-14 20:42   ` Uday Shankar
2025-04-14 11:25 ` [PATCH 9/9] selftests: ublk: add generic_06 for covering fault inject Ming Lei
2025-04-14 20:44   ` Uday Shankar
2025-04-15  1:57     ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).