public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ublk: fix deadlock when reading partition table
@ 2025-12-11  8:38 Ming Lei
  2025-12-11 20:30 ` Caleb Sander Mateos
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2025-12-11  8:38 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

When one process(such as udev) opens ublk block device (e.g., to read
the partition table via bdev_open()), a deadlock[1] can occur:

1. bdev_open() grabs disk->open_mutex
2. The process issues read I/O to ublk backend to read partition table
3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
   runs bio->bi_end_io() callbacks
4. If this triggers fput() on file descriptor of ublk block device, the
   work may be deferred to current task's task work (see fput() implementation)
5. This eventually calls blkdev_release() from the same context
6. blkdev_release() tries to grab disk->open_mutex again
7. Deadlock: same task waiting for a mutex it already holds

The fix is to run blk_update_request() and blk_mq_end_request() with bottom
halves disabled. This forces blkdev_release() to run in kernel work-queue
context instead of current task work context, and allows ublk server to make
forward progress, and avoids the deadlock.

Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Link: https://github.com/ublk-org/ublksrv/issues/170 [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/ublk_drv.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 2c715df63f23..f69da449727f 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1086,6 +1086,7 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 {
 	unsigned int unmapped_bytes;
 	blk_status_t res = BLK_STS_OK;
+	bool requeue;
 
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
@@ -1117,14 +1118,28 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 	if (unlikely(unmapped_bytes < io->res))
 		io->res = unmapped_bytes;
 
-	if (blk_update_request(req, BLK_STS_OK, io->res))
+	/*
+	 * Run bio->bi_end_io() from softirq context for preventing this
+	 * ublk's blkdev_release() from being called on current's task
+	 * work, see fput() implementation.
+	 *
+	 * Otherwise, ublk server may not provide forward progress in
+	 * case of reading partition table from bdev_open() with
+	 * disk->open_mutex grabbed, and causes dead lock.
+	 */
+	local_bh_disable();
+	requeue = blk_update_request(req, BLK_STS_OK, io->res);
+	local_bh_enable();
+	if (requeue)
 		blk_mq_requeue_request(req, true);
 	else if (likely(!blk_should_fake_timeout(req->q)))
 		__blk_mq_end_request(req, BLK_STS_OK);
 
 	return;
 exit:
+	local_bh_disable();
 	blk_mq_end_request(req, res);
+	local_bh_enable();
 }
 
 static struct io_uring_cmd *__ublk_prep_compl_io_cmd(struct ublk_io *io,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] ublk: fix deadlock when reading partition table
  2025-12-11  8:38 [PATCH] ublk: fix deadlock when reading partition table Ming Lei
@ 2025-12-11 20:30 ` Caleb Sander Mateos
  2025-12-11 22:26   ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Caleb Sander Mateos @ 2025-12-11 20:30 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar

On Thu, Dec 11, 2025 at 12:38 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> When one process(such as udev) opens ublk block device (e.g., to read
> the partition table via bdev_open()), a deadlock[1] can occur:
>
> 1. bdev_open() grabs disk->open_mutex
> 2. The process issues read I/O to ublk backend to read partition table

I'm not sure I understand how a process could be issuing read I/O to
the block device before bdev_open() has returned? Or do you mean that
bdev_open() is issuing read I/O for the partition table via
blkdev_get_whole() -> bdev_disk_changed() -> blk_add_partitions() ->
check_partition()?

> 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
>    runs bio->bi_end_io() callbacks
> 4. If this triggers fput() on file descriptor of ublk block device, the
>    work may be deferred to current task's task work (see fput() implementation)

What is the bi_end_io implementation that results in an fput() call?

> 5. This eventually calls blkdev_release() from the same context
> 6. blkdev_release() tries to grab disk->open_mutex again
> 7. Deadlock: same task waiting for a mutex it already holds
>
> The fix is to run blk_update_request() and blk_mq_end_request() with bottom
> halves disabled. This forces blkdev_release() to run in kernel work-queue
> context instead of current task work context, and allows ublk server to make
> forward progress, and avoids the deadlock.

The idea here seems reasonable, but I can't say I understand all the
pieces resulting in the deadlock.

Best,
Caleb

>
> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
> Link: https://github.com/ublk-org/ublksrv/issues/170 [1]
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/ublk_drv.c | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 2c715df63f23..f69da449727f 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1086,6 +1086,7 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>  {
>         unsigned int unmapped_bytes;
>         blk_status_t res = BLK_STS_OK;
> +       bool requeue;
>
>         /* failed read IO if nothing is read */
>         if (!io->res && req_op(req) == REQ_OP_READ)
> @@ -1117,14 +1118,28 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>         if (unlikely(unmapped_bytes < io->res))
>                 io->res = unmapped_bytes;
>
> -       if (blk_update_request(req, BLK_STS_OK, io->res))
> +       /*
> +        * Run bio->bi_end_io() from softirq context for preventing this
> +        * ublk's blkdev_release() from being called on current's task
> +        * work, see fput() implementation.
> +        *
> +        * Otherwise, ublk server may not provide forward progress in
> +        * case of reading partition table from bdev_open() with
> +        * disk->open_mutex grabbed, and causes dead lock.
> +        */
> +       local_bh_disable();
> +       requeue = blk_update_request(req, BLK_STS_OK, io->res);
> +       local_bh_enable();
> +       if (requeue)
>                 blk_mq_requeue_request(req, true);
>         else if (likely(!blk_should_fake_timeout(req->q)))
>                 __blk_mq_end_request(req, BLK_STS_OK);
>
>         return;
>  exit:
> +       local_bh_disable();
>         blk_mq_end_request(req, res);
> +       local_bh_enable();
>  }
>
>  static struct io_uring_cmd *__ublk_prep_compl_io_cmd(struct ublk_io *io,
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ublk: fix deadlock when reading partition table
  2025-12-11 20:30 ` Caleb Sander Mateos
@ 2025-12-11 22:26   ` Ming Lei
  2025-12-12  4:56     ` Caleb Sander Mateos
  0 siblings, 1 reply; 5+ messages in thread
From: Ming Lei @ 2025-12-11 22:26 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar

On Thu, Dec 11, 2025 at 12:30:35PM -0800, Caleb Sander Mateos wrote:
> On Thu, Dec 11, 2025 at 12:38 AM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > When one process(such as udev) opens ublk block device (e.g., to read
> > the partition table via bdev_open()), a deadlock[1] can occur:
> >
> > 1. bdev_open() grabs disk->open_mutex
> > 2. The process issues read I/O to ublk backend to read partition table
> 
> I'm not sure I understand how a process could be issuing read I/O to
> the block device before bdev_open() has returned? Or do you mean that
> bdev_open() is issuing read I/O for the partition table via
> blkdev_get_whole() -> bdev_disk_changed() -> blk_add_partitions() ->
> check_partition()?

Yes, disk->open_mutex is grabbed and waiting for reading partition table.

> 
> > 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
> >    runs bio->bi_end_io() callbacks
> > 4. If this triggers fput() on file descriptor of ublk block device, the
> >    work may be deferred to current task's task work (see fput() implementation)
> 
> What is the bi_end_io implementation that results in an fput() call?

libaio calls fput() via ->ki_complete() via bio->bi_end_io(), io-uring may
call it from io_free_rsrc_node().

https://github.com/ublk-org/ublksrv/issues/170#issuecomment-3635162644

> 
> > 5. This eventually calls blkdev_release() from the same context
> > 6. blkdev_release() tries to grab disk->open_mutex again
> > 7. Deadlock: same task waiting for a mutex it already holds
> >
> > The fix is to run blk_update_request() and blk_mq_end_request() with bottom
> > halves disabled. This forces blkdev_release() to run in kernel work-queue
> > context instead of current task work context, and allows ublk server to make
> > forward progress, and avoids the deadlock.
> 
> The idea here seems reasonable, but I can't say I understand all the
> pieces resulting in the deadlock.

Please see the following scenarios:

1) task A: fio is running IO over /dev/ublkb0 for 5secs

2) task B: just when fio is exiting, another task is calling into ioctl(RRPART) on
/dev/ublkb0, waiting for reading partition with disk->open_mutex held.

3) in ublk server task, for some reason, fput() drops the `struct
file`'s last reference from task A, so bdev_release() is called from
task_work_run() in ublk server context. However, task B is holding
disk->open_mutex, so bdev_release() hangs forever, because this ublk server
can't handle IO for task B any more.

Jiri Pospisil has verified this patch and closes https://github.com/ublk-org/ublksrv/issues/170.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ublk: fix deadlock when reading partition table
  2025-12-11 22:26   ` Ming Lei
@ 2025-12-12  4:56     ` Caleb Sander Mateos
  2025-12-12  8:22       ` Ming Lei
  0 siblings, 1 reply; 5+ messages in thread
From: Caleb Sander Mateos @ 2025-12-12  4:56 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar

On Thu, Dec 11, 2025 at 2:27 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Thu, Dec 11, 2025 at 12:30:35PM -0800, Caleb Sander Mateos wrote:
> > On Thu, Dec 11, 2025 at 12:38 AM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > When one process(such as udev) opens ublk block device (e.g., to read
> > > the partition table via bdev_open()), a deadlock[1] can occur:
> > >
> > > 1. bdev_open() grabs disk->open_mutex
> > > 2. The process issues read I/O to ublk backend to read partition table
> >
> > I'm not sure I understand how a process could be issuing read I/O to
> > the block device before bdev_open() has returned? Or do you mean that
> > bdev_open() is issuing read I/O for the partition table via
> > blkdev_get_whole() -> bdev_disk_changed() -> blk_add_partitions() ->
> > check_partition()?
>
> Yes, disk->open_mutex is grabbed and waiting for reading partition table.
>
> >
> > > 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
> > >    runs bio->bi_end_io() callbacks
> > > 4. If this triggers fput() on file descriptor of ublk block device, the
> > >    work may be deferred to current task's task work (see fput() implementation)
> >
> > What is the bi_end_io implementation that results in an fput() call?
>
> libaio calls fput() via ->ki_complete() via bio->bi_end_io(), io-uring may
> call it from io_free_rsrc_node().
>
> https://github.com/ublk-org/ublksrv/issues/170#issuecomment-3635162644
>
> >
> > > 5. This eventually calls blkdev_release() from the same context
> > > 6. blkdev_release() tries to grab disk->open_mutex again
> > > 7. Deadlock: same task waiting for a mutex it already holds
> > >
> > > The fix is to run blk_update_request() and blk_mq_end_request() with bottom
> > > halves disabled. This forces blkdev_release() to run in kernel work-queue
> > > context instead of current task work context, and allows ublk server to make
> > > forward progress, and avoids the deadlock.
> >
> > The idea here seems reasonable, but I can't say I understand all the
> > pieces resulting in the deadlock.
>
> Please see the following scenarios:
>
> 1) task A: fio is running IO over /dev/ublkb0 for 5secs
>
> 2) task B: just when fio is exiting, another task is calling into ioctl(RRPART) on
> /dev/ublkb0, waiting for reading partition with disk->open_mutex held.
>
> 3) in ublk server task, for some reason, fput() drops the `struct
> file`'s last reference from task A, so bdev_release() is called from
> task_work_run() in ublk server context. However, task B is holding
> disk->open_mutex, so bdev_release() hangs forever, because this ublk server
> can't handle IO for task B any more.

Ah okay this makes more sense, I thought the fput() was for the
completion of the partition table read. But I see now it can be for
any other process that happens to have opened the same ublk disk. The
"userspace thread and not in interrupt/softirq" logic in
__fput_deferred() to decide to close the file in task work seems
pretty suspect. The assumption seems to be that blocking a userspace
thread is fine since it won't be blocking any kernel threads, but
that's clearly wrong for a userspace block device. I wonder if any
similar deadlocks are possible with fuse? Pretending the ublk server
thread is in softirq context when completing the seems a bit hacky,
but I can see it may be the simplest fix.

Do the blk_mq_end_request() calls in__ublk_do_auto_buf_reg() and
__ublk_abort_rq() not also need this protection?

Best,
Caleb

>
> Jiri Pospisil has verified this patch and closes https://github.com/ublk-org/ublksrv/issues/170.
>
>
> Thanks,
> Ming
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] ublk: fix deadlock when reading partition table
  2025-12-12  4:56     ` Caleb Sander Mateos
@ 2025-12-12  8:22       ` Ming Lei
  0 siblings, 0 replies; 5+ messages in thread
From: Ming Lei @ 2025-12-12  8:22 UTC (permalink / raw)
  To: Caleb Sander Mateos; +Cc: Jens Axboe, linux-block, Uday Shankar

On Thu, Dec 11, 2025 at 08:56:46PM -0800, Caleb Sander Mateos wrote:
> On Thu, Dec 11, 2025 at 2:27 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Thu, Dec 11, 2025 at 12:30:35PM -0800, Caleb Sander Mateos wrote:
> > > On Thu, Dec 11, 2025 at 12:38 AM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > When one process(such as udev) opens ublk block device (e.g., to read
> > > > the partition table via bdev_open()), a deadlock[1] can occur:
> > > >
> > > > 1. bdev_open() grabs disk->open_mutex
> > > > 2. The process issues read I/O to ublk backend to read partition table
> > >
> > > I'm not sure I understand how a process could be issuing read I/O to
> > > the block device before bdev_open() has returned? Or do you mean that
> > > bdev_open() is issuing read I/O for the partition table via
> > > blkdev_get_whole() -> bdev_disk_changed() -> blk_add_partitions() ->
> > > check_partition()?
> >
> > Yes, disk->open_mutex is grabbed and waiting for reading partition table.
> >
> > >
> > > > 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
> > > >    runs bio->bi_end_io() callbacks
> > > > 4. If this triggers fput() on file descriptor of ublk block device, the
> > > >    work may be deferred to current task's task work (see fput() implementation)
> > >
> > > What is the bi_end_io implementation that results in an fput() call?
> >
> > libaio calls fput() via ->ki_complete() via bio->bi_end_io(), io-uring may
> > call it from io_free_rsrc_node().
> >
> > https://github.com/ublk-org/ublksrv/issues/170#issuecomment-3635162644
> >
> > >
> > > > 5. This eventually calls blkdev_release() from the same context
> > > > 6. blkdev_release() tries to grab disk->open_mutex again
> > > > 7. Deadlock: same task waiting for a mutex it already holds
> > > >
> > > > The fix is to run blk_update_request() and blk_mq_end_request() with bottom
> > > > halves disabled. This forces blkdev_release() to run in kernel work-queue
> > > > context instead of current task work context, and allows ublk server to make
> > > > forward progress, and avoids the deadlock.
> > >
> > > The idea here seems reasonable, but I can't say I understand all the
> > > pieces resulting in the deadlock.
> >
> > Please see the following scenarios:
> >
> > 1) task A: fio is running IO over /dev/ublkb0 for 5secs
> >
> > 2) task B: just when fio is exiting, another task is calling into ioctl(RRPART) on
> > /dev/ublkb0, waiting for reading partition with disk->open_mutex held.
> >
> > 3) in ublk server task, for some reason, fput() drops the `struct
> > file`'s last reference from task A, so bdev_release() is called from
> > task_work_run() in ublk server context. However, task B is holding
> > disk->open_mutex, so bdev_release() hangs forever, because this ublk server
> > can't handle IO for task B any more.
> 
> Ah okay this makes more sense, I thought the fput() was for the
> completion of the partition table read. But I see now it can be for
> any other process that happens to have opened the same ublk disk. The
> "userspace thread and not in interrupt/softirq" logic in
> __fput_deferred() to decide to close the file in task work seems
> pretty suspect. The assumption seems to be that blocking a userspace
> thread is fine since it won't be blocking any kernel threads, but
> that's clearly wrong for a userspace block device. I wonder if any
> similar deadlocks are possible with fuse? Pretending the ublk server

It should be one block only issue, cause it is related disk->open_disk.

> thread is in softirq context when completing the seems a bit hacky,
> but I can see it may be the simplest fix.
> 
> Do the blk_mq_end_request() calls in__ublk_do_auto_buf_reg() and
> __ublk_abort_rq() not also need this protection?

I think all needs this protection, will do it in v2.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-12  8:23 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-11  8:38 [PATCH] ublk: fix deadlock when reading partition table Ming Lei
2025-12-11 20:30 ` Caleb Sander Mateos
2025-12-11 22:26   ` Ming Lei
2025-12-12  4:56     ` Caleb Sander Mateos
2025-12-12  8:22       ` Ming Lei

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox