[PATCH V2] ublk: fix deadlock when reading partition table

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2] ublk: fix deadlock when reading partition table
@ 2025-12-12 14:34 Ming Lei
  2025-12-12 16:57 ` Caleb Sander Mateos
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Ming Lei @ 2025-12-12 14:34 UTC (permalink / raw)
  To: Jens Axboe, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar, Ming Lei

When one process(such as udev) opens ublk block device (e.g., to read
the partition table via bdev_open()), a deadlock[1] can occur:

1. bdev_open() grabs disk->open_mutex
2. The process issues read I/O to ublk backend to read partition table
3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
   runs bio->bi_end_io() callbacks
4. If this triggers fput() on file descriptor of ublk block device, the
   work may be deferred to current task's task work (see fput() implementation)
5. This eventually calls blkdev_release() from the same context
6. blkdev_release() tries to grab disk->open_mutex again
7. Deadlock: same task waiting for a mutex it already holds

The fix is to run blk_update_request() and blk_mq_end_request() with bottom
halves disabled. This forces blkdev_release() to run in kernel work-queue
context instead of current task work context, and allows ublk server to make
forward progress, and avoids the deadlock.

Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
Link: https://github.com/ublk-org/ublksrv/issues/170 [1]
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V2:
	- cover another two cases of ending request(Caleb Sander Mateos)

 drivers/block/ublk_drv.c | 28 ++++++++++++++++++++++++----
 1 file changed, 24 insertions(+), 4 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index df9831783a13..38f138f248e6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
 	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
 }
 
+static void ublk_end_request(struct request *req, blk_status_t error)
+{
+	local_bh_disable();
+	blk_mq_end_request(req, error);
+	local_bh_enable();
+}
+
 /* todo: handle partial completion */
 static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 				      bool need_map)
 {
 	unsigned int unmapped_bytes;
 	blk_status_t res = BLK_STS_OK;
+	bool requeue;
 
 	/* failed read IO if nothing is read */
 	if (!io->res && req_op(req) == REQ_OP_READ)
@@ -1117,14 +1125,26 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
 	if (unlikely(unmapped_bytes < io->res))
 		io->res = unmapped_bytes;
 
-	if (blk_update_request(req, BLK_STS_OK, io->res))
+	/*
+	 * Run bio->bi_end_io() from softirq context for preventing this
+	 * ublk's blkdev_release() from being called on current's task
+	 * work, see fput() implementation.
+	 *
+	 * Otherwise, ublk server may not provide forward progress in
+	 * case of reading partition table from bdev_open() with
+	 * disk->open_mutex grabbed, and causes dead lock.
+	 */
+	local_bh_disable();
+	requeue = blk_update_request(req, BLK_STS_OK, io->res);
+	local_bh_enable();
+	if (requeue)
 		blk_mq_requeue_request(req, true);
 	else if (likely(!blk_should_fake_timeout(req->q)))
 		__blk_mq_end_request(req, BLK_STS_OK);
 
 	return;
 exit:
-	blk_mq_end_request(req, res);
+	ublk_end_request(req, res);
 }
 
 static struct io_uring_cmd *__ublk_prep_compl_io_cmd(struct ublk_io *io,
@@ -1164,7 +1184,7 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
 	if (ublk_nosrv_dev_should_queue_io(ubq->dev))
 		blk_mq_requeue_request(rq, false);
 	else
-		blk_mq_end_request(rq, BLK_STS_IOERR);
+		ublk_end_request(rq, BLK_STS_IOERR);
 }
 
 static void
@@ -1209,7 +1229,7 @@ __ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
 			ublk_auto_buf_reg_fallback(ubq, req->tag);
 			return AUTO_BUF_REG_FALLBACK;
 		}
-		blk_mq_end_request(req, BLK_STS_IOERR);
+		ublk_end_request(req, BLK_STS_IOERR);
 		return AUTO_BUF_REG_FAIL;
 	}
 
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-12 14:34 [PATCH V2] ublk: fix deadlock when reading partition table Ming Lei
@ 2025-12-12 16:57 ` Caleb Sander Mateos
  2025-12-12 19:49 ` Jens Axboe
  2025-12-18  2:41 ` Jens Axboe
  2 siblings, 0 replies; 13+ messages in thread
From: Caleb Sander Mateos @ 2025-12-12 16:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Uday Shankar

On Fri, Dec 12, 2025 at 6:34 AM Ming Lei <ming.lei@redhat.com> wrote:
>
> When one process(such as udev) opens ublk block device (e.g., to read
> the partition table via bdev_open()), a deadlock[1] can occur:
>
> 1. bdev_open() grabs disk->open_mutex
> 2. The process issues read I/O to ublk backend to read partition table
> 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
>    runs bio->bi_end_io() callbacks
> 4. If this triggers fput() on file descriptor of ublk block device, the
>    work may be deferred to current task's task work (see fput() implementation)
> 5. This eventually calls blkdev_release() from the same context
> 6. blkdev_release() tries to grab disk->open_mutex again
> 7. Deadlock: same task waiting for a mutex it already holds
>
> The fix is to run blk_update_request() and blk_mq_end_request() with bottom
> halves disabled. This forces blkdev_release() to run in kernel work-queue
> context instead of current task work context, and allows ublk server to make
> forward progress, and avoids the deadlock.
>
> Fixes: 71f28f3136af ("ublk_drv: add io_uring based userspace block driver")
> Link: https://github.com/ublk-org/ublksrv/issues/170 [1]
> Signed-off-by: Ming Lei <ming.lei@redhat.com>

Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>

> ---
> V2:
>         - cover another two cases of ending request(Caleb Sander Mateos)
>
>  drivers/block/ublk_drv.c | 28 ++++++++++++++++++++++++----
>  1 file changed, 24 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index df9831783a13..38f138f248e6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>         return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
>  }
>
> +static void ublk_end_request(struct request *req, blk_status_t error)
> +{
> +       local_bh_disable();
> +       blk_mq_end_request(req, error);
> +       local_bh_enable();
> +}
> +
>  /* todo: handle partial completion */
>  static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>                                       bool need_map)
>  {
>         unsigned int unmapped_bytes;
>         blk_status_t res = BLK_STS_OK;
> +       bool requeue;
>
>         /* failed read IO if nothing is read */
>         if (!io->res && req_op(req) == REQ_OP_READ)
> @@ -1117,14 +1125,26 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>         if (unlikely(unmapped_bytes < io->res))
>                 io->res = unmapped_bytes;
>
> -       if (blk_update_request(req, BLK_STS_OK, io->res))
> +       /*
> +        * Run bio->bi_end_io() from softirq context for preventing this
> +        * ublk's blkdev_release() from being called on current's task
> +        * work, see fput() implementation.
> +        *
> +        * Otherwise, ublk server may not provide forward progress in
> +        * case of reading partition table from bdev_open() with
> +        * disk->open_mutex grabbed, and causes dead lock.
> +        */
> +       local_bh_disable();
> +       requeue = blk_update_request(req, BLK_STS_OK, io->res);
> +       local_bh_enable();
> +       if (requeue)
>                 blk_mq_requeue_request(req, true);
>         else if (likely(!blk_should_fake_timeout(req->q)))
>                 __blk_mq_end_request(req, BLK_STS_OK);
>
>         return;
>  exit:
> -       blk_mq_end_request(req, res);
> +       ublk_end_request(req, res);
>  }
>
>  static struct io_uring_cmd *__ublk_prep_compl_io_cmd(struct ublk_io *io,
> @@ -1164,7 +1184,7 @@ static inline void __ublk_abort_rq(struct ublk_queue *ubq,
>         if (ublk_nosrv_dev_should_queue_io(ubq->dev))
>                 blk_mq_requeue_request(rq, false);
>         else
> -               blk_mq_end_request(rq, BLK_STS_IOERR);
> +               ublk_end_request(rq, BLK_STS_IOERR);
>  }
>
>  static void
> @@ -1209,7 +1229,7 @@ __ublk_do_auto_buf_reg(const struct ublk_queue *ubq, struct request *req,
>                         ublk_auto_buf_reg_fallback(ubq, req->tag);
>                         return AUTO_BUF_REG_FALLBACK;
>                 }
> -               blk_mq_end_request(req, BLK_STS_IOERR);
> +               ublk_end_request(req, BLK_STS_IOERR);
>                 return AUTO_BUF_REG_FAIL;
>         }
>
> --
> 2.47.1
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-12 14:34 [PATCH V2] ublk: fix deadlock when reading partition table Ming Lei
  2025-12-12 16:57 ` Caleb Sander Mateos
@ 2025-12-12 19:49 ` Jens Axboe
  2025-12-13  2:28   ` Ming Lei
  2025-12-18  2:41 ` Jens Axboe
  2 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2025-12-12 19:49 UTC (permalink / raw)
  To: Ming Lei, linux-block; +Cc: Caleb Sander Mateos, Uday Shankar

On 12/12/25 7:34 AM, Ming Lei wrote:
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index df9831783a13..38f138f248e6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>  	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
>  }
>  
> +static void ublk_end_request(struct request *req, blk_status_t error)
> +{
> +	local_bh_disable();
> +	blk_mq_end_request(req, error);
> +	local_bh_enable();
> +}

This is really almost too ugly to live, as a work-around for what just
happens to be in __fput_deferred()... Surely we can come up with
something better here? Heck even a PF_ flag would be better than this,
imho.

> @@ -1117,14 +1125,26 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
>  	if (unlikely(unmapped_bytes < io->res))
>  		io->res = unmapped_bytes;
>  
> -	if (blk_update_request(req, BLK_STS_OK, io->res))
> +	/*
> +	 * Run bio->bi_end_io() from softirq context for preventing this
> +	 * ublk's blkdev_release() from being called on current's task
> +	 * work, see fput() implementation.

But that's not what it does, running it from softirq context. It simply
disables local bottomhalf interrupts to trick __fput_deferred(), it's
not scheduling ->bi_end_io() to run from softirq context itself.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-12 19:49 ` Jens Axboe
@ 2025-12-13  2:28   ` Ming Lei
  2025-12-14  6:41     ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-12-13  2:28 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On Fri, Dec 12, 2025 at 12:49:49PM -0700, Jens Axboe wrote:
> On 12/12/25 7:34 AM, Ming Lei wrote:
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index df9831783a13..38f138f248e6 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
> >  	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
> >  }
> >  
> > +static void ublk_end_request(struct request *req, blk_status_t error)
> > +{
> > +	local_bh_disable();
> > +	blk_mq_end_request(req, error);
> > +	local_bh_enable();
> > +}
> 
> This is really almost too ugly to live, as a work-around for what just
> happens to be in __fput_deferred()... Surely we can come up with
> something better here? Heck even a PF_ flag would be better than this,
> imho.

task flag will switch to release all files opened by current from wq context,
and there may be chance to cause regression, especially this fix needs to
backport to stable.

So I'd suggest to take it for safe stable purpose.


> 
> > @@ -1117,14 +1125,26 @@ static inline void __ublk_complete_rq(struct request *req, struct ublk_io *io,
> >  	if (unlikely(unmapped_bytes < io->res))
> >  		io->res = unmapped_bytes;
> >  
> > -	if (blk_update_request(req, BLK_STS_OK, io->res))
> > +	/*
> > +	 * Run bio->bi_end_io() from softirq context for preventing this
> > +	 * ublk's blkdev_release() from being called on current's task
> > +	 * work, see fput() implementation.
> 
> But that's not what it does, running it from softirq context. It simply
> disables local bottomhalf interrupts to trick __fput_deferred(), it's
> not scheduling ->bi_end_io() to run from softirq context itself.

Yes, I can twink the words if you don't object this approach, but it isn't
different from core code viewpoint, in_interrupt() returns true.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-13  2:28   ` Ming Lei
@ 2025-12-14  6:41     ` Jens Axboe
  2025-12-16  8:56       ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2025-12-14  6:41 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On 12/12/25 7:28 PM, Ming Lei wrote:
> On Fri, Dec 12, 2025 at 12:49:49PM -0700, Jens Axboe wrote:
>> On 12/12/25 7:34 AM, Ming Lei wrote:
>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>> index df9831783a13..38f138f248e6 100644
>>> --- a/drivers/block/ublk_drv.c
>>> +++ b/drivers/block/ublk_drv.c
>>> @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>>>  	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
>>>  }
>>>  
>>> +static void ublk_end_request(struct request *req, blk_status_t error)
>>> +{
>>> +	local_bh_disable();
>>> +	blk_mq_end_request(req, error);
>>> +	local_bh_enable();
>>> +}
>>
>> This is really almost too ugly to live, as a work-around for what just
>> happens to be in __fput_deferred()... Surely we can come up with
>> something better here? Heck even a PF_ flag would be better than this,
>> imho.
> 
> task flag will switch to release all files opened by current from wq context,
> and there may be chance to cause regression, especially this fix needs to
> backport to stable.

I don't mean in general for the task, just across the completion. It'd
cause the exact same punts to async puts as the current patch.

> So I'd suggest to take it for safe stable purpose.

I'm really having a hard time with it - and I have to defend it once I
send it further upstream. And I can tell you who's going to hate it, the
guy that pulls from me. We might get lucky that he doesn't look at it.
But the underlying issue here is that the patch is one nasty bandaid,
not that Linus would yell at it, with good reason imho.

At the same time, I don't really have other good suggestions. Let me
ponder it a bit, about to get on a flight anyway and -rc1 has been cut
at this point regardless.

Obviously this isn't a ublk induced issue, it's all down to the lock
grabbing that happens there. Maybe bdev_release() could do a deferred
put if a trylock of ->open_mutex fails?

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-14  6:41     ` Jens Axboe
@ 2025-12-16  8:56       ` Ming Lei
  2025-12-16 15:03         ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-12-16  8:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On Sat, Dec 13, 2025 at 11:41:52PM -0700, Jens Axboe wrote:
> On 12/12/25 7:28 PM, Ming Lei wrote:
> > On Fri, Dec 12, 2025 at 12:49:49PM -0700, Jens Axboe wrote:
> >> On 12/12/25 7:34 AM, Ming Lei wrote:
> >>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>> index df9831783a13..38f138f248e6 100644
> >>> --- a/drivers/block/ublk_drv.c
> >>> +++ b/drivers/block/ublk_drv.c
> >>> @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
> >>>  	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
> >>>  }
> >>>  
> >>> +static void ublk_end_request(struct request *req, blk_status_t error)
> >>> +{
> >>> +	local_bh_disable();
> >>> +	blk_mq_end_request(req, error);
> >>> +	local_bh_enable();
> >>> +}
> >>
> >> This is really almost too ugly to live, as a work-around for what just
> >> happens to be in __fput_deferred()... Surely we can come up with
> >> something better here? Heck even a PF_ flag would be better than this,
> >> imho.
> > 
> > task flag will switch to release all files opened by current from wq context,
> > and there may be chance to cause regression, especially this fix needs to
> > backport to stable.
> 
> I don't mean in general for the task, just across the completion. It'd
> cause the exact same punts to async puts as the current patch.

Technically it is very similar with the posted path, just task flag way touches
core code, especially there are only 4bits left.

> 
> > So I'd suggest to take it for safe stable purpose.
> 
> I'm really having a hard time with it - and I have to defend it once I
> send it further upstream. And I can tell you who's going to hate it, the
> guy that pulls from me. We might get lucky that he doesn't look at it.
> But the underlying issue here is that the patch is one nasty bandaid,
> not that Linus would yell at it, with good reason imho.

Understood.

IMHO, even this patch is a workaround and looks ugly, but it has enough
benefits too:

- driver only change and won't touch core code
- it is for handling abnormal userspace behavior(close(disk_fd) before
  completing block IO)
- correct & safe because it is always fine to complete IO request from irq
  context
- easy to backport

> 
> At the same time, I don't really have other good suggestions. Let me
> ponder it a bit, about to get on a flight anyway and -rc1 has been cut
> at this point regardless.
> 
> Obviously this isn't a ublk induced issue, it's all down to the lock
> grabbing that happens there. Maybe bdev_release() could do a deferred
> put if a trylock of ->open_mutex fails?

There is risk to break application since some cases need to drain
bdev_release before returning to userspace.

The issue for ublk is actually triggered by something abnormal: submit AIO
& close(ublk disk) in client application, then fput() is called when the
submitted AIO is done, it will cause deferred fput handler to wq for any block
IO completed from irq handler.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-16  8:56       ` Ming Lei
@ 2025-12-16 15:03         ` Jens Axboe
  2025-12-16 17:57           ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2025-12-16 15:03 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On 12/16/25 1:56 AM, Ming Lei wrote:
> On Sat, Dec 13, 2025 at 11:41:52PM -0700, Jens Axboe wrote:
>> On 12/12/25 7:28 PM, Ming Lei wrote:
>>> On Fri, Dec 12, 2025 at 12:49:49PM -0700, Jens Axboe wrote:
>>>> On 12/12/25 7:34 AM, Ming Lei wrote:
>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>> index df9831783a13..38f138f248e6 100644
>>>>> --- a/drivers/block/ublk_drv.c
>>>>> +++ b/drivers/block/ublk_drv.c
>>>>> @@ -1080,12 +1080,20 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>>>>>  	return io_uring_cmd_to_pdu(ioucmd, struct ublk_uring_cmd_pdu);
>>>>>  }
>>>>>  
>>>>> +static void ublk_end_request(struct request *req, blk_status_t error)
>>>>> +{
>>>>> +	local_bh_disable();
>>>>> +	blk_mq_end_request(req, error);
>>>>> +	local_bh_enable();
>>>>> +}
>>>>
>>>> This is really almost too ugly to live, as a work-around for what just
>>>> happens to be in __fput_deferred()... Surely we can come up with
>>>> something better here? Heck even a PF_ flag would be better than this,
>>>> imho.
>>>
>>> task flag will switch to release all files opened by current from wq context,
>>> and there may be chance to cause regression, especially this fix needs to
>>> backport to stable.
>>
>> I don't mean in general for the task, just across the completion. It'd
>> cause the exact same punts to async puts as the current patch.
> 
> Technically it is very similar with the posted path, just task flag
> way touches core code, especially there are only 4bits left.

Doesn't matter if it touches core code or not, the cleaner fix is what
matters. But a task flag is not very clean other, don't disagree on
that. Just tossing out ideas to make this palatable.

>>> So I'd suggest to take it for safe stable purpose.
>>
>> I'm really having a hard time with it - and I have to defend it once I
>> send it further upstream. And I can tell you who's going to hate it, the
>> guy that pulls from me. We might get lucky that he doesn't look at it.
>> But the underlying issue here is that the patch is one nasty bandaid,
>> not that Linus would yell at it, with good reason imho.
> 
> Understood.
> 
> IMHO, even this patch is a workaround and looks ugly, but it has enough
> benefits too:
> 
> - driver only change and won't touch core code

You bring this up again, why does it matter?

> - it is for handling abnormal userspace behavior(close(disk_fd) before
>   completing block IO)
> - correct & safe because it is always fine to complete IO request from irq
>   context
> - easy to backport

The only benefit that is valid here is the last one, yeah it's an easier
backport. But it's not like that one is that important, the other
suggested fix is a trivial backport as well.

>> At the same time, I don't really have other good suggestions. Let me
>> ponder it a bit, about to get on a flight anyway and -rc1 has been cut
>> at this point regardless.
>>
>> Obviously this isn't a ublk induced issue, it's all down to the lock
>> grabbing that happens there. Maybe bdev_release() could do a deferred
>> put if a trylock of ->open_mutex fails?
> 
> There is risk to break application since some cases need to drain
> bdev_release before returning to userspace.

How so? There's no guarantee that the release is sync and inline, in
fact there are already cases where that isn't so, and yours very much
takes advantage of that.

> The issue for ublk is actually triggered by something abnormal: submit AIO
> & close(ublk disk) in client application, then fput() is called when the
> submitted AIO is done, it will cause deferred fput handler to wq for any block
> IO completed from irq handler.

My suggested logic is something ala this in bdev_release():

	if (current->flags & PF_KTHREAD) {
		mutex_lock(&disk->open_mutex);
	} else {
		if (!mutex_trylock(&disk->open_mutex)) {
			deferred_put(file);
			return;
		}
	}

and that's about it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-16 15:03         ` Jens Axboe
@ 2025-12-16 17:57           ` Jens Axboe
  2025-12-17  3:09             ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2025-12-16 17:57 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On 12/16/25 8:03 AM, Jens Axboe wrote:
>> The issue for ublk is actually triggered by something abnormal: submit AIO
>> & close(ublk disk) in client application, then fput() is called when the
>> submitted AIO is done, it will cause deferred fput handler to wq for any block
>> IO completed from irq handler.
> 
> My suggested logic is something ala this in bdev_release():
> 
> 	if (current->flags & PF_KTHREAD) {
> 		mutex_lock(&disk->open_mutex);
> 	} else {
> 		if (!mutex_trylock(&disk->open_mutex)) {
> 			deferred_put(file);
> 			return;
> 		}
> 	}
> 
> and that's about it.

I took a look at the bug report, and now it makes more sense to me -
this is an aio only issue, as it does fput() from ->bi_end_io() context.
That's pretty nasty, as you don't really know what context that might
be, both in terms of irq/bh state, but also in terms of locks. The
former fput() does work around.

Why isn't the fix something as simple as the below, with your comment
added on top? I'm not aware of anyone else that would do fput off
->bi_end_io, so we migt as well treat the source of the issue rather
than work around it in ublk. THAT makes a lot more sense to me.

diff --git a/fs/aio.c b/fs/aio.c
index 0a23a8c0717f..d34ae02c61c1 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -1104,8 +1104,13 @@ static inline void iocb_destroy(struct aio_kiocb *iocb)
 {
 	if (iocb->ki_eventfd)
 		eventfd_ctx_put(iocb->ki_eventfd);
-	if (iocb->ki_filp)
+	if (iocb->ki_filp) {
+		unsigned long flags;
+
+		local_irq_save(flags);
 		fput(iocb->ki_filp);
+		local_irq_restore(flags);
+	}
 	percpu_ref_put(&iocb->ki_ctx->reqs);
 	kmem_cache_free(kiocb_cachep, iocb);
 }

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-16 17:57           ` Jens Axboe
@ 2025-12-17  3:09             ` Ming Lei
  2025-12-17  3:19               ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-12-17  3:09 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On Tue, Dec 16, 2025 at 10:57:25AM -0700, Jens Axboe wrote:
> On 12/16/25 8:03 AM, Jens Axboe wrote:
> >> The issue for ublk is actually triggered by something abnormal: submit AIO
> >> & close(ublk disk) in client application, then fput() is called when the
> >> submitted AIO is done, it will cause deferred fput handler to wq for any block
> >> IO completed from irq handler.
> > 
> > My suggested logic is something ala this in bdev_release():
> > 
> > 	if (current->flags & PF_KTHREAD) {
> > 		mutex_lock(&disk->open_mutex);
> > 	} else {
> > 		if (!mutex_trylock(&disk->open_mutex)) {
> > 			deferred_put(file);
> > 			return;
> > 		}
> > 	}
> > 
> > and that's about it.
> 
> I took a look at the bug report, and now it makes more sense to me -
> this is an aio only issue, as it does fput() from ->bi_end_io() context.
> That's pretty nasty, as you don't really know what context that might
> be, both in terms of irq/bh state, but also in terms of locks. The
> former fput() does work around.
> 
> Why isn't the fix something as simple as the below, with your comment
> added on top? I'm not aware of anyone else that would do fput off
> ->bi_end_io, so we migt as well treat the source of the issue rather
> than work around it in ublk. THAT makes a lot more sense to me.

It doesn't matter if fput is called from ->bi_end_io() directly, it can
be triggered on io-uring indirectly too, in which fput() is called from
__io_submit_flush_completions() in case of non-registerd file.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-17  3:09             ` Ming Lei
@ 2025-12-17  3:19               ` Jens Axboe
  2025-12-17  3:33                 ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Jens Axboe @ 2025-12-17  3:19 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On 12/16/25 8:09 PM, Ming Lei wrote:
> On Tue, Dec 16, 2025 at 10:57:25AM -0700, Jens Axboe wrote:
>> On 12/16/25 8:03 AM, Jens Axboe wrote:
>>>> The issue for ublk is actually triggered by something abnormal: submit AIO
>>>> & close(ublk disk) in client application, then fput() is called when the
>>>> submitted AIO is done, it will cause deferred fput handler to wq for any block
>>>> IO completed from irq handler.
>>>
>>> My suggested logic is something ala this in bdev_release():
>>>
>>> 	if (current->flags & PF_KTHREAD) {
>>> 		mutex_lock(&disk->open_mutex);
>>> 	} else {
>>> 		if (!mutex_trylock(&disk->open_mutex)) {
>>> 			deferred_put(file);
>>> 			return;
>>> 		}
>>> 	}
>>>
>>> and that's about it.
>>
>> I took a look at the bug report, and now it makes more sense to me -
>> this is an aio only issue, as it does fput() from ->bi_end_io() context.
>> That's pretty nasty, as you don't really know what context that might
>> be, both in terms of irq/bh state, but also in terms of locks. The
>> former fput() does work around.
>>
>> Why isn't the fix something as simple as the below, with your comment
>> added on top? I'm not aware of anyone else that would do fput off
>> ->bi_end_io, so we migt as well treat the source of the issue rather
>> than work around it in ublk. THAT makes a lot more sense to me.
> 
> It doesn't matter if fput is called from ->bi_end_io() directly, it can
> be triggered on io-uring indirectly too, in which fput() is called from
> __io_submit_flush_completions() in case of non-registerd file.

Because of the work-around in io_req_post_cqe()? Or just because of
!DEFER_TASKRUN?

The real problem is holding ->open_mutex over IO, and then also
requiring it to put the file as well. bdev_release() should be able to
work-around that, rather than need anyone to paper around it.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-17  3:19               ` Jens Axboe
@ 2025-12-17  3:33                 ` Ming Lei
  2025-12-18  2:37                   ` Jens Axboe
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2025-12-17  3:33 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On Tue, Dec 16, 2025 at 08:19:15PM -0700, Jens Axboe wrote:
> On 12/16/25 8:09 PM, Ming Lei wrote:
> > On Tue, Dec 16, 2025 at 10:57:25AM -0700, Jens Axboe wrote:
> >> On 12/16/25 8:03 AM, Jens Axboe wrote:
> >>>> The issue for ublk is actually triggered by something abnormal: submit AIO
> >>>> & close(ublk disk) in client application, then fput() is called when the
> >>>> submitted AIO is done, it will cause deferred fput handler to wq for any block
> >>>> IO completed from irq handler.
> >>>
> >>> My suggested logic is something ala this in bdev_release():
> >>>
> >>> 	if (current->flags & PF_KTHREAD) {
> >>> 		mutex_lock(&disk->open_mutex);
> >>> 	} else {
> >>> 		if (!mutex_trylock(&disk->open_mutex)) {
> >>> 			deferred_put(file);
> >>> 			return;
> >>> 		}
> >>> 	}
> >>>
> >>> and that's about it.
> >>
> >> I took a look at the bug report, and now it makes more sense to me -
> >> this is an aio only issue, as it does fput() from ->bi_end_io() context.
> >> That's pretty nasty, as you don't really know what context that might
> >> be, both in terms of irq/bh state, but also in terms of locks. The
> >> former fput() does work around.
> >>
> >> Why isn't the fix something as simple as the below, with your comment
> >> added on top? I'm not aware of anyone else that would do fput off
> >> ->bi_end_io, so we migt as well treat the source of the issue rather
> >> than work around it in ublk. THAT makes a lot more sense to me.
> > 
> > It doesn't matter if fput is called from ->bi_end_io() directly, it can
> > be triggered on io-uring indirectly too, in which fput() is called from
> > __io_submit_flush_completions() in case of non-registerd file.
> 
> Because of the work-around in io_req_post_cqe()? Or just because of
> !DEFER_TASKRUN?

When fput() is called from __io_submit_flush_completions(), its release
handler will be deferred to run task work, where the current task
is blocked because of ->open_mutex.

It is actually one ublk specific issue which relies on the current task
for handling IO and providing forward progress, so cause deadlock since
reading partition table(with ->open_mutex) requires the task for handling IO.

> 
> The real problem is holding ->open_mutex over IO, and then also
> requiring it to put the file as well. bdev_release() should be able to
> work-around that, rather than need anyone to paper around it.

deferred bdev_release is not safe, for example of suggestion:

        if (current->flags & PF_KTHREAD) {
                mutex_lock(&disk->open_mutex);
        } else {
                if (!mutex_trylock(&disk->open_mutex)) {
                        deferred_put(file);
                        return;
                }
        }

deferred_put(file) will cause disk released after returning to userspace.

Yes, __fput_deferred() allows that in case of in_interrupt(), which usually
means one abnormal application(close(disk) before completing/handling IO),
but it will cause normal application to release disk after returning to
userspace, it may cause -EBUSY for following syscall.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-17  3:33                 ` Ming Lei
@ 2025-12-18  2:37                   ` Jens Axboe
  0 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2025-12-18  2:37 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Caleb Sander Mateos, Uday Shankar

On 12/16/25 8:33 PM, Ming Lei wrote:
> On Tue, Dec 16, 2025 at 08:19:15PM -0700, Jens Axboe wrote:
>> On 12/16/25 8:09 PM, Ming Lei wrote:
>>> On Tue, Dec 16, 2025 at 10:57:25AM -0700, Jens Axboe wrote:
>>>> On 12/16/25 8:03 AM, Jens Axboe wrote:
>>>>>> The issue for ublk is actually triggered by something abnormal: submit AIO
>>>>>> & close(ublk disk) in client application, then fput() is called when the
>>>>>> submitted AIO is done, it will cause deferred fput handler to wq for any block
>>>>>> IO completed from irq handler.
>>>>>
>>>>> My suggested logic is something ala this in bdev_release():
>>>>>
>>>>> 	if (current->flags & PF_KTHREAD) {
>>>>> 		mutex_lock(&disk->open_mutex);
>>>>> 	} else {
>>>>> 		if (!mutex_trylock(&disk->open_mutex)) {
>>>>> 			deferred_put(file);
>>>>> 			return;
>>>>> 		}
>>>>> 	}
>>>>>
>>>>> and that's about it.
>>>>
>>>> I took a look at the bug report, and now it makes more sense to me -
>>>> this is an aio only issue, as it does fput() from ->bi_end_io() context.
>>>> That's pretty nasty, as you don't really know what context that might
>>>> be, both in terms of irq/bh state, but also in terms of locks. The
>>>> former fput() does work around.
>>>>
>>>> Why isn't the fix something as simple as the below, with your comment
>>>> added on top? I'm not aware of anyone else that would do fput off
>>>> ->bi_end_io, so we migt as well treat the source of the issue rather
>>>> than work around it in ublk. THAT makes a lot more sense to me.
>>>
>>> It doesn't matter if fput is called from ->bi_end_io() directly, it can
>>> be triggered on io-uring indirectly too, in which fput() is called from
>>> __io_submit_flush_completions() in case of non-registerd file.
>>
>> Because of the work-around in io_req_post_cqe()? Or just because of
>> !DEFER_TASKRUN?
> 
> When fput() is called from __io_submit_flush_completions(), its release
> handler will be deferred to run task work, where the current task
> is blocked because of ->open_mutex.
> 
> It is actually one ublk specific issue which relies on the current task
> for handling IO and providing forward progress, so cause deadlock since
> reading partition table(with ->open_mutex) requires the task for handling IO.
> 
>>
>> The real problem is holding ->open_mutex over IO, and then also
>> requiring it to put the file as well. bdev_release() should be able to
>> work-around that, rather than need anyone to paper around it.
> 
> deferred bdev_release is not safe, for example of suggestion:
> 
>         if (current->flags & PF_KTHREAD) {
>                 mutex_lock(&disk->open_mutex);
>         } else {
>                 if (!mutex_trylock(&disk->open_mutex)) {
>                         deferred_put(file);
>                         return;
>                 }
>         }
> 
> deferred_put(file) will cause disk released after returning to userspace.
> 
> Yes, __fput_deferred() allows that in case of in_interrupt(), which usually
> means one abnormal application(close(disk) before completing/handling IO),
> but it will cause normal application to release disk after returning to
> userspace, it may cause -EBUSY for following syscall.

OK, let's just go with the hacky work-around. At least for now. Still
hopeful we can improve it in the future...

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH V2] ublk: fix deadlock when reading partition table
  2025-12-12 14:34 [PATCH V2] ublk: fix deadlock when reading partition table Ming Lei
  2025-12-12 16:57 ` Caleb Sander Mateos
  2025-12-12 19:49 ` Jens Axboe
@ 2025-12-18  2:41 ` Jens Axboe
  2 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2025-12-18  2:41 UTC (permalink / raw)
  To: linux-block, Ming Lei; +Cc: Caleb Sander Mateos, Uday Shankar


On Fri, 12 Dec 2025 22:34:15 +0800, Ming Lei wrote:
> When one process(such as udev) opens ublk block device (e.g., to read
> the partition table via bdev_open()), a deadlock[1] can occur:
> 
> 1. bdev_open() grabs disk->open_mutex
> 2. The process issues read I/O to ublk backend to read partition table
> 3. In __ublk_complete_rq(), blk_update_request() or blk_mq_end_request()
>    runs bio->bi_end_io() callbacks
> 4. If this triggers fput() on file descriptor of ublk block device, the
>    work may be deferred to current task's task work (see fput() implementation)
> 5. This eventually calls blkdev_release() from the same context
> 6. blkdev_release() tries to grab disk->open_mutex again
> 7. Deadlock: same task waiting for a mutex it already holds
> 
> [...]

Applied, thanks!

[1/1] ublk: fix deadlock when reading partition table
      commit: c258f5c4502c9667bccf5d76fa731ab9c96687c1

Best regards,
-- 
Jens Axboe




^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-12-18  2:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-12 14:34 [PATCH V2] ublk: fix deadlock when reading partition table Ming Lei
2025-12-12 16:57 ` Caleb Sander Mateos
2025-12-12 19:49 ` Jens Axboe
2025-12-13  2:28   ` Ming Lei
2025-12-14  6:41     ` Jens Axboe
2025-12-16  8:56       ` Ming Lei
2025-12-16 15:03         ` Jens Axboe
2025-12-16 17:57           ` Jens Axboe
2025-12-17  3:09             ` Ming Lei
2025-12-17  3:19               ` Jens Axboe
2025-12-17  3:33                 ` Ming Lei
2025-12-18  2:37                   ` Jens Axboe
2025-12-18  2:41 ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).