block: del_gendisk() vs blk_queue_enter() race condition

linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* block: del_gendisk() vs blk_queue_enter() race condition
@ 2024-10-03  8:56 Sergey Senozhatsky
  2024-10-03 13:36 ` Christoph Hellwig
  2024-10-08  4:02 ` YangYang
  0 siblings, 2 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-03  8:56 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Sergey Senozhatsky

Hello,

I'm looking at a report from the fleet (don't have a reproducer)
and wondering what you and block folks might think / suggest.

The problem is basically as follows

CPU0

do_syscall
 sys_close
  __fput
   blkdev_release
    blkdev_put              grabs ->open_mutex
     sr_block_release
      scsi_set_medium_removal
       ioctl_internal_command
        scsi_execute_cmd
         scsi_alloc_request
          blk_mq_alloc_request
           blk_queue_enter
            schedule

at the same time:

CPU1

usb_disconnect
 usb_disable_device
  device_del
   usb_unbind_interface
    usb_stor_disconnect
     scsi_remove_host
      scsi_forget_host
       __scsi_remove_device
        device_del
         bus_remove_device
          device_release_driver_internal
           sr_remove
            del_gendisk
             mutex_lock     attempts to grab ->open_mutex
              schedule

blk_queue_enter() sleeps forever, under ->open_mutex, there is no
way for it to be woken up and to detect blk_queue_dying().  del_gendisk()
sleeps forever because it attempts to grab ->open_mutex before it calls
__blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and
wake up ->mq_freeze_wq (which is blk_queue_enter() in this case).

I wonder how to fix it.  My current "patch" is to set QUEUE_FLAG_DYING
and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts
to grab ->open_mutex for the first time.

Any suggestions?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03  8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky
@ 2024-10-03 13:36 ` Christoph Hellwig
  2024-10-03 13:43   ` Christoph Hellwig
  2024-10-03 13:55   ` Sergey Senozhatsky
  2024-10-08  4:02 ` YangYang
  1 sibling, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-03 13:36 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Jens Axboe, linux-block

On Thu, Oct 03, 2024 at 05:56:10PM +0900, Sergey Senozhatsky wrote:
> blk_queue_enter() sleeps forever, under ->open_mutex, there is no
> way for it to be woken up and to detect blk_queue_dying().  del_gendisk()
> sleeps forever because it attempts to grab ->open_mutex before it calls
> __blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and
> wake up ->mq_freeze_wq (which is blk_queue_enter() in this case).
> 
> I wonder how to fix it.  My current "patch" is to set QUEUE_FLAG_DYING
> and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts
> to grab ->open_mutex for the first time.

We split blk_queue_enter further to distinguish between file system
requests and passthrough ones.

The file system request should be using bio_queue_enter, which only
checks GD_DEAD, instead of QUEUE_FLAG_DYING.  Passthrough requests like
the cdrom door lock are using blk_queue_enter that checks QUEUE_FLAG_DYING
which only gets set in blk_mq_destroy_queue.

So AFAICS your trace should not happen with the current kernel, but
probably could happen with older stable version unless I'm missing
something.  What kernel version did you see this on?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03 13:36 ` Christoph Hellwig
@ 2024-10-03 13:43   ` Christoph Hellwig
  2024-10-03 14:00     ` Sergey Senozhatsky
  2024-10-03 13:55   ` Sergey Senozhatsky
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-03 13:43 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Jens Axboe, linux-block

On Thu, Oct 03, 2024 at 06:36:20AM -0700, Christoph Hellwig wrote:
> On Thu, Oct 03, 2024 at 05:56:10PM +0900, Sergey Senozhatsky wrote:
> > blk_queue_enter() sleeps forever, under ->open_mutex, there is no
> > way for it to be woken up and to detect blk_queue_dying().  del_gendisk()
> > sleeps forever because it attempts to grab ->open_mutex before it calls
> > __blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and
> > wake up ->mq_freeze_wq (which is blk_queue_enter() in this case).
> > 
> > I wonder how to fix it.  My current "patch" is to set QUEUE_FLAG_DYING
> > and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts
> > to grab ->open_mutex for the first time.
> 
> We split blk_queue_enter further to distinguish between file system
> requests and passthrough ones.
> 
> The file system request should be using bio_queue_enter, which only
> checks GD_DEAD, instead of QUEUE_FLAG_DYING.  Passthrough requests like
> the cdrom door lock are using blk_queue_enter that checks QUEUE_FLAG_DYING
> which only gets set in blk_mq_destroy_queue.
> 
> So AFAICS your trace should not happen with the current kernel, but
> probably could happen with older stable version unless I'm missing
> something.  What kernel version did you see this on?

.. actually, we still clear QUEUE_FLAG_DYING early.  Something like
the pathc below (plus proper comments) should sort it out:

diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980b5..9a1e18fbb136cf 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
 	if (test_and_set_bit(GD_DEAD, &disk->state))
 		return;
 
-	if (test_bit(GD_OWNS_QUEUE, &disk->state))
-		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
-
 	/*
 	 * Stop buffered writers from dirtying pages that can't be written out.
 	 */
@@ -673,6 +670,9 @@ void del_gendisk(struct gendisk *disk)
 		drop_partition(part);
 	mutex_unlock(&disk->open_mutex);
 
+	if (test_bit(GD_OWNS_QUEUE, &disk->state))
+		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
+
 	if (!(disk->flags & GENHD_FL_HIDDEN)) {
 		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03 13:43   ` Christoph Hellwig
@ 2024-10-03 14:00     ` Sergey Senozhatsky
  2024-10-03 14:17       ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-03 14:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block

On (24/10/03 06:43), Christoph Hellwig wrote:
> .. actually, we still clear QUEUE_FLAG_DYING early.  Something like
> the pathc below (plus proper comments) should sort it out:
> 
> diff --git a/block/genhd.c b/block/genhd.c
> index 1c05dd4c6980b5..9a1e18fbb136cf 100644
> --- a/block/genhd.c
> +++ b/block/genhd.c
> @@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
>  	if (test_and_set_bit(GD_DEAD, &disk->state))
>  		return;
>  
> -	if (test_bit(GD_OWNS_QUEUE, &disk->state))
> -		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> -
>  	/*
>  	 * Stop buffered writers from dirtying pages that can't be written out.
>  	 */
> @@ -673,6 +670,9 @@ void del_gendisk(struct gendisk *disk)
>  		drop_partition(part);
>  	mutex_unlock(&disk->open_mutex);
>  
> +	if (test_bit(GD_OWNS_QUEUE, &disk->state))
> +		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);

So that mutex_lock(&disk->open_mutex) right before it potentially can
deadlock (I think it will).

My idea, thus far, was to

	if (test_bit(GD_OWNS_QUEUE, &disk->state)) }
		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
		blk_kick_queue_enter(disk->queue);  // this one simply wake_up_all(&q->mq_freeze_wq);
											// if the queue has QUEUE_FLAG_DYING
	}

in del_gendisk() before the very first time del_gendisk() attempts to
mutex_lock(&disk->open_mutex), because that mutex is already locked
forever.

>  	if (!(disk->flags & GENHD_FL_HIDDEN)) {
>  		sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi");

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03 14:00     ` Sergey Senozhatsky
@ 2024-10-03 14:17       ` Sergey Senozhatsky
  2024-10-04  4:21         ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-03 14:17 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block

On (24/10/03 23:00), Sergey Senozhatsky wrote:
> On (24/10/03 06:43), Christoph Hellwig wrote:
[..]
> So that mutex_lock(&disk->open_mutex) right before it potentially can
> deadlock (I think it will).
> 
> My idea, thus far, was to
> 
> 	if (test_bit(GD_OWNS_QUEUE, &disk->state)) }
> 		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> 		blk_kick_queue_enter(disk->queue);  // this one simply wake_up_all(&q->mq_freeze_wq);
> 											// if the queue has QUEUE_FLAG_DYING
> 	}
> 
> in del_gendisk() before the very first time del_gendisk() attempts to
> mutex_lock(&disk->open_mutex), because that mutex is already locked
> forever.

Well, just in case, the diff that I have (against 6.6)

---

diff --git a/block/blk-core.c b/block/blk-core.c
index 4f25d2c..470c910 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -304,6 +304,13 @@
 	wake_up_all(&q->mq_freeze_wq);
 }
 
+void blk_kick_queue_enter(struct request_queue *q)
+{
+	if (WARN_ON_ONCE(!blk_queue_dying(q)))
+		return;
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
diff --git a/block/genhd.c b/block/genhd.c
index 203c880..3ccc593 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -581,9 +581,6 @@
 	if (test_and_set_bit(GD_DEAD, &disk->state))
 		return;
 
-	if (test_bit(GD_OWNS_QUEUE, &disk->state))
-		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
-
 	/*
 	 * Stop buffered writers from dirtying pages that can't be written out.
 	 */
@@ -641,6 +638,20 @@
 
 	disk_del_events(disk);
 
+	if (test_bit(GD_OWNS_QUEUE, &disk->state)) {
+		/*
+		 * Set QUEUE_FLAG_DYING before we grab ->open_mutex so that
+		 * blkdev_put() -> release -> blk_queue_enter() can detect
+		 * dead device
+		 */
+		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
+		/*
+		 * Make sure that ->mq_freeze_wq see QUEUE_FLAG_DYING and
+		 * bail out, unlocking ->open_mutex
+		 */
+		blk_kick_queue_enter(disk->queue);
+	}
+
 	/*
 	 * Prevent new openers by unlinked the bdev inode.
 	 */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 6f67dbe..a49afe9 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -854,6 +854,7 @@
 extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
+void blk_kick_queue_enter(struct request_queue *q);
 
 /* Helper to convert REQ_OP_XXX to its string format XXX */
 extern const char *blk_op_str(enum req_op op);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03 14:17       ` Sergey Senozhatsky
@ 2024-10-04  4:21         ` Sergey Senozhatsky
  2024-10-04  6:45           ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-04  4:21 UTC (permalink / raw)
  To: Christoph Hellwig, Jens Axboe; +Cc: linux-block, Sergey Senozhatsky

On (24/10/03 23:17), Sergey Senozhatsky wrote:
> On (24/10/03 23:00), Sergey Senozhatsky wrote:
> > On (24/10/03 06:43), Christoph Hellwig wrote:
> [..]
> > So that mutex_lock(&disk->open_mutex) right before it potentially can
> > deadlock (I think it will).
> > 
> > My idea, thus far, was to
> > 
> > 	if (test_bit(GD_OWNS_QUEUE, &disk->state)) }
> > 		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> > 		blk_kick_queue_enter(disk->queue);  // this one simply wake_up_all(&q->mq_freeze_wq);
> > 											// if the queue has QUEUE_FLAG_DYING
> > 	}
> > 
> > in del_gendisk() before the very first time del_gendisk() attempts to
> > mutex_lock(&disk->open_mutex), because that mutex is already locked
> > forever.

Dunno. Is something like this completely silly?

---

diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980..c968b04ccc7c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,6 +649,13 @@ void del_gendisk(struct gendisk *disk)
 
 	disk_del_events(disk);
 
+	/*
+	 * Mark the queue QUEUE_FLAG_DYING and wakeup ->mq_freeze_wq so
+	 * that the waiters (e.g. blk_queue_enter()) can see blk_queue_dying()
+	 * and error out, unlocking the ->open_mutex.
+	 */
+	__blk_mark_disk_dead(disk);
+
 	/*
 	 * Prevent new openers by unlinked the bdev inode.
 	 */
@@ -668,7 +675,6 @@ void del_gendisk(struct gendisk *disk)
 	 * Drop all partitions now that the disk is marked dead.
 	 */
 	mutex_lock(&disk->open_mutex);
-	__blk_mark_disk_dead(disk);
 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
 		drop_partition(part);
 	mutex_unlock(&disk->open_mutex);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04  4:21         ` Sergey Senozhatsky
@ 2024-10-04  6:45           ` Christoph Hellwig
  2024-10-04  7:48             ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-04  6:45 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang

On Fri, Oct 04, 2024 at 01:21:27PM +0900, Sergey Senozhatsky wrote:
> Dunno. Is something like this completely silly?

__blk_mark_disk_dead got moved into the lock by: 7e04da2dc701 
("block: fix deadlock between sd_remove & sd_release"), which has a trace
that looks very similar to the one your reported.

And that commit also points out something I missed - we do not set
QUEUE_FLAG_DYING here because the gendisk does not own the queue for
SCSI.  Because of that allocating the request in sd/sr will not fail, and
it will deadlock.

So I think the short term fix is to also fail passthrough request here -
either by clearing and resurrecting QUEUE_FLAG_DYING or by also checking
q->disk for GD_DEAD if it exists.  Both of these are a bit ugly because
they will fail passthrough through /dev/sg during the removal which is
unexpected (although probably not happening for usual workloads).

The proper fix would be to split the freezing mechanism for file system
vs passthrough I/O, but that's going to be a huge change.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04  6:45           ` Christoph Hellwig
@ 2024-10-04  7:48             ` Sergey Senozhatsky
  2024-10-04  7:49               ` Sergey Senozhatsky
  2024-10-04 12:20               ` Christoph Hellwig
  0 siblings, 2 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-04  7:48 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang

On (24/10/03 23:45), Christoph Hellwig wrote:
> Date: Thu, 3 Oct 2024 23:45:10 -0700
> From: Christoph Hellwig <hch@infradead.org>
> To: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Christoph Hellwig <hch@infradead.org>, Jens Axboe <axboe@kernel.dk>,
>  linux-block@vger.kernel.org, Yang Yang <yang.yang@vivo.com>
> Subject: Re: block: del_gendisk() vs blk_queue_enter() race condition
> Message-ID: <Zv-O9tldIzPfD8ju@infradead.org>
> 
> On Fri, Oct 04, 2024 at 01:21:27PM +0900, Sergey Senozhatsky wrote:
> > Dunno. Is something like this completely silly?
> 
> __blk_mark_disk_dead got moved into the lock by: 7e04da2dc701 
> ("block: fix deadlock between sd_remove & sd_release"), which has a trace
> that looks very similar to the one your reported.

Hmm, okay, a deadlock one way or another.

> And that commit also points out something I missed - we do not set
> QUEUE_FLAG_DYING here because the gendisk does not own the queue for
> SCSI.  Because of that allocating the request in sd/sr will not fail, and
> it will deadlock.

I see.  Thanks for the pointers.

> So I think the short term fix is to also fail passthrough request here -
> either by clearing and resurrecting QUEUE_FLAG_DYING or by also checking
> q->disk for GD_DEAD if it exists.  Both of these are a bit ugly because
> they will fail passthrough through /dev/sg during the removal which is
> unexpected (although probably not happening for usual workloads).

You are way ahead of me.  Does the below diff look like "checking for
GD_DEAD"?

---

diff --git a/block/blk-core.c b/block/blk-core.c
index bc5e8c5eaac9..ccd36cb5ada7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q)
 	wake_up_all(&q->mq_freeze_wq);
 }
 
+void blk_queue_disk_dead(struct request_queue *q)
+{
+	struct gendisk *disk = q->disk;
+
+	if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state)))
+		return;
+	/* Make blk_queue_enter() reexamine the GD_DEAD flag. */
+	wake_up_all(&q->mq_freeze_wq);
+}
+
 /**
  * blk_queue_enter() - try to increase q->q_usage_counter
  * @q: request queue pointer
@@ -302,6 +312,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 	const bool pm = flags & BLK_MQ_REQ_PM;
 
 	while (!blk_try_enter_queue(q, pm)) {
+		struct gendisk *disk = q->disk;
+
 		if (flags & BLK_MQ_REQ_NOWAIT)
 			return -EAGAIN;
 
@@ -316,8 +328,9 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 		wait_event(q->mq_freeze_wq,
 			   (!q->mq_freeze_depth &&
 			    blk_pm_resume_queue(pm, q)) ||
-			   blk_queue_dying(q));
-		if (blk_queue_dying(q))
+			   blk_queue_dying(q) ||
+			   test_bit(GD_DEAD, &disk->state));
+		if (blk_queue_dying(q) || test_bit(GD_DEAD, &disk->state))
 			return -ENODEV;
 	}
 
diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980..c213a0cf8268 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -583,12 +583,6 @@ static void blk_report_disk_dead(struct gendisk *disk, bool surprise)
 
 static void __blk_mark_disk_dead(struct gendisk *disk)
 {
-	/*
-	 * Fail any new I/O.
-	 */
-	if (test_and_set_bit(GD_DEAD, &disk->state))
-		return;
-
 	if (test_bit(GD_OWNS_QUEUE, &disk->state))
 		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
 
@@ -649,6 +643,12 @@ void del_gendisk(struct gendisk *disk)
 
 	disk_del_events(disk);
 
+	/*
+	 * Fail any new I/O.
+	 */
+	test_bit(GD_DEAD, &disk->state);
+	blk_queue_disk_dead(disk->queue);
+
 	/*
 	 * Prevent new openers by unlinked the bdev inode.
 	 */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 50c3b959da28..aaaa6fa12328 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -862,6 +862,7 @@ extern int blk_lld_busy(struct request_queue *q);
 extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags);
 extern void blk_queue_exit(struct request_queue *q);
 extern void blk_sync_queue(struct request_queue *q);
+void blk_queue_disk_dead(struct request_queue *q);
 
 /* Helper to convert REQ_OP_XXX to its string format XXX */
 extern const char *blk_op_str(enum req_op op);

---

> The proper fix would be to split the freezing mechanism for file system
> vs passthrough I/O, but that's going to be a huge change.

My preference would be a simpler short-term fix (cherry-pick to older
kernels are much easier this way).

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04  7:48             ` Sergey Senozhatsky
@ 2024-10-04  7:49               ` Sergey Senozhatsky
  2024-10-04 12:20               ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-04  7:49 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang

On (24/10/04 16:48), Sergey Senozhatsky wrote:
[..]
> +	/*
> +	 * Fail any new I/O.
> +	 */
> +	test_bit(GD_DEAD, &disk->state);

	 ^^^ set bit, of course.

> +	blk_queue_disk_dead(disk->queue);
> +

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04  7:48             ` Sergey Senozhatsky
  2024-10-04  7:49               ` Sergey Senozhatsky
@ 2024-10-04 12:20               ` Christoph Hellwig
  2024-10-04 14:32                 ` Sergey Senozhatsky
  2024-10-04 14:41                 ` Sergey Senozhatsky
  1 sibling, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-04 12:20 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang

On Fri, Oct 04, 2024 at 04:48:18PM +0900, Sergey Senozhatsky wrote:
> diff --git a/block/blk-core.c b/block/blk-core.c
> index bc5e8c5eaac9..ccd36cb5ada7 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q)
>  	wake_up_all(&q->mq_freeze_wq);
>  }
>  
> +void blk_queue_disk_dead(struct request_queue *q)
> +{
> +	struct gendisk *disk = q->disk;
> +
> +	if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state)))
> +		return;
> +	/* Make blk_queue_enter() reexamine the GD_DEAD flag. */
> +	wake_up_all(&q->mq_freeze_wq);
> +}

Why is this a separate helper vs just doing the wake_up_all in the
only caller that sets (with the suggested fixup anyway) GD_DEAD?

> +			   blk_queue_dying(q) ||
> +			   test_bit(GD_DEAD, &disk->state));

This needs to check for a NULL disk.  And now that I'm looking at the
code a bit more this makes me worried that checking for q->disk here
sounds like a good way to hit a race with clearing it.  So I fear we
need the other hack variant that sets QUEUE_FLAG_DYING unconditionally
in __blk_mark_disk_dead and then clears it again (for GD_OWNS_QUEUE
only) toward the end of del_gendisk.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04 12:20               ` Christoph Hellwig
@ 2024-10-04 14:32                 ` Sergey Senozhatsky
  2024-10-07  6:10                   ` Christoph Hellwig
  2024-10-04 14:41                 ` Sergey Senozhatsky
  1 sibling, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-04 14:32 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang

On (24/10/04 05:20), Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 04:48:18PM +0900, Sergey Senozhatsky wrote:
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index bc5e8c5eaac9..ccd36cb5ada7 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q)
> >  	wake_up_all(&q->mq_freeze_wq);
> >  }
> >  
> > +void blk_queue_disk_dead(struct request_queue *q)
> > +{
> > +	struct gendisk *disk = q->disk;
> > +
> > +	if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state)))
> > +		return;
> > +	/* Make blk_queue_enter() reexamine the GD_DEAD flag. */
> > +	wake_up_all(&q->mq_freeze_wq);
> > +}
> 
> Why is this a separate helper vs just doing the wake_up_all in the
> only caller that sets (with the suggested fixup anyway) GD_DEAD?

It looked to me like whatever happens to ->mq_freeze_wq stays in Las^W
blk-core or blk-mq, so I added a new helper to follow suit, IOW to not
spread ->mq_freeze_wq wakeup across multiple files.

> > +			   blk_queue_dying(q) ||
> > +			   test_bit(GD_DEAD, &disk->state));
> 
> This needs to check for a NULL disk.

Ack.

> And now that I'm looking at the code a bit more this makes me worried
> that checking for q->disk here sounds like a good way to hit a race with
> clearing it.  So I fear we need the other hack variant that sets
> QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead and then clears
> it again (for GD_OWNS_QUEUE only) toward the end of del_gendisk.

Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead()
implies moving it up, to the very top of del_gendisk(), before the first
time we grab ->open_mutex, because that's the issue that we are having.
Does this sound like re-introducing the previous deadlock scenario (the
one you pointed at previously) because of that "don't acquire ->open_mutex
after freezing the queue" thing?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04 14:32                 ` Sergey Senozhatsky
@ 2024-10-07  6:10                   ` Christoph Hellwig
  2024-10-07  9:45                     ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-07  6:10 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang

On Fri, Oct 04, 2024 at 11:32:34PM +0900, Sergey Senozhatsky wrote:
> Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead()
> implies moving it up, to the very top of del_gendisk(), before the first
> time we grab ->open_mutex, because that's the issue that we are having.
> Does this sound like re-introducing the previous deadlock scenario (the
> one you pointed at previously) because of that "don't acquire ->open_mutex
> after freezing the queue" thing?

So the trace of that one is literally the same as the one you reported,
and I'm still wondering how they are related (I hope Yang Yang can
chime in).  I suspect that if we mark both the disk and queue dead
early that will error out everything and should fix it.

That would also avoid the issue with your patch in the next reply that
would skip marking the disk dead when calling blk_mark_disk_dead.

(BTW, we really need to write a big fat comment explaining how we ended
up with whatever is the final fix here for the next person touching the
code)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-07  6:10                   ` Christoph Hellwig
@ 2024-10-07  9:45                     ` Sergey Senozhatsky
  2024-10-08  5:31                       ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-07  9:45 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang

On (24/10/06 23:10), Christoph Hellwig wrote:
> On Fri, Oct 04, 2024 at 11:32:34PM +0900, Sergey Senozhatsky wrote:
> > Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead()
> > implies moving it up, to the very top of del_gendisk(), before the first
> > time we grab ->open_mutex, because that's the issue that we are having.
> > Does this sound like re-introducing the previous deadlock scenario (the
> > one you pointed at previously) because of that "don't acquire ->open_mutex
> > after freezing the queue" thing?
> 
> So the trace of that one is literally the same as the one you reported,
> and I'm still wondering how they are related (I hope Yang Yang can
> chime in)
>
> I suspect that if we mark both the disk and queue dead
> early that will error out everything and should fix it.

Does the diff below look like something that you are thinking about?

__blk_mark_disk_dead() cannot be moved alone, we need
blk_report_disk_dead() before it.  And all of these should be
done before the first time we take ->open_mutex.

I keep __blk_mark_disk_dead() the way it is and just forcibly
set QUEUE_FLAG_DYING right before __blk_mark_disk_dead(), so
that hopefully bio_queue_enter() can detect DYING.

---

diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980..3b2a7e0f2176 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -649,6 +649,16 @@ void del_gendisk(struct gendisk *disk)
 
 	disk_del_events(disk);
 
+	/*
+	 * Tell the file system to write back all dirty data and shut down if
+	 * it hasn't been notified earlier.
+	 */
+	if (!test_bit(GD_DEAD, &disk->state))
+		blk_report_disk_dead(disk, false);
+	/* TODO: big fat comment */
+	blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
+	__blk_mark_disk_dead(disk);
+
 	/*
 	 * Prevent new openers by unlinked the bdev inode.
 	 */
@@ -657,18 +667,10 @@ void del_gendisk(struct gendisk *disk)
 		bdev_unhash(part);
 	mutex_unlock(&disk->open_mutex);
 
-	/*
-	 * Tell the file system to write back all dirty data and shut down if
-	 * it hasn't been notified earlier.
-	 */
-	if (!test_bit(GD_DEAD, &disk->state))
-		blk_report_disk_dead(disk, false);
-
 	/*
 	 * Drop all partitions now that the disk is marked dead.
 	 */
 	mutex_lock(&disk->open_mutex);
-	__blk_mark_disk_dead(disk);
 	xa_for_each_start(&disk->part_tbl, idx, part, 1)
 		drop_partition(part);
 	mutex_unlock(&disk->open_mutex);
@@ -714,6 +716,10 @@ void del_gendisk(struct gendisk *disk)
 	rq_qos_exit(q);
 	blk_mq_unquiesce_queue(q);
 
+	/* TODO: big fat comment */
+	if (test_bit(GD_OWNS_QUEUE, &disk->state))
+		blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue);
+
 	/*
 	 * If the disk does not own the queue, allow using passthrough requests
 	 * again.  Else leave the queue frozen to fail all I/O.

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-07  9:45                     ` Sergey Senozhatsky
@ 2024-10-08  5:31                       ` Sergey Senozhatsky
  0 siblings, 0 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-08  5:31 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang

On (24/10/07 18:45), Sergey Senozhatsky wrote:
> +	/*
> +	 * Tell the file system to write back all dirty data and shut down if
> +	 * it hasn't been notified earlier.
> +	 */
> +	if (!test_bit(GD_DEAD, &disk->state))
> +		blk_report_disk_dead(disk, false);
> +	/* TODO: big fat comment */
> +	blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
> +	__blk_mark_disk_dead(disk);
> +
>  	/*
>  	 * Prevent new openers by unlinked the bdev inode.
>  	 */
> @@ -657,18 +667,10 @@ void del_gendisk(struct gendisk *disk)
>  		bdev_unhash(part);
>  	mutex_unlock(&disk->open_mutex);
>  
> -	/*
> -	 * Tell the file system to write back all dirty data and shut down if
> -	 * it hasn't been notified earlier.
> -	 */
> -	if (!test_bit(GD_DEAD, &disk->state))
> -		blk_report_disk_dead(disk, false);
> -
>  	/*
>  	 * Drop all partitions now that the disk is marked dead.
>  	 */
>  	mutex_lock(&disk->open_mutex);
> -	__blk_mark_disk_dead(disk);
>  	xa_for_each_start(&disk->part_tbl, idx, part, 1)
>  		drop_partition(part);
>  	mutex_unlock(&disk->open_mutex);
> @@ -714,6 +716,10 @@ void del_gendisk(struct gendisk *disk)
>  	rq_qos_exit(q);
>  	blk_mq_unquiesce_queue(q);
>  
> +	/* TODO: big fat comment */
> +	if (test_bit(GD_OWNS_QUEUE, &disk->state))

       if (!test_bit()), muppet.

> +		blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue);

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-04 12:20               ` Christoph Hellwig
  2024-10-04 14:32                 ` Sergey Senozhatsky
@ 2024-10-04 14:41                 ` Sergey Senozhatsky
  1 sibling, 0 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-04 14:41 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang

On (24/10/04 05:20), Christoph Hellwig wrote:
> This needs to check for a NULL disk.  And now that I'm looking at the
> code a bit more this makes me worried that checking for q->disk here
> sounds like a good way to hit a race with clearing it.  So I fear we
> need the other hack variant that sets QUEUE_FLAG_DYING unconditionally
> in __blk_mark_disk_dead and then clears it again (for GD_OWNS_QUEUE
> only) toward the end of del_gendisk.

Something like this?

---

diff --git a/block/genhd.c b/block/genhd.c
index 1c05dd4c6980..aca43c8fa4ed 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk)
 	if (test_and_set_bit(GD_DEAD, &disk->state))
 		return;
 
-	if (test_bit(GD_OWNS_QUEUE, &disk->state))
-		blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
-
 	/*
 	 * Stop buffered writers from dirtying pages that can't be written out.
 	 */
@@ -649,6 +646,9 @@ void del_gendisk(struct gendisk *disk)
 
 	disk_del_events(disk);
 
+	blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue);
+	wake_up_all(&disk->queue->mq_freeze_wq);
+
 	/*
 	 * Prevent new openers by unlinked the bdev inode.
 	 */
@@ -725,6 +725,9 @@ void del_gendisk(struct gendisk *disk)
 		if (queue_is_mq(q))
 			blk_mq_exit_queue(q);
 	}
+
+	if (!test_bit(GD_OWNS_QUEUE, &disk->state))
+		blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue);
 }
 EXPORT_SYMBOL(del_gendisk);
 

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03 13:36 ` Christoph Hellwig
  2024-10-03 13:43   ` Christoph Hellwig
@ 2024-10-03 13:55   ` Sergey Senozhatsky
  1 sibling, 0 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-03 13:55 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block

On (24/10/03 06:36), Christoph Hellwig wrote:
> What kernel version did you see this on?

6.6

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-03  8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky
  2024-10-03 13:36 ` Christoph Hellwig
@ 2024-10-08  4:02 ` YangYang
  2024-10-08  5:19   ` Sergey Senozhatsky
  1 sibling, 1 reply; 24+ messages in thread
From: YangYang @ 2024-10-08  4:02 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: linux-block, Jens Axboe

On 2024/10/3 16:56, Sergey Senozhatsky wrote:
> Hello,
> 
> I'm looking at a report from the fleet (don't have a reproducer)
> and wondering what you and block folks might think / suggest.
> 
> The problem is basically as follows
> 
> CPU0
> 
> do_syscall
>   sys_close
>    __fput
>     blkdev_release
>      blkdev_put              grabs ->open_mutex
>       sr_block_release
>        scsi_set_medium_removal
>         ioctl_internal_command
>          scsi_execute_cmd
>           scsi_alloc_request
>            blk_mq_alloc_request
>             blk_queue_enter
>              schedule
> 
> at the same time:
> 
> CPU1
> 
> usb_disconnect
>   usb_disable_device
>    device_del
>     usb_unbind_interface
>      usb_stor_disconnect
>       scsi_remove_host
>        scsi_forget_host
>         __scsi_remove_device
>          device_del
>           bus_remove_device
>            device_release_driver_internal
>             sr_remove
>              del_gendisk
>               mutex_lock     attempts to grab ->open_mutex
>                schedule
> 

I'm a little confused here. How is the queue getting frozen in this 
scenario? Normally, the queue should be frozen by 
__blk_mark_disk_dead()->blk_queue_start_drain()->blk_freeze_queue_start(), 
and this cannot occur without grabbing ->open_mutex.

  670     mutex_lock(&disk->open_mutex);
  671     __blk_mark_disk_dead(disk);
  672     xa_for_each_start(&disk->part_tbl, idx, part, 1)
  673         drop_partition(part);
  674     mutex_unlock(&disk->open_mutex);


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  4:02 ` YangYang
@ 2024-10-08  5:19   ` Sergey Senozhatsky
  2024-10-08  5:26     ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-08  5:19 UTC (permalink / raw)
  To: YangYang; +Cc: Sergey Senozhatsky, linux-block, Jens Axboe

On (24/10/08 12:02), YangYang wrote:
> On 2024/10/3 16:56, Sergey Senozhatsky wrote:
> > Hello,
> >
> > I'm looking at a report from the fleet (don't have a reproducer)
> > and wondering what you and block folks might think / suggest.
> >
> > The problem is basically as follows
> >
> > CPU0
> >
> > do_syscall
> >   sys_close
> >    __fput
> >     blkdev_release
> >      blkdev_put              grabs ->open_mutex
> >       sr_block_release
> >        scsi_set_medium_removal
> >         ioctl_internal_command
> >          scsi_execute_cmd
> >           scsi_alloc_request
> >            blk_mq_alloc_request
> >             blk_queue_enter
> >              schedule
> >
> > at the same time:
> >
> > CPU1
> >
> > usb_disconnect
> >   usb_disable_device
> >    device_del
> >     usb_unbind_interface
> >      usb_stor_disconnect
> >       scsi_remove_host
> >        scsi_forget_host
> >         __scsi_remove_device
> >          device_del
> >           bus_remove_device
> >            device_release_driver_internal
> >             sr_remove
> >              del_gendisk
> >               mutex_lock     attempts to grab ->open_mutex
> >                schedule
> >
>
> I'm a little confused here. How is the queue getting frozen in this
> scenario?

I don't know.  Could it be that it's PM not frozen queue that falsifies
wait_event() condition? (if that's what you are pointing at).

I have several reports (various devices, various use-cases) and the ones
that I looked at so far have the same pattern:

	usb_disconnect() vs blk_queue_enter()

E.g. one of the reports:

...
 sd 1:0:0:0: [sdb] Attached SCSI removable disk
 usb 3-4: USB disconnect, device number 29
 sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=15s
 sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 07 47 af fd 00 00 01 00
 I/O error, dev sdb, sector 122138621 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
 device offline error, dev sdb, sector 122138616 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2
 Buffer I/O error on dev sdb, logical block 15267327, async page read
...

  schedule+0x4f4/0x1540
  del_gendisk+0x136/0x370
  sd_remove+0x30/0x60
  device_release_driver_internal+0x1a2/0x2a0
  bus_remove_device+0x154/0x180
  device_del+0x207/0x370
  __scsi_remove_device+0xc0/0x170
  scsi_forget_host+0x45/0x60
  scsi_remove_host+0x87/0x170
  usb_stor_disconnect+0x63/0xb0
  usb_unbind_interface+0xbe/0x250
  device_release_driver_internal+0x1a2/0x2a0
  bus_remove_device+0x154/0x180
  device_del+0x207/0x370
  ? kobject_release+0x56/0xb0
  usb_disable_device+0x72/0x170
  usb_disconnect+0xeb/0x280

  schedule+0x4f4/0x1540
  blk_queue_enter+0x172/0x250
  blk_mq_alloc_request+0x167/0x210
  scsi_execute_cmd+0x65/0x240
  ioctl_internal_command+0x6c/0x150
  scsi_set_medium_removal+0x63/0xc0
  sd_release+0x42/0x50
  blkdev_put+0x13b/0x1f0
  blkdev_release+0x2b/0x40
  __fput_sync+0x9b/0x2c0
  __se_sys_close+0x69/0xc0
  do_syscall_64+0x60/0x90


Or another report:

 sr 1:0:0:0: Power-on or device reset occurred
 sr 1:0:0:0: [sr0] scsi3-mmc drive: 8x/24x writer dvd-ram cd/rw xa/form2 cdda tray
 usb 1-1.3.1: USB disconnect, device number 27

  schedule+0x554/0x1218
  schedule_preempt_disabled+0x30/0x50
  mutex_lock+0x3c/0x70
  del_gendisk+0xe8/0x370
  sr_remove+0x30/0x58 [sr_mod (HASH:d5f2 4)]
  device_release_driver_internal+0x1a0/0x278
  device_release_driver+0x24/0x38
  bus_remove_device+0x150/0x170
  device_del+0x1d0/0x348
  __scsi_remove_device+0xb4/0x198
  scsi_forget_host+0x5c/0x80
  scsi_remove_host+0x98/0x1c8
  usb_stor_disconnect+0x74/0x110
  usb_unbind_interface+0xcc/0x250
  device_release_driver_internal+0x1a0/0x278
  device_release_driver+0x24/0x38
  bus_remove_device+0x150/0x170
  device_del+0x1d0/0x348
  usb_disable_device+0x88/0x190
  usb_disconnect+0xf8/0x318

  schedule+0x554/0x1218
  blk_queue_enter+0xd0/0x170
  blk_mq_alloc_request+0x138/0x1e8
  scsi_execute_cmd+0x88/0x258
  scsi_test_unit_ready+0x88/0x118
  sr_drive_status+0x5c/0x160 [sr_mod (HASH:d5f2 4)]
  cdrom_ioctl+0x7d4/0x2730 [cdrom (HASH:37c3 5)]
  sr_block_ioctl+0xa8/0x110 [sr_mod (HASH:d5f2 4)]
  blkdev_ioctl+0x468/0xbf0
  __arm64_sys_ioctl+0x254/0x6d0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  5:19   ` Sergey Senozhatsky
@ 2024-10-08  5:26     ` Sergey Senozhatsky
  2024-10-08  5:56       ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-08  5:26 UTC (permalink / raw)
  To: YangYang; +Cc: linux-block, Jens Axboe, Sergey Senozhatsky

On (24/10/08 14:19), Sergey Senozhatsky wrote:
[..]
> Or another report:
>
>  sr 1:0:0:0: Power-on or device reset occurred
>  sr 1:0:0:0: [sr0] scsi3-mmc drive: 8x/24x writer dvd-ram cd/rw xa/form2 cdda tray
>  usb 1-1.3.1: USB disconnect, device number 27
>
>   schedule+0x554/0x1218
>   schedule_preempt_disabled+0x30/0x50
>   mutex_lock+0x3c/0x70
>   del_gendisk+0xe8/0x370
>   sr_remove+0x30/0x58 [sr_mod (HASH:d5f2 4)]
>   device_release_driver_internal+0x1a0/0x278
>   device_release_driver+0x24/0x38
>   bus_remove_device+0x150/0x170
>   device_del+0x1d0/0x348
>   __scsi_remove_device+0xb4/0x198
>   scsi_forget_host+0x5c/0x80
>   scsi_remove_host+0x98/0x1c8
>   usb_stor_disconnect+0x74/0x110
>   usb_unbind_interface+0xcc/0x250
>   device_release_driver_internal+0x1a0/0x278
>   device_release_driver+0x24/0x38
>   bus_remove_device+0x150/0x170
>   device_del+0x1d0/0x348
>   usb_disable_device+0x88/0x190
>   usb_disconnect+0xf8/0x318
>
>   schedule+0x554/0x1218
>   blk_queue_enter+0xd0/0x170
>   blk_mq_alloc_request+0x138/0x1e8
>   scsi_execute_cmd+0x88/0x258
>   scsi_test_unit_ready+0x88/0x118
>   sr_drive_status+0x5c/0x160 [sr_mod (HASH:d5f2 4)]
>   cdrom_ioctl+0x7d4/0x2730 [cdrom (HASH:37c3 5)]
>   sr_block_ioctl+0xa8/0x110 [sr_mod (HASH:d5f2 4)]
>   blkdev_ioctl+0x468/0xbf0
>   __arm64_sys_ioctl+0x254/0x6d0

Didn't copy one more backtrace here, there are two mutexes involved.

  schedule+0x554/0x1218
  schedule_preempt_disabled+0x30/0x50
  mutex_lock+0x3c/0x70
  sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)]
  blkdev_put+0x184/0x290
  blkdev_release+0x34/0x50
  __fput_sync+0xa8/0x2d8
  __arm64_sys_close+0x6c/0xd8
  invoke_syscall+0x78/0xf0

So process A holds cd->lock and sleeps in blk_queue_enter()
   process B holds ->open_mutex and sleeps on cd->lock, which is owned by A
   process C sleeps on ->open_mutex, which is owned by B.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  5:26     ` Sergey Senozhatsky
@ 2024-10-08  5:56       ` Christoph Hellwig
  2024-10-08  6:04         ` Christoph Hellwig
  2024-10-08  6:10         ` Sergey Senozhatsky
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-08  5:56 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: YangYang, linux-block, Jens Axboe

On Tue, Oct 08, 2024 at 02:26:17PM +0900, Sergey Senozhatsky wrote:
> Didn't copy one more backtrace here, there are two mutexes involved.
> 
>   schedule+0x554/0x1218
>   schedule_preempt_disabled+0x30/0x50
>   mutex_lock+0x3c/0x70
>   sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)]
>   blkdev_put+0x184/0x290
>   blkdev_release+0x34/0x50
>   __fput_sync+0xa8/0x2d8
>   __arm64_sys_close+0x6c/0xd8
>   invoke_syscall+0x78/0xf0
> 
> So process A holds cd->lock and sleeps in blk_queue_enter()
>    process B holds ->open_mutex and sleeps on cd->lock, which is owned by A
>    process C sleeps on ->open_mutex, which is owned by B.

Oh, cd->mutex is a bit of a problem.  And looking into the generic
CD layer code this can be relatively easily avoided while cleaning
a lot of the code up.  Give me a little time to cook something up.

I also wonder if simulating a cdrom removal might be possible using
qemu to help reproducing some of this.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  5:56       ` Christoph Hellwig
@ 2024-10-08  6:04         ` Christoph Hellwig
  2024-10-08  6:10         ` Sergey Senozhatsky
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-08  6:04 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: YangYang, linux-block, Jens Axboe, martin.petersen, linux-scsi

On Mon, Oct 07, 2024 at 10:56:31PM -0700, Christoph Hellwig wrote:
> CD layer code this can be relatively easily avoided while cleaning
> a lot of the code up.  Give me a little time to cook something up.

Actually..  It might be as simple as the patch below.  In addition
we should probably not do the door locking for a hot remove, but
the SCSI handling of hot removals could use some work in general.

---
From 74cf726f2f02d219778a90c7a99db7a57fb252ad Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Tue, 8 Oct 2024 08:01:17 +0200
Subject: sr: remove cd->lock

cd->lock is taken in sr_block_open, sr_block_release and sr_block_ioctl.
->open and ->release are synchronized by the block layer open_mutex,
and ->ioctl can only be called on live files and thus block devices.

So there is nothing that is actually protected by this lock, but on the
other hand it causes deadlocks when hot removing sr devices due to
the door locking called from cdrom_release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 drivers/scsi/sr.c | 14 +-------------
 drivers/scsi/sr.h |  1 -
 2 files changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 198bec87bb8e7c..cb89f7afc284e9 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -500,9 +500,7 @@ static int sr_block_open(struct gendisk *disk, blk_mode_t mode)
 			goto out;
 	}
 
-	mutex_lock(&cd->lock);
 	ret = cdrom_open(&cd->cdi, mode);
-	mutex_unlock(&cd->lock);
 out:
 	scsi_autopm_put_device(sdev);
 	if (ret)
@@ -514,10 +512,7 @@ static void sr_block_release(struct gendisk *disk)
 {
 	struct scsi_cd *cd = scsi_cd(disk);
 
-	mutex_lock(&cd->lock);
 	cdrom_release(&cd->cdi);
-	mutex_unlock(&cd->lock);
-
 	scsi_device_put(cd->device);
 }
 
@@ -532,12 +527,10 @@ static int sr_block_ioctl(struct block_device *bdev, blk_mode_t mode,
 	if (bdev_is_partition(bdev) && !capable(CAP_SYS_RAWIO))
 		return -ENOIOCTLCMD;
 
-	mutex_lock(&cd->lock);
-
 	ret = scsi_ioctl_block_when_processing_errors(sdev, cmd,
 			(mode & BLK_OPEN_NDELAY));
 	if (ret)
-		goto out;
+		return ret;
 
 	scsi_autopm_get_device(sdev);
 
@@ -550,8 +543,6 @@ static int sr_block_ioctl(struct block_device *bdev, blk_mode_t mode,
 
 put:
 	scsi_autopm_put_device(sdev);
-out:
-	mutex_unlock(&cd->lock);
 	return ret;
 }
 
@@ -574,7 +565,6 @@ static void sr_free_disk(struct gendisk *disk)
 	spin_unlock(&sr_index_lock);
 
 	unregister_cdrom(&cd->cdi);
-	mutex_destroy(&cd->lock);
 	kfree(cd);
 }
 
@@ -629,7 +619,6 @@ static int sr_probe(struct device *dev)
 					   &sr_bio_compl_lkclass);
 	if (!disk)
 		goto fail_free;
-	mutex_init(&cd->lock);
 
 	spin_lock(&sr_index_lock);
 	minor = find_first_zero_bit(sr_index_bits, SR_DISKS);
@@ -710,7 +699,6 @@ static int sr_probe(struct device *dev)
 	spin_unlock(&sr_index_lock);
 fail_put:
 	put_disk(disk);
-	mutex_destroy(&cd->lock);
 fail_free:
 	kfree(cd);
 fail:
diff --git a/drivers/scsi/sr.h b/drivers/scsi/sr.h
index dc899277b3a441..98e881775aa591 100644
--- a/drivers/scsi/sr.h
+++ b/drivers/scsi/sr.h
@@ -49,7 +49,6 @@ typedef struct scsi_cd {
 	bool ignore_get_event:1;	/* GET_EVENT is unreliable, use TUR */
 
 	struct cdrom_device_info cdi;
-	struct mutex lock;
 	struct gendisk *disk;
 } Scsi_CD;
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  5:56       ` Christoph Hellwig
  2024-10-08  6:04         ` Christoph Hellwig
@ 2024-10-08  6:10         ` Sergey Senozhatsky
  2024-10-08  8:13           ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-08  6:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, YangYang, linux-block, Jens Axboe

On (24/10/07 22:56), Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 02:26:17PM +0900, Sergey Senozhatsky wrote:
> > Didn't copy one more backtrace here, there are two mutexes involved.
> > 
> >   schedule+0x554/0x1218
> >   schedule_preempt_disabled+0x30/0x50
> >   mutex_lock+0x3c/0x70
> >   sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)]
> >   blkdev_put+0x184/0x290
> >   blkdev_release+0x34/0x50
> >   __fput_sync+0xa8/0x2d8
> >   __arm64_sys_close+0x6c/0xd8
> >   invoke_syscall+0x78/0xf0
> > 
> > So process A holds cd->lock and sleeps in blk_queue_enter()
> >    process B holds ->open_mutex and sleeps on cd->lock, which is owned by A
> >    process C sleeps on ->open_mutex, which is owned by B.
> 
> Oh, cd->mutex is a bit of a problem.  And looking into the generic
> CD layer code this can be relatively easily avoided while cleaning
> a lot of the code up.  Give me a little time to cook something up.

Sure, thanks.  I can't test the patch, tho.  At least not yet.

CD layer is in several reports, I also have reports with SD, and
a bunch of reports that I still have to look at. E.g.

		schedule
		blk_queue_enter
		blk_mq_alloc_request
		scsi_execute_cmd
		ioctl_internal_command
		scsi_set_medium_removal
		sd_release
		blkdev_put

cd->lock still falls a victim of
"blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex"
thingy, which seems like a primary problem here.  No matter why
blk_queue_enter() sleeps, draining under ->open_mutex, given that what we
want to drain can hold ->open_mutex, sometimes isn't going to drain.

> I also wonder if simulating a cdrom removal might be possible using
> qemu to help reproducing some of this.

Hmm, that's an interesting idea.  I've only tried to "unsafely"
remove a USB stick out of my laptop so far, with no success.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  6:10         ` Sergey Senozhatsky
@ 2024-10-08  8:13           ` Christoph Hellwig
  2024-10-08  8:20             ` Sergey Senozhatsky
  0 siblings, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2024-10-08  8:13 UTC (permalink / raw)
  To: Sergey Senozhatsky; +Cc: Christoph Hellwig, YangYang, linux-block, Jens Axboe

On Tue, Oct 08, 2024 at 03:10:53PM +0900, Sergey Senozhatsky wrote:
> cd->lock still falls a victim of
> "blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex"
> thingy, which seems like a primary problem here.  No matter why
> blk_queue_enter() sleeps, draining under ->open_mutex, given that what we
> want to drain can hold ->open_mutex, sometimes isn't going to drain.

Yes. So I think we'll need to move __blk_mark_disk_dead out
of ->open_mutex again, it also isn't protected when calling
blk_mark_disk_dead, but we'll also need to stop the SCSI LLDs from
submitting new commands from their ->relase routines.   Let me cook up
a little series..


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: block: del_gendisk() vs blk_queue_enter() race condition
  2024-10-08  8:13           ` Christoph Hellwig
@ 2024-10-08  8:20             ` Sergey Senozhatsky
  0 siblings, 0 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-08  8:20 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Sergey Senozhatsky, YangYang, linux-block, Jens Axboe

On (24/10/08 01:13), Christoph Hellwig wrote:
> On Tue, Oct 08, 2024 at 03:10:53PM +0900, Sergey Senozhatsky wrote:
> > cd->lock still falls a victim of
> > "blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex"
> > thingy, which seems like a primary problem here.  No matter why
> > blk_queue_enter() sleeps, draining under ->open_mutex, given that what we
> > want to drain can hold ->open_mutex, sometimes isn't going to drain.
> 
> Yes. So I think we'll need to move __blk_mark_disk_dead out
> of ->open_mutex again

Right, we also need to make sure that we drain before we try to
lock ->open_mutex in gel_dendisk() for the first time.

> it also isn't protected when calling blk_mark_disk_dead, but we'll
> also need to stop the SCSI LLDs from submitting new commands from
> their ->relase routines.   Let me cook up a little series..

Thank you!

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2024-10-08  8:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-03  8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky
2024-10-03 13:36 ` Christoph Hellwig
2024-10-03 13:43   ` Christoph Hellwig
2024-10-03 14:00     ` Sergey Senozhatsky
2024-10-03 14:17       ` Sergey Senozhatsky
2024-10-04  4:21         ` Sergey Senozhatsky
2024-10-04  6:45           ` Christoph Hellwig
2024-10-04  7:48             ` Sergey Senozhatsky
2024-10-04  7:49               ` Sergey Senozhatsky
2024-10-04 12:20               ` Christoph Hellwig
2024-10-04 14:32                 ` Sergey Senozhatsky
2024-10-07  6:10                   ` Christoph Hellwig
2024-10-07  9:45                     ` Sergey Senozhatsky
2024-10-08  5:31                       ` Sergey Senozhatsky
2024-10-04 14:41                 ` Sergey Senozhatsky
2024-10-03 13:55   ` Sergey Senozhatsky
2024-10-08  4:02 ` YangYang
2024-10-08  5:19   ` Sergey Senozhatsky
2024-10-08  5:26     ` Sergey Senozhatsky
2024-10-08  5:56       ` Christoph Hellwig
2024-10-08  6:04         ` Christoph Hellwig
2024-10-08  6:10         ` Sergey Senozhatsky
2024-10-08  8:13           ` Christoph Hellwig
2024-10-08  8:20             ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).