* block: del_gendisk() vs blk_queue_enter() race condition
@ 2024-10-03 8:56 Sergey Senozhatsky
2024-10-03 13:36 ` Christoph Hellwig
2024-10-08 4:02 ` YangYang
0 siblings, 2 replies; 24+ messages in thread
From: Sergey Senozhatsky @ 2024-10-03 8:56 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Sergey Senozhatsky
Hello,
I'm looking at a report from the fleet (don't have a reproducer)
and wondering what you and block folks might think / suggest.
The problem is basically as follows
CPU0
do_syscall
sys_close
__fput
blkdev_release
blkdev_put grabs ->open_mutex
sr_block_release
scsi_set_medium_removal
ioctl_internal_command
scsi_execute_cmd
scsi_alloc_request
blk_mq_alloc_request
blk_queue_enter
schedule
at the same time:
CPU1
usb_disconnect
usb_disable_device
device_del
usb_unbind_interface
usb_stor_disconnect
scsi_remove_host
scsi_forget_host
__scsi_remove_device
device_del
bus_remove_device
device_release_driver_internal
sr_remove
del_gendisk
mutex_lock attempts to grab ->open_mutex
schedule
blk_queue_enter() sleeps forever, under ->open_mutex, there is no
way for it to be woken up and to detect blk_queue_dying(). del_gendisk()
sleeps forever because it attempts to grab ->open_mutex before it calls
__blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and
wake up ->mq_freeze_wq (which is blk_queue_enter() in this case).
I wonder how to fix it. My current "patch" is to set QUEUE_FLAG_DYING
and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts
to grab ->open_mutex for the first time.
Any suggestions?
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky @ 2024-10-03 13:36 ` Christoph Hellwig 2024-10-03 13:43 ` Christoph Hellwig 2024-10-03 13:55 ` Sergey Senozhatsky 2024-10-08 4:02 ` YangYang 1 sibling, 2 replies; 24+ messages in thread From: Christoph Hellwig @ 2024-10-03 13:36 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Jens Axboe, linux-block On Thu, Oct 03, 2024 at 05:56:10PM +0900, Sergey Senozhatsky wrote: > blk_queue_enter() sleeps forever, under ->open_mutex, there is no > way for it to be woken up and to detect blk_queue_dying(). del_gendisk() > sleeps forever because it attempts to grab ->open_mutex before it calls > __blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and > wake up ->mq_freeze_wq (which is blk_queue_enter() in this case). > > I wonder how to fix it. My current "patch" is to set QUEUE_FLAG_DYING > and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts > to grab ->open_mutex for the first time. We split blk_queue_enter further to distinguish between file system requests and passthrough ones. The file system request should be using bio_queue_enter, which only checks GD_DEAD, instead of QUEUE_FLAG_DYING. Passthrough requests like the cdrom door lock are using blk_queue_enter that checks QUEUE_FLAG_DYING which only gets set in blk_mq_destroy_queue. So AFAICS your trace should not happen with the current kernel, but probably could happen with older stable version unless I'm missing something. What kernel version did you see this on? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 13:36 ` Christoph Hellwig @ 2024-10-03 13:43 ` Christoph Hellwig 2024-10-03 14:00 ` Sergey Senozhatsky 2024-10-03 13:55 ` Sergey Senozhatsky 1 sibling, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2024-10-03 13:43 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Jens Axboe, linux-block On Thu, Oct 03, 2024 at 06:36:20AM -0700, Christoph Hellwig wrote: > On Thu, Oct 03, 2024 at 05:56:10PM +0900, Sergey Senozhatsky wrote: > > blk_queue_enter() sleeps forever, under ->open_mutex, there is no > > way for it to be woken up and to detect blk_queue_dying(). del_gendisk() > > sleeps forever because it attempts to grab ->open_mutex before it calls > > __blk_mark_disk_dead(), which would mark the queue QUEUE_FLAG_DYING and > > wake up ->mq_freeze_wq (which is blk_queue_enter() in this case). > > > > I wonder how to fix it. My current "patch" is to set QUEUE_FLAG_DYING > > and "kick" ->mq_freeze_wq early on in del_gendisk(), before it attempts > > to grab ->open_mutex for the first time. > > We split blk_queue_enter further to distinguish between file system > requests and passthrough ones. > > The file system request should be using bio_queue_enter, which only > checks GD_DEAD, instead of QUEUE_FLAG_DYING. Passthrough requests like > the cdrom door lock are using blk_queue_enter that checks QUEUE_FLAG_DYING > which only gets set in blk_mq_destroy_queue. > > So AFAICS your trace should not happen with the current kernel, but > probably could happen with older stable version unless I'm missing > something. What kernel version did you see this on? .. actually, we still clear QUEUE_FLAG_DYING early. Something like the pathc below (plus proper comments) should sort it out: diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980b5..9a1e18fbb136cf 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk) if (test_and_set_bit(GD_DEAD, &disk->state)) return; - if (test_bit(GD_OWNS_QUEUE, &disk->state)) - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); - /* * Stop buffered writers from dirtying pages that can't be written out. */ @@ -673,6 +670,9 @@ void del_gendisk(struct gendisk *disk) drop_partition(part); mutex_unlock(&disk->open_mutex); + if (test_bit(GD_OWNS_QUEUE, &disk->state)) + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); + if (!(disk->flags & GENHD_FL_HIDDEN)) { sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 13:43 ` Christoph Hellwig @ 2024-10-03 14:00 ` Sergey Senozhatsky 2024-10-03 14:17 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-03 14:00 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block On (24/10/03 06:43), Christoph Hellwig wrote: > .. actually, we still clear QUEUE_FLAG_DYING early. Something like > the pathc below (plus proper comments) should sort it out: > > diff --git a/block/genhd.c b/block/genhd.c > index 1c05dd4c6980b5..9a1e18fbb136cf 100644 > --- a/block/genhd.c > +++ b/block/genhd.c > @@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk) > if (test_and_set_bit(GD_DEAD, &disk->state)) > return; > > - if (test_bit(GD_OWNS_QUEUE, &disk->state)) > - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > - > /* > * Stop buffered writers from dirtying pages that can't be written out. > */ > @@ -673,6 +670,9 @@ void del_gendisk(struct gendisk *disk) > drop_partition(part); > mutex_unlock(&disk->open_mutex); > > + if (test_bit(GD_OWNS_QUEUE, &disk->state)) > + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); So that mutex_lock(&disk->open_mutex) right before it potentially can deadlock (I think it will). My idea, thus far, was to if (test_bit(GD_OWNS_QUEUE, &disk->state)) } blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); blk_kick_queue_enter(disk->queue); // this one simply wake_up_all(&q->mq_freeze_wq); // if the queue has QUEUE_FLAG_DYING } in del_gendisk() before the very first time del_gendisk() attempts to mutex_lock(&disk->open_mutex), because that mutex is already locked forever. > if (!(disk->flags & GENHD_FL_HIDDEN)) { > sysfs_remove_link(&disk_to_dev(disk)->kobj, "bdi"); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 14:00 ` Sergey Senozhatsky @ 2024-10-03 14:17 ` Sergey Senozhatsky 2024-10-04 4:21 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-03 14:17 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block On (24/10/03 23:00), Sergey Senozhatsky wrote: > On (24/10/03 06:43), Christoph Hellwig wrote: [..] > So that mutex_lock(&disk->open_mutex) right before it potentially can > deadlock (I think it will). > > My idea, thus far, was to > > if (test_bit(GD_OWNS_QUEUE, &disk->state)) } > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > blk_kick_queue_enter(disk->queue); // this one simply wake_up_all(&q->mq_freeze_wq); > // if the queue has QUEUE_FLAG_DYING > } > > in del_gendisk() before the very first time del_gendisk() attempts to > mutex_lock(&disk->open_mutex), because that mutex is already locked > forever. Well, just in case, the diff that I have (against 6.6) --- diff --git a/block/blk-core.c b/block/blk-core.c index 4f25d2c..470c910 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -304,6 +304,13 @@ wake_up_all(&q->mq_freeze_wq); } +void blk_kick_queue_enter(struct request_queue *q) +{ + if (WARN_ON_ONCE(!blk_queue_dying(q))) + return; + wake_up_all(&q->mq_freeze_wq); +} + /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer diff --git a/block/genhd.c b/block/genhd.c index 203c880..3ccc593 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -581,9 +581,6 @@ if (test_and_set_bit(GD_DEAD, &disk->state)) return; - if (test_bit(GD_OWNS_QUEUE, &disk->state)) - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); - /* * Stop buffered writers from dirtying pages that can't be written out. */ @@ -641,6 +638,20 @@ disk_del_events(disk); + if (test_bit(GD_OWNS_QUEUE, &disk->state)) { + /* + * Set QUEUE_FLAG_DYING before we grab ->open_mutex so that + * blkdev_put() -> release -> blk_queue_enter() can detect + * dead device + */ + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); + /* + * Make sure that ->mq_freeze_wq see QUEUE_FLAG_DYING and + * bail out, unlocking ->open_mutex + */ + blk_kick_queue_enter(disk->queue); + } + /* * Prevent new openers by unlinked the bdev inode. */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 6f67dbe..a49afe9 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -854,6 +854,7 @@ extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); +void blk_kick_queue_enter(struct request_queue *q); /* Helper to convert REQ_OP_XXX to its string format XXX */ extern const char *blk_op_str(enum req_op op); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 14:17 ` Sergey Senozhatsky @ 2024-10-04 4:21 ` Sergey Senozhatsky 2024-10-04 6:45 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-04 4:21 UTC (permalink / raw) To: Christoph Hellwig, Jens Axboe; +Cc: linux-block, Sergey Senozhatsky On (24/10/03 23:17), Sergey Senozhatsky wrote: > On (24/10/03 23:00), Sergey Senozhatsky wrote: > > On (24/10/03 06:43), Christoph Hellwig wrote: > [..] > > So that mutex_lock(&disk->open_mutex) right before it potentially can > > deadlock (I think it will). > > > > My idea, thus far, was to > > > > if (test_bit(GD_OWNS_QUEUE, &disk->state)) } > > blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > > blk_kick_queue_enter(disk->queue); // this one simply wake_up_all(&q->mq_freeze_wq); > > // if the queue has QUEUE_FLAG_DYING > > } > > > > in del_gendisk() before the very first time del_gendisk() attempts to > > mutex_lock(&disk->open_mutex), because that mutex is already locked > > forever. Dunno. Is something like this completely silly? --- diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980..c968b04ccc7c 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -649,6 +649,13 @@ void del_gendisk(struct gendisk *disk) disk_del_events(disk); + /* + * Mark the queue QUEUE_FLAG_DYING and wakeup ->mq_freeze_wq so + * that the waiters (e.g. blk_queue_enter()) can see blk_queue_dying() + * and error out, unlocking the ->open_mutex. + */ + __blk_mark_disk_dead(disk); + /* * Prevent new openers by unlinked the bdev inode. */ @@ -668,7 +675,6 @@ void del_gendisk(struct gendisk *disk) * Drop all partitions now that the disk is marked dead. */ mutex_lock(&disk->open_mutex); - __blk_mark_disk_dead(disk); xa_for_each_start(&disk->part_tbl, idx, part, 1) drop_partition(part); mutex_unlock(&disk->open_mutex); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 4:21 ` Sergey Senozhatsky @ 2024-10-04 6:45 ` Christoph Hellwig 2024-10-04 7:48 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2024-10-04 6:45 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang On Fri, Oct 04, 2024 at 01:21:27PM +0900, Sergey Senozhatsky wrote: > Dunno. Is something like this completely silly? __blk_mark_disk_dead got moved into the lock by: 7e04da2dc701 ("block: fix deadlock between sd_remove & sd_release"), which has a trace that looks very similar to the one your reported. And that commit also points out something I missed - we do not set QUEUE_FLAG_DYING here because the gendisk does not own the queue for SCSI. Because of that allocating the request in sd/sr will not fail, and it will deadlock. So I think the short term fix is to also fail passthrough request here - either by clearing and resurrecting QUEUE_FLAG_DYING or by also checking q->disk for GD_DEAD if it exists. Both of these are a bit ugly because they will fail passthrough through /dev/sg during the removal which is unexpected (although probably not happening for usual workloads). The proper fix would be to split the freezing mechanism for file system vs passthrough I/O, but that's going to be a huge change. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 6:45 ` Christoph Hellwig @ 2024-10-04 7:48 ` Sergey Senozhatsky 2024-10-04 7:49 ` Sergey Senozhatsky 2024-10-04 12:20 ` Christoph Hellwig 0 siblings, 2 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-04 7:48 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang On (24/10/03 23:45), Christoph Hellwig wrote: > Date: Thu, 3 Oct 2024 23:45:10 -0700 > From: Christoph Hellwig <hch@infradead.org> > To: Sergey Senozhatsky <senozhatsky@chromium.org> > Cc: Christoph Hellwig <hch@infradead.org>, Jens Axboe <axboe@kernel.dk>, > linux-block@vger.kernel.org, Yang Yang <yang.yang@vivo.com> > Subject: Re: block: del_gendisk() vs blk_queue_enter() race condition > Message-ID: <Zv-O9tldIzPfD8ju@infradead.org> > > On Fri, Oct 04, 2024 at 01:21:27PM +0900, Sergey Senozhatsky wrote: > > Dunno. Is something like this completely silly? > > __blk_mark_disk_dead got moved into the lock by: 7e04da2dc701 > ("block: fix deadlock between sd_remove & sd_release"), which has a trace > that looks very similar to the one your reported. Hmm, okay, a deadlock one way or another. > And that commit also points out something I missed - we do not set > QUEUE_FLAG_DYING here because the gendisk does not own the queue for > SCSI. Because of that allocating the request in sd/sr will not fail, and > it will deadlock. I see. Thanks for the pointers. > So I think the short term fix is to also fail passthrough request here - > either by clearing and resurrecting QUEUE_FLAG_DYING or by also checking > q->disk for GD_DEAD if it exists. Both of these are a bit ugly because > they will fail passthrough through /dev/sg during the removal which is > unexpected (although probably not happening for usual workloads). You are way ahead of me. Does the below diff look like "checking for GD_DEAD"? --- diff --git a/block/blk-core.c b/block/blk-core.c index bc5e8c5eaac9..ccd36cb5ada7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q) wake_up_all(&q->mq_freeze_wq); } +void blk_queue_disk_dead(struct request_queue *q) +{ + struct gendisk *disk = q->disk; + + if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state))) + return; + /* Make blk_queue_enter() reexamine the GD_DEAD flag. */ + wake_up_all(&q->mq_freeze_wq); +} + /** * blk_queue_enter() - try to increase q->q_usage_counter * @q: request queue pointer @@ -302,6 +312,8 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) const bool pm = flags & BLK_MQ_REQ_PM; while (!blk_try_enter_queue(q, pm)) { + struct gendisk *disk = q->disk; + if (flags & BLK_MQ_REQ_NOWAIT) return -EAGAIN; @@ -316,8 +328,9 @@ int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) wait_event(q->mq_freeze_wq, (!q->mq_freeze_depth && blk_pm_resume_queue(pm, q)) || - blk_queue_dying(q)); - if (blk_queue_dying(q)) + blk_queue_dying(q) || + test_bit(GD_DEAD, &disk->state)); + if (blk_queue_dying(q) || test_bit(GD_DEAD, &disk->state)) return -ENODEV; } diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980..c213a0cf8268 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -583,12 +583,6 @@ static void blk_report_disk_dead(struct gendisk *disk, bool surprise) static void __blk_mark_disk_dead(struct gendisk *disk) { - /* - * Fail any new I/O. - */ - if (test_and_set_bit(GD_DEAD, &disk->state)) - return; - if (test_bit(GD_OWNS_QUEUE, &disk->state)) blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); @@ -649,6 +643,12 @@ void del_gendisk(struct gendisk *disk) disk_del_events(disk); + /* + * Fail any new I/O. + */ + test_bit(GD_DEAD, &disk->state); + blk_queue_disk_dead(disk->queue); + /* * Prevent new openers by unlinked the bdev inode. */ diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h index 50c3b959da28..aaaa6fa12328 100644 --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -862,6 +862,7 @@ extern int blk_lld_busy(struct request_queue *q); extern int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags); extern void blk_queue_exit(struct request_queue *q); extern void blk_sync_queue(struct request_queue *q); +void blk_queue_disk_dead(struct request_queue *q); /* Helper to convert REQ_OP_XXX to its string format XXX */ extern const char *blk_op_str(enum req_op op); --- > The proper fix would be to split the freezing mechanism for file system > vs passthrough I/O, but that's going to be a huge change. My preference would be a simpler short-term fix (cherry-pick to older kernels are much easier this way). ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 7:48 ` Sergey Senozhatsky @ 2024-10-04 7:49 ` Sergey Senozhatsky 2024-10-04 12:20 ` Christoph Hellwig 1 sibling, 0 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-04 7:49 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang On (24/10/04 16:48), Sergey Senozhatsky wrote: [..] > + /* > + * Fail any new I/O. > + */ > + test_bit(GD_DEAD, &disk->state); ^^^ set bit, of course. > + blk_queue_disk_dead(disk->queue); > + ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 7:48 ` Sergey Senozhatsky 2024-10-04 7:49 ` Sergey Senozhatsky @ 2024-10-04 12:20 ` Christoph Hellwig 2024-10-04 14:32 ` Sergey Senozhatsky 2024-10-04 14:41 ` Sergey Senozhatsky 1 sibling, 2 replies; 24+ messages in thread From: Christoph Hellwig @ 2024-10-04 12:20 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang On Fri, Oct 04, 2024 at 04:48:18PM +0900, Sergey Senozhatsky wrote: > diff --git a/block/blk-core.c b/block/blk-core.c > index bc5e8c5eaac9..ccd36cb5ada7 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q) > wake_up_all(&q->mq_freeze_wq); > } > > +void blk_queue_disk_dead(struct request_queue *q) > +{ > + struct gendisk *disk = q->disk; > + > + if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state))) > + return; > + /* Make blk_queue_enter() reexamine the GD_DEAD flag. */ > + wake_up_all(&q->mq_freeze_wq); > +} Why is this a separate helper vs just doing the wake_up_all in the only caller that sets (with the suggested fixup anyway) GD_DEAD? > + blk_queue_dying(q) || > + test_bit(GD_DEAD, &disk->state)); This needs to check for a NULL disk. And now that I'm looking at the code a bit more this makes me worried that checking for q->disk here sounds like a good way to hit a race with clearing it. So I fear we need the other hack variant that sets QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead and then clears it again (for GD_OWNS_QUEUE only) toward the end of del_gendisk. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 12:20 ` Christoph Hellwig @ 2024-10-04 14:32 ` Sergey Senozhatsky 2024-10-07 6:10 ` Christoph Hellwig 2024-10-04 14:41 ` Sergey Senozhatsky 1 sibling, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-04 14:32 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang On (24/10/04 05:20), Christoph Hellwig wrote: > On Fri, Oct 04, 2024 at 04:48:18PM +0900, Sergey Senozhatsky wrote: > > diff --git a/block/blk-core.c b/block/blk-core.c > > index bc5e8c5eaac9..ccd36cb5ada7 100644 > > --- a/block/blk-core.c > > +++ b/block/blk-core.c > > @@ -292,6 +292,16 @@ void blk_queue_start_drain(struct request_queue *q) > > wake_up_all(&q->mq_freeze_wq); > > } > > > > +void blk_queue_disk_dead(struct request_queue *q) > > +{ > > + struct gendisk *disk = q->disk; > > + > > + if (WARN_ON_ONCE(!test_bit(GD_DEAD, &disk->state))) > > + return; > > + /* Make blk_queue_enter() reexamine the GD_DEAD flag. */ > > + wake_up_all(&q->mq_freeze_wq); > > +} > > Why is this a separate helper vs just doing the wake_up_all in the > only caller that sets (with the suggested fixup anyway) GD_DEAD? It looked to me like whatever happens to ->mq_freeze_wq stays in Las^W blk-core or blk-mq, so I added a new helper to follow suit, IOW to not spread ->mq_freeze_wq wakeup across multiple files. > > + blk_queue_dying(q) || > > + test_bit(GD_DEAD, &disk->state)); > > This needs to check for a NULL disk. Ack. > And now that I'm looking at the code a bit more this makes me worried > that checking for q->disk here sounds like a good way to hit a race with > clearing it. So I fear we need the other hack variant that sets > QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead and then clears > it again (for GD_OWNS_QUEUE only) toward the end of del_gendisk. Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead() implies moving it up, to the very top of del_gendisk(), before the first time we grab ->open_mutex, because that's the issue that we are having. Does this sound like re-introducing the previous deadlock scenario (the one you pointed at previously) because of that "don't acquire ->open_mutex after freezing the queue" thing? ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 14:32 ` Sergey Senozhatsky @ 2024-10-07 6:10 ` Christoph Hellwig 2024-10-07 9:45 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2024-10-07 6:10 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang On Fri, Oct 04, 2024 at 11:32:34PM +0900, Sergey Senozhatsky wrote: > Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead() > implies moving it up, to the very top of del_gendisk(), before the first > time we grab ->open_mutex, because that's the issue that we are having. > Does this sound like re-introducing the previous deadlock scenario (the > one you pointed at previously) because of that "don't acquire ->open_mutex > after freezing the queue" thing? So the trace of that one is literally the same as the one you reported, and I'm still wondering how they are related (I hope Yang Yang can chime in). I suspect that if we mark both the disk and queue dead early that will error out everything and should fix it. That would also avoid the issue with your patch in the next reply that would skip marking the disk dead when calling blk_mark_disk_dead. (BTW, we really need to write a big fat comment explaining how we ended up with whatever is the final fix here for the next person touching the code) ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-07 6:10 ` Christoph Hellwig @ 2024-10-07 9:45 ` Sergey Senozhatsky 2024-10-08 5:31 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-07 9:45 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang On (24/10/06 23:10), Christoph Hellwig wrote: > On Fri, Oct 04, 2024 at 11:32:34PM +0900, Sergey Senozhatsky wrote: > > Hmm, setting QUEUE_FLAG_DYING unconditionally in __blk_mark_disk_dead() > > implies moving it up, to the very top of del_gendisk(), before the first > > time we grab ->open_mutex, because that's the issue that we are having. > > Does this sound like re-introducing the previous deadlock scenario (the > > one you pointed at previously) because of that "don't acquire ->open_mutex > > after freezing the queue" thing? > > So the trace of that one is literally the same as the one you reported, > and I'm still wondering how they are related (I hope Yang Yang can > chime in) > > I suspect that if we mark both the disk and queue dead > early that will error out everything and should fix it. Does the diff below look like something that you are thinking about? __blk_mark_disk_dead() cannot be moved alone, we need blk_report_disk_dead() before it. And all of these should be done before the first time we take ->open_mutex. I keep __blk_mark_disk_dead() the way it is and just forcibly set QUEUE_FLAG_DYING right before __blk_mark_disk_dead(), so that hopefully bio_queue_enter() can detect DYING. --- diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980..3b2a7e0f2176 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -649,6 +649,16 @@ void del_gendisk(struct gendisk *disk) disk_del_events(disk); + /* + * Tell the file system to write back all dirty data and shut down if + * it hasn't been notified earlier. + */ + if (!test_bit(GD_DEAD, &disk->state)) + blk_report_disk_dead(disk, false); + /* TODO: big fat comment */ + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); + __blk_mark_disk_dead(disk); + /* * Prevent new openers by unlinked the bdev inode. */ @@ -657,18 +667,10 @@ void del_gendisk(struct gendisk *disk) bdev_unhash(part); mutex_unlock(&disk->open_mutex); - /* - * Tell the file system to write back all dirty data and shut down if - * it hasn't been notified earlier. - */ - if (!test_bit(GD_DEAD, &disk->state)) - blk_report_disk_dead(disk, false); - /* * Drop all partitions now that the disk is marked dead. */ mutex_lock(&disk->open_mutex); - __blk_mark_disk_dead(disk); xa_for_each_start(&disk->part_tbl, idx, part, 1) drop_partition(part); mutex_unlock(&disk->open_mutex); @@ -714,6 +716,10 @@ void del_gendisk(struct gendisk *disk) rq_qos_exit(q); blk_mq_unquiesce_queue(q); + /* TODO: big fat comment */ + if (test_bit(GD_OWNS_QUEUE, &disk->state)) + blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue); + /* * If the disk does not own the queue, allow using passthrough requests * again. Else leave the queue frozen to fail all I/O. ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-07 9:45 ` Sergey Senozhatsky @ 2024-10-08 5:31 ` Sergey Senozhatsky 0 siblings, 0 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-08 5:31 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yang Yang On (24/10/07 18:45), Sergey Senozhatsky wrote: > + /* > + * Tell the file system to write back all dirty data and shut down if > + * it hasn't been notified earlier. > + */ > + if (!test_bit(GD_DEAD, &disk->state)) > + blk_report_disk_dead(disk, false); > + /* TODO: big fat comment */ > + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); > + __blk_mark_disk_dead(disk); > + > /* > * Prevent new openers by unlinked the bdev inode. > */ > @@ -657,18 +667,10 @@ void del_gendisk(struct gendisk *disk) > bdev_unhash(part); > mutex_unlock(&disk->open_mutex); > > - /* > - * Tell the file system to write back all dirty data and shut down if > - * it hasn't been notified earlier. > - */ > - if (!test_bit(GD_DEAD, &disk->state)) > - blk_report_disk_dead(disk, false); > - > /* > * Drop all partitions now that the disk is marked dead. > */ > mutex_lock(&disk->open_mutex); > - __blk_mark_disk_dead(disk); > xa_for_each_start(&disk->part_tbl, idx, part, 1) > drop_partition(part); > mutex_unlock(&disk->open_mutex); > @@ -714,6 +716,10 @@ void del_gendisk(struct gendisk *disk) > rq_qos_exit(q); > blk_mq_unquiesce_queue(q); > > + /* TODO: big fat comment */ > + if (test_bit(GD_OWNS_QUEUE, &disk->state)) if (!test_bit()), muppet. > + blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-04 12:20 ` Christoph Hellwig 2024-10-04 14:32 ` Sergey Senozhatsky @ 2024-10-04 14:41 ` Sergey Senozhatsky 1 sibling, 0 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-04 14:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block, Yang Yang On (24/10/04 05:20), Christoph Hellwig wrote: > This needs to check for a NULL disk. And now that I'm looking at the > code a bit more this makes me worried that checking for q->disk here > sounds like a good way to hit a race with clearing it. So I fear we > need the other hack variant that sets QUEUE_FLAG_DYING unconditionally > in __blk_mark_disk_dead and then clears it again (for GD_OWNS_QUEUE > only) toward the end of del_gendisk. Something like this? --- diff --git a/block/genhd.c b/block/genhd.c index 1c05dd4c6980..aca43c8fa4ed 100644 --- a/block/genhd.c +++ b/block/genhd.c @@ -589,9 +589,6 @@ static void __blk_mark_disk_dead(struct gendisk *disk) if (test_and_set_bit(GD_DEAD, &disk->state)) return; - if (test_bit(GD_OWNS_QUEUE, &disk->state)) - blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); - /* * Stop buffered writers from dirtying pages that can't be written out. */ @@ -649,6 +646,9 @@ void del_gendisk(struct gendisk *disk) disk_del_events(disk); + blk_queue_flag_set(QUEUE_FLAG_DYING, disk->queue); + wake_up_all(&disk->queue->mq_freeze_wq); + /* * Prevent new openers by unlinked the bdev inode. */ @@ -725,6 +725,9 @@ void del_gendisk(struct gendisk *disk) if (queue_is_mq(q)) blk_mq_exit_queue(q); } + + if (!test_bit(GD_OWNS_QUEUE, &disk->state)) + blk_queue_flag_clear(QUEUE_FLAG_DYING, disk->queue); } EXPORT_SYMBOL(del_gendisk); ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 13:36 ` Christoph Hellwig 2024-10-03 13:43 ` Christoph Hellwig @ 2024-10-03 13:55 ` Sergey Senozhatsky 1 sibling, 0 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-03 13:55 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, Jens Axboe, linux-block On (24/10/03 06:36), Christoph Hellwig wrote: > What kernel version did you see this on? 6.6 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-03 8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky 2024-10-03 13:36 ` Christoph Hellwig @ 2024-10-08 4:02 ` YangYang 2024-10-08 5:19 ` Sergey Senozhatsky 1 sibling, 1 reply; 24+ messages in thread From: YangYang @ 2024-10-08 4:02 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: linux-block, Jens Axboe On 2024/10/3 16:56, Sergey Senozhatsky wrote: > Hello, > > I'm looking at a report from the fleet (don't have a reproducer) > and wondering what you and block folks might think / suggest. > > The problem is basically as follows > > CPU0 > > do_syscall > sys_close > __fput > blkdev_release > blkdev_put grabs ->open_mutex > sr_block_release > scsi_set_medium_removal > ioctl_internal_command > scsi_execute_cmd > scsi_alloc_request > blk_mq_alloc_request > blk_queue_enter > schedule > > at the same time: > > CPU1 > > usb_disconnect > usb_disable_device > device_del > usb_unbind_interface > usb_stor_disconnect > scsi_remove_host > scsi_forget_host > __scsi_remove_device > device_del > bus_remove_device > device_release_driver_internal > sr_remove > del_gendisk > mutex_lock attempts to grab ->open_mutex > schedule > I'm a little confused here. How is the queue getting frozen in this scenario? Normally, the queue should be frozen by __blk_mark_disk_dead()->blk_queue_start_drain()->blk_freeze_queue_start(), and this cannot occur without grabbing ->open_mutex. 670 mutex_lock(&disk->open_mutex); 671 __blk_mark_disk_dead(disk); 672 xa_for_each_start(&disk->part_tbl, idx, part, 1) 673 drop_partition(part); 674 mutex_unlock(&disk->open_mutex); ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 4:02 ` YangYang @ 2024-10-08 5:19 ` Sergey Senozhatsky 2024-10-08 5:26 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-08 5:19 UTC (permalink / raw) To: YangYang; +Cc: Sergey Senozhatsky, linux-block, Jens Axboe On (24/10/08 12:02), YangYang wrote: > On 2024/10/3 16:56, Sergey Senozhatsky wrote: > > Hello, > > > > I'm looking at a report from the fleet (don't have a reproducer) > > and wondering what you and block folks might think / suggest. > > > > The problem is basically as follows > > > > CPU0 > > > > do_syscall > > sys_close > > __fput > > blkdev_release > > blkdev_put grabs ->open_mutex > > sr_block_release > > scsi_set_medium_removal > > ioctl_internal_command > > scsi_execute_cmd > > scsi_alloc_request > > blk_mq_alloc_request > > blk_queue_enter > > schedule > > > > at the same time: > > > > CPU1 > > > > usb_disconnect > > usb_disable_device > > device_del > > usb_unbind_interface > > usb_stor_disconnect > > scsi_remove_host > > scsi_forget_host > > __scsi_remove_device > > device_del > > bus_remove_device > > device_release_driver_internal > > sr_remove > > del_gendisk > > mutex_lock attempts to grab ->open_mutex > > schedule > > > > I'm a little confused here. How is the queue getting frozen in this > scenario? I don't know. Could it be that it's PM not frozen queue that falsifies wait_event() condition? (if that's what you are pointing at). I have several reports (various devices, various use-cases) and the ones that I looked at so far have the same pattern: usb_disconnect() vs blk_queue_enter() E.g. one of the reports: ... sd 1:0:0:0: [sdb] Attached SCSI removable disk usb 3-4: USB disconnect, device number 29 sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK cmd_age=15s sd 1:0:0:0: [sdb] tag#0 CDB: Read(10) 28 00 07 47 af fd 00 00 01 00 I/O error, dev sdb, sector 122138621 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 device offline error, dev sdb, sector 122138616 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 2 Buffer I/O error on dev sdb, logical block 15267327, async page read ... schedule+0x4f4/0x1540 del_gendisk+0x136/0x370 sd_remove+0x30/0x60 device_release_driver_internal+0x1a2/0x2a0 bus_remove_device+0x154/0x180 device_del+0x207/0x370 __scsi_remove_device+0xc0/0x170 scsi_forget_host+0x45/0x60 scsi_remove_host+0x87/0x170 usb_stor_disconnect+0x63/0xb0 usb_unbind_interface+0xbe/0x250 device_release_driver_internal+0x1a2/0x2a0 bus_remove_device+0x154/0x180 device_del+0x207/0x370 ? kobject_release+0x56/0xb0 usb_disable_device+0x72/0x170 usb_disconnect+0xeb/0x280 schedule+0x4f4/0x1540 blk_queue_enter+0x172/0x250 blk_mq_alloc_request+0x167/0x210 scsi_execute_cmd+0x65/0x240 ioctl_internal_command+0x6c/0x150 scsi_set_medium_removal+0x63/0xc0 sd_release+0x42/0x50 blkdev_put+0x13b/0x1f0 blkdev_release+0x2b/0x40 __fput_sync+0x9b/0x2c0 __se_sys_close+0x69/0xc0 do_syscall_64+0x60/0x90 Or another report: sr 1:0:0:0: Power-on or device reset occurred sr 1:0:0:0: [sr0] scsi3-mmc drive: 8x/24x writer dvd-ram cd/rw xa/form2 cdda tray usb 1-1.3.1: USB disconnect, device number 27 schedule+0x554/0x1218 schedule_preempt_disabled+0x30/0x50 mutex_lock+0x3c/0x70 del_gendisk+0xe8/0x370 sr_remove+0x30/0x58 [sr_mod (HASH:d5f2 4)] device_release_driver_internal+0x1a0/0x278 device_release_driver+0x24/0x38 bus_remove_device+0x150/0x170 device_del+0x1d0/0x348 __scsi_remove_device+0xb4/0x198 scsi_forget_host+0x5c/0x80 scsi_remove_host+0x98/0x1c8 usb_stor_disconnect+0x74/0x110 usb_unbind_interface+0xcc/0x250 device_release_driver_internal+0x1a0/0x278 device_release_driver+0x24/0x38 bus_remove_device+0x150/0x170 device_del+0x1d0/0x348 usb_disable_device+0x88/0x190 usb_disconnect+0xf8/0x318 schedule+0x554/0x1218 blk_queue_enter+0xd0/0x170 blk_mq_alloc_request+0x138/0x1e8 scsi_execute_cmd+0x88/0x258 scsi_test_unit_ready+0x88/0x118 sr_drive_status+0x5c/0x160 [sr_mod (HASH:d5f2 4)] cdrom_ioctl+0x7d4/0x2730 [cdrom (HASH:37c3 5)] sr_block_ioctl+0xa8/0x110 [sr_mod (HASH:d5f2 4)] blkdev_ioctl+0x468/0xbf0 __arm64_sys_ioctl+0x254/0x6d0 ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 5:19 ` Sergey Senozhatsky @ 2024-10-08 5:26 ` Sergey Senozhatsky 2024-10-08 5:56 ` Christoph Hellwig 0 siblings, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-08 5:26 UTC (permalink / raw) To: YangYang; +Cc: linux-block, Jens Axboe, Sergey Senozhatsky On (24/10/08 14:19), Sergey Senozhatsky wrote: [..] > Or another report: > > sr 1:0:0:0: Power-on or device reset occurred > sr 1:0:0:0: [sr0] scsi3-mmc drive: 8x/24x writer dvd-ram cd/rw xa/form2 cdda tray > usb 1-1.3.1: USB disconnect, device number 27 > > schedule+0x554/0x1218 > schedule_preempt_disabled+0x30/0x50 > mutex_lock+0x3c/0x70 > del_gendisk+0xe8/0x370 > sr_remove+0x30/0x58 [sr_mod (HASH:d5f2 4)] > device_release_driver_internal+0x1a0/0x278 > device_release_driver+0x24/0x38 > bus_remove_device+0x150/0x170 > device_del+0x1d0/0x348 > __scsi_remove_device+0xb4/0x198 > scsi_forget_host+0x5c/0x80 > scsi_remove_host+0x98/0x1c8 > usb_stor_disconnect+0x74/0x110 > usb_unbind_interface+0xcc/0x250 > device_release_driver_internal+0x1a0/0x278 > device_release_driver+0x24/0x38 > bus_remove_device+0x150/0x170 > device_del+0x1d0/0x348 > usb_disable_device+0x88/0x190 > usb_disconnect+0xf8/0x318 > > schedule+0x554/0x1218 > blk_queue_enter+0xd0/0x170 > blk_mq_alloc_request+0x138/0x1e8 > scsi_execute_cmd+0x88/0x258 > scsi_test_unit_ready+0x88/0x118 > sr_drive_status+0x5c/0x160 [sr_mod (HASH:d5f2 4)] > cdrom_ioctl+0x7d4/0x2730 [cdrom (HASH:37c3 5)] > sr_block_ioctl+0xa8/0x110 [sr_mod (HASH:d5f2 4)] > blkdev_ioctl+0x468/0xbf0 > __arm64_sys_ioctl+0x254/0x6d0 Didn't copy one more backtrace here, there are two mutexes involved. schedule+0x554/0x1218 schedule_preempt_disabled+0x30/0x50 mutex_lock+0x3c/0x70 sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)] blkdev_put+0x184/0x290 blkdev_release+0x34/0x50 __fput_sync+0xa8/0x2d8 __arm64_sys_close+0x6c/0xd8 invoke_syscall+0x78/0xf0 So process A holds cd->lock and sleeps in blk_queue_enter() process B holds ->open_mutex and sleeps on cd->lock, which is owned by A process C sleeps on ->open_mutex, which is owned by B. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 5:26 ` Sergey Senozhatsky @ 2024-10-08 5:56 ` Christoph Hellwig 2024-10-08 6:04 ` Christoph Hellwig 2024-10-08 6:10 ` Sergey Senozhatsky 0 siblings, 2 replies; 24+ messages in thread From: Christoph Hellwig @ 2024-10-08 5:56 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: YangYang, linux-block, Jens Axboe On Tue, Oct 08, 2024 at 02:26:17PM +0900, Sergey Senozhatsky wrote: > Didn't copy one more backtrace here, there are two mutexes involved. > > schedule+0x554/0x1218 > schedule_preempt_disabled+0x30/0x50 > mutex_lock+0x3c/0x70 > sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)] > blkdev_put+0x184/0x290 > blkdev_release+0x34/0x50 > __fput_sync+0xa8/0x2d8 > __arm64_sys_close+0x6c/0xd8 > invoke_syscall+0x78/0xf0 > > So process A holds cd->lock and sleeps in blk_queue_enter() > process B holds ->open_mutex and sleeps on cd->lock, which is owned by A > process C sleeps on ->open_mutex, which is owned by B. Oh, cd->mutex is a bit of a problem. And looking into the generic CD layer code this can be relatively easily avoided while cleaning a lot of the code up. Give me a little time to cook something up. I also wonder if simulating a cdrom removal might be possible using qemu to help reproducing some of this. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 5:56 ` Christoph Hellwig @ 2024-10-08 6:04 ` Christoph Hellwig 2024-10-08 6:10 ` Sergey Senozhatsky 1 sibling, 0 replies; 24+ messages in thread From: Christoph Hellwig @ 2024-10-08 6:04 UTC (permalink / raw) To: Sergey Senozhatsky Cc: YangYang, linux-block, Jens Axboe, martin.petersen, linux-scsi On Mon, Oct 07, 2024 at 10:56:31PM -0700, Christoph Hellwig wrote: > CD layer code this can be relatively easily avoided while cleaning > a lot of the code up. Give me a little time to cook something up. Actually.. It might be as simple as the patch below. In addition we should probably not do the door locking for a hot remove, but the SCSI handling of hot removals could use some work in general. --- From 74cf726f2f02d219778a90c7a99db7a57fb252ad Mon Sep 17 00:00:00 2001 From: Christoph Hellwig <hch@lst.de> Date: Tue, 8 Oct 2024 08:01:17 +0200 Subject: sr: remove cd->lock cd->lock is taken in sr_block_open, sr_block_release and sr_block_ioctl. ->open and ->release are synchronized by the block layer open_mutex, and ->ioctl can only be called on live files and thus block devices. So there is nothing that is actually protected by this lock, but on the other hand it causes deadlocks when hot removing sr devices due to the door locking called from cdrom_release. Signed-off-by: Christoph Hellwig <hch@lst.de> --- drivers/scsi/sr.c | 14 +------------- drivers/scsi/sr.h | 1 - 2 files changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 198bec87bb8e7c..cb89f7afc284e9 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -500,9 +500,7 @@ static int sr_block_open(struct gendisk *disk, blk_mode_t mode) goto out; } - mutex_lock(&cd->lock); ret = cdrom_open(&cd->cdi, mode); - mutex_unlock(&cd->lock); out: scsi_autopm_put_device(sdev); if (ret) @@ -514,10 +512,7 @@ static void sr_block_release(struct gendisk *disk) { struct scsi_cd *cd = scsi_cd(disk); - mutex_lock(&cd->lock); cdrom_release(&cd->cdi); - mutex_unlock(&cd->lock); - scsi_device_put(cd->device); } @@ -532,12 +527,10 @@ static int sr_block_ioctl(struct block_device *bdev, blk_mode_t mode, if (bdev_is_partition(bdev) && !capable(CAP_SYS_RAWIO)) return -ENOIOCTLCMD; - mutex_lock(&cd->lock); - ret = scsi_ioctl_block_when_processing_errors(sdev, cmd, (mode & BLK_OPEN_NDELAY)); if (ret) - goto out; + return ret; scsi_autopm_get_device(sdev); @@ -550,8 +543,6 @@ static int sr_block_ioctl(struct block_device *bdev, blk_mode_t mode, put: scsi_autopm_put_device(sdev); -out: - mutex_unlock(&cd->lock); return ret; } @@ -574,7 +565,6 @@ static void sr_free_disk(struct gendisk *disk) spin_unlock(&sr_index_lock); unregister_cdrom(&cd->cdi); - mutex_destroy(&cd->lock); kfree(cd); } @@ -629,7 +619,6 @@ static int sr_probe(struct device *dev) &sr_bio_compl_lkclass); if (!disk) goto fail_free; - mutex_init(&cd->lock); spin_lock(&sr_index_lock); minor = find_first_zero_bit(sr_index_bits, SR_DISKS); @@ -710,7 +699,6 @@ static int sr_probe(struct device *dev) spin_unlock(&sr_index_lock); fail_put: put_disk(disk); - mutex_destroy(&cd->lock); fail_free: kfree(cd); fail: diff --git a/drivers/scsi/sr.h b/drivers/scsi/sr.h index dc899277b3a441..98e881775aa591 100644 --- a/drivers/scsi/sr.h +++ b/drivers/scsi/sr.h @@ -49,7 +49,6 @@ typedef struct scsi_cd { bool ignore_get_event:1; /* GET_EVENT is unreliable, use TUR */ struct cdrom_device_info cdi; - struct mutex lock; struct gendisk *disk; } Scsi_CD; -- 2.45.2 ^ permalink raw reply related [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 5:56 ` Christoph Hellwig 2024-10-08 6:04 ` Christoph Hellwig @ 2024-10-08 6:10 ` Sergey Senozhatsky 2024-10-08 8:13 ` Christoph Hellwig 1 sibling, 1 reply; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-08 6:10 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, YangYang, linux-block, Jens Axboe On (24/10/07 22:56), Christoph Hellwig wrote: > On Tue, Oct 08, 2024 at 02:26:17PM +0900, Sergey Senozhatsky wrote: > > Didn't copy one more backtrace here, there are two mutexes involved. > > > > schedule+0x554/0x1218 > > schedule_preempt_disabled+0x30/0x50 > > mutex_lock+0x3c/0x70 > > sr_block_release+0x2c/0x60 [sr_mod (HASH:d5f2 4)] > > blkdev_put+0x184/0x290 > > blkdev_release+0x34/0x50 > > __fput_sync+0xa8/0x2d8 > > __arm64_sys_close+0x6c/0xd8 > > invoke_syscall+0x78/0xf0 > > > > So process A holds cd->lock and sleeps in blk_queue_enter() > > process B holds ->open_mutex and sleeps on cd->lock, which is owned by A > > process C sleeps on ->open_mutex, which is owned by B. > > Oh, cd->mutex is a bit of a problem. And looking into the generic > CD layer code this can be relatively easily avoided while cleaning > a lot of the code up. Give me a little time to cook something up. Sure, thanks. I can't test the patch, tho. At least not yet. CD layer is in several reports, I also have reports with SD, and a bunch of reports that I still have to look at. E.g. schedule blk_queue_enter blk_mq_alloc_request scsi_execute_cmd ioctl_internal_command scsi_set_medium_removal sd_release blkdev_put cd->lock still falls a victim of "blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex" thingy, which seems like a primary problem here. No matter why blk_queue_enter() sleeps, draining under ->open_mutex, given that what we want to drain can hold ->open_mutex, sometimes isn't going to drain. > I also wonder if simulating a cdrom removal might be possible using > qemu to help reproducing some of this. Hmm, that's an interesting idea. I've only tried to "unsafely" remove a USB stick out of my laptop so far, with no success. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 6:10 ` Sergey Senozhatsky @ 2024-10-08 8:13 ` Christoph Hellwig 2024-10-08 8:20 ` Sergey Senozhatsky 0 siblings, 1 reply; 24+ messages in thread From: Christoph Hellwig @ 2024-10-08 8:13 UTC (permalink / raw) To: Sergey Senozhatsky; +Cc: Christoph Hellwig, YangYang, linux-block, Jens Axboe On Tue, Oct 08, 2024 at 03:10:53PM +0900, Sergey Senozhatsky wrote: > cd->lock still falls a victim of > "blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex" > thingy, which seems like a primary problem here. No matter why > blk_queue_enter() sleeps, draining under ->open_mutex, given that what we > want to drain can hold ->open_mutex, sometimes isn't going to drain. Yes. So I think we'll need to move __blk_mark_disk_dead out of ->open_mutex again, it also isn't protected when calling blk_mark_disk_dead, but we'll also need to stop the SCSI LLDs from submitting new commands from their ->relase routines. Let me cook up a little series.. ^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: block: del_gendisk() vs blk_queue_enter() race condition 2024-10-08 8:13 ` Christoph Hellwig @ 2024-10-08 8:20 ` Sergey Senozhatsky 0 siblings, 0 replies; 24+ messages in thread From: Sergey Senozhatsky @ 2024-10-08 8:20 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Sergey Senozhatsky, YangYang, linux-block, Jens Axboe On (24/10/08 01:13), Christoph Hellwig wrote: > On Tue, Oct 08, 2024 at 03:10:53PM +0900, Sergey Senozhatsky wrote: > > cd->lock still falls a victim of > > "blk_queue_enter() and blk_queue_start_drain() are both called under ->open_mutex" > > thingy, which seems like a primary problem here. No matter why > > blk_queue_enter() sleeps, draining under ->open_mutex, given that what we > > want to drain can hold ->open_mutex, sometimes isn't going to drain. > > Yes. So I think we'll need to move __blk_mark_disk_dead out > of ->open_mutex again Right, we also need to make sure that we drain before we try to lock ->open_mutex in gel_dendisk() for the first time. > it also isn't protected when calling blk_mark_disk_dead, but we'll > also need to stop the SCSI LLDs from submitting new commands from > their ->relase routines. Let me cook up a little series.. Thank you! ^ permalink raw reply [flat|nested] 24+ messages in thread
end of thread, other threads:[~2024-10-08 8:20 UTC | newest] Thread overview: 24+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-10-03 8:56 block: del_gendisk() vs blk_queue_enter() race condition Sergey Senozhatsky 2024-10-03 13:36 ` Christoph Hellwig 2024-10-03 13:43 ` Christoph Hellwig 2024-10-03 14:00 ` Sergey Senozhatsky 2024-10-03 14:17 ` Sergey Senozhatsky 2024-10-04 4:21 ` Sergey Senozhatsky 2024-10-04 6:45 ` Christoph Hellwig 2024-10-04 7:48 ` Sergey Senozhatsky 2024-10-04 7:49 ` Sergey Senozhatsky 2024-10-04 12:20 ` Christoph Hellwig 2024-10-04 14:32 ` Sergey Senozhatsky 2024-10-07 6:10 ` Christoph Hellwig 2024-10-07 9:45 ` Sergey Senozhatsky 2024-10-08 5:31 ` Sergey Senozhatsky 2024-10-04 14:41 ` Sergey Senozhatsky 2024-10-03 13:55 ` Sergey Senozhatsky 2024-10-08 4:02 ` YangYang 2024-10-08 5:19 ` Sergey Senozhatsky 2024-10-08 5:26 ` Sergey Senozhatsky 2024-10-08 5:56 ` Christoph Hellwig 2024-10-08 6:04 ` Christoph Hellwig 2024-10-08 6:10 ` Sergey Senozhatsky 2024-10-08 8:13 ` Christoph Hellwig 2024-10-08 8:20 ` Sergey Senozhatsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).