* [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order [not found] <20250514202937.2058598-1-bvanassche@acm.org> @ 2025-05-14 20:29 ` Bart Van Assche 2025-05-15 7:19 ` Niklas Cassel 2025-05-16 4:47 ` Christoph Hellwig 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche 1 sibling, 2 replies; 14+ messages in thread From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal, Yu Kuai, Ming Lei, stable submit_bio() may be called recursively. To limit the stack depth, recursive calls result in bios being added to a list (current->bio_list). __submit_bio_noacct() sets up that list and maintains two lists with requests: * bio_list_on_stack[0] is the list with bios submitted by recursive submit_bio() calls from inside the latest __submit_bio() call. * bio_list_on_stack[1] is the list with bios submitted by recursive submit_bio() calls from inside previous __submit_bio() calls. Make sure that bios are submitted to lower devices in the order these have been submitted by submit_bio() by adding new bios at the end of the list instead of at the front. This patch fixes unaligned write errors that I encountered with F2FS submitting zoned writes to a dm driver stacked on top of a zoned UFS device. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..4b728fa1c138 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -704,9 +704,9 @@ static void __submit_bio_noacct(struct bio *bio) /* * Now assemble so we handle the lowest level first. */ + bio_list_on_stack[0] = bio_list_on_stack[1]; bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); current->bio_list = NULL; ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche @ 2025-05-15 7:19 ` Niklas Cassel 2025-05-15 15:58 ` Bart Van Assche 2025-05-16 4:47 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Niklas Cassel @ 2025-05-15 7:19 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable Hello Bart, On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: > submit_bio() may be called recursively. To limit the stack depth, recursive > calls result in bios being added to a list (current->bio_list). > __submit_bio_noacct() sets up that list and maintains two lists with > requests: > * bio_list_on_stack[0] is the list with bios submitted by recursive > submit_bio() calls from inside the latest __submit_bio() call. > * bio_list_on_stack[1] is the list with bios submitted by recursive > submit_bio() calls from inside previous __submit_bio() calls. > > Make sure that bios are submitted to lower devices in the order these > have been submitted by submit_bio() by adding new bios at the end of the > list instead of at the front. > > This patch fixes unaligned write errors that I encountered with F2FS > submitting zoned writes to a dm driver stacked on top of a zoned UFS > device. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Damien Le Moal <dlemoal@kernel.org> > Cc: Yu Kuai <yukuai1@huaweicloud.com> > Cc: Ming Lei <ming.lei@redhat.com> > Cc: stable@vger.kernel.org Here you add stable to Cc, but you don't specify either 1) a minimum version e.g. stable@vger.kernel.org # v6.8+ or 2) a Fixes tag. Kind regards, Niklas ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-15 7:19 ` Niklas Cassel @ 2025-05-15 15:58 ` Bart Van Assche 0 siblings, 0 replies; 14+ messages in thread From: Bart Van Assche @ 2025-05-15 15:58 UTC (permalink / raw) To: Niklas Cassel, NeilBrown Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 12:19 AM, Niklas Cassel wrote: > Hello Bart, > > On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: >> submit_bio() may be called recursively. To limit the stack depth, recursive >> calls result in bios being added to a list (current->bio_list). >> __submit_bio_noacct() sets up that list and maintains two lists with >> requests: >> * bio_list_on_stack[0] is the list with bios submitted by recursive >> submit_bio() calls from inside the latest __submit_bio() call. >> * bio_list_on_stack[1] is the list with bios submitted by recursive >> submit_bio() calls from inside previous __submit_bio() calls. >> >> Make sure that bios are submitted to lower devices in the order these >> have been submitted by submit_bio() by adding new bios at the end of the >> list instead of at the front. >> >> This patch fixes unaligned write errors that I encountered with F2FS >> submitting zoned writes to a dm driver stacked on top of a zoned UFS >> device. >> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Damien Le Moal <dlemoal@kernel.org> >> Cc: Yu Kuai <yukuai1@huaweicloud.com> >> Cc: Ming Lei <ming.lei@redhat.com> >> Cc: stable@vger.kernel.org > > Here you add stable to Cc, but you don't specify either > 1) a minimum version e.g. > stable@vger.kernel.org # v6.8+ > or > 2) a Fixes tag. Hi Niklas, Let's add the following to this patch: Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") Neil, since that commit was authored by you: the commit message is elaborate but the names of the drivers that needed that commit have not been mentioned. Which drivers needed that change? Additionally, can you please help with reviewing this patch: https://lore.kernel.org/linux-block/20250514202937.2058598-2-bvanassche@acm.org/ Thanks, Bart. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche 2025-05-15 7:19 ` Niklas Cassel @ 2025-05-16 4:47 ` Christoph Hellwig 2025-05-19 22:12 ` Bart Van Assche 1 sibling, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-05-16 4:47 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: > /* > * Now assemble so we handle the lowest level first. > */ > + bio_list_on_stack[0] = bio_list_on_stack[1]; > bio_list_merge(&bio_list_on_stack[0], &lower); > bio_list_merge(&bio_list_on_stack[0], &same); > - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); If I read this code correctly, this means that we no keep processing bios that already were on bio_list_on_stack[0] and the beginning of the loop in the next iteration(s) instead of finishing off the ones created by this iteration, which could lead to exhaustion of resources like mempool. Note that this is a big if - the code is really hard to read, it should really grow a data structure for the on-stack list that has named members for both lists instead of the array magic.. :( I'm still trying to understand your problem given that it wasn't described much. What I could think it is that bio_split_to_limits through bio_submit_split first re-submits the remainder bio using submit_bio_noacct, which the above should place on the same list and then later the stacking block drivers also submit the bio split off at the beginning, unlike blk-mq drivers that process it directly. But given that this resubmission should be on the lower list above I don't see how it causes problems. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-16 4:47 ` Christoph Hellwig @ 2025-05-19 22:12 ` Bart Van Assche 2025-05-20 13:56 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Bart Van Assche @ 2025-05-19 22:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 9:47 PM, Christoph Hellwig wrote: > On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: >> /* >> * Now assemble so we handle the lowest level first. >> */ >> + bio_list_on_stack[0] = bio_list_on_stack[1]; >> bio_list_merge(&bio_list_on_stack[0], &lower); >> bio_list_merge(&bio_list_on_stack[0], &same); >> - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); > > If I read this code correctly, this means that we no keep processing bios > that already were on bio_list_on_stack[0] and the beginning of the loop > in the next iteration(s) instead of finishing off the ones created by > this iteration, which could lead to exhaustion of resources like mempool. > > Note that this is a big if - the code is really hard to read, it should > really grow a data structure for the on-stack list that has named members > for both lists instead of the array magic.. :( > > I'm still trying to understand your problem given that it wasn't > described much. What I could think it is that bio_split_to_limits through > bio_submit_split first re-submits the remainder bio using > submit_bio_noacct, which the above should place on the same list and then > later the stacking block drivers also submit the bio split off at the > beginning, unlike blk-mq drivers that process it directly. But given > that this resubmission should be on the lower list above I don't > see how it causes problems. Agreed that this should be root-caused. To my own frustration I do not yet have a full root-cause analysis. What I have done to obtain more information is to make the kernel issue a warning the first time a bio is added out-of-order at the end of the bio list. The following output appeared (sde is the zoned block device at the bottom of the stack): [ 71.312492][ T1] bio_list_insert_sorted: inserting in the middle of a bio list [ 71.313483][ T1] print_bio_list(sde) sector 0x1b7520 size 0x10 [ 71.313034][ T1] bio_list_insert_sorted(sde) sector 0x1b7120 size 0x400 [ ... ] [ 71.368117][ T163] WARNING: CPU: 4 PID: 163 at block/blk-core.c:725 bio_list_insert_sorted+0x144/0x18c [ 71.386664][ T163] Workqueue: writeback wb_workfn (flush-253:49) [ 71.387110][ T163] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 71.393772][ T163] Call trace: [ 71.393988][ T163] bio_list_insert_sorted+0x144/0x18c [ 71.394338][ T163] submit_bio_noacct_nocheck+0xd8/0x4f4 [ 71.394696][ T163] submit_bio_noacct+0x32c/0x50c [ 71.395017][ T163] bio_submit_split+0xf0/0x1f8 [ 71.395349][ T163] bio_split_rw+0xdc/0xf0 [ 71.395631][ T163] blk_mq_submit_bio+0x320/0x940 [ 71.395970][ T163] __submit_bio+0xa4/0x1c4 [ 71.396260][ T163] submit_bio_noacct_nocheck+0x1c0/0x4f4 [ 71.396623][ T163] submit_bio_noacct+0x32c/0x50c [ 71.396942][ T163] submit_bio+0x17c/0x198 [ 71.397222][ T163] f2fs_submit_write_bio+0x94/0x154 [ 71.397604][ T163] __submit_merged_bio+0x80/0x204 [ 71.397933][ T163] __submit_merged_write_cond+0xd0/0x1fc [ 71.398297][ T163] f2fs_submit_merged_write+0x24/0x30 [ 71.398646][ T163] f2fs_sync_node_pages+0x5ec/0x64c [ 71.398999][ T163] f2fs_write_node_pages+0xe8/0x1dc [ 71.399338][ T163] do_writepages+0xe4/0x2f8 [ 71.399673][ T163] __writeback_single_inode+0x84/0x6e4 [ 71.400036][ T163] writeback_sb_inodes+0x2cc/0x5c0 [ 71.400369][ T163] wb_writeback+0x134/0x550 [ 71.400662][ T163] wb_workfn+0x154/0x588 [ 71.400937][ T163] process_one_work+0x26c/0x65c [ 71.401271][ T163] worker_thread+0x33c/0x498 [ 71.401575][ T163] kthread+0x110/0x134 [ 71.401844][ T163] ret_from_fork+0x10/0x20 I think that the above call stack indicates the following: f2fs_submit_write_bio() submits a bio to a dm driver, that the dm driver submitted a bio for the lower driver (SCSI core), that the bio for the lower driver is split by bio_split_rw(), and that the second half of the split bio triggers the above out-of-order warning. This new patch should address the concerns brought up in your latest email: diff --git a/block/blk-core.c b/block/blk-core.c index 411f005e6b1f..aa270588272a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -649,6 +649,26 @@ static void __submit_bio(struct bio *bio) blk_finish_plug(&plug); } +/* + * Insert a bio in LBA order. If no bio for the same bdev with a higher LBA is + * found, append at the end. + */ +static void bio_list_insert_sorted(struct bio_list *bl, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct bio **pprev = &bl->head, *next; + sector_t sector = bio->bi_iter.bi_sector; + + for (next = *pprev; next; pprev = &next->bi_next, next = next->bi_next) + if (next->bi_bdev == bdev && sector < next->bi_iter.bi_sector) + break; + + bio->bi_next = next; + *pprev = bio; + if (!next) + bl->tail = bio; +} + /* * The loop in this function may be a bit non-obvious, and so deserves some * explanation: @@ -706,7 +726,8 @@ static void __submit_bio_noacct(struct bio *bio) */ bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); + while ((bio = bio_list_pop(&bio_list_on_stack[1]))) + bio_list_insert_sorted(&bio_list_on_stack[0], bio); } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); current->bio_list = NULL; @@ -746,7 +767,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) * it is active, and then process them after it returned. */ if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); + bio_list_insert_sorted(¤t->bio_list[0], bio); else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) __submit_bio_noacct_mq(bio); else ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-19 22:12 ` Bart Van Assche @ 2025-05-20 13:56 ` Christoph Hellwig 2025-05-20 18:09 ` Bart Van Assche 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-05-20 13:56 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote: > This new patch should address the concerns brought up in your latest > email: No, we should never need to do a sort, as mentioned we need to fix how stackable drivers split the I/O. Or maybe even get them out of all the splits that aren't required. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-20 13:56 ` Christoph Hellwig @ 2025-05-20 18:09 ` Bart Van Assche 2025-05-21 5:53 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Bart Van Assche @ 2025-05-20 18:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/20/25 6:56 AM, Christoph Hellwig wrote: > On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote: >> This new patch should address the concerns brought up in your latest >> email: > > No, we should never need to do a sort, as mentioned we need to fix > how stackable drivers split the I/O. Or maybe even get them out of > all the splits that aren't required. If the sequential write bios are split by the device mapper, sorting bios in the block layer is not necessary. Christoph and Damien, do you agree to replace the bio sorting code in my previous email with the patch below? Thanks, Bart. diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3d419fd2be57..c41ab294987e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1792,9 +1792,12 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md, { /* * For mapped device that need zone append emulation, we must - * split any large BIO that straddles zone boundaries. + * split any large BIO that straddles zone boundaries. Additionally, + * split sequential writes to prevent that splitting lower in the stack + * causes reordering. */ - return dm_emulate_zone_append(md) && bio_straddles_zones(bio) && + return ((dm_emulate_zone_append(md) && bio_straddles_zones(bio)) || + bio_op(bio) == REQ_OP_WRITE) && !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); } static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio) ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-20 18:09 ` Bart Van Assche @ 2025-05-21 5:53 ` Christoph Hellwig 2025-05-21 21:18 ` Bart Van Assche 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-05-21 5:53 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote: > If the sequential write bios are split by the device mapper, sorting > bios in the block layer is not necessary. Christoph and Damien, do you > agree to replace the bio sorting code in my previous email with the > patch below? No. First please create a reproducer for your issue using null_blk or scsi_debug, otherwise we have no way to understand what is going on here, and will regress in the future. Second should very much be able to fix the splitting in dm to place the bios in the right order. As mentioned before I have a theory of how to do it, but we really need a proper reproducer to test this and then to write it up to blktests first. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-21 5:53 ` Christoph Hellwig @ 2025-05-21 21:18 ` Bart Van Assche 2025-05-23 4:21 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Bart Van Assche @ 2025-05-21 21:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/20/25 10:53 PM, Christoph Hellwig wrote: > On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote: >> If the sequential write bios are split by the device mapper, sorting >> bios in the block layer is not necessary. Christoph and Damien, do you >> agree to replace the bio sorting code in my previous email with the >> patch below? > > No. First please create a reproducer for your issue using null_blk > or scsi_debug, otherwise we have no way to understand what is going > on here, and will regress in the future. > > Second should very much be able to fix the splitting in dm to place > the bios in the right order. As mentioned before I have a theory > of how to do it, but we really need a proper reproducer to test this > and then to write it up to blktests first. Hi Christoph, The following pull request includes a test that triggers the deadlock fixed by patch 2/2 reliably: https://github.com/osandov/blktests/pull/171 I do not yet have a reproducer for the bio reordering but I'm still working on this. Bart. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-21 21:18 ` Bart Van Assche @ 2025-05-23 4:21 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2025-05-23 4:21 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 21, 2025 at 02:18:12PM -0700, Bart Van Assche wrote: > The following pull request includes a test that triggers the deadlock > fixed by patch 2/2 reliably: > > https://github.com/osandov/blktests/pull/171 Can you post the pathc please? I've not found any obvious way to download a patch from the above website. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices [not found] <20250514202937.2058598-1-bvanassche@acm.org> 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche @ 2025-05-14 20:29 ` Bart Van Assche 2025-05-16 4:51 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal, Yu Kuai, Ming Lei, stable blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen. This patch fixes the following deadlock: Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8 Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-core.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4b728fa1c138..e961896a8717 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } +/* + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been + * set because this causes blk_mq_freeze_queue() to deadlock if + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios + * on the plug list is not necessary since a q_usage_counter reference is held + * while a bio is on the plug list. + */ static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,7 +640,8 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + } else if (likely(bio_zone_write_plugging(bio) || + bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; if ((bio->bi_opf & REQ_POLLED) && @@ -643,7 +651,8 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); } - blk_queue_exit(disk->queue); + if (!bio_zone_write_plugging(bio)) + blk_queue_exit(disk->queue); } blk_finish_plug(&plug); ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche @ 2025-05-16 4:51 ` Christoph Hellwig 2025-05-19 22:22 ` Bart Van Assche 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2025-05-16 4:51 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote: > +/* > + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been > + * set because this causes blk_mq_freeze_queue() to deadlock if > + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios > + * on the plug list is not necessary since a q_usage_counter reference is held > + * while a bio is on the plug list. > + */ How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag was set in a stacking driver we need to clear it before resubmitting the bio I think. Can you provide a null_blk based reproducer for your testcase to see what happens here? Either way we can't just simply skip taking q_usage_count. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-16 4:51 ` Christoph Hellwig @ 2025-05-19 22:22 ` Bart Van Assche 2025-05-20 13:57 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Bart Van Assche @ 2025-05-19 22:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 9:51 PM, Christoph Hellwig wrote: > On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote: >> +/* >> + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been >> + * set because this causes blk_mq_freeze_queue() to deadlock if >> + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios >> + * on the plug list is not necessary since a q_usage_counter reference is held >> + * while a bio is on the plug list. >> + */ > > How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag > was set in a stacking driver we need to clear it before resubmitting > the bio I think. Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must remain set until the bio has completed. Here is an example of code in the bio completion path that tests this flag: static inline void blk_zone_bio_endio(struct bio *bio) { /* * For write BIOs to zoned devices, signal the completion of the BIO so * that the next write BIO can be submitted by zone write plugging. */ if (bio_zone_write_plugging(bio)) blk_zone_write_plug_bio_endio(bio); } The bio_zone_write_plugging() definition is as follows: static inline bool bio_zone_write_plugging(struct bio *bio) { return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); } > Can you provide a null_blk based reproducer for your testcase to see > what happens here? My attempts so far to build a reproduce for the blktests framework have been unsuccessful. This test script runs fine in the VM that I use for kernel testing: https://github.com/bvanassche/blktests/blob/master/tests/zbd/013 > Either way we can't just simply skip taking q_usage_count. Why not? If BIO_ZONE_WRITE_PLUGGING is set, that guarantees that a queue reference count is held and will remain held across the entire disk->fops->submit_bio() invocation, isn't it? From blk_zone_wplug_bio_work(), below the submit_bio_noacct_nocheck(bio) call: if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) blk_queue_exit(bdev->bd_disk->queue); Thanks, Bart. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-19 22:22 ` Bart Van Assche @ 2025-05-20 13:57 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2025-05-20 13:57 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Mon, May 19, 2025 at 03:22:42PM -0700, Bart Van Assche wrote: > Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must > remain set until the bio has completed. Here is an example of code in > the bio completion path that tests this flag: True. Well, we'll need some other flag that to tell lower levels to ignore the flag because it is owned by the higher level. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-05-23 4:21 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20250514202937.2058598-1-bvanassche@acm.org>
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
2025-05-15 7:19 ` Niklas Cassel
2025-05-15 15:58 ` Bart Van Assche
2025-05-16 4:47 ` Christoph Hellwig
2025-05-19 22:12 ` Bart Van Assche
2025-05-20 13:56 ` Christoph Hellwig
2025-05-20 18:09 ` Bart Van Assche
2025-05-21 5:53 ` Christoph Hellwig
2025-05-21 21:18 ` Bart Van Assche
2025-05-23 4:21 ` Christoph Hellwig
2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche
2025-05-16 4:51 ` Christoph Hellwig
2025-05-19 22:22 ` Bart Van Assche
2025-05-20 13:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox