* [PATCH 0/2] Two bug fixes for zoned block devices @ 2025-05-14 20:29 Bart Van Assche 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche 0 siblings, 2 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw) To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche Hi Jens, The two patches in this series fix the issues that I ran into by stacking a dm driver on top of a zoned storage device. Please consider these two bug fixes for the next merge window. Thank you, Bart. Bart Van Assche (2): block: Make __submit_bio_noacct() preserve the bio submission order block: Fix a deadlock related freezing zoned storage devices block/blk-core.c | 15 ++++++++++++--- 1 file changed, 12 insertions(+), 3 deletions(-) ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche @ 2025-05-14 20:29 ` Bart Van Assche 2025-05-15 7:19 ` Niklas Cassel 2025-05-16 4:47 ` Christoph Hellwig 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche 1 sibling, 2 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal, Yu Kuai, Ming Lei, stable submit_bio() may be called recursively. To limit the stack depth, recursive calls result in bios being added to a list (current->bio_list). __submit_bio_noacct() sets up that list and maintains two lists with requests: * bio_list_on_stack[0] is the list with bios submitted by recursive submit_bio() calls from inside the latest __submit_bio() call. * bio_list_on_stack[1] is the list with bios submitted by recursive submit_bio() calls from inside previous __submit_bio() calls. Make sure that bios are submitted to lower devices in the order these have been submitted by submit_bio() by adding new bios at the end of the list instead of at the front. This patch fixes unaligned write errors that I encountered with F2FS submitting zoned writes to a dm driver stacked on top of a zoned UFS device. Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..4b728fa1c138 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -704,9 +704,9 @@ static void __submit_bio_noacct(struct bio *bio) /* * Now assemble so we handle the lowest level first. */ + bio_list_on_stack[0] = bio_list_on_stack[1]; bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); current->bio_list = NULL; ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche @ 2025-05-15 7:19 ` Niklas Cassel 2025-05-15 15:58 ` Bart Van Assche 2025-05-16 4:47 ` Christoph Hellwig 1 sibling, 1 reply; 44+ messages in thread From: Niklas Cassel @ 2025-05-15 7:19 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable Hello Bart, On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: > submit_bio() may be called recursively. To limit the stack depth, recursive > calls result in bios being added to a list (current->bio_list). > __submit_bio_noacct() sets up that list and maintains two lists with > requests: > * bio_list_on_stack[0] is the list with bios submitted by recursive > submit_bio() calls from inside the latest __submit_bio() call. > * bio_list_on_stack[1] is the list with bios submitted by recursive > submit_bio() calls from inside previous __submit_bio() calls. > > Make sure that bios are submitted to lower devices in the order these > have been submitted by submit_bio() by adding new bios at the end of the > list instead of at the front. > > This patch fixes unaligned write errors that I encountered with F2FS > submitting zoned writes to a dm driver stacked on top of a zoned UFS > device. > > Cc: Christoph Hellwig <hch@lst.de> > Cc: Damien Le Moal <dlemoal@kernel.org> > Cc: Yu Kuai <yukuai1@huaweicloud.com> > Cc: Ming Lei <ming.lei@redhat.com> > Cc: stable@vger.kernel.org Here you add stable to Cc, but you don't specify either 1) a minimum version e.g. stable@vger.kernel.org # v6.8+ or 2) a Fixes tag. Kind regards, Niklas ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-15 7:19 ` Niklas Cassel @ 2025-05-15 15:58 ` Bart Van Assche 0 siblings, 0 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-15 15:58 UTC (permalink / raw) To: Niklas Cassel, NeilBrown Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 12:19 AM, Niklas Cassel wrote: > Hello Bart, > > On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: >> submit_bio() may be called recursively. To limit the stack depth, recursive >> calls result in bios being added to a list (current->bio_list). >> __submit_bio_noacct() sets up that list and maintains two lists with >> requests: >> * bio_list_on_stack[0] is the list with bios submitted by recursive >> submit_bio() calls from inside the latest __submit_bio() call. >> * bio_list_on_stack[1] is the list with bios submitted by recursive >> submit_bio() calls from inside previous __submit_bio() calls. >> >> Make sure that bios are submitted to lower devices in the order these >> have been submitted by submit_bio() by adding new bios at the end of the >> list instead of at the front. >> >> This patch fixes unaligned write errors that I encountered with F2FS >> submitting zoned writes to a dm driver stacked on top of a zoned UFS >> device. >> >> Cc: Christoph Hellwig <hch@lst.de> >> Cc: Damien Le Moal <dlemoal@kernel.org> >> Cc: Yu Kuai <yukuai1@huaweicloud.com> >> Cc: Ming Lei <ming.lei@redhat.com> >> Cc: stable@vger.kernel.org > > Here you add stable to Cc, but you don't specify either > 1) a minimum version e.g. > stable@vger.kernel.org # v6.8+ > or > 2) a Fixes tag. Hi Niklas, Let's add the following to this patch: Fixes: 79bd99596b73 ("blk: improve order of bio handling in generic_make_request()") Neil, since that commit was authored by you: the commit message is elaborate but the names of the drivers that needed that commit have not been mentioned. Which drivers needed that change? Additionally, can you please help with reviewing this patch: https://lore.kernel.org/linux-block/20250514202937.2058598-2-bvanassche@acm.org/ Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche 2025-05-15 7:19 ` Niklas Cassel @ 2025-05-16 4:47 ` Christoph Hellwig 2025-05-19 22:12 ` Bart Van Assche 1 sibling, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-05-16 4:47 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: > /* > * Now assemble so we handle the lowest level first. > */ > + bio_list_on_stack[0] = bio_list_on_stack[1]; > bio_list_merge(&bio_list_on_stack[0], &lower); > bio_list_merge(&bio_list_on_stack[0], &same); > - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); If I read this code correctly, this means that we no keep processing bios that already were on bio_list_on_stack[0] and the beginning of the loop in the next iteration(s) instead of finishing off the ones created by this iteration, which could lead to exhaustion of resources like mempool. Note that this is a big if - the code is really hard to read, it should really grow a data structure for the on-stack list that has named members for both lists instead of the array magic.. :( I'm still trying to understand your problem given that it wasn't described much. What I could think it is that bio_split_to_limits through bio_submit_split first re-submits the remainder bio using submit_bio_noacct, which the above should place on the same list and then later the stacking block drivers also submit the bio split off at the beginning, unlike blk-mq drivers that process it directly. But given that this resubmission should be on the lower list above I don't see how it causes problems. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-16 4:47 ` Christoph Hellwig @ 2025-05-19 22:12 ` Bart Van Assche 2025-05-20 13:56 ` Christoph Hellwig 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-19 22:12 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 9:47 PM, Christoph Hellwig wrote: > On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote: >> /* >> * Now assemble so we handle the lowest level first. >> */ >> + bio_list_on_stack[0] = bio_list_on_stack[1]; >> bio_list_merge(&bio_list_on_stack[0], &lower); >> bio_list_merge(&bio_list_on_stack[0], &same); >> - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); > > If I read this code correctly, this means that we no keep processing bios > that already were on bio_list_on_stack[0] and the beginning of the loop > in the next iteration(s) instead of finishing off the ones created by > this iteration, which could lead to exhaustion of resources like mempool. > > Note that this is a big if - the code is really hard to read, it should > really grow a data structure for the on-stack list that has named members > for both lists instead of the array magic.. :( > > I'm still trying to understand your problem given that it wasn't > described much. What I could think it is that bio_split_to_limits through > bio_submit_split first re-submits the remainder bio using > submit_bio_noacct, which the above should place on the same list and then > later the stacking block drivers also submit the bio split off at the > beginning, unlike blk-mq drivers that process it directly. But given > that this resubmission should be on the lower list above I don't > see how it causes problems. Agreed that this should be root-caused. To my own frustration I do not yet have a full root-cause analysis. What I have done to obtain more information is to make the kernel issue a warning the first time a bio is added out-of-order at the end of the bio list. The following output appeared (sde is the zoned block device at the bottom of the stack): [ 71.312492][ T1] bio_list_insert_sorted: inserting in the middle of a bio list [ 71.313483][ T1] print_bio_list(sde) sector 0x1b7520 size 0x10 [ 71.313034][ T1] bio_list_insert_sorted(sde) sector 0x1b7120 size 0x400 [ ... ] [ 71.368117][ T163] WARNING: CPU: 4 PID: 163 at block/blk-core.c:725 bio_list_insert_sorted+0x144/0x18c [ 71.386664][ T163] Workqueue: writeback wb_workfn (flush-253:49) [ 71.387110][ T163] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--) [ 71.393772][ T163] Call trace: [ 71.393988][ T163] bio_list_insert_sorted+0x144/0x18c [ 71.394338][ T163] submit_bio_noacct_nocheck+0xd8/0x4f4 [ 71.394696][ T163] submit_bio_noacct+0x32c/0x50c [ 71.395017][ T163] bio_submit_split+0xf0/0x1f8 [ 71.395349][ T163] bio_split_rw+0xdc/0xf0 [ 71.395631][ T163] blk_mq_submit_bio+0x320/0x940 [ 71.395970][ T163] __submit_bio+0xa4/0x1c4 [ 71.396260][ T163] submit_bio_noacct_nocheck+0x1c0/0x4f4 [ 71.396623][ T163] submit_bio_noacct+0x32c/0x50c [ 71.396942][ T163] submit_bio+0x17c/0x198 [ 71.397222][ T163] f2fs_submit_write_bio+0x94/0x154 [ 71.397604][ T163] __submit_merged_bio+0x80/0x204 [ 71.397933][ T163] __submit_merged_write_cond+0xd0/0x1fc [ 71.398297][ T163] f2fs_submit_merged_write+0x24/0x30 [ 71.398646][ T163] f2fs_sync_node_pages+0x5ec/0x64c [ 71.398999][ T163] f2fs_write_node_pages+0xe8/0x1dc [ 71.399338][ T163] do_writepages+0xe4/0x2f8 [ 71.399673][ T163] __writeback_single_inode+0x84/0x6e4 [ 71.400036][ T163] writeback_sb_inodes+0x2cc/0x5c0 [ 71.400369][ T163] wb_writeback+0x134/0x550 [ 71.400662][ T163] wb_workfn+0x154/0x588 [ 71.400937][ T163] process_one_work+0x26c/0x65c [ 71.401271][ T163] worker_thread+0x33c/0x498 [ 71.401575][ T163] kthread+0x110/0x134 [ 71.401844][ T163] ret_from_fork+0x10/0x20 I think that the above call stack indicates the following: f2fs_submit_write_bio() submits a bio to a dm driver, that the dm driver submitted a bio for the lower driver (SCSI core), that the bio for the lower driver is split by bio_split_rw(), and that the second half of the split bio triggers the above out-of-order warning. This new patch should address the concerns brought up in your latest email: diff --git a/block/blk-core.c b/block/blk-core.c index 411f005e6b1f..aa270588272a 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -649,6 +649,26 @@ static void __submit_bio(struct bio *bio) blk_finish_plug(&plug); } +/* + * Insert a bio in LBA order. If no bio for the same bdev with a higher LBA is + * found, append at the end. + */ +static void bio_list_insert_sorted(struct bio_list *bl, struct bio *bio) +{ + struct block_device *bdev = bio->bi_bdev; + struct bio **pprev = &bl->head, *next; + sector_t sector = bio->bi_iter.bi_sector; + + for (next = *pprev; next; pprev = &next->bi_next, next = next->bi_next) + if (next->bi_bdev == bdev && sector < next->bi_iter.bi_sector) + break; + + bio->bi_next = next; + *pprev = bio; + if (!next) + bl->tail = bio; +} + /* * The loop in this function may be a bit non-obvious, and so deserves some * explanation: @@ -706,7 +726,8 @@ static void __submit_bio_noacct(struct bio *bio) */ bio_list_merge(&bio_list_on_stack[0], &lower); bio_list_merge(&bio_list_on_stack[0], &same); - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); + while ((bio = bio_list_pop(&bio_list_on_stack[1]))) + bio_list_insert_sorted(&bio_list_on_stack[0], bio); } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); current->bio_list = NULL; @@ -746,7 +767,7 @@ void submit_bio_noacct_nocheck(struct bio *bio) * it is active, and then process them after it returned. */ if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); + bio_list_insert_sorted(¤t->bio_list[0], bio); else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) __submit_bio_noacct_mq(bio); else ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-19 22:12 ` Bart Van Assche @ 2025-05-20 13:56 ` Christoph Hellwig 2025-05-20 18:09 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-05-20 13:56 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote: > This new patch should address the concerns brought up in your latest > email: No, we should never need to do a sort, as mentioned we need to fix how stackable drivers split the I/O. Or maybe even get them out of all the splits that aren't required. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-20 13:56 ` Christoph Hellwig @ 2025-05-20 18:09 ` Bart Van Assche 2025-05-21 5:53 ` Christoph Hellwig 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-20 18:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/20/25 6:56 AM, Christoph Hellwig wrote: > On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote: >> This new patch should address the concerns brought up in your latest >> email: > > No, we should never need to do a sort, as mentioned we need to fix > how stackable drivers split the I/O. Or maybe even get them out of > all the splits that aren't required. If the sequential write bios are split by the device mapper, sorting bios in the block layer is not necessary. Christoph and Damien, do you agree to replace the bio sorting code in my previous email with the patch below? Thanks, Bart. diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 3d419fd2be57..c41ab294987e 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1792,9 +1792,12 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md, { /* * For mapped device that need zone append emulation, we must - * split any large BIO that straddles zone boundaries. + * split any large BIO that straddles zone boundaries. Additionally, + * split sequential writes to prevent that splitting lower in the stack + * causes reordering. */ - return dm_emulate_zone_append(md) && bio_straddles_zones(bio) && + return ((dm_emulate_zone_append(md) && bio_straddles_zones(bio)) || + bio_op(bio) == REQ_OP_WRITE) && !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); } static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio) ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-20 18:09 ` Bart Van Assche @ 2025-05-21 5:53 ` Christoph Hellwig 2025-05-21 21:18 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-05-21 5:53 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote: > If the sequential write bios are split by the device mapper, sorting > bios in the block layer is not necessary. Christoph and Damien, do you > agree to replace the bio sorting code in my previous email with the > patch below? No. First please create a reproducer for your issue using null_blk or scsi_debug, otherwise we have no way to understand what is going on here, and will regress in the future. Second should very much be able to fix the splitting in dm to place the bios in the right order. As mentioned before I have a theory of how to do it, but we really need a proper reproducer to test this and then to write it up to blktests first. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-21 5:53 ` Christoph Hellwig @ 2025-05-21 21:18 ` Bart Van Assche 2025-05-22 5:12 ` Damien Le Moal 2025-05-23 4:21 ` Christoph Hellwig 0 siblings, 2 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-21 21:18 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/20/25 10:53 PM, Christoph Hellwig wrote: > On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote: >> If the sequential write bios are split by the device mapper, sorting >> bios in the block layer is not necessary. Christoph and Damien, do you >> agree to replace the bio sorting code in my previous email with the >> patch below? > > No. First please create a reproducer for your issue using null_blk > or scsi_debug, otherwise we have no way to understand what is going > on here, and will regress in the future. > > Second should very much be able to fix the splitting in dm to place > the bios in the right order. As mentioned before I have a theory > of how to do it, but we really need a proper reproducer to test this > and then to write it up to blktests first. Hi Christoph, The following pull request includes a test that triggers the deadlock fixed by patch 2/2 reliably: https://github.com/osandov/blktests/pull/171 I do not yet have a reproducer for the bio reordering but I'm still working on this. Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-21 21:18 ` Bart Van Assche @ 2025-05-22 5:12 ` Damien Le Moal 2025-05-22 17:08 ` Bart Van Assche 2025-05-23 4:21 ` Christoph Hellwig 1 sibling, 1 reply; 44+ messages in thread From: Damien Le Moal @ 2025-05-22 5:12 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/21/25 23:18, Bart Van Assche wrote: > On 5/20/25 10:53 PM, Christoph Hellwig wrote: >> On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote: >>> If the sequential write bios are split by the device mapper, sorting >>> bios in the block layer is not necessary. Christoph and Damien, do you >>> agree to replace the bio sorting code in my previous email with the >>> patch below? >> >> No. First please create a reproducer for your issue using null_blk >> or scsi_debug, otherwise we have no way to understand what is going >> on here, and will regress in the future. >> >> Second should very much be able to fix the splitting in dm to place >> the bios in the right order. As mentioned before I have a theory >> of how to do it, but we really need a proper reproducer to test this >> and then to write it up to blktests first. > > Hi Christoph, > > The following pull request includes a test that triggers the deadlock > fixed by patch 2/2 reliably: > > https://github.com/osandov/blktests/pull/171 +Shin'ichiro so that he is aware of the context. Please share the blktest patch on this list so that we can see how you recreate the issue. That makes it easier to see if a fix is appropriate. > I do not yet have a reproducer for the bio reordering but I'm still > working on this. I am still very confused about how this is possible assuming a well behaved user that actually submits write BIOs in sequence for a zone. That means with a lock around submit_bio() calls. Assuming such user, a large write BIO that is split would have its fragments all processed and added to the target zone plug in order. Another context (or the same context) submitting the next write for that zone would have the same happen, so BIO fragments should not be reordered... So to clarify: are we talking about splits of the BIO that the DM device receives ? Or is it about splits of cloned BIOs that are used to process the BIOs that the DM device received ? The clones are for the underlying device and should not have the zone plugging flag set until the DM target driver submits them, even if the original BIO is flagged with zone plugging. Looking at the bio clone code, the bio flags do not seem to be copied from the source BIO to the clone. So even if the source BIO (the BIO received by the DM device) is flagged with zone write plugging, a clone should not have this flag set until it is submitted. Could you clarify the sequence and BIO flags you see that leads to the issue ? -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-22 5:12 ` Damien Le Moal @ 2025-05-22 17:08 ` Bart Van Assche 2025-05-23 6:02 ` Damien Le Moal 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-22 17:08 UTC (permalink / raw) To: Damien Le Moal, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/21/25 10:12 PM, Damien Le Moal wrote: > I am still very confused about how this is possible assuming a well behaved user > that actually submits write BIOs in sequence for a zone. That means with a lock > around submit_bio() calls. Assuming such user, a large write BIO that is split > would have its fragments all processed and added to the target zone plug in > order. Another context (or the same context) submitting the next write for that > zone would have the same happen, so BIO fragments should not be reordered... > > So to clarify: are we talking about splits of the BIO that the DM device > receives ? Or is it about splits of cloned BIOs that are used to process the > BIOs that the DM device received ? The clones are for the underlying device and > should not have the zone plugging flag set until the DM target driver submits > them, even if the original BIO is flagged with zone plugging. Looking at the bio > clone code, the bio flags do not seem to be copied from the source BIO to the > clone. So even if the source BIO (the BIO received by the DM device) is flagged > with zone write plugging, a clone should not have this flag set until it is > submitted. > > Could you clarify the sequence and BIO flags you see that leads to the issue ? Hi Damien, In the tests that I ran, F2FS submits bios to a dm driver and the dm driver submits these bios to the SCSI disk (sd) driver. F2FS submits bios at the write pointer. If that wouldn't be the case, the following code in block/blk-zoned.c would reject these bios: /* * Check for non-sequential writes early as we know that BIOs * with a start sector not unaligned to the zone write pointer * will fail. */ if (bio_offset_from_zone_start(bio) != zwplug->wp_offset) return false; If the bio is larger than 1 MiB, it gets split by the block layer after it passed through the dm driver and before it is submitted to the sd driver. The UFS driver sets max_sectors to 1 MiB. Although UFS host controllers support larger requests, this value has been chosen to minimize the impact of writes on read latency. Earlier emails in this thread show that the bio splitting below the dm driver can cause bio reordering. See also the call stack that is available here: https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/ Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-22 17:08 ` Bart Van Assche @ 2025-05-23 6:02 ` Damien Le Moal 2025-05-23 16:30 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Damien Le Moal @ 2025-05-23 6:02 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/22/25 19:08, Bart Van Assche wrote: > On 5/21/25 10:12 PM, Damien Le Moal wrote: >> I am still very confused about how this is possible assuming a well behaved user >> that actually submits write BIOs in sequence for a zone. That means with a lock >> around submit_bio() calls. Assuming such user, a large write BIO that is split >> would have its fragments all processed and added to the target zone plug in >> order. Another context (or the same context) submitting the next write for that >> zone would have the same happen, so BIO fragments should not be reordered... >> >> So to clarify: are we talking about splits of the BIO that the DM device >> receives ? Or is it about splits of cloned BIOs that are used to process the >> BIOs that the DM device received ? The clones are for the underlying device and >> should not have the zone plugging flag set until the DM target driver submits >> them, even if the original BIO is flagged with zone plugging. Looking at the bio >> clone code, the bio flags do not seem to be copied from the source BIO to the >> clone. So even if the source BIO (the BIO received by the DM device) is flagged >> with zone write plugging, a clone should not have this flag set until it is >> submitted. >> >> Could you clarify the sequence and BIO flags you see that leads to the issue ? > > Hi Damien, > > In the tests that I ran, F2FS submits bios to a dm driver and the dm > driver submits these bios to the SCSI disk (sd) driver. F2FS submits Which DM driver is it ? Does that DM driver have some special work queue handling of BIO submissions ? Or does is simply remap the BIO and send it down to the underlying device in the initial submit_bio() context ? If it is the former case, then that DM driver must enable zone write plugging. If it is the latter, it should not need zone write plugging and ordering will be handled correctly throughout the submit_bio() context for the initial DM BIO, assuming that the submitter does indeed serialize write BIO submissions to a zone. I have not looked at f2fs code in ages. When I worked on it, there was a mutex to serialize write issuing to avoid reordering issues... > bios at the write pointer. If that wouldn't be the case, the following > code in block/blk-zoned.c would reject these bios: > > /* > * Check for non-sequential writes early as we know that BIOs > * with a start sector not unaligned to the zone write pointer > * will fail. > */ > if (bio_offset_from_zone_start(bio) != zwplug->wp_offset) > return false; > > If the bio is larger than 1 MiB, it gets split by the block layer after > it passed through the dm driver and before it is submitted to the sd > driver. The UFS driver sets max_sectors to 1 MiB. Although UFS host > controllers support larger requests, this value has been chosen to > minimize the impact of writes on read latency. As mentioned above, the splitting and adding to the zone write plug should all be serialized by the submitter, using whatever mean is appropriate there. As long as submit_bio() is ongoing processing a large BIO and splitting it, if the submitter is correctly serialzing writes, I do not see how splitting can result in reordering... > Earlier emails in this thread show that the bio splitting below the dm > driver can cause bio reordering. See also the call stack that is > available here: > > https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/ I asked for clarification in the first place because I still do not understand what is going on reading that lightly explained backtrace you show in that email. A more detailed time flow explanation of what is happening and in which context would very likely clarify exactly what is gong on. So far, the only thing I can think of is that maybe we need to split BIOs in DM core before submitting them to the DM driver. But I am reluctant to send such patch because I cannot justify/expalin its need based on your explanations. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-23 6:02 ` Damien Le Moal @ 2025-05-23 16:30 ` Bart Van Assche 2025-05-24 8:48 ` Damien Le Moal 2025-05-26 5:24 ` Christoph Hellwig 0 siblings, 2 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-23 16:30 UTC (permalink / raw) To: Damien Le Moal, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/22/25 11:02 PM, Damien Le Moal wrote: > On 5/22/25 19:08, Bart Van Assche wrote: > [ ... ] > Which DM driver is it ? Does that DM driver have some special work queue > handling of BIO submissions ? Or does is simply remap the BIO and send it down > to the underlying device in the initial submit_bio() context ? If it is the > former case, then that DM driver must enable zone write plugging. If it is the > latter, it should not need zone write plugging and ordering will be handled > correctly throughout the submit_bio() context for the initial DM BIO, assuming > that the submitter does indeed serialize write BIO submissions to a zone. I have > not looked at f2fs code in ages. When I worked on it, there was a mutex to > serialize write issuing to avoid reordering issues... It is the dm-default-key driver, a driver about which everyone (including the authors of that driver) agree that it should disappear. Unfortunately the functionality provided by that driver has not yet been integrated in the upstream kernel (encrypt filesystem metadata). How that driver (dm-default-key) works is very similar to how dm-crypt works. I think that the most important difference is that dm-crypt requests encryption for all bios while dm-default-key only sets an encryption key for a subset of the bios it processes. The source code of that driver is available here: https://android.googlesource.com/kernel/common/+/refs/heads/android16-6.12/drivers/md/dm-default-key.c >> Earlier emails in this thread show that the bio splitting below the dm >> driver can cause bio reordering. See also the call stack that is >> available here: >> >> https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/ > > I asked for clarification in the first place because I still do not understand > what is going on reading that lightly explained backtrace you show in that > email. A more detailed time flow explanation of what is happening and in which > context would very likely clarify exactly what is gong on. > > So far, the only thing I can think of is that maybe we need to split BIOs in DM > core before submitting them to the DM driver. But I am reluctant to send such > patch because I cannot justify/expalin its need based on your explanations. Backtraces do not fully explain what happens because a single function (__submit_bio_noacct()) processes bios for multiple levels in a stacked block driver hierarchy. That's why I added debug code that reports the disk name (sde) for the first bio that is inserted out-of-order. This output shows that the bio reordering is the result of bio splitting at the bottom of the stack. Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-23 16:30 ` Bart Van Assche @ 2025-05-24 8:48 ` Damien Le Moal 2025-05-24 14:05 ` Bart Van Assche 2025-05-26 5:24 ` Christoph Hellwig 1 sibling, 1 reply; 44+ messages in thread From: Damien Le Moal @ 2025-05-24 8:48 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/23/25 18:30, Bart Van Assche wrote: > On 5/22/25 11:02 PM, Damien Le Moal wrote: >> On 5/22/25 19:08, Bart Van Assche wrote: > > [ ... ] >> Which DM driver is it ? Does that DM driver have some special work queue >> handling of BIO submissions ? Or does is simply remap the BIO and send it down >> to the underlying device in the initial submit_bio() context ? If it is the >> former case, then that DM driver must enable zone write plugging. If it is the >> latter, it should not need zone write plugging and ordering will be handled >> correctly throughout the submit_bio() context for the initial DM BIO, assuming >> that the submitter does indeed serialize write BIO submissions to a zone. I have >> not looked at f2fs code in ages. When I worked on it, there was a mutex to >> serialize write issuing to avoid reordering issues... > > It is the dm-default-key driver, a driver about which everyone > (including the authors of that driver) agree that it should disappear. > Unfortunately the functionality provided by that driver has not yet been > integrated in the upstream kernel (encrypt filesystem metadata). > > How that driver (dm-default-key) works is very similar to how dm-crypt > works. I think that the most important difference is that dm-crypt > requests encryption for all bios while dm-default-key only sets an > encryption key for a subset of the bios it processes. > > The source code of that driver is available here: > https://android.googlesource.com/kernel/common/+/refs/heads/android16-6.12/drivers/md/dm-default-key.c Well, this is an out of tree driver. Not touching this. Unless you can reproduce the issue with dm-crypt (or any other DM target that is upstream), I will not even try to debug this. Note that our internal test suite runs *lots* of different zoned devices (SMR HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns device) against *lots* of configurations for file systems (xfs, btrfs, zonefs) and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue, We run this test suite weekly against RC kernels and for-next branch. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-24 8:48 ` Damien Le Moal @ 2025-05-24 14:05 ` Bart Van Assche 2025-05-24 15:36 ` Damien Le Moal 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-24 14:05 UTC (permalink / raw) To: Damien Le Moal, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/24/25 1:48 AM, Damien Le Moal wrote: > Note that our internal test suite runs *lots* of different zoned devices (SMR > HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns > device) against *lots* of configurations for file systems (xfs, btrfs, zonefs) > and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue, > We run this test suite weekly against RC kernels and for-next branch. Hi Damien, Please consider adding this dm-crypt test to the weekly test run: https://lore.kernel.org/linux-block/20250523164956.883024-1-bvanassche@acm.org Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-24 14:05 ` Bart Van Assche @ 2025-05-24 15:36 ` Damien Le Moal 0 siblings, 0 replies; 44+ messages in thread From: Damien Le Moal @ 2025-05-24 15:36 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/24/25 16:05, Bart Van Assche wrote: > On 5/24/25 1:48 AM, Damien Le Moal wrote: >> Note that our internal test suite runs *lots* of different zoned devices (SMR >> HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns >> device) against *lots* of configurations for file systems (xfs, btrfs, zonefs) >> and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue, >> We run this test suite weekly against RC kernels and for-next branch. > > Hi Damien, > > Please consider adding this dm-crypt test to the weekly test run: > https://lore.kernel.org/linux-block/20250523164956.883024-1-bvanassche@acm.org We run blktest too, of course. So when Shin'ichiro applies this, it will be part of the runs. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-23 16:30 ` Bart Van Assche 2025-05-24 8:48 ` Damien Le Moal @ 2025-05-26 5:24 ` Christoph Hellwig 2025-05-27 16:19 ` Bart Van Assche 2025-06-08 22:07 ` Bart Van Assche 1 sibling, 2 replies; 44+ messages in thread From: Christoph Hellwig @ 2025-05-26 5:24 UTC (permalink / raw) To: Bart Van Assche Cc: Damien Le Moal, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Fri, May 23, 2025 at 09:30:36AM -0700, Bart Van Assche wrote: > It is the dm-default-key driver, a driver about which everyone > (including the authors of that driver) agree that it should disappear. > Unfortunately the functionality provided by that driver has not yet been > integrated in the upstream kernel (encrypt filesystem metadata). > > How that driver (dm-default-key) works is very similar to how dm-crypt > works. I think that the most important difference is that dm-crypt > requests encryption for all bios while dm-default-key only sets an > encryption key for a subset of the bios it processes. Umm, Bart I really expected better from you. You're ducking around providing a reproducer for over a week and waste multiple peoples time to tell us the only reproducer is your out of tree thingy reject upstream before? That's not really how Linux developement works. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-26 5:24 ` Christoph Hellwig @ 2025-05-27 16:19 ` Bart Van Assche 2025-05-31 0:25 ` Bart Van Assche 2025-06-08 22:07 ` Bart Van Assche 1 sibling, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-27 16:19 UTC (permalink / raw) To: Christoph Hellwig Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/25/25 10:24 PM, Christoph Hellwig wrote: > On Fri, May 23, 2025 at 09:30:36AM -0700, Bart Van Assche wrote: >> It is the dm-default-key driver, a driver about which everyone >> (including the authors of that driver) agree that it should disappear. >> Unfortunately the functionality provided by that driver has not yet been >> integrated in the upstream kernel (encrypt filesystem metadata). >> >> How that driver (dm-default-key) works is very similar to how dm-crypt >> works. I think that the most important difference is that dm-crypt >> requests encryption for all bios while dm-default-key only sets an >> encryption key for a subset of the bios it processes. > > Umm, Bart I really expected better from you. You're ducking around > providing a reproducer for over a week and waste multiple peoples > time to tell us the only reproducer is your out of tree thingy > reject upstream before? That's not really how Linux developement > works. I'm still working on a reproducer for the blktests framework. To my own frustration I have not yet found a small reproducer that is easy to run by others. I'm convinced that it should be possible to create a reproducer that is based on dm-crypt since dm-crypt and the dm-default- key driver behave identically with regard to bio splitting. Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-27 16:19 ` Bart Van Assche @ 2025-05-31 0:25 ` Bart Van Assche 0 siblings, 0 replies; 44+ messages in thread From: Bart Van Assche @ 2025-05-31 0:25 UTC (permalink / raw) To: Christoph Hellwig Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/27/25 9:19 AM, Bart Van Assche wrote: > I'm still working on a reproducer for the blktests framework. To my own > frustration I have not yet found a small reproducer that is easy to run > by others. I'm convinced that it should be possible to create a > reproducer that is based on dm-crypt since dm-crypt and the dm-default- > key driver behave identically with regard to bio splitting. (replying to my own email) I have not yet found a small reproducer - not even against the kernel this issue was originally observed with (android16-6.12). So I switched my approach. I'm now working on obtaining more information about the queued bios when this issue occurs. It seems like a bio for a SCSI disk device that shouldn't be split is split anyway. A bio A that does not cross a zone boundary is split into bios A1 and A2 and bio A1 is split into bios A11 and A12. Bio A12 is added at the end of the list (past bio A2) and this causes the reordering. I'm currently analyzing why this happens because this shouldn't happen. Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-26 5:24 ` Christoph Hellwig 2025-05-27 16:19 ` Bart Van Assche @ 2025-06-08 22:07 ` Bart Van Assche 2025-06-08 22:47 ` Damien Le Moal 2025-06-09 3:55 ` Christoph Hellwig 1 sibling, 2 replies; 44+ messages in thread From: Bart Van Assche @ 2025-06-08 22:07 UTC (permalink / raw) To: Christoph Hellwig Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 5/25/25 10:24 PM, Christoph Hellwig wrote: > Umm, Bart I really expected better from you. You're ducking around > providing a reproducer for over a week and waste multiple peoples > time to tell us the only reproducer is your out of tree thingy > reject upstream before? That's not really how Linux developement > works. A reproducer for the blktests framework is available below. It took me more time than I had expected to come up with a reproducer because I was initially looking in the wrong direction (device mapper). As one can see, the reproducer below does not involve any device mapper driver. Do you plan to work on a fix or do you perhaps expect me to do this? Thanks, Bart. $ cat tests/zbd/014 #!/bin/bash # SPDX-License-Identifier: GPL-3.0+ # Copyright (C) 2025 Google LLC # # Trigger multiple splitting of a bio. About the values of the size parameters # in this test: # - The zone size is 4 MiB and is larger than max_sectors_kb. # - The maximum bio size supported by the code in block/blk-crypto-fallback.c # is BIO_MAX_VECS * PAGE_SIZE or 1 MiB if the page size is 4 KiB. # - max_sectors_kb has been set to 512 KiB to cause further bio splitting. . tests/zbd/rc . common/null_blk DESCRIPTION="test multiple bio splitting" QUICK=1 requires() { _have_driver f2fs _have_driver null_blk _have_program fscrypt _have_program getconf _have_program mkfs.f2fs for o in BLK_INLINE_ENCRYPTION_FALLBACK FS_ENCRYPTION_INLINE_CRYPT; do if ! _check_kernel_option "$o"; then SKIP_REASONS+=("Kernel option $o has not been enabled") fi done } trace_block_io() { ( set -e cd /sys/kernel/tracing lsof -t trace_pipe | xargs -r kill echo 1024 > buffer_size_kb echo nop > current_tracer echo 0 > tracing_on echo > trace echo 0 > events/enable echo 1 > events/block/enable echo 0 > events/block/block_dirty_buffer/enable echo 0 > events/block/block_plug/enable echo 0 > events/block/block_touch_buffer/enable echo 0 > events/block/block_unplug/enable grep -lw 1 events/block/*/enable | while read -r event_path; do filter_path="${event_path%enable}filter" echo "$1" > "$filter_path" done echo 1 > tracing_on echo "Tracing has been enabled" >>"$FULL" cat trace_pipe > "${FULL%.full}-block-trace.txt" ) } stop_tracing() { kill "$1" ( set -e cd /sys/kernel/tracing echo 0 > tracing_on echo 0 > events/enable ) } report_stats() { local pfx=$1 before after read -r -a before <<<"$2" read -r -a after <<<"$3" local reads=$((after[0]-before[0])) local rmerge=$((after[1]-before[1])) local writes=$((after[4]-before[4])) local wmerge=$((after[5]-before[5])) echo "$pfx reads: $reads rmerge: $rmerge writes: $writes wmerge: $wmerge" } devno() { IFS=: read -r maj min <"/sys/class/block/$(basename "$1")/dev" # From <linux/kdev_t.h> MINORBITS = 20 echo $(((maj << 20) + min)) } test() { echo "Running ${TEST_NAME}" local page_size page_size=$(getconf PAGE_SIZE) || return $? local mount_dir="$TMPDIR/mnt" umount /dev/nullb1 >>"$FULL" 2>&1 _remove_null_blk_devices # A small conventional block device for the F2FS metadata. local null_blk_params=( blocksize=4096 discard=1 max_sectors=$(((1 << 32) - 1)) memory_backed=1 size=64 # MiB submit_queues=1 power=1 ) _configure_null_blk nullb1 "${null_blk_params[@]}" local cdev=/dev/nullb1 # A larger zoned block device for the data. local null_blk_params=( blocksize=4096 completion_nsec=10000000 # 10 ms irqmode=2 max_sectors=$(((1 << 32) - 1)) memory_backed=1 hw_queue_depth=1 size=1024 # MiB submit_queues=1 zone_size="$((page_size >> 10))" # 1024 times the page size. zoned=1 power=1 ) _configure_null_blk nullb2 "${null_blk_params[@]}" || return $? local zdev_basename=nullb2 local zdev=/dev/${zdev_basename} local zdev_devno zdev_devno=$(devno "$zdev") { local max_sectors_zdev=/sys/class/block/"${zdev_basename}"/queue/max_sectors_kb echo $((128 * page_size)) > "${max_sectors_zdev}" echo "${zdev_basename}: max_sectors_kb=$(<"${max_sectors_zdev}")" ls -ld "${cdev}" "${zdev}" echo "${zdev_basename} settings:" (cd "/sys/class/block/$zdev_basename/queue" && grep -vw 0 ./*) } >>"${FULL}" 2>&1 trace_block_io "dev == ${zdev_devno}" & local trace_pid=$! while ! grep -q 'Tracing has been enabled' "$FULL"; do sleep .1 done { mkfs.f2fs -q -O encrypt -m "${cdev}" -c "${zdev}" && mkdir -p "${mount_dir}" && mount -o inlinecrypt -t f2fs "${cdev}" "${mount_dir}" && local encrypted_dir="${mount_dir}/encrypted" && mkdir -p "${encrypted_dir}" fscrypt setup "${mount_dir}" </dev/null local keyfile=$TMPDIR/keyfile && dd if=/dev/zero of="$keyfile" bs=32 count=1 status=none && fscrypt encrypt "${encrypted_dir}" --source=raw_key \ --name=protector --key="$keyfile" && fscrypt status "${encrypted_dir}" && local before after && read -r -a before <"/sys/class/block/${zdev_basename}/stat" && echo "Starting IO" && local cmd="dd if=/dev/zero of=${encrypted_dir}/file bs=64M count=15 conv=fdatasync" && echo "$cmd" && eval "$cmd" && ls -ld "${mount_dir}/encrypted/file" && read -r -a after <"/sys/class/block/${zdev_basename}/stat" && report_stats "zdev:" "${before[*]}" "${after[*]}" } >>"$FULL" 2>&1 || fail=true umount "${mount_dir}" >>"${FULL}" 2>&1 [ -n "${trace_pid}" ] && kill "${trace_pid}" _exit_null_blk if [ -z "$fail" ]; then echo "Test complete" else echo "Test failed" return 1 fi } ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-08 22:07 ` Bart Van Assche @ 2025-06-08 22:47 ` Damien Le Moal 2025-06-09 3:58 ` Christoph Hellwig 2025-06-09 20:48 ` Bart Van Assche 2025-06-09 3:55 ` Christoph Hellwig 1 sibling, 2 replies; 44+ messages in thread From: Damien Le Moal @ 2025-06-08 22:47 UTC (permalink / raw) To: Bart Van Assche, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 6/9/25 07:07, Bart Van Assche wrote: > On 5/25/25 10:24 PM, Christoph Hellwig wrote: >> Umm, Bart I really expected better from you. You're ducking around >> providing a reproducer for over a week and waste multiple peoples >> time to tell us the only reproducer is your out of tree thingy >> reject upstream before? That's not really how Linux developement >> works. > > A reproducer for the blktests framework is available below. It took me > more time than I had expected to come up with a reproducer because I was > initially looking in the wrong direction (device mapper). As one can > see, the reproducer below does not involve any device mapper driver. > > Do you plan to work on a fix or do you perhaps expect me to do this? To be honest, I have never tested the block layer inline crypto feature using zoned devices. All the use cases I have with crypto are using dm-crypt. A quick look tells me that the issue is with that particular feature, especially the fallback path, as it is using work items to asynchronously handle write data encryption. Since this happens above the zone write plug, without that being aware of the sequential write constraint, problems may indeed happen. So yes, we need a fix. Can you work on one ? One additional thing: why do you need f2fs with your reproducer ? Problems should happen also doing writes directly to the block device, no ? > > Thanks, > > Bart. > > $ cat tests/zbd/014 > #!/bin/bash > # SPDX-License-Identifier: GPL-3.0+ > # Copyright (C) 2025 Google LLC > # > # Trigger multiple splitting of a bio. About the values of the size > parameters > # in this test: > # - The zone size is 4 MiB and is larger than max_sectors_kb. > # - The maximum bio size supported by the code in > block/blk-crypto-fallback.c > # is BIO_MAX_VECS * PAGE_SIZE or 1 MiB if the page size is 4 KiB. > # - max_sectors_kb has been set to 512 KiB to cause further bio splitting. > > . tests/zbd/rc > . common/null_blk > > DESCRIPTION="test multiple bio splitting" > QUICK=1 > > requires() { > _have_driver f2fs > _have_driver null_blk > _have_program fscrypt > _have_program getconf > _have_program mkfs.f2fs > for o in BLK_INLINE_ENCRYPTION_FALLBACK FS_ENCRYPTION_INLINE_CRYPT; do > if ! _check_kernel_option "$o"; then > SKIP_REASONS+=("Kernel option $o has not been enabled") > fi > done > } > > trace_block_io() { > ( > set -e > cd /sys/kernel/tracing > lsof -t trace_pipe | xargs -r kill > echo 1024 > buffer_size_kb > echo nop > current_tracer > echo 0 > tracing_on > echo > trace > echo 0 > events/enable > echo 1 > events/block/enable > echo 0 > events/block/block_dirty_buffer/enable > echo 0 > events/block/block_plug/enable > echo 0 > events/block/block_touch_buffer/enable > echo 0 > events/block/block_unplug/enable > grep -lw 1 events/block/*/enable | while read -r event_path; do > filter_path="${event_path%enable}filter" > echo "$1" > "$filter_path" > done > echo 1 > tracing_on > echo "Tracing has been enabled" >>"$FULL" > cat trace_pipe > "${FULL%.full}-block-trace.txt" > ) > } > > stop_tracing() { > kill "$1" > ( > set -e > cd /sys/kernel/tracing > echo 0 > tracing_on > echo 0 > events/enable > ) > } > > report_stats() { > local pfx=$1 before after > read -r -a before <<<"$2" > read -r -a after <<<"$3" > local reads=$((after[0]-before[0])) > local rmerge=$((after[1]-before[1])) > local writes=$((after[4]-before[4])) > local wmerge=$((after[5]-before[5])) > echo "$pfx reads: $reads rmerge: $rmerge writes: $writes wmerge: $wmerge" > } > > devno() { > IFS=: read -r maj min <"/sys/class/block/$(basename "$1")/dev" > # From <linux/kdev_t.h> MINORBITS = 20 > echo $(((maj << 20) + min)) > } > > test() { > echo "Running ${TEST_NAME}" > > local page_size > page_size=$(getconf PAGE_SIZE) || return $? > > local mount_dir="$TMPDIR/mnt" > > umount /dev/nullb1 >>"$FULL" 2>&1 > _remove_null_blk_devices > > # A small conventional block device for the F2FS metadata. > local null_blk_params=( > blocksize=4096 > discard=1 > max_sectors=$(((1 << 32) - 1)) > memory_backed=1 > size=64 # MiB > submit_queues=1 > power=1 > ) > _configure_null_blk nullb1 "${null_blk_params[@]}" > local cdev=/dev/nullb1 > > # A larger zoned block device for the data. > local null_blk_params=( > blocksize=4096 > completion_nsec=10000000 # 10 ms > irqmode=2 > max_sectors=$(((1 << 32) - 1)) > memory_backed=1 > hw_queue_depth=1 > size=1024 # MiB > submit_queues=1 > zone_size="$((page_size >> 10))" # 1024 times the page size. > zoned=1 > power=1 > ) > _configure_null_blk nullb2 "${null_blk_params[@]}" || return $? > local zdev_basename=nullb2 > local zdev=/dev/${zdev_basename} > local zdev_devno > zdev_devno=$(devno "$zdev") > > { > local > max_sectors_zdev=/sys/class/block/"${zdev_basename}"/queue/max_sectors_kb > echo $((128 * page_size)) > "${max_sectors_zdev}" > echo "${zdev_basename}: max_sectors_kb=$(<"${max_sectors_zdev}")" > > ls -ld "${cdev}" "${zdev}" > echo "${zdev_basename} settings:" > (cd "/sys/class/block/$zdev_basename/queue" && grep -vw 0 ./*) > } >>"${FULL}" 2>&1 > > trace_block_io "dev == ${zdev_devno}" & > local trace_pid=$! > while ! grep -q 'Tracing has been enabled' "$FULL"; do > sleep .1 > done > > { > mkfs.f2fs -q -O encrypt -m "${cdev}" -c "${zdev}" && > mkdir -p "${mount_dir}" && > mount -o inlinecrypt -t f2fs "${cdev}" "${mount_dir}" && > local encrypted_dir="${mount_dir}/encrypted" && > mkdir -p "${encrypted_dir}" > fscrypt setup "${mount_dir}" </dev/null > local keyfile=$TMPDIR/keyfile && > dd if=/dev/zero of="$keyfile" bs=32 count=1 status=none && > fscrypt encrypt "${encrypted_dir}" --source=raw_key \ > --name=protector --key="$keyfile" && > fscrypt status "${encrypted_dir}" && > > local before after && > read -r -a before <"/sys/class/block/${zdev_basename}/stat" && > echo "Starting IO" && > local cmd="dd if=/dev/zero of=${encrypted_dir}/file bs=64M count=15 > conv=fdatasync" && > echo "$cmd" && > eval "$cmd" && > ls -ld "${mount_dir}/encrypted/file" && > read -r -a after <"/sys/class/block/${zdev_basename}/stat" && > report_stats "zdev:" "${before[*]}" "${after[*]}" > } >>"$FULL" 2>&1 || fail=true > > umount "${mount_dir}" >>"${FULL}" 2>&1 > [ -n "${trace_pid}" ] && kill "${trace_pid}" > _exit_null_blk > > if [ -z "$fail" ]; then > echo "Test complete" > else > echo "Test failed" > return 1 > fi > } > > -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-08 22:47 ` Damien Le Moal @ 2025-06-09 3:58 ` Christoph Hellwig 2025-06-09 20:48 ` Bart Van Assche 1 sibling, 0 replies; 44+ messages in thread From: Christoph Hellwig @ 2025-06-09 3:58 UTC (permalink / raw) To: Damien Le Moal Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Mon, Jun 09, 2025 at 07:47:38AM +0900, Damien Le Moal wrote: > To be honest, I have never tested the block layer inline crypto feature using > zoned devices. All the use cases I have with crypto are using dm-crypt. > A quick look tells me that the issue is with that particular feature, especially > the fallback path, as it is using work items to asynchronously handle write data > encryption. Nothing asynchronous in the fallback write path, just weird open coded bio splitting like the one I just got rid of with the bounce buffering.. > One additional thing: why do you need f2fs with your reproducer ? Problems > should happen also doing writes directly to the block device, no ? Because fscrypt is the only user of the block crypto code, and f2fs is the only user of that that works on zoned devices. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-08 22:47 ` Damien Le Moal 2025-06-09 3:58 ` Christoph Hellwig @ 2025-06-09 20:48 ` Bart Van Assche 2025-06-10 5:04 ` Christoph Hellwig 1 sibling, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-06-09 20:48 UTC (permalink / raw) To: Damien Le Moal, Christoph Hellwig Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 6/8/25 3:47 PM, Damien Le Moal wrote: > So yes, we need a fix. Can you work on one ? The patch below seems to be sufficient but I'm not sure whether this approach is acceptable: Subject: [PATCH] block: Preserve the LBA order when splitting a bio Preserve the bio order if bio_split() is called on the prefix returned by an earlier bio_split() call. This can happen with fscrypt and the inline encryption fallback code if max_sectors is less than the maximum bio size supported by the inline encryption fallback code (1 MiB for 4 KiB pages) or when using zoned storage and the distance from the start of the bio to the next zone boundary is less than 1 MiB. Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption") Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/bio.c | 8 ++++++++ block/blk-core.c | 12 ++++++++---- include/linux/blk_types.h | 5 +++++ 3 files changed, 21 insertions(+), 4 deletions(-) diff --git a/block/bio.c b/block/bio.c index 3c0a558c90f5..440ed443545c 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1689,6 +1689,14 @@ struct bio *bio_split(struct bio *bio, int sectors, bio_advance(bio, split->bi_iter.bi_size); + /* + * If bio_split() is called on a prefix from an earlier bio_split() + * call, adding it at the head of current->bio_list[0] preserves the + * LBA order. This is essential when writing data to a zoned block + * device and when using REQ_OP_WRITE instead of REQ_OP_ZONE_APPEND. + */ + bio_set_flag(bio, BIO_ADD_AT_HEAD); + if (bio_flagged(bio, BIO_TRACE_COMPLETION)) bio_set_flag(split, BIO_TRACE_COMPLETION); diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f2..570a14a7bcd4 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio) * to collect a list of requests submited by a ->submit_bio method while * it is active, and then process them after it returned. */ - if (current->bio_list) - bio_list_add(¤t->bio_list[0], bio); - else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) + if (current->bio_list) { + if (bio_flagged(bio, BIO_ADD_AT_HEAD)) + bio_list_add_head(¤t->bio_list[0], bio); + else + bio_list_add(¤t->bio_list[0], bio); + } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { __submit_bio_noacct_mq(bio); - else + } else { __submit_bio_noacct(bio); + } } static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q, diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 3d1577f07c1c..0e2d3fd8d40a 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -308,6 +308,11 @@ enum { BIO_REMAPPED, BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */ BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */ + /* + * make submit_bio_noacct_nocheck() add this bio at the head of + * current->bio_list[0]. + */ + BIO_ADD_AT_HEAD, BIO_FLAG_LAST }; ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-09 20:48 ` Bart Van Assche @ 2025-06-10 5:04 ` Christoph Hellwig 0 siblings, 0 replies; 44+ messages in thread From: Christoph Hellwig @ 2025-06-10 5:04 UTC (permalink / raw) To: Bart Van Assche Cc: Damien Le Moal, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Mon, Jun 09, 2025 at 01:48:06PM -0700, Bart Van Assche wrote: > On 6/8/25 3:47 PM, Damien Le Moal wrote: >> So yes, we need a fix. Can you work on one ? > > The patch below seems to be sufficient but I'm not sure whether this > approach is acceptable: It's still the wrong approach. Please add the splitting needed for the blk-crypt fallback to __bio_split_to_limits instead. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-08 22:07 ` Bart Van Assche 2025-06-08 22:47 ` Damien Le Moal @ 2025-06-09 3:55 ` Christoph Hellwig 2025-06-10 17:23 ` Bart Van Assche 1 sibling, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-06-09 3:55 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Sun, Jun 08, 2025 at 03:07:30PM -0700, Bart Van Assche wrote: > On 5/25/25 10:24 PM, Christoph Hellwig wrote: >> Umm, Bart I really expected better from you. You're ducking around >> providing a reproducer for over a week and waste multiple peoples >> time to tell us the only reproducer is your out of tree thingy >> reject upstream before? That's not really how Linux developement >> works. > > A reproducer for the blktests framework is available below. It took me > more time than I had expected to come up with a reproducer because I was > initially looking in the wrong direction (device mapper). As one can > see, the reproducer below does not involve any device mapper driver. > > Do you plan to work on a fix or do you perhaps expect me to do this? The problem here is that blk_crypto_fallback_split_bio_if_needed does sneaky splits behind the back of the main splitting code. The fix is to include the limit imposed by it in __bio_split_to_limits as well if the crypto fallback is used. If you have time to fix this that would be great. Otherwise I can give it a spin, but it's public holiday and travel season here, so my availability is a bit limited. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-09 3:55 ` Christoph Hellwig @ 2025-06-10 17:23 ` Bart Van Assche 2025-06-10 23:18 ` Keith Busch 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-06-10 17:23 UTC (permalink / raw) To: Christoph Hellwig Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On 6/8/25 8:55 PM, Christoph Hellwig wrote: > The problem here is that blk_crypto_fallback_split_bio_if_needed does > sneaky splits behind the back of the main splitting code. > > The fix is to include the limit imposed by it in __bio_split_to_limits > as well if the crypto fallback is used. (+Eric) Hmm ... my understanding is that when the inline encryption fallback code is used that the bio must be split before blk_crypto_fallback_encrypt_bio() encrypts the bio. Making __bio_split_to_limits() take the inline encryption limit into account would require to encrypt the data much later. How to perform encryption later for bio-based drivers? Would moving the blk_crypto_bio_prep() call from submit_bio() to just before __bio_split_to_limits() perhaps require modifying all bio-based drivers that do not call __bio_split_to_limits()? > If you have time to fix this that would be great. Otherwise I can > give it a spin, but it's public holiday and travel season here, so > my availability is a bit limited. This is not the solution that you are looking for but this seems to work: diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c index 7c33e9573e5e..f4fefecdcc5e 100644 --- a/block/blk-crypto-fallback.c +++ b/block/blk-crypto-fallback.c @@ -213,6 +213,7 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) { struct bio *bio = *bio_ptr; + const struct queue_limits *lim = bdev_limits(bio->bi_bdev); unsigned int i = 0; unsigned int num_sectors = 0; struct bio_vec bv; @@ -223,6 +224,7 @@ static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) if (++i == BIO_MAX_VECS) break; } + num_sectors = min(num_sectors, get_max_io_size(bio, lim)); if (num_sectors < bio_sectors(bio)) { struct bio *split_bio; diff --git a/block/blk-merge.c b/block/blk-merge.c index fb6253c07387..e308325a333c 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -192,8 +192,7 @@ static inline unsigned int blk_boundary_sectors(const struct queue_limits *lim, * requests that are submitted to a block device if the start of a bio is not * aligned to a physical block boundary. */ -static inline unsigned get_max_io_size(struct bio *bio, - const struct queue_limits *lim) +unsigned get_max_io_size(struct bio *bio, const struct queue_limits *lim) { unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT; unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT; diff --git a/block/blk.h b/block/blk.h index 37ec459fe656..5f97db919cdf 100644 --- a/block/blk.h +++ b/block/blk.h @@ -425,6 +425,8 @@ static inline unsigned get_max_segment_size(const struct queue_limits *lim, (unsigned long)lim->max_segment_size - 1) + 1); } +unsigned get_max_io_size(struct bio *bio, const struct queue_limits *lim); + int ll_back_merge_fn(struct request *req, struct bio *bio, unsigned int nr_segs); bool blk_attempt_req_merge(struct request_queue *q, struct request *rq, Thanks, Bart. ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-10 17:23 ` Bart Van Assche @ 2025-06-10 23:18 ` Keith Busch 2025-06-11 0:46 ` Damien Le Moal ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: Keith Busch @ 2025-06-10 23:18 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On Tue, Jun 10, 2025 at 10:23:26AM -0700, Bart Van Assche wrote: > On 6/8/25 8:55 PM, Christoph Hellwig wrote: > > The problem here is that blk_crypto_fallback_split_bio_if_needed does > > sneaky splits behind the back of the main splitting code. > > > > The fix is to include the limit imposed by it in __bio_split_to_limits > > as well if the crypto fallback is used. > > (+Eric) > > Hmm ... my understanding is that when the inline encryption fallback > code is used that the bio must be split before > blk_crypto_fallback_encrypt_bio() encrypts the bio. Making > __bio_split_to_limits() take the inline encryption limit into account > would require to encrypt the data much later. How to perform encryption > later for bio-based drivers? Would moving the blk_crypto_bio_prep() call > from submit_bio() to just before __bio_split_to_limits() perhaps require > modifying all bio-based drivers that do not call > __bio_split_to_limits()? I think you could just prep the encryption at the point the bio is split to its queue's limits, and then all you need to do after that is ensure the limits don't exceed what the fallback requires (which appears to just be a simple segment limitation). It looks like most of the bio based drivers split to limits already. And if that's all okay, it simplifies quite a lot of things in this path because the crypto fallback doesn't need to concern itself with splitting anymore. Here's a quick shot a it, but compile tested only: --- diff --git a/block/blk-core.c b/block/blk-core.c index b862c66018f25..157f4515eea7d 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -39,7 +39,6 @@ #include <linux/bpf.h> #include <linux/part_stat.h> #include <linux/sched/sysctl.h> -#include <linux/blk-crypto.h> #define CREATE_TRACE_POINTS #include <trace/events/block.h> @@ -626,9 +625,6 @@ static void __submit_bio(struct bio *bio) /* If plug is not used, add new plug here to cache nsecs time. */ struct blk_plug plug; - if (unlikely(!blk_crypto_bio_prep(&bio))) - return; - blk_start_plug(&plug); if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c index 005c9157ffb3d..794400382e85b 100644 --- a/block/blk-crypto-fallback.c +++ b/block/blk-crypto-fallback.c @@ -209,36 +209,6 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, return true; } -static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) -{ - struct bio *bio = *bio_ptr; - unsigned int i = 0; - unsigned int num_sectors = 0; - struct bio_vec bv; - struct bvec_iter iter; - - bio_for_each_segment(bv, bio, iter) { - num_sectors += bv.bv_len >> SECTOR_SHIFT; - if (++i == BIO_MAX_VECS) - break; - } - if (num_sectors < bio_sectors(bio)) { - struct bio *split_bio; - - split_bio = bio_split(bio, num_sectors, GFP_NOIO, - &crypto_bio_split); - if (IS_ERR(split_bio)) { - bio->bi_status = BLK_STS_RESOURCE; - return false; - } - bio_chain(split_bio, bio); - submit_bio_noacct(bio); - *bio_ptr = split_bio; - } - - return true; -} - union blk_crypto_iv { __le64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; u8 bytes[BLK_CRYPTO_MAX_IV_SIZE]; @@ -275,10 +245,6 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) bool ret = false; blk_status_t blk_st; - /* Split the bio if it's too big for single page bvec */ - if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr)) - return false; - src_bio = *bio_ptr; bc = src_bio->bi_crypt_context; data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c index 94a155912bf1c..2804843310cc4 100644 --- a/block/blk-crypto-profile.c +++ b/block/blk-crypto-profile.c @@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile, pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); return false; } + + if (queue_max_segments(q) > BIO_MAX_VECS) { + struct queue_limits lim; + + lim = queue_limits_start_update(q); + lim.max_segments = BIO_MAX_VECS; + if (queue_limits_commit_update(q, &lim)) + return false; + } + q->crypto_profile = profile; return true; } diff --git a/block/blk-merge.c b/block/blk-merge.c index 3af1d284add50..d6d6f5b214c82 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors) trace_block_split(split, bio->bi_iter.bi_sector); WARN_ON_ONCE(bio_zone_write_plugging(bio)); submit_bio_noacct(bio); - return split; + bio = split; } + blk_crypto_bio_prep(&bio); return bio; error: bio->bi_status = errno_to_blk_status(split_sectors); -- ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-10 23:18 ` Keith Busch @ 2025-06-11 0:46 ` Damien Le Moal 2025-06-11 1:00 ` Keith Busch 2025-06-11 1:34 ` Keith Busch 2025-06-11 3:40 ` Christoph Hellwig 2 siblings, 1 reply; 44+ messages in thread From: Damien Le Moal @ 2025-06-11 0:46 UTC (permalink / raw) To: Keith Busch, Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On 6/11/25 8:18 AM, Keith Busch wrote: > On Tue, Jun 10, 2025 at 10:23:26AM -0700, Bart Van Assche wrote: >> On 6/8/25 8:55 PM, Christoph Hellwig wrote: >>> The problem here is that blk_crypto_fallback_split_bio_if_needed does >>> sneaky splits behind the back of the main splitting code. >>> >>> The fix is to include the limit imposed by it in __bio_split_to_limits >>> as well if the crypto fallback is used. >> >> (+Eric) >> >> Hmm ... my understanding is that when the inline encryption fallback >> code is used that the bio must be split before >> blk_crypto_fallback_encrypt_bio() encrypts the bio. Making >> __bio_split_to_limits() take the inline encryption limit into account >> would require to encrypt the data much later. How to perform encryption >> later for bio-based drivers? Would moving the blk_crypto_bio_prep() call >> from submit_bio() to just before __bio_split_to_limits() perhaps require >> modifying all bio-based drivers that do not call >> __bio_split_to_limits()? > > I think you could just prep the encryption at the point the bio is split > to its queue's limits, and then all you need to do after that is ensure > the limits don't exceed what the fallback requires (which appears to > just be a simple segment limitation). It looks like most of the bio > based drivers split to limits already. Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs using bio_split_to_limits(). The only BIOs being split are "abnormal" ones, such as discard. That said, DM does split the BIOs through cloning and also has the "accept partial" thing, which is used in many places and allows DM target drivers to split BIOs as they are received, but not necessarilly based on the device queue limits (e.g. dm-crypt splits reads and writes using an internal max_read|write_size limit). So inline crypto on top of a DM/bio-based device may need some explicit splitting added in dm.c (__max_io_len() maybe ?). > And if that's all okay, it simplifies quite a lot of things in this path > because the crypto fallback doesn't need to concern itself with > splitting anymore. Here's a quick shot a it, but compile tested only: > > --- > diff --git a/block/blk-core.c b/block/blk-core.c > index b862c66018f25..157f4515eea7d 100644 > --- a/block/blk-core.c > +++ b/block/blk-core.c > @@ -39,7 +39,6 @@ > #include <linux/bpf.h> > #include <linux/part_stat.h> > #include <linux/sched/sysctl.h> > -#include <linux/blk-crypto.h> > > #define CREATE_TRACE_POINTS > #include <trace/events/block.h> > @@ -626,9 +625,6 @@ static void __submit_bio(struct bio *bio) > /* If plug is not used, add new plug here to cache nsecs time. */ > struct blk_plug plug; > > - if (unlikely(!blk_crypto_bio_prep(&bio))) > - return; > - > blk_start_plug(&plug); > > if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { > diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c > index 005c9157ffb3d..794400382e85b 100644 > --- a/block/blk-crypto-fallback.c > +++ b/block/blk-crypto-fallback.c > @@ -209,36 +209,6 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, > return true; > } > > -static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) > -{ > - struct bio *bio = *bio_ptr; > - unsigned int i = 0; > - unsigned int num_sectors = 0; > - struct bio_vec bv; > - struct bvec_iter iter; > - > - bio_for_each_segment(bv, bio, iter) { > - num_sectors += bv.bv_len >> SECTOR_SHIFT; > - if (++i == BIO_MAX_VECS) > - break; > - } > - if (num_sectors < bio_sectors(bio)) { > - struct bio *split_bio; > - > - split_bio = bio_split(bio, num_sectors, GFP_NOIO, > - &crypto_bio_split); > - if (IS_ERR(split_bio)) { > - bio->bi_status = BLK_STS_RESOURCE; > - return false; > - } > - bio_chain(split_bio, bio); > - submit_bio_noacct(bio); > - *bio_ptr = split_bio; > - } > - > - return true; > -} > - > union blk_crypto_iv { > __le64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE]; > u8 bytes[BLK_CRYPTO_MAX_IV_SIZE]; > @@ -275,10 +245,6 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) > bool ret = false; > blk_status_t blk_st; > > - /* Split the bio if it's too big for single page bvec */ > - if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr)) > - return false; > - > src_bio = *bio_ptr; > bc = src_bio->bi_crypt_context; > data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; > diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c > index 94a155912bf1c..2804843310cc4 100644 > --- a/block/blk-crypto-profile.c > +++ b/block/blk-crypto-profile.c > @@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile, > pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); > return false; > } > + > + if (queue_max_segments(q) > BIO_MAX_VECS) { > + struct queue_limits lim; > + > + lim = queue_limits_start_update(q); > + lim.max_segments = BIO_MAX_VECS; > + if (queue_limits_commit_update(q, &lim)) > + return false; > + } > + > q->crypto_profile = profile; > return true; > } > diff --git a/block/blk-merge.c b/block/blk-merge.c > index 3af1d284add50..d6d6f5b214c82 100644 > --- a/block/blk-merge.c > +++ b/block/blk-merge.c > @@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors) > trace_block_split(split, bio->bi_iter.bi_sector); > WARN_ON_ONCE(bio_zone_write_plugging(bio)); > submit_bio_noacct(bio); > - return split; > + bio = split; > } > > + blk_crypto_bio_prep(&bio); > return bio; > error: > bio->bi_status = errno_to_blk_status(split_sectors); > -- -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 0:46 ` Damien Le Moal @ 2025-06-11 1:00 ` Keith Busch 2025-06-11 1:02 ` Damien Le Moal 0 siblings, 1 reply; 44+ messages in thread From: Keith Busch @ 2025-06-11 1:00 UTC (permalink / raw) To: Damien Le Moal Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On Wed, Jun 11, 2025 at 09:46:31AM +0900, Damien Le Moal wrote: > On 6/11/25 8:18 AM, Keith Busch wrote: > > > > I think you could just prep the encryption at the point the bio is split > > to its queue's limits, and then all you need to do after that is ensure > > the limits don't exceed what the fallback requires (which appears to > > just be a simple segment limitation). It looks like most of the bio > > based drivers split to limits already. > > Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs > using bio_split_to_limits(). The only BIOs being split are "abnormal" ones, > such as discard. > > That said, DM does split the BIOs through cloning and also has the "accept > partial" thing, which is used in many places and allows DM target drivers to > split BIOs as they are received, but not necessarilly based on the device queue > limits (e.g. dm-crypt splits reads and writes using an internal > max_read|write_size limit). > > So inline crypto on top of a DM/bio-based device may need some explicit > splitting added in dm.c (__max_io_len() maybe ?). Ah, thanks for pointing that out! In that case, could we have dm consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That way it does the split-to-limits for these, and dm appears to do that action very early before its own split/clone actions on the resulting bio. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 1:00 ` Keith Busch @ 2025-06-11 1:02 ` Damien Le Moal 2025-06-11 1:08 ` Keith Busch 0 siblings, 1 reply; 44+ messages in thread From: Damien Le Moal @ 2025-06-11 1:02 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On 6/11/25 10:00 AM, Keith Busch wrote: > On Wed, Jun 11, 2025 at 09:46:31AM +0900, Damien Le Moal wrote: >> On 6/11/25 8:18 AM, Keith Busch wrote: >>> >>> I think you could just prep the encryption at the point the bio is split >>> to its queue's limits, and then all you need to do after that is ensure >>> the limits don't exceed what the fallback requires (which appears to >>> just be a simple segment limitation). It looks like most of the bio >>> based drivers split to limits already. >> >> Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs >> using bio_split_to_limits(). The only BIOs being split are "abnormal" ones, >> such as discard. >> >> That said, DM does split the BIOs through cloning and also has the "accept >> partial" thing, which is used in many places and allows DM target drivers to >> split BIOs as they are received, but not necessarilly based on the device queue >> limits (e.g. dm-crypt splits reads and writes using an internal >> max_read|write_size limit). >> >> So inline crypto on top of a DM/bio-based device may need some explicit >> splitting added in dm.c (__max_io_len() maybe ?). > > Ah, thanks for pointing that out! In that case, could we have dm > consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That > way it does the split-to-limits for these, and dm appears to do that > action very early before its own split/clone actions on the resulting > bio. That sounds reasonable to me. -- Damien Le Moal Western Digital Research ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 1:02 ` Damien Le Moal @ 2025-06-11 1:08 ` Keith Busch 0 siblings, 0 replies; 44+ messages in thread From: Keith Busch @ 2025-06-11 1:08 UTC (permalink / raw) To: Damien Le Moal Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On Wed, Jun 11, 2025 at 10:02:34AM +0900, Damien Le Moal wrote: > On 6/11/25 10:00 AM, Keith Busch wrote: > > Ah, thanks for pointing that out! In that case, could we have dm > > consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That > > way it does the split-to-limits for these, and dm appears to do that > > action very early before its own split/clone actions on the resulting > > bio. > > That sounds reasonable to me. Cool, that's just this atop the previous patch: --- diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 5ab7574c0c76a..d8a43f37a9283 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -1632,7 +1632,7 @@ static bool is_abnormal_io(struct bio *bio) case REQ_OP_READ: case REQ_OP_WRITE: case REQ_OP_FLUSH: - return false; + return bio_has_crypt_ctx(bio); case REQ_OP_DISCARD: case REQ_OP_SECURE_ERASE: case REQ_OP_WRITE_ZEROES: -- ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-10 23:18 ` Keith Busch 2025-06-11 0:46 ` Damien Le Moal @ 2025-06-11 1:34 ` Keith Busch 2025-06-11 3:40 ` Christoph Hellwig 2 siblings, 0 replies; 44+ messages in thread From: Keith Busch @ 2025-06-11 1:34 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote: > @@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors) > trace_block_split(split, bio->bi_iter.bi_sector); > WARN_ON_ONCE(bio_zone_write_plugging(bio)); > submit_bio_noacct(bio); > - return split; > + bio = split; > } > > + blk_crypto_bio_prep(&bio); Eh, this part needs to handle the error condition: + if (!blk_crypto_bio_prep(&bio)) + return NULL; Happy to make a formal patch out of this if this approach is promising. I think it aligns with the spirit of what Christoph was suggesting, at least. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-10 23:18 ` Keith Busch 2025-06-11 0:46 ` Damien Le Moal 2025-06-11 1:34 ` Keith Busch @ 2025-06-11 3:40 ` Christoph Hellwig 2025-06-11 4:21 ` Eric Biggers 2 siblings, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-06-11 3:40 UTC (permalink / raw) To: Keith Busch Cc: Bart Van Assche, Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote: > I think you could just prep the encryption at the point the bio is split > to its queue's limits, and then all you need to do after that is ensure > the limits don't exceed what the fallback requires (which appears to > just be a simple segment limitation). It looks like most of the bio > based drivers split to limits already. I'm still a bit confused about the interaction of block-crypto and stacking drivers. For native support you obviously always want to pass the crypt context down to the lowest level hardware driver, and dm has code to support this. But if you stacking driver is not dm, or the algorithm is not supported by the underlying hardware it looks like we might still be able to run the fallback for a stacking driver. Or am I missing something here? Eric/Bart: any chance to get blktests (or xfstets if that's easier to wire up) to exercise all these corner cases? > And if that's all okay, it simplifies quite a lot of things in this path > because the crypto fallback doesn't need to concern itself with > splitting anymore. Here's a quick shot a it, but compile tested only: Exactly. And in the next step we can significantly simply and speed up the bio submission recursion protection if __bio_split_to_limits is the only place that can split/reinject bios for blk-mq drivers. > diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c > index 94a155912bf1c..2804843310cc4 100644 > --- a/block/blk-crypto-profile.c > +++ b/block/blk-crypto-profile.c > @@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile, > pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n"); > return false; > } > + > + if (queue_max_segments(q) > BIO_MAX_VECS) { > + struct queue_limits lim; > + > + lim = queue_limits_start_update(q); > + lim.max_segments = BIO_MAX_VECS; > + if (queue_limits_commit_update(q, &lim)) > + return false; > + } I think this limit actually only exists when using the fallback, i.e. when no profile is registered. And even then only for encrypted bios. We'll also need to ensure the queue is frozen and deal with limit changes after the profile registration. I suspect we should probably integrate the queue profile with the queue limits, so things are don't in one place, but that's out of scope for this patch. It's also not just a segment limit, but also an overall size limit, i.e. it must be no more than 256 * 4k. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 3:40 ` Christoph Hellwig @ 2025-06-11 4:21 ` Eric Biggers 2025-06-11 16:15 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Eric Biggers @ 2025-06-11 4:21 UTC (permalink / raw) To: Christoph Hellwig Cc: Keith Busch, Bart Van Assche, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Wed, Jun 11, 2025 at 05:40:31AM +0200, Christoph Hellwig wrote: > On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote: > > I think you could just prep the encryption at the point the bio is split > > to its queue's limits, and then all you need to do after that is ensure > > the limits don't exceed what the fallback requires (which appears to > > just be a simple segment limitation). It looks like most of the bio > > based drivers split to limits already. > > I'm still a bit confused about the interaction of block-crypto and > stacking drivers. For native support you obviously always want to pass > the crypt context down to the lowest level hardware driver, and dm > has code to support this. But if you stacking driver is not dm, or > the algorithm is not supported by the underlying hardware it looks > like we might still be able to run the fallback for a stacking > driver. Or am I missing something here? Eric/Bart: any chance to > get blktests (or xfstets if that's easier to wire up) to exercise all > these corner cases? blk-crypto-fallback runs at the top layer, so yes it's different from native inline encryption support where the encryption is done at the bottom. (But the results are the same from the filesystem's perspective, since native support only gets passed through and used when it would give the expected result.) If it helps, blk-crypto-fallback can be exercised by running the "encrypt" group of xfstests on ext4 or f2fs with "inlinecrypt" added to the mount options. These include tests that the data on-disk is actually encrypted correctly. But that probably doesn't help much here, where the issue is with bio splitting. It's been a while since I looked at blk-crypto-fallback. But the reason that it does splitting is because it has to replace the data with encrypted data, and it allocates encrypted bounce pages individually which each needs its own bio_vec. Therefore, if the bio is longer than BIO_MAX_VECS pages, it has to be split. If the splitting to stay within BIO_MAX_VECS pages could be done before blk-crypto-fallback is given the bio instead, that should work fine too. Just keep in mind that blk-crypto-fallback is meant to work on any block device (even ones that don't have a crypto profile, as the profile is just for the native support). So we do need to make sure it always gets invoked when needed. - Eric ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 4:21 ` Eric Biggers @ 2025-06-11 16:15 ` Bart Van Assche 2025-06-11 18:15 ` Eric Biggers 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-06-11 16:15 UTC (permalink / raw) To: Eric Biggers, Christoph Hellwig Cc: Keith Busch, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 6/10/25 9:21 PM, Eric Biggers wrote: > blk-crypto-fallback runs at the top layer, so yes it's different from native > inline encryption support where the encryption is done at the bottom. (But the > results are the same from the filesystem's perspective, since native support > only gets passed through and used when it would give the expected result.) Although I'm not sure Keith realizes this, his patch may move encryption from the top of the block driver stack (a device mapper driver) to the bottom (something else than a device mapper driver). This may happen because device mapper drivers do not split bios unless this is essential, e.g. because the LBA range of a bio spans more than one entry in the mapping table. Is my understanding correct that this is acceptable because the encryption IV is based on the file offset provided by the filesystem and not on the LBA where the data is written? > Just keep in mind that blk-crypto-fallback is meant to work on any block device > (even ones that don't have a crypto profile, as the profile is just for the > native support). So we do need to make sure it always gets invoked when needed. I propose that we require that bio-based drivers must call bio_split_to_limits() to support inline encryption. Otherwise the approach of Keith's patch can't work. Does this seem reasonable to you? As far as I can tell upstream bio-based drivers already call bio_split_to_limits(). Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 16:15 ` Bart Van Assche @ 2025-06-11 18:15 ` Eric Biggers 2025-06-11 19:43 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Eric Biggers @ 2025-06-11 18:15 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On Wed, Jun 11, 2025 at 09:15:05AM -0700, Bart Van Assche wrote: > On 6/10/25 9:21 PM, Eric Biggers wrote: > > blk-crypto-fallback runs at the top layer, so yes it's different from native > > inline encryption support where the encryption is done at the bottom. (But the > > results are the same from the filesystem's perspective, since native support > > only gets passed through and used when it would give the expected result.) > > Although I'm not sure Keith realizes this, his patch may move encryption > from the top of the block driver stack (a device mapper driver) to the > bottom (something else than a device mapper driver). This may happen > because device mapper drivers do not split bios unless this is > essential, e.g. because the LBA range of a bio spans more than one entry > in the mapping table. > > Is my understanding correct that this is acceptable because the > encryption IV is based on the file offset provided by the filesystem and > not on the LBA where the data is written? The IV is provided in the bio_crypt_context. The encryption can be done at a lower layer, like how hardware inline encryption works, if the layers above are compatible with it. (E.g., they must not change the data.) Ultimately, the data that's actually written to disk needs to be identical to the data that would have been written if the user encrypted the data themselves and submitted a pre-encrypted bio. > > > Just keep in mind that blk-crypto-fallback is meant to work on any block device > > (even ones that don't have a crypto profile, as the profile is just for the > > native support). So we do need to make sure it always gets invoked when needed. > > I propose that we require that bio-based drivers must call > bio_split_to_limits() to support inline encryption. Otherwise the > approach of Keith's patch can't work. Does this seem reasonable to you? > > As far as I can tell upstream bio-based drivers already call > bio_split_to_limits(). Well, again it needs to work on any block device. If the encryption might just not be done and plaintext ends up on-disk, then blk-crypto-fallback would be unsafe to use. It would be preferable to have blk-crypto-fallback continue to be handled in the block layer so that drivers don't need to worry about it. - Eric ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 18:15 ` Eric Biggers @ 2025-06-11 19:43 ` Bart Van Assche 2025-06-18 22:27 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-06-11 19:43 UTC (permalink / raw) To: Eric Biggers Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 6/11/25 11:15 AM, Eric Biggers wrote: > Well, again it needs to work on any block device. If the encryption might just > not be done and plaintext ends up on-disk, then blk-crypto-fallback would be > unsafe to use. > > It would be preferable to have blk-crypto-fallback continue to be handled in the > block layer so that drivers don't need to worry about it. This concern could be addressed by introducing a new flag in struct block_device_operations or struct queue_limits - a flag that indicates that bio_split_to_limits() will be called by the block driver. If that flag is set, blk_crypto_bio_prep() can be called from bio_submit_split(). If that flag is not set, blk_crypto_bio_prep() should be called from __submit_bio(). The latter behavior is is the current behavior for the upstream kernel. Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-06-11 19:43 ` Bart Van Assche @ 2025-06-18 22:27 ` Bart Van Assche 0 siblings, 0 replies; 44+ messages in thread From: Bart Van Assche @ 2025-06-18 22:27 UTC (permalink / raw) To: Eric Biggers Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki On 6/11/25 12:43 PM, Bart Van Assche wrote: > On 6/11/25 11:15 AM, Eric Biggers wrote: >> Well, again it needs to work on any block device. If the encryption >> might just >> not be done and plaintext ends up on-disk, then blk-crypto-fallback >> would be >> unsafe to use. >> >> It would be preferable to have blk-crypto-fallback continue to be >> handled in the >> block layer so that drivers don't need to worry about it. > > This concern could be addressed by introducing a new flag in struct > block_device_operations or struct queue_limits - a flag that indicates > that bio_split_to_limits() will be called by the block driver. If that > flag is set, blk_crypto_bio_prep() can be called from > bio_submit_split(). If that flag is not set, blk_crypto_bio_prep() > should be called from __submit_bio(). The latter behavior is is the > current behavior for the upstream kernel. (replying to my own email) The patch below seems to work but needs further review and testing: block: Rework splitting of encrypted bios Modify splitting of encrypted bios as follows: - Introduce a new block device flag BD_SPLITS_BIO_TO_LIMITS that allows bio- based drivers to report that they call bio_split_to_limits() for every bio. - For request-based block drivers and for bio-based block drivers that call bio_split_to_limits(), call blk_crypto_bio_prep() after bio splitting happened instead of before bio splitting happened. - For bio-based block drivers of which it is not known whether they call bio_split_to_limits(), call blk_crypto_bio_prep() before .submit_bio() is called. - In blk_crypto_fallback_encrypt_bio(), prevent infinite recursion by only trying to split a bio if this function is not called from inside bio_split_to_limits(). - Since blk_crypto_fallback_encrypt_bio() may clear *bio_ptr, in its caller (__blk_crypto_bio_prep()), check whether or not this pointer is NULL. - In bio_split_rw(), restrict the bio size to the smaller size of what is supported to the block driver and the crypto fallback code. The advantages of these changes are as follows: - This patch fixes write errors on zoned storage caused by out-of-order submission of bios. This out-of-order submission happens if both the crypto fallback code and bio_split_to_limits() split a bio. - Less code duplication. The crypto fallback code now calls bio_split_to_limits() instead of open-coding it. diff --git a/block/bio.c b/block/bio.c index 3c0a558c90f5..d597cef6d228 100644 --- a/block/bio.c +++ b/block/bio.c @@ -1682,6 +1682,11 @@ struct bio *bio_split(struct bio *bio, int sectors, if (!split) return ERR_PTR(-ENOMEM); + /* + * Tell blk_crypto_fallback_encrypt_bio() not to split this bio further. + */ + bio_set_flag(split, BIO_BEING_SPLIT); + split->bi_iter.bi_size = sectors << 9; if (bio_integrity(split)) diff --git a/block/blk-core.c b/block/blk-core.c index fdac48aec5ef..78b555ceea77 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -626,16 +626,22 @@ static void __submit_bio(struct bio *bio) /* If plug is not used, add new plug here to cache nsecs time. */ struct blk_plug plug; - if (unlikely(!blk_crypto_bio_prep(&bio))) - return; - blk_start_plug(&plug); if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); } else if (likely(bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; - + + /* + * Only call blk_crypto_bio_prep() before .submit_bio() if + * the block driver won't call bio_split_to_limits(). + */ + if (unlikely(!bdev_test_flag(bio->bi_bdev, + BD_SPLITS_BIO_TO_LIMITS) && + !blk_crypto_bio_prep(&bio))) + goto exit_queue; + if ((bio->bi_opf & REQ_POLLED) && !(disk->queue->limits.features & BLK_FEAT_POLL)) { bio->bi_status = BLK_STS_NOTSUPP; @@ -643,6 +649,8 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); } + +exit_queue: blk_queue_exit(disk->queue); } diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c index 005c9157ffb3..f2012368ac9e 100644 --- a/block/blk-crypto-fallback.c +++ b/block/blk-crypto-fallback.c @@ -209,9 +209,12 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot, return true; } -static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) +/* + * The encryption fallback code allocates bounce pages individually. Hence this + * function that calculates an upper limit for the bio size. + */ +unsigned int blk_crypto_max_io_size(struct bio *bio) { - struct bio *bio = *bio_ptr; unsigned int i = 0; unsigned int num_sectors = 0; struct bio_vec bv; @@ -222,21 +225,8 @@ static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr) if (++i == BIO_MAX_VECS) break; } - if (num_sectors < bio_sectors(bio)) { - struct bio *split_bio; - - split_bio = bio_split(bio, num_sectors, GFP_NOIO, - &crypto_bio_split); - if (IS_ERR(split_bio)) { - bio->bi_status = BLK_STS_RESOURCE; - return false; - } - bio_chain(split_bio, bio); - submit_bio_noacct(bio); - *bio_ptr = split_bio; - } - return true; + return num_sectors; } union blk_crypto_iv { @@ -257,8 +247,10 @@ static void blk_crypto_dun_to_iv(const u64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE], * The crypto API fallback's encryption routine. * Allocate a bounce bio for encryption, encrypt the input bio using crypto API, * and replace *bio_ptr with the bounce bio. May split input bio if it's too - * large. Returns true on success. Returns false and sets bio->bi_status on - * error. + * large. Returns %true on success. On error, %false is returned and one of + * these two actions is taken: + * - Either @bio_ptr->bi_status is set and *@bio_ptr is not modified. + * - Or bio_endio() is called and *@bio_ptr is changed into %NULL. */ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) { @@ -275,11 +267,17 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr) bool ret = false; blk_status_t blk_st; - /* Split the bio if it's too big for single page bvec */ - if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr)) - return false; + if (!bio_flagged(*bio_ptr, BIO_BEING_SPLIT)) { + /* Split the bio if it's too big for single page bvec */ + src_bio = bio_split_to_limits(*bio_ptr); + if (!src_bio) { + *bio_ptr = NULL; + return false; + } + } else { + src_bio = *bio_ptr; + } - src_bio = *bio_ptr; bc = src_bio->bi_crypt_context; data_unit_size = bc->bc_key->crypto_cfg.data_unit_size; @@ -475,9 +473,8 @@ static void blk_crypto_fallback_decrypt_endio(struct bio *bio) * @bio_ptr: pointer to the bio to prepare * * If bio is doing a WRITE operation, this splits the bio into two parts if it's - * too big (see blk_crypto_fallback_split_bio_if_needed()). It then allocates a - * bounce bio for the first part, encrypts it, and updates bio_ptr to point to - * the bounce bio. + * too big. It then allocates a bounce bio for the first part, encrypts it, and + * updates bio_ptr to point to the bounce bio. * * For a READ operation, we mark the bio for decryption by using bi_private and * bi_end_io. @@ -495,6 +492,12 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr) struct bio_crypt_ctx *bc = bio->bi_crypt_context; struct bio_fallback_crypt_ctx *f_ctx; + /* + * Check whether blk_crypto_fallback_bio_prep() has already been called. + */ + if (bio->bi_end_io == blk_crypto_fallback_encrypt_endio) + return true; + if (WARN_ON_ONCE(!tfms_inited[bc->bc_key->crypto_cfg.crypto_mode])) { /* User didn't call blk_crypto_start_using_key() first */ bio->bi_status = BLK_STS_IOERR; diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h index ccf6dff6ff6b..443ba1fd82e6 100644 --- a/block/blk-crypto-internal.h +++ b/block/blk-crypto-internal.h @@ -223,6 +223,8 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr); int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key); +unsigned int blk_crypto_max_io_size(struct bio *bio); + #else /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */ static inline int @@ -245,6 +247,11 @@ blk_crypto_fallback_evict_key(const struct blk_crypto_key *key) return 0; } +static inline unsigned int blk_crypto_max_io_size(struct bio *bio) +{ + return UINT_MAX; +} + #endif /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */ #endif /* __LINUX_BLK_CRYPTO_INTERNAL_H */ diff --git a/block/blk-crypto.c b/block/blk-crypto.c index 4b1ad84d1b5a..76278e23193d 100644 --- a/block/blk-crypto.c +++ b/block/blk-crypto.c @@ -306,7 +306,8 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr) if (blk_crypto_fallback_bio_prep(bio_ptr)) return true; fail: - bio_endio(*bio_ptr); + if (*bio_ptr) + bio_endio(*bio_ptr); return false; } diff --git a/block/blk-merge.c b/block/blk-merge.c index e55a8ec219c9..df65231be543 100644 --- a/block/blk-merge.c +++ b/block/blk-merge.c @@ -9,6 +9,7 @@ #include <linux/blk-integrity.h> #include <linux/part_stat.h> #include <linux/blk-cgroup.h> +#include <linux/blk-crypto.h> #include <trace/events/block.h> @@ -125,9 +126,14 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors) bio->bi_iter.bi_sector); WARN_ON_ONCE(bio_zone_write_plugging(bio)); submit_bio_noacct(bio); - return split; + + bio = split; } + WARN_ON_ONCE(!bio); + if (unlikely(!blk_crypto_bio_prep(&bio))) + return NULL; + return bio; error: bio->bi_status = errno_to_blk_status(split_sectors); @@ -356,9 +362,12 @@ EXPORT_SYMBOL_GPL(bio_split_rw_at); struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim, unsigned *nr_segs) { + u32 max_sectors = + min(get_max_io_size(bio, lim), blk_crypto_max_io_size(bio)); + return bio_submit_split(bio, bio_split_rw_at(bio, lim, nr_segs, - get_max_io_size(bio, lim) << SECTOR_SHIFT)); + (u64)max_sectors << SECTOR_SHIFT)); } /* diff --git a/block/blk.h b/block/blk.h index 37ec459fe656..8db804f32896 100644 --- a/block/blk.h +++ b/block/blk.h @@ -367,6 +367,11 @@ static inline bool bio_may_need_split(struct bio *bio, lim->min_segment_size; } +static void bio_clear_split_flag(struct bio **bio) +{ + bio_clear_flag(*bio, BIO_BEING_SPLIT); +} + /** * __bio_split_to_limits - split a bio to fit the queue limits * @bio: bio to be split @@ -383,6 +388,10 @@ static inline bool bio_may_need_split(struct bio *bio, static inline struct bio *__bio_split_to_limits(struct bio *bio, const struct queue_limits *lim, unsigned int *nr_segs) { + struct bio *clear_split_flag __cleanup(bio_clear_split_flag) = bio; + + bio_set_flag(bio, BIO_BEING_SPLIT); + switch (bio_op(bio)) { case REQ_OP_READ: case REQ_OP_WRITE: diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 5facb06e5924..c79195de2669 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c @@ -2316,6 +2316,8 @@ static struct mapped_device *alloc_dev(int minor) md->disk->private_data = md; sprintf(md->disk->disk_name, "dm-%d", minor); + bdev_set_flag(md->disk->part0, BD_SPLITS_BIO_TO_LIMITS); + dax_dev = alloc_dax(md, &dm_dax_ops); if (IS_ERR(dax_dev)) { if (PTR_ERR(dax_dev) != -EOPNOTSUPP) diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h index 3d1577f07c1c..25b789830b75 100644 --- a/include/linux/blk_types.h +++ b/include/linux/blk_types.h @@ -54,6 +54,8 @@ struct block_device { #ifdef CONFIG_FAIL_MAKE_REQUEST #define BD_MAKE_IT_FAIL (1u<<12) #endif + /* Set this flag if the driver calls bio_split_to_limits(). */ +#define BD_SPLITS_BIO_TO_LIMITS (1u<<13) dev_t bd_dev; struct address_space *bd_mapping; /* page cache */ @@ -308,6 +310,7 @@ enum { BIO_REMAPPED, BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */ BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */ + BIO_BEING_SPLIT, BIO_FLAG_LAST }; ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order 2025-05-21 21:18 ` Bart Van Assche 2025-05-22 5:12 ` Damien Le Moal @ 2025-05-23 4:21 ` Christoph Hellwig 1 sibling, 0 replies; 44+ messages in thread From: Christoph Hellwig @ 2025-05-23 4:21 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 21, 2025 at 02:18:12PM -0700, Bart Van Assche wrote: > The following pull request includes a test that triggers the deadlock > fixed by patch 2/2 reliably: > > https://github.com/osandov/blktests/pull/171 Can you post the pathc please? I've not found any obvious way to download a patch from the above website. ^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche @ 2025-05-14 20:29 ` Bart Van Assche 2025-05-16 4:51 ` Christoph Hellwig 1 sibling, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw) To: Jens Axboe Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal, Yu Kuai, Ming Lei, stable blk_mq_freeze_queue() never terminates if one or more bios are on the plug list and if the block device driver defines a .submit_bio() method. This is the case for device mapper drivers. The deadlock happens because blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because a queue reference is held by bios on the plug list and because the __bio_queue_enter() call in __submit_bio() waits for the queue to be unfrozen. This patch fixes the following deadlock: Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work Call trace: __schedule+0xb08/0x1160 schedule+0x48/0xc8 __bio_queue_enter+0xcc/0x1d0 __submit_bio+0x100/0x1b0 submit_bio_noacct_nocheck+0x230/0x49c blk_zone_wplug_bio_work+0x168/0x250 process_one_work+0x26c/0x65c worker_thread+0x33c/0x498 kthread+0x110/0x134 ret_from_fork+0x10/0x20 Call trace: __switch_to+0x230/0x410 __schedule+0xb08/0x1160 schedule+0x48/0xc8 blk_mq_freeze_queue_wait+0x78/0xb8 blk_mq_freeze_queue+0x90/0xa4 queue_attr_store+0x7c/0xf0 sysfs_kf_write+0x98/0xc8 kernfs_fop_write_iter+0x12c/0x1d4 vfs_write+0x340/0x3ac ksys_write+0x78/0xe8 Cc: Christoph Hellwig <hch@lst.de> Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: stable@vger.kernel.org Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-core.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 4b728fa1c138..e961896a8717 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q, return BLK_STS_OK; } +/* + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been + * set because this causes blk_mq_freeze_queue() to deadlock if + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios + * on the plug list is not necessary since a q_usage_counter reference is held + * while a bio is on the plug list. + */ static void __submit_bio(struct bio *bio) { /* If plug is not used, add new plug here to cache nsecs time. */ @@ -633,7 +640,8 @@ static void __submit_bio(struct bio *bio) if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) { blk_mq_submit_bio(bio); - } else if (likely(bio_queue_enter(bio) == 0)) { + } else if (likely(bio_zone_write_plugging(bio) || + bio_queue_enter(bio) == 0)) { struct gendisk *disk = bio->bi_bdev->bd_disk; if ((bio->bi_opf & REQ_POLLED) && @@ -643,7 +651,8 @@ static void __submit_bio(struct bio *bio) } else { disk->fops->submit_bio(bio); } - blk_queue_exit(disk->queue); + if (!bio_zone_write_plugging(bio)) + blk_queue_exit(disk->queue); } blk_finish_plug(&plug); ^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche @ 2025-05-16 4:51 ` Christoph Hellwig 2025-05-19 22:22 ` Bart Van Assche 0 siblings, 1 reply; 44+ messages in thread From: Christoph Hellwig @ 2025-05-16 4:51 UTC (permalink / raw) To: Bart Van Assche Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal, Yu Kuai, Ming Lei, stable On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote: > +/* > + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been > + * set because this causes blk_mq_freeze_queue() to deadlock if > + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios > + * on the plug list is not necessary since a q_usage_counter reference is held > + * while a bio is on the plug list. > + */ How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag was set in a stacking driver we need to clear it before resubmitting the bio I think. Can you provide a null_blk based reproducer for your testcase to see what happens here? Either way we can't just simply skip taking q_usage_count. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-16 4:51 ` Christoph Hellwig @ 2025-05-19 22:22 ` Bart Van Assche 2025-05-20 13:57 ` Christoph Hellwig 0 siblings, 1 reply; 44+ messages in thread From: Bart Van Assche @ 2025-05-19 22:22 UTC (permalink / raw) To: Christoph Hellwig Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On 5/15/25 9:51 PM, Christoph Hellwig wrote: > On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote: >> +/* >> + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been >> + * set because this causes blk_mq_freeze_queue() to deadlock if >> + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios >> + * on the plug list is not necessary since a q_usage_counter reference is held >> + * while a bio is on the plug list. >> + */ > > How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag > was set in a stacking driver we need to clear it before resubmitting > the bio I think. Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must remain set until the bio has completed. Here is an example of code in the bio completion path that tests this flag: static inline void blk_zone_bio_endio(struct bio *bio) { /* * For write BIOs to zoned devices, signal the completion of the BIO so * that the next write BIO can be submitted by zone write plugging. */ if (bio_zone_write_plugging(bio)) blk_zone_write_plug_bio_endio(bio); } The bio_zone_write_plugging() definition is as follows: static inline bool bio_zone_write_plugging(struct bio *bio) { return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING); } > Can you provide a null_blk based reproducer for your testcase to see > what happens here? My attempts so far to build a reproduce for the blktests framework have been unsuccessful. This test script runs fine in the VM that I use for kernel testing: https://github.com/bvanassche/blktests/blob/master/tests/zbd/013 > Either way we can't just simply skip taking q_usage_count. Why not? If BIO_ZONE_WRITE_PLUGGING is set, that guarantees that a queue reference count is held and will remain held across the entire disk->fops->submit_bio() invocation, isn't it? From blk_zone_wplug_bio_work(), below the submit_bio_noacct_nocheck(bio) call: if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO)) blk_queue_exit(bdev->bd_disk->queue); Thanks, Bart. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices 2025-05-19 22:22 ` Bart Van Assche @ 2025-05-20 13:57 ` Christoph Hellwig 0 siblings, 0 replies; 44+ messages in thread From: Christoph Hellwig @ 2025-05-20 13:57 UTC (permalink / raw) To: Bart Van Assche Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei, stable On Mon, May 19, 2025 at 03:22:42PM -0700, Bart Van Assche wrote: > Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must > remain set until the bio has completed. Here is an example of code in > the bio completion path that tests this flag: True. Well, we'll need some other flag that to tell lower levels to ignore the flag because it is owned by the higher level. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-06-18 22:28 UTC | newest] Thread overview: 44+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche 2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche 2025-05-15 7:19 ` Niklas Cassel 2025-05-15 15:58 ` Bart Van Assche 2025-05-16 4:47 ` Christoph Hellwig 2025-05-19 22:12 ` Bart Van Assche 2025-05-20 13:56 ` Christoph Hellwig 2025-05-20 18:09 ` Bart Van Assche 2025-05-21 5:53 ` Christoph Hellwig 2025-05-21 21:18 ` Bart Van Assche 2025-05-22 5:12 ` Damien Le Moal 2025-05-22 17:08 ` Bart Van Assche 2025-05-23 6:02 ` Damien Le Moal 2025-05-23 16:30 ` Bart Van Assche 2025-05-24 8:48 ` Damien Le Moal 2025-05-24 14:05 ` Bart Van Assche 2025-05-24 15:36 ` Damien Le Moal 2025-05-26 5:24 ` Christoph Hellwig 2025-05-27 16:19 ` Bart Van Assche 2025-05-31 0:25 ` Bart Van Assche 2025-06-08 22:07 ` Bart Van Assche 2025-06-08 22:47 ` Damien Le Moal 2025-06-09 3:58 ` Christoph Hellwig 2025-06-09 20:48 ` Bart Van Assche 2025-06-10 5:04 ` Christoph Hellwig 2025-06-09 3:55 ` Christoph Hellwig 2025-06-10 17:23 ` Bart Van Assche 2025-06-10 23:18 ` Keith Busch 2025-06-11 0:46 ` Damien Le Moal 2025-06-11 1:00 ` Keith Busch 2025-06-11 1:02 ` Damien Le Moal 2025-06-11 1:08 ` Keith Busch 2025-06-11 1:34 ` Keith Busch 2025-06-11 3:40 ` Christoph Hellwig 2025-06-11 4:21 ` Eric Biggers 2025-06-11 16:15 ` Bart Van Assche 2025-06-11 18:15 ` Eric Biggers 2025-06-11 19:43 ` Bart Van Assche 2025-06-18 22:27 ` Bart Van Assche 2025-05-23 4:21 ` Christoph Hellwig 2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche 2025-05-16 4:51 ` Christoph Hellwig 2025-05-19 22:22 ` Bart Van Assche 2025-05-20 13:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox