* [PATCH 0/2] Two bug fixes for zoned block devices
@ 2025-05-14 20:29 Bart Van Assche
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche
0 siblings, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw)
To: Jens Axboe; +Cc: linux-block, Christoph Hellwig, Bart Van Assche
Hi Jens,
The two patches in this series fix the issues that I ran into by stacking a
dm driver on top of a zoned storage device. Please consider these two bug
fixes for the next merge window.
Thank you,
Bart.
Bart Van Assche (2):
block: Make __submit_bio_noacct() preserve the bio submission order
block: Fix a deadlock related freezing zoned storage devices
block/blk-core.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 44+ messages in thread
* [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche
@ 2025-05-14 20:29 ` Bart Van Assche
2025-05-15 7:19 ` Niklas Cassel
2025-05-16 4:47 ` Christoph Hellwig
2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche
1 sibling, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal,
Yu Kuai, Ming Lei, stable
submit_bio() may be called recursively. To limit the stack depth, recursive
calls result in bios being added to a list (current->bio_list).
__submit_bio_noacct() sets up that list and maintains two lists with
requests:
* bio_list_on_stack[0] is the list with bios submitted by recursive
submit_bio() calls from inside the latest __submit_bio() call.
* bio_list_on_stack[1] is the list with bios submitted by recursive
submit_bio() calls from inside previous __submit_bio() calls.
Make sure that bios are submitted to lower devices in the order these
have been submitted by submit_bio() by adding new bios at the end of the
list instead of at the front.
This patch fixes unaligned write errors that I encountered with F2FS
submitting zoned writes to a dm driver stacked on top of a zoned UFS
device.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-core.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f2..4b728fa1c138 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -704,9 +704,9 @@ static void __submit_bio_noacct(struct bio *bio)
/*
* Now assemble so we handle the lowest level first.
*/
+ bio_list_on_stack[0] = bio_list_on_stack[1];
bio_list_merge(&bio_list_on_stack[0], &lower);
bio_list_merge(&bio_list_on_stack[0], &same);
- bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL;
^ permalink raw reply related [flat|nested] 44+ messages in thread
* [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices
2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
@ 2025-05-14 20:29 ` Bart Van Assche
2025-05-16 4:51 ` Christoph Hellwig
1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-14 20:29 UTC (permalink / raw)
To: Jens Axboe
Cc: linux-block, Christoph Hellwig, Bart Van Assche, Damien Le Moal,
Yu Kuai, Ming Lei, stable
blk_mq_freeze_queue() never terminates if one or more bios are on the plug
list and if the block device driver defines a .submit_bio() method.
This is the case for device mapper drivers. The deadlock happens because
blk_mq_freeze_queue() waits for q_usage_counter to drop to zero, because
a queue reference is held by bios on the plug list and because the
__bio_queue_enter() call in __submit_bio() waits for the queue to be
unfrozen.
This patch fixes the following deadlock:
Workqueue: dm-51_zwplugs blk_zone_wplug_bio_work
Call trace:
__schedule+0xb08/0x1160
schedule+0x48/0xc8
__bio_queue_enter+0xcc/0x1d0
__submit_bio+0x100/0x1b0
submit_bio_noacct_nocheck+0x230/0x49c
blk_zone_wplug_bio_work+0x168/0x250
process_one_work+0x26c/0x65c
worker_thread+0x33c/0x498
kthread+0x110/0x134
ret_from_fork+0x10/0x20
Call trace:
__switch_to+0x230/0x410
__schedule+0xb08/0x1160
schedule+0x48/0xc8
blk_mq_freeze_queue_wait+0x78/0xb8
blk_mq_freeze_queue+0x90/0xa4
queue_attr_store+0x7c/0xf0
sysfs_kf_write+0x98/0xc8
kernfs_fop_write_iter+0x12c/0x1d4
vfs_write+0x340/0x3ac
ksys_write+0x78/0xe8
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/blk-core.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 4b728fa1c138..e961896a8717 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -621,6 +621,13 @@ static inline blk_status_t blk_check_zone_append(struct request_queue *q,
return BLK_STS_OK;
}
+/*
+ * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
+ * set because this causes blk_mq_freeze_queue() to deadlock if
+ * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
+ * on the plug list is not necessary since a q_usage_counter reference is held
+ * while a bio is on the plug list.
+ */
static void __submit_bio(struct bio *bio)
{
/* If plug is not used, add new plug here to cache nsecs time. */
@@ -633,7 +640,8 @@ static void __submit_bio(struct bio *bio)
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
blk_mq_submit_bio(bio);
- } else if (likely(bio_queue_enter(bio) == 0)) {
+ } else if (likely(bio_zone_write_plugging(bio) ||
+ bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
if ((bio->bi_opf & REQ_POLLED) &&
@@ -643,7 +651,8 @@ static void __submit_bio(struct bio *bio)
} else {
disk->fops->submit_bio(bio);
}
- blk_queue_exit(disk->queue);
+ if (!bio_zone_write_plugging(bio))
+ blk_queue_exit(disk->queue);
}
blk_finish_plug(&plug);
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
@ 2025-05-15 7:19 ` Niklas Cassel
2025-05-15 15:58 ` Bart Van Assche
2025-05-16 4:47 ` Christoph Hellwig
1 sibling, 1 reply; 44+ messages in thread
From: Niklas Cassel @ 2025-05-15 7:19 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Yu Kuai, Ming Lei, stable
Hello Bart,
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
> submit_bio() may be called recursively. To limit the stack depth, recursive
> calls result in bios being added to a list (current->bio_list).
> __submit_bio_noacct() sets up that list and maintains two lists with
> requests:
> * bio_list_on_stack[0] is the list with bios submitted by recursive
> submit_bio() calls from inside the latest __submit_bio() call.
> * bio_list_on_stack[1] is the list with bios submitted by recursive
> submit_bio() calls from inside previous __submit_bio() calls.
>
> Make sure that bios are submitted to lower devices in the order these
> have been submitted by submit_bio() by adding new bios at the end of the
> list instead of at the front.
>
> This patch fixes unaligned write errors that I encountered with F2FS
> submitting zoned writes to a dm driver stacked on top of a zoned UFS
> device.
>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Damien Le Moal <dlemoal@kernel.org>
> Cc: Yu Kuai <yukuai1@huaweicloud.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: stable@vger.kernel.org
Here you add stable to Cc, but you don't specify either
1) a minimum version e.g.
stable@vger.kernel.org # v6.8+
or
2) a Fixes tag.
Kind regards,
Niklas
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-15 7:19 ` Niklas Cassel
@ 2025-05-15 15:58 ` Bart Van Assche
0 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-15 15:58 UTC (permalink / raw)
To: Niklas Cassel, NeilBrown
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On 5/15/25 12:19 AM, Niklas Cassel wrote:
> Hello Bart,
>
> On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
>> submit_bio() may be called recursively. To limit the stack depth, recursive
>> calls result in bios being added to a list (current->bio_list).
>> __submit_bio_noacct() sets up that list and maintains two lists with
>> requests:
>> * bio_list_on_stack[0] is the list with bios submitted by recursive
>> submit_bio() calls from inside the latest __submit_bio() call.
>> * bio_list_on_stack[1] is the list with bios submitted by recursive
>> submit_bio() calls from inside previous __submit_bio() calls.
>>
>> Make sure that bios are submitted to lower devices in the order these
>> have been submitted by submit_bio() by adding new bios at the end of the
>> list instead of at the front.
>>
>> This patch fixes unaligned write errors that I encountered with F2FS
>> submitting zoned writes to a dm driver stacked on top of a zoned UFS
>> device.
>>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Damien Le Moal <dlemoal@kernel.org>
>> Cc: Yu Kuai <yukuai1@huaweicloud.com>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Cc: stable@vger.kernel.org
>
> Here you add stable to Cc, but you don't specify either
> 1) a minimum version e.g.
> stable@vger.kernel.org # v6.8+
> or
> 2) a Fixes tag.
Hi Niklas,
Let's add the following to this patch:
Fixes: 79bd99596b73 ("blk: improve order of bio handling in
generic_make_request()")
Neil, since that commit was authored by you: the commit message is
elaborate but the names of the drivers that needed that commit have
not been mentioned. Which drivers needed that change? Additionally,
can you please help with reviewing this patch:
https://lore.kernel.org/linux-block/20250514202937.2058598-2-bvanassche@acm.org/
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
2025-05-15 7:19 ` Niklas Cassel
@ 2025-05-16 4:47 ` Christoph Hellwig
2025-05-19 22:12 ` Bart Van Assche
1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-16 4:47 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
> /*
> * Now assemble so we handle the lowest level first.
> */
> + bio_list_on_stack[0] = bio_list_on_stack[1];
> bio_list_merge(&bio_list_on_stack[0], &lower);
> bio_list_merge(&bio_list_on_stack[0], &same);
> - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
If I read this code correctly, this means that we no keep processing bios
that already were on bio_list_on_stack[0] and the beginning of the loop
in the next iteration(s) instead of finishing off the ones created by
this iteration, which could lead to exhaustion of resources like mempool.
Note that this is a big if - the code is really hard to read, it should
really grow a data structure for the on-stack list that has named members
for both lists instead of the array magic.. :(
I'm still trying to understand your problem given that it wasn't
described much. What I could think it is that bio_split_to_limits through
bio_submit_split first re-submits the remainder bio using
submit_bio_noacct, which the above should place on the same list and then
later the stacking block drivers also submit the bio split off at the
beginning, unlike blk-mq drivers that process it directly. But given
that this resubmission should be on the lower list above I don't
see how it causes problems.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices
2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche
@ 2025-05-16 4:51 ` Christoph Hellwig
2025-05-19 22:22 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-16 4:51 UTC (permalink / raw)
To: Bart Van Assche
Cc: Jens Axboe, linux-block, Christoph Hellwig, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote:
> +/*
> + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
> + * set because this causes blk_mq_freeze_queue() to deadlock if
> + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
> + * on the plug list is not necessary since a q_usage_counter reference is held
> + * while a bio is on the plug list.
> + */
How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag
was set in a stacking driver we need to clear it before resubmitting
the bio I think.
Can you provide a null_blk based reproducer for your testcase to see
what happens here?
Either way we can't just simply skip taking q_usage_count.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-16 4:47 ` Christoph Hellwig
@ 2025-05-19 22:12 ` Bart Van Assche
2025-05-20 13:56 ` Christoph Hellwig
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-19 22:12 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei,
stable
On 5/15/25 9:47 PM, Christoph Hellwig wrote:
> On Wed, May 14, 2025 at 01:29:36PM -0700, Bart Van Assche wrote:
>> /*
>> * Now assemble so we handle the lowest level first.
>> */
>> + bio_list_on_stack[0] = bio_list_on_stack[1];
>> bio_list_merge(&bio_list_on_stack[0], &lower);
>> bio_list_merge(&bio_list_on_stack[0], &same);
>> - bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
>
> If I read this code correctly, this means that we no keep processing bios
> that already were on bio_list_on_stack[0] and the beginning of the loop
> in the next iteration(s) instead of finishing off the ones created by
> this iteration, which could lead to exhaustion of resources like mempool.
>
> Note that this is a big if - the code is really hard to read, it should
> really grow a data structure for the on-stack list that has named members
> for both lists instead of the array magic.. :(
>
> I'm still trying to understand your problem given that it wasn't
> described much. What I could think it is that bio_split_to_limits through
> bio_submit_split first re-submits the remainder bio using
> submit_bio_noacct, which the above should place on the same list and then
> later the stacking block drivers also submit the bio split off at the
> beginning, unlike blk-mq drivers that process it directly. But given
> that this resubmission should be on the lower list above I don't
> see how it causes problems.
Agreed that this should be root-caused. To my own frustration I do not
yet have a full root-cause analysis. What I have done to obtain more
information is to make the kernel issue a warning the first time a bio
is added out-of-order at the end of the bio list. The following output
appeared (sde is the zoned block device at the bottom of the stack):
[ 71.312492][ T1] bio_list_insert_sorted: inserting in the middle
of a bio list
[ 71.313483][ T1] print_bio_list(sde) sector 0x1b7520 size 0x10
[ 71.313034][ T1] bio_list_insert_sorted(sde) sector 0x1b7120 size
0x400
[ ... ]
[ 71.368117][ T163] WARNING: CPU: 4 PID: 163 at block/blk-core.c:725
bio_list_insert_sorted+0x144/0x18c
[ 71.386664][ T163] Workqueue: writeback wb_workfn (flush-253:49)
[ 71.387110][ T163] pstate: 61400005 (nZCv daif +PAN -UAO -TCO +DIT
-SSBS BTYPE=--)
[ 71.393772][ T163] Call trace:
[ 71.393988][ T163] bio_list_insert_sorted+0x144/0x18c
[ 71.394338][ T163] submit_bio_noacct_nocheck+0xd8/0x4f4
[ 71.394696][ T163] submit_bio_noacct+0x32c/0x50c
[ 71.395017][ T163] bio_submit_split+0xf0/0x1f8
[ 71.395349][ T163] bio_split_rw+0xdc/0xf0
[ 71.395631][ T163] blk_mq_submit_bio+0x320/0x940
[ 71.395970][ T163] __submit_bio+0xa4/0x1c4
[ 71.396260][ T163] submit_bio_noacct_nocheck+0x1c0/0x4f4
[ 71.396623][ T163] submit_bio_noacct+0x32c/0x50c
[ 71.396942][ T163] submit_bio+0x17c/0x198
[ 71.397222][ T163] f2fs_submit_write_bio+0x94/0x154
[ 71.397604][ T163] __submit_merged_bio+0x80/0x204
[ 71.397933][ T163] __submit_merged_write_cond+0xd0/0x1fc
[ 71.398297][ T163] f2fs_submit_merged_write+0x24/0x30
[ 71.398646][ T163] f2fs_sync_node_pages+0x5ec/0x64c
[ 71.398999][ T163] f2fs_write_node_pages+0xe8/0x1dc
[ 71.399338][ T163] do_writepages+0xe4/0x2f8
[ 71.399673][ T163] __writeback_single_inode+0x84/0x6e4
[ 71.400036][ T163] writeback_sb_inodes+0x2cc/0x5c0
[ 71.400369][ T163] wb_writeback+0x134/0x550
[ 71.400662][ T163] wb_workfn+0x154/0x588
[ 71.400937][ T163] process_one_work+0x26c/0x65c
[ 71.401271][ T163] worker_thread+0x33c/0x498
[ 71.401575][ T163] kthread+0x110/0x134
[ 71.401844][ T163] ret_from_fork+0x10/0x20
I think that the above call stack indicates the following:
f2fs_submit_write_bio() submits a bio to a dm driver, that the dm driver
submitted a bio for the lower driver (SCSI core), that the bio for the
lower driver is split by bio_split_rw(), and that the second half of the
split bio triggers the above out-of-order warning.
This new patch should address the concerns brought up in your latest
email:
diff --git a/block/blk-core.c b/block/blk-core.c
index 411f005e6b1f..aa270588272a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -649,6 +649,26 @@ static void __submit_bio(struct bio *bio)
blk_finish_plug(&plug);
}
+/*
+ * Insert a bio in LBA order. If no bio for the same bdev with a higher
LBA is
+ * found, append at the end.
+ */
+static void bio_list_insert_sorted(struct bio_list *bl, struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+ struct bio **pprev = &bl->head, *next;
+ sector_t sector = bio->bi_iter.bi_sector;
+
+ for (next = *pprev; next; pprev = &next->bi_next, next = next->bi_next)
+ if (next->bi_bdev == bdev && sector < next->bi_iter.bi_sector)
+ break;
+
+ bio->bi_next = next;
+ *pprev = bio;
+ if (!next)
+ bl->tail = bio;
+}
+
/*
* The loop in this function may be a bit non-obvious, and so deserves
some
* explanation:
@@ -706,7 +726,8 @@ static void __submit_bio_noacct(struct bio *bio)
*/
bio_list_merge(&bio_list_on_stack[0], &lower);
bio_list_merge(&bio_list_on_stack[0], &same);
- bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]);
+ while ((bio = bio_list_pop(&bio_list_on_stack[1])))
+ bio_list_insert_sorted(&bio_list_on_stack[0], bio);
} while ((bio = bio_list_pop(&bio_list_on_stack[0])));
current->bio_list = NULL;
@@ -746,7 +767,7 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* it is active, and then process them after it returned.
*/
if (current->bio_list)
- bio_list_add(¤t->bio_list[0], bio);
+ bio_list_insert_sorted(¤t->bio_list[0], bio);
else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
__submit_bio_noacct_mq(bio);
else
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices
2025-05-16 4:51 ` Christoph Hellwig
@ 2025-05-19 22:22 ` Bart Van Assche
2025-05-20 13:57 ` Christoph Hellwig
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-19 22:22 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei,
stable
On 5/15/25 9:51 PM, Christoph Hellwig wrote:
> On Wed, May 14, 2025 at 01:29:37PM -0700, Bart Van Assche wrote:
>> +/*
>> + * Do not call bio_queue_enter() if the BIO_ZONE_WRITE_PLUGGING flag has been
>> + * set because this causes blk_mq_freeze_queue() to deadlock if
>> + * blk_zone_wplug_bio_work() submits a bio. Calling bio_queue_enter() for bios
>> + * on the plug list is not necessary since a q_usage_counter reference is held
>> + * while a bio is on the plug list.
>> + */
>
> How do we end up with BIO_ZONE_WRITE_PLUGGING set here? If that flag
> was set in a stacking driver we need to clear it before resubmitting
> the bio I think.
Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must
remain set until the bio has completed. Here is an example of code in
the bio completion path that tests this flag:
static inline void blk_zone_bio_endio(struct bio *bio)
{
/*
* For write BIOs to zoned devices, signal the completion of the BIO so
* that the next write BIO can be submitted by zone write plugging.
*/
if (bio_zone_write_plugging(bio))
blk_zone_write_plug_bio_endio(bio);
}
The bio_zone_write_plugging() definition is as follows:
static inline bool bio_zone_write_plugging(struct bio *bio)
{
return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
}
> Can you provide a null_blk based reproducer for your testcase to see
> what happens here?
My attempts so far to build a reproduce for the blktests framework have
been unsuccessful. This test script runs fine in the VM that I use for
kernel testing:
https://github.com/bvanassche/blktests/blob/master/tests/zbd/013
> Either way we can't just simply skip taking q_usage_count.
Why not? If BIO_ZONE_WRITE_PLUGGING is set, that guarantees that a
queue reference count is held and will remain held across the entire
disk->fops->submit_bio() invocation, isn't it? From
blk_zone_wplug_bio_work(), below the submit_bio_noacct_nocheck(bio)
call:
if (bdev_test_flag(bdev, BD_HAS_SUBMIT_BIO))
blk_queue_exit(bdev->bd_disk->queue);
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-19 22:12 ` Bart Van Assche
@ 2025-05-20 13:56 ` Christoph Hellwig
2025-05-20 18:09 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-20 13:56 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote:
> This new patch should address the concerns brought up in your latest
> email:
No, we should never need to do a sort, as mentioned we need to fix
how stackable drivers split the I/O. Or maybe even get them out of
all the splits that aren't required.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices
2025-05-19 22:22 ` Bart Van Assche
@ 2025-05-20 13:57 ` Christoph Hellwig
0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-20 13:57 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Mon, May 19, 2025 at 03:22:42PM -0700, Bart Van Assche wrote:
> Hmm ... my understanding is that BIO_ZONE_WRITE_PLUGGING, if set, must
> remain set until the bio has completed. Here is an example of code in
> the bio completion path that tests this flag:
True. Well, we'll need some other flag that to tell lower levels to
ignore the flag because it is owned by the higher level.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-20 13:56 ` Christoph Hellwig
@ 2025-05-20 18:09 ` Bart Van Assche
2025-05-21 5:53 ` Christoph Hellwig
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-20 18:09 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei,
stable
On 5/20/25 6:56 AM, Christoph Hellwig wrote:
> On Mon, May 19, 2025 at 03:12:11PM -0700, Bart Van Assche wrote:
>> This new patch should address the concerns brought up in your latest
>> email:
>
> No, we should never need to do a sort, as mentioned we need to fix
> how stackable drivers split the I/O. Or maybe even get them out of
> all the splits that aren't required.
If the sequential write bios are split by the device mapper, sorting
bios in the block layer is not necessary. Christoph and Damien, do you
agree to replace the bio sorting code in my previous email with the
patch below?
Thanks,
Bart.
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 3d419fd2be57..c41ab294987e 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1792,9 +1792,12 @@ static inline bool dm_zone_bio_needs_split(struct
mapped_device *md,
{
/*
* For mapped device that need zone append emulation, we must
- * split any large BIO that straddles zone boundaries.
+ * split any large BIO that straddles zone boundaries. Additionally,
+ * split sequential writes to prevent that splitting lower in the stack
+ * causes reordering.
*/
- return dm_emulate_zone_append(md) && bio_straddles_zones(bio) &&
+ return ((dm_emulate_zone_append(md) && bio_straddles_zones(bio)) ||
+ bio_op(bio) == REQ_OP_WRITE) &&
!bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
}
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct
bio *bio)
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-20 18:09 ` Bart Van Assche
@ 2025-05-21 5:53 ` Christoph Hellwig
2025-05-21 21:18 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-21 5:53 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote:
> If the sequential write bios are split by the device mapper, sorting
> bios in the block layer is not necessary. Christoph and Damien, do you
> agree to replace the bio sorting code in my previous email with the
> patch below?
No. First please create a reproducer for your issue using null_blk
or scsi_debug, otherwise we have no way to understand what is going
on here, and will regress in the future.
Second should very much be able to fix the splitting in dm to place
the bios in the right order. As mentioned before I have a theory
of how to do it, but we really need a proper reproducer to test this
and then to write it up to blktests first.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-21 5:53 ` Christoph Hellwig
@ 2025-05-21 21:18 ` Bart Van Assche
2025-05-22 5:12 ` Damien Le Moal
2025-05-23 4:21 ` Christoph Hellwig
0 siblings, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-21 21:18 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jens Axboe, linux-block, Damien Le Moal, Yu Kuai, Ming Lei,
stable
On 5/20/25 10:53 PM, Christoph Hellwig wrote:
> On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote:
>> If the sequential write bios are split by the device mapper, sorting
>> bios in the block layer is not necessary. Christoph and Damien, do you
>> agree to replace the bio sorting code in my previous email with the
>> patch below?
>
> No. First please create a reproducer for your issue using null_blk
> or scsi_debug, otherwise we have no way to understand what is going
> on here, and will regress in the future.
>
> Second should very much be able to fix the splitting in dm to place
> the bios in the right order. As mentioned before I have a theory
> of how to do it, but we really need a proper reproducer to test this
> and then to write it up to blktests first.
Hi Christoph,
The following pull request includes a test that triggers the deadlock
fixed by patch 2/2 reliably:
https://github.com/osandov/blktests/pull/171
I do not yet have a reproducer for the bio reordering but I'm still
working on this.
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-21 21:18 ` Bart Van Assche
@ 2025-05-22 5:12 ` Damien Le Moal
2025-05-22 17:08 ` Bart Van Assche
2025-05-23 4:21 ` Christoph Hellwig
1 sibling, 1 reply; 44+ messages in thread
From: Damien Le Moal @ 2025-05-22 5:12 UTC (permalink / raw)
To: Bart Van Assche, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/21/25 23:18, Bart Van Assche wrote:
> On 5/20/25 10:53 PM, Christoph Hellwig wrote:
>> On Tue, May 20, 2025 at 11:09:15AM -0700, Bart Van Assche wrote:
>>> If the sequential write bios are split by the device mapper, sorting
>>> bios in the block layer is not necessary. Christoph and Damien, do you
>>> agree to replace the bio sorting code in my previous email with the
>>> patch below?
>>
>> No. First please create a reproducer for your issue using null_blk
>> or scsi_debug, otherwise we have no way to understand what is going
>> on here, and will regress in the future.
>>
>> Second should very much be able to fix the splitting in dm to place
>> the bios in the right order. As mentioned before I have a theory
>> of how to do it, but we really need a proper reproducer to test this
>> and then to write it up to blktests first.
>
> Hi Christoph,
>
> The following pull request includes a test that triggers the deadlock
> fixed by patch 2/2 reliably:
>
> https://github.com/osandov/blktests/pull/171
+Shin'ichiro so that he is aware of the context.
Please share the blktest patch on this list so that we can see how you recreate
the issue. That makes it easier to see if a fix is appropriate.
> I do not yet have a reproducer for the bio reordering but I'm still
> working on this.
I am still very confused about how this is possible assuming a well behaved user
that actually submits write BIOs in sequence for a zone. That means with a lock
around submit_bio() calls. Assuming such user, a large write BIO that is split
would have its fragments all processed and added to the target zone plug in
order. Another context (or the same context) submitting the next write for that
zone would have the same happen, so BIO fragments should not be reordered...
So to clarify: are we talking about splits of the BIO that the DM device
receives ? Or is it about splits of cloned BIOs that are used to process the
BIOs that the DM device received ? The clones are for the underlying device and
should not have the zone plugging flag set until the DM target driver submits
them, even if the original BIO is flagged with zone plugging. Looking at the bio
clone code, the bio flags do not seem to be copied from the source BIO to the
clone. So even if the source BIO (the BIO received by the DM device) is flagged
with zone write plugging, a clone should not have this flag set until it is
submitted.
Could you clarify the sequence and BIO flags you see that leads to the issue ?
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-22 5:12 ` Damien Le Moal
@ 2025-05-22 17:08 ` Bart Van Assche
2025-05-23 6:02 ` Damien Le Moal
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-22 17:08 UTC (permalink / raw)
To: Damien Le Moal, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/21/25 10:12 PM, Damien Le Moal wrote:
> I am still very confused about how this is possible assuming a well behaved user
> that actually submits write BIOs in sequence for a zone. That means with a lock
> around submit_bio() calls. Assuming such user, a large write BIO that is split
> would have its fragments all processed and added to the target zone plug in
> order. Another context (or the same context) submitting the next write for that
> zone would have the same happen, so BIO fragments should not be reordered...
>
> So to clarify: are we talking about splits of the BIO that the DM device
> receives ? Or is it about splits of cloned BIOs that are used to process the
> BIOs that the DM device received ? The clones are for the underlying device and
> should not have the zone plugging flag set until the DM target driver submits
> them, even if the original BIO is flagged with zone plugging. Looking at the bio
> clone code, the bio flags do not seem to be copied from the source BIO to the
> clone. So even if the source BIO (the BIO received by the DM device) is flagged
> with zone write plugging, a clone should not have this flag set until it is
> submitted.
>
> Could you clarify the sequence and BIO flags you see that leads to the issue ?
Hi Damien,
In the tests that I ran, F2FS submits bios to a dm driver and the dm
driver submits these bios to the SCSI disk (sd) driver. F2FS submits
bios at the write pointer. If that wouldn't be the case, the following
code in block/blk-zoned.c would reject these bios:
/*
* Check for non-sequential writes early as we know that BIOs
* with a start sector not unaligned to the zone write pointer
* will fail.
*/
if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
return false;
If the bio is larger than 1 MiB, it gets split by the block layer after
it passed through the dm driver and before it is submitted to the sd
driver. The UFS driver sets max_sectors to 1 MiB. Although UFS host
controllers support larger requests, this value has been chosen to
minimize the impact of writes on read latency.
Earlier emails in this thread show that the bio splitting below the dm
driver can cause bio reordering. See also the call stack that is
available here:
https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-21 21:18 ` Bart Van Assche
2025-05-22 5:12 ` Damien Le Moal
@ 2025-05-23 4:21 ` Christoph Hellwig
1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-23 4:21 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block, Damien Le Moal,
Yu Kuai, Ming Lei, stable
On Wed, May 21, 2025 at 02:18:12PM -0700, Bart Van Assche wrote:
> The following pull request includes a test that triggers the deadlock
> fixed by patch 2/2 reliably:
>
> https://github.com/osandov/blktests/pull/171
Can you post the pathc please? I've not found any obvious way
to download a patch from the above website.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-22 17:08 ` Bart Van Assche
@ 2025-05-23 6:02 ` Damien Le Moal
2025-05-23 16:30 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Damien Le Moal @ 2025-05-23 6:02 UTC (permalink / raw)
To: Bart Van Assche, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/22/25 19:08, Bart Van Assche wrote:
> On 5/21/25 10:12 PM, Damien Le Moal wrote:
>> I am still very confused about how this is possible assuming a well behaved user
>> that actually submits write BIOs in sequence for a zone. That means with a lock
>> around submit_bio() calls. Assuming such user, a large write BIO that is split
>> would have its fragments all processed and added to the target zone plug in
>> order. Another context (or the same context) submitting the next write for that
>> zone would have the same happen, so BIO fragments should not be reordered...
>>
>> So to clarify: are we talking about splits of the BIO that the DM device
>> receives ? Or is it about splits of cloned BIOs that are used to process the
>> BIOs that the DM device received ? The clones are for the underlying device and
>> should not have the zone plugging flag set until the DM target driver submits
>> them, even if the original BIO is flagged with zone plugging. Looking at the bio
>> clone code, the bio flags do not seem to be copied from the source BIO to the
>> clone. So even if the source BIO (the BIO received by the DM device) is flagged
>> with zone write plugging, a clone should not have this flag set until it is
>> submitted.
>>
>> Could you clarify the sequence and BIO flags you see that leads to the issue ?
>
> Hi Damien,
>
> In the tests that I ran, F2FS submits bios to a dm driver and the dm
> driver submits these bios to the SCSI disk (sd) driver. F2FS submits
Which DM driver is it ? Does that DM driver have some special work queue
handling of BIO submissions ? Or does is simply remap the BIO and send it down
to the underlying device in the initial submit_bio() context ? If it is the
former case, then that DM driver must enable zone write plugging. If it is the
latter, it should not need zone write plugging and ordering will be handled
correctly throughout the submit_bio() context for the initial DM BIO, assuming
that the submitter does indeed serialize write BIO submissions to a zone. I have
not looked at f2fs code in ages. When I worked on it, there was a mutex to
serialize write issuing to avoid reordering issues...
> bios at the write pointer. If that wouldn't be the case, the following
> code in block/blk-zoned.c would reject these bios:
>
> /*
> * Check for non-sequential writes early as we know that BIOs
> * with a start sector not unaligned to the zone write pointer
> * will fail.
> */
> if (bio_offset_from_zone_start(bio) != zwplug->wp_offset)
> return false;
>
> If the bio is larger than 1 MiB, it gets split by the block layer after
> it passed through the dm driver and before it is submitted to the sd
> driver. The UFS driver sets max_sectors to 1 MiB. Although UFS host
> controllers support larger requests, this value has been chosen to
> minimize the impact of writes on read latency.
As mentioned above, the splitting and adding to the zone write plug should all
be serialized by the submitter, using whatever mean is appropriate there. As
long as submit_bio() is ongoing processing a large BIO and splitting it, if the
submitter is correctly serialzing writes, I do not see how splitting can result
in reordering...
> Earlier emails in this thread show that the bio splitting below the dm
> driver can cause bio reordering. See also the call stack that is
> available here:
>
> https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/
I asked for clarification in the first place because I still do not understand
what is going on reading that lightly explained backtrace you show in that
email. A more detailed time flow explanation of what is happening and in which
context would very likely clarify exactly what is gong on.
So far, the only thing I can think of is that maybe we need to split BIOs in DM
core before submitting them to the DM driver. But I am reluctant to send such
patch because I cannot justify/expalin its need based on your explanations.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-23 6:02 ` Damien Le Moal
@ 2025-05-23 16:30 ` Bart Van Assche
2025-05-24 8:48 ` Damien Le Moal
2025-05-26 5:24 ` Christoph Hellwig
0 siblings, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-23 16:30 UTC (permalink / raw)
To: Damien Le Moal, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/22/25 11:02 PM, Damien Le Moal wrote:
> On 5/22/25 19:08, Bart Van Assche wrote:
> [ ... ]
> Which DM driver is it ? Does that DM driver have some special work queue
> handling of BIO submissions ? Or does is simply remap the BIO and send it down
> to the underlying device in the initial submit_bio() context ? If it is the
> former case, then that DM driver must enable zone write plugging. If it is the
> latter, it should not need zone write plugging and ordering will be handled
> correctly throughout the submit_bio() context for the initial DM BIO, assuming
> that the submitter does indeed serialize write BIO submissions to a zone. I have
> not looked at f2fs code in ages. When I worked on it, there was a mutex to
> serialize write issuing to avoid reordering issues...
It is the dm-default-key driver, a driver about which everyone
(including the authors of that driver) agree that it should disappear.
Unfortunately the functionality provided by that driver has not yet been
integrated in the upstream kernel (encrypt filesystem metadata).
How that driver (dm-default-key) works is very similar to how dm-crypt
works. I think that the most important difference is that dm-crypt
requests encryption for all bios while dm-default-key only sets an
encryption key for a subset of the bios it processes.
The source code of that driver is available here:
https://android.googlesource.com/kernel/common/+/refs/heads/android16-6.12/drivers/md/dm-default-key.c
>> Earlier emails in this thread show that the bio splitting below the dm
>> driver can cause bio reordering. See also the call stack that is
>> available here:
>>
>> https://lore.kernel.org/linux-block/47b24ea0-ef8f-441f-b405-a062b986ce93@acm.org/
>
> I asked for clarification in the first place because I still do not understand
> what is going on reading that lightly explained backtrace you show in that
> email. A more detailed time flow explanation of what is happening and in which
> context would very likely clarify exactly what is gong on.
>
> So far, the only thing I can think of is that maybe we need to split BIOs in DM
> core before submitting them to the DM driver. But I am reluctant to send such
> patch because I cannot justify/expalin its need based on your explanations.
Backtraces do not fully explain what happens because a single function
(__submit_bio_noacct()) processes bios for multiple levels in a stacked
block driver hierarchy. That's why I added debug code that reports the
disk name (sde) for the first bio that is inserted out-of-order. This
output shows that the bio reordering is the result of bio splitting at
the bottom of the stack.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-23 16:30 ` Bart Van Assche
@ 2025-05-24 8:48 ` Damien Le Moal
2025-05-24 14:05 ` Bart Van Assche
2025-05-26 5:24 ` Christoph Hellwig
1 sibling, 1 reply; 44+ messages in thread
From: Damien Le Moal @ 2025-05-24 8:48 UTC (permalink / raw)
To: Bart Van Assche, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/23/25 18:30, Bart Van Assche wrote:
> On 5/22/25 11:02 PM, Damien Le Moal wrote:
>> On 5/22/25 19:08, Bart Van Assche wrote:
> > [ ... ]
>> Which DM driver is it ? Does that DM driver have some special work queue
>> handling of BIO submissions ? Or does is simply remap the BIO and send it down
>> to the underlying device in the initial submit_bio() context ? If it is the
>> former case, then that DM driver must enable zone write plugging. If it is the
>> latter, it should not need zone write plugging and ordering will be handled
>> correctly throughout the submit_bio() context for the initial DM BIO, assuming
>> that the submitter does indeed serialize write BIO submissions to a zone. I have
>> not looked at f2fs code in ages. When I worked on it, there was a mutex to
>> serialize write issuing to avoid reordering issues...
>
> It is the dm-default-key driver, a driver about which everyone
> (including the authors of that driver) agree that it should disappear.
> Unfortunately the functionality provided by that driver has not yet been
> integrated in the upstream kernel (encrypt filesystem metadata).
>
> How that driver (dm-default-key) works is very similar to how dm-crypt
> works. I think that the most important difference is that dm-crypt
> requests encryption for all bios while dm-default-key only sets an
> encryption key for a subset of the bios it processes.
>
> The source code of that driver is available here:
> https://android.googlesource.com/kernel/common/+/refs/heads/android16-6.12/drivers/md/dm-default-key.c
Well, this is an out of tree driver. Not touching this.
Unless you can reproduce the issue with dm-crypt (or any other DM target that is
upstream), I will not even try to debug this.
Note that our internal test suite runs *lots* of different zoned devices (SMR
HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns
device) against *lots* of configurations for file systems (xfs, btrfs, zonefs)
and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue,
We run this test suite weekly against RC kernels and for-next branch.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-24 8:48 ` Damien Le Moal
@ 2025-05-24 14:05 ` Bart Van Assche
2025-05-24 15:36 ` Damien Le Moal
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-24 14:05 UTC (permalink / raw)
To: Damien Le Moal, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/24/25 1:48 AM, Damien Le Moal wrote:
> Note that our internal test suite runs *lots* of different zoned devices (SMR
> HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns
> device) against *lots* of configurations for file systems (xfs, btrfs, zonefs)
> and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue,
> We run this test suite weekly against RC kernels and for-next branch.
Hi Damien,
Please consider adding this dm-crypt test to the weekly test run:
https://lore.kernel.org/linux-block/20250523164956.883024-1-bvanassche@acm.org
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-24 14:05 ` Bart Van Assche
@ 2025-05-24 15:36 ` Damien Le Moal
0 siblings, 0 replies; 44+ messages in thread
From: Damien Le Moal @ 2025-05-24 15:36 UTC (permalink / raw)
To: Bart Van Assche, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/24/25 16:05, Bart Van Assche wrote:
> On 5/24/25 1:48 AM, Damien Le Moal wrote:
>> Note that our internal test suite runs *lots* of different zoned devices (SMR
>> HDD, ZNS SSDs, nullblk, tcmu-runner ZBC device, scsi_debug, qemu nvme zns
>> device) against *lots* of configurations for file systems (xfs, btrfs, zonefs)
>> and DM targets (dm-crypt, dm-linear) and we have not seen any reordering issue,
>> We run this test suite weekly against RC kernels and for-next branch.
>
> Hi Damien,
>
> Please consider adding this dm-crypt test to the weekly test run:
> https://lore.kernel.org/linux-block/20250523164956.883024-1-bvanassche@acm.org
We run blktest too, of course. So when Shin'ichiro applies this, it will be part
of the runs.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-23 16:30 ` Bart Van Assche
2025-05-24 8:48 ` Damien Le Moal
@ 2025-05-26 5:24 ` Christoph Hellwig
2025-05-27 16:19 ` Bart Van Assche
2025-06-08 22:07 ` Bart Van Assche
1 sibling, 2 replies; 44+ messages in thread
From: Christoph Hellwig @ 2025-05-26 5:24 UTC (permalink / raw)
To: Bart Van Assche
Cc: Damien Le Moal, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Fri, May 23, 2025 at 09:30:36AM -0700, Bart Van Assche wrote:
> It is the dm-default-key driver, a driver about which everyone
> (including the authors of that driver) agree that it should disappear.
> Unfortunately the functionality provided by that driver has not yet been
> integrated in the upstream kernel (encrypt filesystem metadata).
>
> How that driver (dm-default-key) works is very similar to how dm-crypt
> works. I think that the most important difference is that dm-crypt
> requests encryption for all bios while dm-default-key only sets an
> encryption key for a subset of the bios it processes.
Umm, Bart I really expected better from you. You're ducking around
providing a reproducer for over a week and waste multiple peoples
time to tell us the only reproducer is your out of tree thingy
reject upstream before? That's not really how Linux developement
works.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-26 5:24 ` Christoph Hellwig
@ 2025-05-27 16:19 ` Bart Van Assche
2025-05-31 0:25 ` Bart Van Assche
2025-06-08 22:07 ` Bart Van Assche
1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-05-27 16:19 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/25/25 10:24 PM, Christoph Hellwig wrote:
> On Fri, May 23, 2025 at 09:30:36AM -0700, Bart Van Assche wrote:
>> It is the dm-default-key driver, a driver about which everyone
>> (including the authors of that driver) agree that it should disappear.
>> Unfortunately the functionality provided by that driver has not yet been
>> integrated in the upstream kernel (encrypt filesystem metadata).
>>
>> How that driver (dm-default-key) works is very similar to how dm-crypt
>> works. I think that the most important difference is that dm-crypt
>> requests encryption for all bios while dm-default-key only sets an
>> encryption key for a subset of the bios it processes.
>
> Umm, Bart I really expected better from you. You're ducking around
> providing a reproducer for over a week and waste multiple peoples
> time to tell us the only reproducer is your out of tree thingy
> reject upstream before? That's not really how Linux developement
> works.
I'm still working on a reproducer for the blktests framework. To my own
frustration I have not yet found a small reproducer that is easy to run
by others. I'm convinced that it should be possible to create a
reproducer that is based on dm-crypt since dm-crypt and the dm-default-
key driver behave identically with regard to bio splitting.
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-27 16:19 ` Bart Van Assche
@ 2025-05-31 0:25 ` Bart Van Assche
0 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-05-31 0:25 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/27/25 9:19 AM, Bart Van Assche wrote:
> I'm still working on a reproducer for the blktests framework. To my own
> frustration I have not yet found a small reproducer that is easy to run
> by others. I'm convinced that it should be possible to create a
> reproducer that is based on dm-crypt since dm-crypt and the dm-default-
> key driver behave identically with regard to bio splitting.
(replying to my own email)
I have not yet found a small reproducer - not even against the kernel
this issue was originally observed with (android16-6.12). So I switched
my approach. I'm now working on obtaining more information about the
queued bios when this issue occurs. It seems like a bio for a SCSI disk
device that shouldn't be split is split anyway. A bio A that does not
cross a zone boundary is split into bios A1 and A2 and bio A1 is split
into bios A11 and A12. Bio A12 is added at the end of the list (past bio
A2) and this causes the reordering. I'm currently analyzing why this
happens because this shouldn't happen.
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-05-26 5:24 ` Christoph Hellwig
2025-05-27 16:19 ` Bart Van Assche
@ 2025-06-08 22:07 ` Bart Van Assche
2025-06-08 22:47 ` Damien Le Moal
2025-06-09 3:55 ` Christoph Hellwig
1 sibling, 2 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-06-08 22:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 5/25/25 10:24 PM, Christoph Hellwig wrote:
> Umm, Bart I really expected better from you. You're ducking around
> providing a reproducer for over a week and waste multiple peoples
> time to tell us the only reproducer is your out of tree thingy
> reject upstream before? That's not really how Linux developement
> works.
A reproducer for the blktests framework is available below. It took me
more time than I had expected to come up with a reproducer because I was
initially looking in the wrong direction (device mapper). As one can
see, the reproducer below does not involve any device mapper driver.
Do you plan to work on a fix or do you perhaps expect me to do this?
Thanks,
Bart.
$ cat tests/zbd/014
#!/bin/bash
# SPDX-License-Identifier: GPL-3.0+
# Copyright (C) 2025 Google LLC
#
# Trigger multiple splitting of a bio. About the values of the size
parameters
# in this test:
# - The zone size is 4 MiB and is larger than max_sectors_kb.
# - The maximum bio size supported by the code in
block/blk-crypto-fallback.c
# is BIO_MAX_VECS * PAGE_SIZE or 1 MiB if the page size is 4 KiB.
# - max_sectors_kb has been set to 512 KiB to cause further bio splitting.
. tests/zbd/rc
. common/null_blk
DESCRIPTION="test multiple bio splitting"
QUICK=1
requires() {
_have_driver f2fs
_have_driver null_blk
_have_program fscrypt
_have_program getconf
_have_program mkfs.f2fs
for o in BLK_INLINE_ENCRYPTION_FALLBACK FS_ENCRYPTION_INLINE_CRYPT; do
if ! _check_kernel_option "$o"; then
SKIP_REASONS+=("Kernel option $o has not been enabled")
fi
done
}
trace_block_io() {
(
set -e
cd /sys/kernel/tracing
lsof -t trace_pipe | xargs -r kill
echo 1024 > buffer_size_kb
echo nop > current_tracer
echo 0 > tracing_on
echo > trace
echo 0 > events/enable
echo 1 > events/block/enable
echo 0 > events/block/block_dirty_buffer/enable
echo 0 > events/block/block_plug/enable
echo 0 > events/block/block_touch_buffer/enable
echo 0 > events/block/block_unplug/enable
grep -lw 1 events/block/*/enable | while read -r event_path; do
filter_path="${event_path%enable}filter"
echo "$1" > "$filter_path"
done
echo 1 > tracing_on
echo "Tracing has been enabled" >>"$FULL"
cat trace_pipe > "${FULL%.full}-block-trace.txt"
)
}
stop_tracing() {
kill "$1"
(
set -e
cd /sys/kernel/tracing
echo 0 > tracing_on
echo 0 > events/enable
)
}
report_stats() {
local pfx=$1 before after
read -r -a before <<<"$2"
read -r -a after <<<"$3"
local reads=$((after[0]-before[0]))
local rmerge=$((after[1]-before[1]))
local writes=$((after[4]-before[4]))
local wmerge=$((after[5]-before[5]))
echo "$pfx reads: $reads rmerge: $rmerge writes: $writes wmerge: $wmerge"
}
devno() {
IFS=: read -r maj min <"/sys/class/block/$(basename "$1")/dev"
# From <linux/kdev_t.h> MINORBITS = 20
echo $(((maj << 20) + min))
}
test() {
echo "Running ${TEST_NAME}"
local page_size
page_size=$(getconf PAGE_SIZE) || return $?
local mount_dir="$TMPDIR/mnt"
umount /dev/nullb1 >>"$FULL" 2>&1
_remove_null_blk_devices
# A small conventional block device for the F2FS metadata.
local null_blk_params=(
blocksize=4096
discard=1
max_sectors=$(((1 << 32) - 1))
memory_backed=1
size=64 # MiB
submit_queues=1
power=1
)
_configure_null_blk nullb1 "${null_blk_params[@]}"
local cdev=/dev/nullb1
# A larger zoned block device for the data.
local null_blk_params=(
blocksize=4096
completion_nsec=10000000 # 10 ms
irqmode=2
max_sectors=$(((1 << 32) - 1))
memory_backed=1
hw_queue_depth=1
size=1024 # MiB
submit_queues=1
zone_size="$((page_size >> 10))" # 1024 times the page size.
zoned=1
power=1
)
_configure_null_blk nullb2 "${null_blk_params[@]}" || return $?
local zdev_basename=nullb2
local zdev=/dev/${zdev_basename}
local zdev_devno
zdev_devno=$(devno "$zdev")
{
local
max_sectors_zdev=/sys/class/block/"${zdev_basename}"/queue/max_sectors_kb
echo $((128 * page_size)) > "${max_sectors_zdev}"
echo "${zdev_basename}: max_sectors_kb=$(<"${max_sectors_zdev}")"
ls -ld "${cdev}" "${zdev}"
echo "${zdev_basename} settings:"
(cd "/sys/class/block/$zdev_basename/queue" && grep -vw 0 ./*)
} >>"${FULL}" 2>&1
trace_block_io "dev == ${zdev_devno}" &
local trace_pid=$!
while ! grep -q 'Tracing has been enabled' "$FULL"; do
sleep .1
done
{
mkfs.f2fs -q -O encrypt -m "${cdev}" -c "${zdev}" &&
mkdir -p "${mount_dir}" &&
mount -o inlinecrypt -t f2fs "${cdev}" "${mount_dir}" &&
local encrypted_dir="${mount_dir}/encrypted" &&
mkdir -p "${encrypted_dir}"
fscrypt setup "${mount_dir}" </dev/null
local keyfile=$TMPDIR/keyfile &&
dd if=/dev/zero of="$keyfile" bs=32 count=1 status=none &&
fscrypt encrypt "${encrypted_dir}" --source=raw_key \
--name=protector --key="$keyfile" &&
fscrypt status "${encrypted_dir}" &&
local before after &&
read -r -a before <"/sys/class/block/${zdev_basename}/stat" &&
echo "Starting IO" &&
local cmd="dd if=/dev/zero of=${encrypted_dir}/file bs=64M count=15
conv=fdatasync" &&
echo "$cmd" &&
eval "$cmd" &&
ls -ld "${mount_dir}/encrypted/file" &&
read -r -a after <"/sys/class/block/${zdev_basename}/stat" &&
report_stats "zdev:" "${before[*]}" "${after[*]}"
} >>"$FULL" 2>&1 || fail=true
umount "${mount_dir}" >>"${FULL}" 2>&1
[ -n "${trace_pid}" ] && kill "${trace_pid}"
_exit_null_blk
if [ -z "$fail" ]; then
echo "Test complete"
else
echo "Test failed"
return 1
fi
}
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-08 22:07 ` Bart Van Assche
@ 2025-06-08 22:47 ` Damien Le Moal
2025-06-09 3:58 ` Christoph Hellwig
2025-06-09 20:48 ` Bart Van Assche
2025-06-09 3:55 ` Christoph Hellwig
1 sibling, 2 replies; 44+ messages in thread
From: Damien Le Moal @ 2025-06-08 22:47 UTC (permalink / raw)
To: Bart Van Assche, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 6/9/25 07:07, Bart Van Assche wrote:
> On 5/25/25 10:24 PM, Christoph Hellwig wrote:
>> Umm, Bart I really expected better from you. You're ducking around
>> providing a reproducer for over a week and waste multiple peoples
>> time to tell us the only reproducer is your out of tree thingy
>> reject upstream before? That's not really how Linux developement
>> works.
>
> A reproducer for the blktests framework is available below. It took me
> more time than I had expected to come up with a reproducer because I was
> initially looking in the wrong direction (device mapper). As one can
> see, the reproducer below does not involve any device mapper driver.
>
> Do you plan to work on a fix or do you perhaps expect me to do this?
To be honest, I have never tested the block layer inline crypto feature using
zoned devices. All the use cases I have with crypto are using dm-crypt.
A quick look tells me that the issue is with that particular feature, especially
the fallback path, as it is using work items to asynchronously handle write data
encryption. Since this happens above the zone write plug, without that being
aware of the sequential write constraint, problems may indeed happen.
So yes, we need a fix. Can you work on one ?
One additional thing: why do you need f2fs with your reproducer ? Problems
should happen also doing writes directly to the block device, no ?
>
> Thanks,
>
> Bart.
>
> $ cat tests/zbd/014
> #!/bin/bash
> # SPDX-License-Identifier: GPL-3.0+
> # Copyright (C) 2025 Google LLC
> #
> # Trigger multiple splitting of a bio. About the values of the size
> parameters
> # in this test:
> # - The zone size is 4 MiB and is larger than max_sectors_kb.
> # - The maximum bio size supported by the code in
> block/blk-crypto-fallback.c
> # is BIO_MAX_VECS * PAGE_SIZE or 1 MiB if the page size is 4 KiB.
> # - max_sectors_kb has been set to 512 KiB to cause further bio splitting.
>
> . tests/zbd/rc
> . common/null_blk
>
> DESCRIPTION="test multiple bio splitting"
> QUICK=1
>
> requires() {
> _have_driver f2fs
> _have_driver null_blk
> _have_program fscrypt
> _have_program getconf
> _have_program mkfs.f2fs
> for o in BLK_INLINE_ENCRYPTION_FALLBACK FS_ENCRYPTION_INLINE_CRYPT; do
> if ! _check_kernel_option "$o"; then
> SKIP_REASONS+=("Kernel option $o has not been enabled")
> fi
> done
> }
>
> trace_block_io() {
> (
> set -e
> cd /sys/kernel/tracing
> lsof -t trace_pipe | xargs -r kill
> echo 1024 > buffer_size_kb
> echo nop > current_tracer
> echo 0 > tracing_on
> echo > trace
> echo 0 > events/enable
> echo 1 > events/block/enable
> echo 0 > events/block/block_dirty_buffer/enable
> echo 0 > events/block/block_plug/enable
> echo 0 > events/block/block_touch_buffer/enable
> echo 0 > events/block/block_unplug/enable
> grep -lw 1 events/block/*/enable | while read -r event_path; do
> filter_path="${event_path%enable}filter"
> echo "$1" > "$filter_path"
> done
> echo 1 > tracing_on
> echo "Tracing has been enabled" >>"$FULL"
> cat trace_pipe > "${FULL%.full}-block-trace.txt"
> )
> }
>
> stop_tracing() {
> kill "$1"
> (
> set -e
> cd /sys/kernel/tracing
> echo 0 > tracing_on
> echo 0 > events/enable
> )
> }
>
> report_stats() {
> local pfx=$1 before after
> read -r -a before <<<"$2"
> read -r -a after <<<"$3"
> local reads=$((after[0]-before[0]))
> local rmerge=$((after[1]-before[1]))
> local writes=$((after[4]-before[4]))
> local wmerge=$((after[5]-before[5]))
> echo "$pfx reads: $reads rmerge: $rmerge writes: $writes wmerge: $wmerge"
> }
>
> devno() {
> IFS=: read -r maj min <"/sys/class/block/$(basename "$1")/dev"
> # From <linux/kdev_t.h> MINORBITS = 20
> echo $(((maj << 20) + min))
> }
>
> test() {
> echo "Running ${TEST_NAME}"
>
> local page_size
> page_size=$(getconf PAGE_SIZE) || return $?
>
> local mount_dir="$TMPDIR/mnt"
>
> umount /dev/nullb1 >>"$FULL" 2>&1
> _remove_null_blk_devices
>
> # A small conventional block device for the F2FS metadata.
> local null_blk_params=(
> blocksize=4096
> discard=1
> max_sectors=$(((1 << 32) - 1))
> memory_backed=1
> size=64 # MiB
> submit_queues=1
> power=1
> )
> _configure_null_blk nullb1 "${null_blk_params[@]}"
> local cdev=/dev/nullb1
>
> # A larger zoned block device for the data.
> local null_blk_params=(
> blocksize=4096
> completion_nsec=10000000 # 10 ms
> irqmode=2
> max_sectors=$(((1 << 32) - 1))
> memory_backed=1
> hw_queue_depth=1
> size=1024 # MiB
> submit_queues=1
> zone_size="$((page_size >> 10))" # 1024 times the page size.
> zoned=1
> power=1
> )
> _configure_null_blk nullb2 "${null_blk_params[@]}" || return $?
> local zdev_basename=nullb2
> local zdev=/dev/${zdev_basename}
> local zdev_devno
> zdev_devno=$(devno "$zdev")
>
> {
> local
> max_sectors_zdev=/sys/class/block/"${zdev_basename}"/queue/max_sectors_kb
> echo $((128 * page_size)) > "${max_sectors_zdev}"
> echo "${zdev_basename}: max_sectors_kb=$(<"${max_sectors_zdev}")"
>
> ls -ld "${cdev}" "${zdev}"
> echo "${zdev_basename} settings:"
> (cd "/sys/class/block/$zdev_basename/queue" && grep -vw 0 ./*)
> } >>"${FULL}" 2>&1
>
> trace_block_io "dev == ${zdev_devno}" &
> local trace_pid=$!
> while ! grep -q 'Tracing has been enabled' "$FULL"; do
> sleep .1
> done
>
> {
> mkfs.f2fs -q -O encrypt -m "${cdev}" -c "${zdev}" &&
> mkdir -p "${mount_dir}" &&
> mount -o inlinecrypt -t f2fs "${cdev}" "${mount_dir}" &&
> local encrypted_dir="${mount_dir}/encrypted" &&
> mkdir -p "${encrypted_dir}"
> fscrypt setup "${mount_dir}" </dev/null
> local keyfile=$TMPDIR/keyfile &&
> dd if=/dev/zero of="$keyfile" bs=32 count=1 status=none &&
> fscrypt encrypt "${encrypted_dir}" --source=raw_key \
> --name=protector --key="$keyfile" &&
> fscrypt status "${encrypted_dir}" &&
>
> local before after &&
> read -r -a before <"/sys/class/block/${zdev_basename}/stat" &&
> echo "Starting IO" &&
> local cmd="dd if=/dev/zero of=${encrypted_dir}/file bs=64M count=15
> conv=fdatasync" &&
> echo "$cmd" &&
> eval "$cmd" &&
> ls -ld "${mount_dir}/encrypted/file" &&
> read -r -a after <"/sys/class/block/${zdev_basename}/stat" &&
> report_stats "zdev:" "${before[*]}" "${after[*]}"
> } >>"$FULL" 2>&1 || fail=true
>
> umount "${mount_dir}" >>"${FULL}" 2>&1
> [ -n "${trace_pid}" ] && kill "${trace_pid}"
> _exit_null_blk
>
> if [ -z "$fail" ]; then
> echo "Test complete"
> else
> echo "Test failed"
> return 1
> fi
> }
>
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-08 22:07 ` Bart Van Assche
2025-06-08 22:47 ` Damien Le Moal
@ 2025-06-09 3:55 ` Christoph Hellwig
2025-06-10 17:23 ` Bart Van Assche
1 sibling, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-06-09 3:55 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Sun, Jun 08, 2025 at 03:07:30PM -0700, Bart Van Assche wrote:
> On 5/25/25 10:24 PM, Christoph Hellwig wrote:
>> Umm, Bart I really expected better from you. You're ducking around
>> providing a reproducer for over a week and waste multiple peoples
>> time to tell us the only reproducer is your out of tree thingy
>> reject upstream before? That's not really how Linux developement
>> works.
>
> A reproducer for the blktests framework is available below. It took me
> more time than I had expected to come up with a reproducer because I was
> initially looking in the wrong direction (device mapper). As one can
> see, the reproducer below does not involve any device mapper driver.
>
> Do you plan to work on a fix or do you perhaps expect me to do this?
The problem here is that blk_crypto_fallback_split_bio_if_needed does
sneaky splits behind the back of the main splitting code.
The fix is to include the limit imposed by it in __bio_split_to_limits
as well if the crypto fallback is used.
If you have time to fix this that would be great. Otherwise I can
give it a spin, but it's public holiday and travel season here, so
my availability is a bit limited.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-08 22:47 ` Damien Le Moal
@ 2025-06-09 3:58 ` Christoph Hellwig
2025-06-09 20:48 ` Bart Van Assche
1 sibling, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2025-06-09 3:58 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Mon, Jun 09, 2025 at 07:47:38AM +0900, Damien Le Moal wrote:
> To be honest, I have never tested the block layer inline crypto feature using
> zoned devices. All the use cases I have with crypto are using dm-crypt.
> A quick look tells me that the issue is with that particular feature, especially
> the fallback path, as it is using work items to asynchronously handle write data
> encryption.
Nothing asynchronous in the fallback write path, just weird open coded
bio splitting like the one I just got rid of with the bounce buffering..
> One additional thing: why do you need f2fs with your reproducer ? Problems
> should happen also doing writes directly to the block device, no ?
Because fscrypt is the only user of the block crypto code, and f2fs
is the only user of that that works on zoned devices.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-08 22:47 ` Damien Le Moal
2025-06-09 3:58 ` Christoph Hellwig
@ 2025-06-09 20:48 ` Bart Van Assche
2025-06-10 5:04 ` Christoph Hellwig
1 sibling, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-06-09 20:48 UTC (permalink / raw)
To: Damien Le Moal, Christoph Hellwig
Cc: Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki
On 6/8/25 3:47 PM, Damien Le Moal wrote:
> So yes, we need a fix. Can you work on one ?
The patch below seems to be sufficient but I'm not sure whether this
approach is acceptable:
Subject: [PATCH] block: Preserve the LBA order when splitting a bio
Preserve the bio order if bio_split() is called on the prefix returned
by an earlier bio_split() call. This can happen with fscrypt and the
inline encryption fallback code if max_sectors is less than the maximum
bio size supported by the inline encryption fallback code (1 MiB for 4 KiB
pages) or when using zoned storage and the distance from the start of the
bio to the next zone boundary is less than 1 MiB.
Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
block/bio.c | 8 ++++++++
block/blk-core.c | 12 ++++++++----
include/linux/blk_types.h | 5 +++++
3 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f5..440ed443545c 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1689,6 +1689,14 @@ struct bio *bio_split(struct bio *bio, int sectors,
bio_advance(bio, split->bi_iter.bi_size);
+ /*
+ * If bio_split() is called on a prefix from an earlier bio_split()
+ * call, adding it at the head of current->bio_list[0] preserves the
+ * LBA order. This is essential when writing data to a zoned block
+ * device and when using REQ_OP_WRITE instead of REQ_OP_ZONE_APPEND.
+ */
+ bio_set_flag(bio, BIO_ADD_AT_HEAD);
+
if (bio_flagged(bio, BIO_TRACE_COMPLETION))
bio_set_flag(split, BIO_TRACE_COMPLETION);
diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f2..570a14a7bcd4 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -745,12 +745,16 @@ void submit_bio_noacct_nocheck(struct bio *bio)
* to collect a list of requests submited by a ->submit_bio method while
* it is active, and then process them after it returned.
*/
- if (current->bio_list)
- bio_list_add(¤t->bio_list[0], bio);
- else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO))
+ if (current->bio_list) {
+ if (bio_flagged(bio, BIO_ADD_AT_HEAD))
+ bio_list_add_head(¤t->bio_list[0], bio);
+ else
+ bio_list_add(¤t->bio_list[0], bio);
+ } else if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
__submit_bio_noacct_mq(bio);
- else
+ } else {
__submit_bio_noacct(bio);
+ }
}
static blk_status_t blk_validate_atomic_write_op_size(struct
request_queue *q,
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..0e2d3fd8d40a 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -308,6 +308,11 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ /*
+ * make submit_bio_noacct_nocheck() add this bio at the head of
+ * current->bio_list[0].
+ */
+ BIO_ADD_AT_HEAD,
BIO_FLAG_LAST
};
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-09 20:48 ` Bart Van Assche
@ 2025-06-10 5:04 ` Christoph Hellwig
0 siblings, 0 replies; 44+ messages in thread
From: Christoph Hellwig @ 2025-06-10 5:04 UTC (permalink / raw)
To: Bart Van Assche
Cc: Damien Le Moal, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Mon, Jun 09, 2025 at 01:48:06PM -0700, Bart Van Assche wrote:
> On 6/8/25 3:47 PM, Damien Le Moal wrote:
>> So yes, we need a fix. Can you work on one ?
>
> The patch below seems to be sufficient but I'm not sure whether this
> approach is acceptable:
It's still the wrong approach. Please add the splitting needed for
the blk-crypt fallback to __bio_split_to_limits instead.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-09 3:55 ` Christoph Hellwig
@ 2025-06-10 17:23 ` Bart Van Assche
2025-06-10 23:18 ` Keith Busch
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-06-10 17:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Damien Le Moal, Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki, Eric Biggers
On 6/8/25 8:55 PM, Christoph Hellwig wrote:
> The problem here is that blk_crypto_fallback_split_bio_if_needed does
> sneaky splits behind the back of the main splitting code.
>
> The fix is to include the limit imposed by it in __bio_split_to_limits
> as well if the crypto fallback is used.
(+Eric)
Hmm ... my understanding is that when the inline encryption fallback
code is used that the bio must be split before
blk_crypto_fallback_encrypt_bio() encrypts the bio. Making
__bio_split_to_limits() take the inline encryption limit into account
would require to encrypt the data much later. How to perform encryption
later for bio-based drivers? Would moving the blk_crypto_bio_prep() call
from submit_bio() to just before __bio_split_to_limits() perhaps require
modifying all bio-based drivers that do not call
__bio_split_to_limits()?
> If you have time to fix this that would be great. Otherwise I can
> give it a spin, but it's public holiday and travel season here, so
> my availability is a bit limited.
This is not the solution that you are looking for but this seems to
work:
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 7c33e9573e5e..f4fefecdcc5e 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -213,6 +213,7 @@ blk_crypto_fallback_alloc_cipher_req(struct
blk_crypto_keyslot *slot,
static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
{
struct bio *bio = *bio_ptr;
+ const struct queue_limits *lim = bdev_limits(bio->bi_bdev);
unsigned int i = 0;
unsigned int num_sectors = 0;
struct bio_vec bv;
@@ -223,6 +224,7 @@ static bool
blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
if (++i == BIO_MAX_VECS)
break;
}
+ num_sectors = min(num_sectors, get_max_io_size(bio, lim));
if (num_sectors < bio_sectors(bio)) {
struct bio *split_bio;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index fb6253c07387..e308325a333c 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -192,8 +192,7 @@ static inline unsigned int
blk_boundary_sectors(const struct queue_limits *lim,
* requests that are submitted to a block device if the start of a bio
is not
* aligned to a physical block boundary.
*/
-static inline unsigned get_max_io_size(struct bio *bio,
- const struct queue_limits *lim)
+unsigned get_max_io_size(struct bio *bio, const struct queue_limits *lim)
{
unsigned pbs = lim->physical_block_size >> SECTOR_SHIFT;
unsigned lbs = lim->logical_block_size >> SECTOR_SHIFT;
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..5f97db919cdf 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -425,6 +425,8 @@ static inline unsigned get_max_segment_size(const
struct queue_limits *lim,
(unsigned long)lim->max_segment_size - 1) + 1);
}
+unsigned get_max_io_size(struct bio *bio, const struct queue_limits *lim);
+
int ll_back_merge_fn(struct request *req, struct bio *bio,
unsigned int nr_segs);
bool blk_attempt_req_merge(struct request_queue *q, struct request *rq,
Thanks,
Bart.
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-10 17:23 ` Bart Van Assche
@ 2025-06-10 23:18 ` Keith Busch
2025-06-11 0:46 ` Damien Le Moal
` (2 more replies)
0 siblings, 3 replies; 44+ messages in thread
From: Keith Busch @ 2025-06-10 23:18 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers
On Tue, Jun 10, 2025 at 10:23:26AM -0700, Bart Van Assche wrote:
> On 6/8/25 8:55 PM, Christoph Hellwig wrote:
> > The problem here is that blk_crypto_fallback_split_bio_if_needed does
> > sneaky splits behind the back of the main splitting code.
> >
> > The fix is to include the limit imposed by it in __bio_split_to_limits
> > as well if the crypto fallback is used.
>
> (+Eric)
>
> Hmm ... my understanding is that when the inline encryption fallback
> code is used that the bio must be split before
> blk_crypto_fallback_encrypt_bio() encrypts the bio. Making
> __bio_split_to_limits() take the inline encryption limit into account
> would require to encrypt the data much later. How to perform encryption
> later for bio-based drivers? Would moving the blk_crypto_bio_prep() call
> from submit_bio() to just before __bio_split_to_limits() perhaps require
> modifying all bio-based drivers that do not call
> __bio_split_to_limits()?
I think you could just prep the encryption at the point the bio is split
to its queue's limits, and then all you need to do after that is ensure
the limits don't exceed what the fallback requires (which appears to
just be a simple segment limitation). It looks like most of the bio
based drivers split to limits already.
And if that's all okay, it simplifies quite a lot of things in this path
because the crypto fallback doesn't need to concern itself with
splitting anymore. Here's a quick shot a it, but compile tested only:
---
diff --git a/block/blk-core.c b/block/blk-core.c
index b862c66018f25..157f4515eea7d 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -39,7 +39,6 @@
#include <linux/bpf.h>
#include <linux/part_stat.h>
#include <linux/sched/sysctl.h>
-#include <linux/blk-crypto.h>
#define CREATE_TRACE_POINTS
#include <trace/events/block.h>
@@ -626,9 +625,6 @@ static void __submit_bio(struct bio *bio)
/* If plug is not used, add new plug here to cache nsecs time. */
struct blk_plug plug;
- if (unlikely(!blk_crypto_bio_prep(&bio)))
- return;
-
blk_start_plug(&plug);
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 005c9157ffb3d..794400382e85b 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -209,36 +209,6 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot,
return true;
}
-static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
-{
- struct bio *bio = *bio_ptr;
- unsigned int i = 0;
- unsigned int num_sectors = 0;
- struct bio_vec bv;
- struct bvec_iter iter;
-
- bio_for_each_segment(bv, bio, iter) {
- num_sectors += bv.bv_len >> SECTOR_SHIFT;
- if (++i == BIO_MAX_VECS)
- break;
- }
- if (num_sectors < bio_sectors(bio)) {
- struct bio *split_bio;
-
- split_bio = bio_split(bio, num_sectors, GFP_NOIO,
- &crypto_bio_split);
- if (IS_ERR(split_bio)) {
- bio->bi_status = BLK_STS_RESOURCE;
- return false;
- }
- bio_chain(split_bio, bio);
- submit_bio_noacct(bio);
- *bio_ptr = split_bio;
- }
-
- return true;
-}
-
union blk_crypto_iv {
__le64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE];
u8 bytes[BLK_CRYPTO_MAX_IV_SIZE];
@@ -275,10 +245,6 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr)
bool ret = false;
blk_status_t blk_st;
- /* Split the bio if it's too big for single page bvec */
- if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr))
- return false;
-
src_bio = *bio_ptr;
bc = src_bio->bi_crypt_context;
data_unit_size = bc->bc_key->crypto_cfg.data_unit_size;
diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c
index 94a155912bf1c..2804843310cc4 100644
--- a/block/blk-crypto-profile.c
+++ b/block/blk-crypto-profile.c
@@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile,
pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n");
return false;
}
+
+ if (queue_max_segments(q) > BIO_MAX_VECS) {
+ struct queue_limits lim;
+
+ lim = queue_limits_start_update(q);
+ lim.max_segments = BIO_MAX_VECS;
+ if (queue_limits_commit_update(q, &lim))
+ return false;
+ }
+
q->crypto_profile = profile;
return true;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3af1d284add50..d6d6f5b214c82 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
trace_block_split(split, bio->bi_iter.bi_sector);
WARN_ON_ONCE(bio_zone_write_plugging(bio));
submit_bio_noacct(bio);
- return split;
+ bio = split;
}
+ blk_crypto_bio_prep(&bio);
return bio;
error:
bio->bi_status = errno_to_blk_status(split_sectors);
--
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-10 23:18 ` Keith Busch
@ 2025-06-11 0:46 ` Damien Le Moal
2025-06-11 1:00 ` Keith Busch
2025-06-11 1:34 ` Keith Busch
2025-06-11 3:40 ` Christoph Hellwig
2 siblings, 1 reply; 44+ messages in thread
From: Damien Le Moal @ 2025-06-11 0:46 UTC (permalink / raw)
To: Keith Busch, Bart Van Assche
Cc: Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai, Ming Lei,
Shin'ichiro Kawasaki, Eric Biggers
On 6/11/25 8:18 AM, Keith Busch wrote:
> On Tue, Jun 10, 2025 at 10:23:26AM -0700, Bart Van Assche wrote:
>> On 6/8/25 8:55 PM, Christoph Hellwig wrote:
>>> The problem here is that blk_crypto_fallback_split_bio_if_needed does
>>> sneaky splits behind the back of the main splitting code.
>>>
>>> The fix is to include the limit imposed by it in __bio_split_to_limits
>>> as well if the crypto fallback is used.
>>
>> (+Eric)
>>
>> Hmm ... my understanding is that when the inline encryption fallback
>> code is used that the bio must be split before
>> blk_crypto_fallback_encrypt_bio() encrypts the bio. Making
>> __bio_split_to_limits() take the inline encryption limit into account
>> would require to encrypt the data much later. How to perform encryption
>> later for bio-based drivers? Would moving the blk_crypto_bio_prep() call
>> from submit_bio() to just before __bio_split_to_limits() perhaps require
>> modifying all bio-based drivers that do not call
>> __bio_split_to_limits()?
>
> I think you could just prep the encryption at the point the bio is split
> to its queue's limits, and then all you need to do after that is ensure
> the limits don't exceed what the fallback requires (which appears to
> just be a simple segment limitation). It looks like most of the bio
> based drivers split to limits already.
Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs
using bio_split_to_limits(). The only BIOs being split are "abnormal" ones,
such as discard.
That said, DM does split the BIOs through cloning and also has the "accept
partial" thing, which is used in many places and allows DM target drivers to
split BIOs as they are received, but not necessarilly based on the device queue
limits (e.g. dm-crypt splits reads and writes using an internal
max_read|write_size limit).
So inline crypto on top of a DM/bio-based device may need some explicit
splitting added in dm.c (__max_io_len() maybe ?).
> And if that's all okay, it simplifies quite a lot of things in this path
> because the crypto fallback doesn't need to concern itself with
> splitting anymore. Here's a quick shot a it, but compile tested only:
>
> ---
> diff --git a/block/blk-core.c b/block/blk-core.c
> index b862c66018f25..157f4515eea7d 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -39,7 +39,6 @@
> #include <linux/bpf.h>
> #include <linux/part_stat.h>
> #include <linux/sched/sysctl.h>
> -#include <linux/blk-crypto.h>
>
> #define CREATE_TRACE_POINTS
> #include <trace/events/block.h>
> @@ -626,9 +625,6 @@ static void __submit_bio(struct bio *bio)
> /* If plug is not used, add new plug here to cache nsecs time. */
> struct blk_plug plug;
>
> - if (unlikely(!blk_crypto_bio_prep(&bio)))
> - return;
> -
> blk_start_plug(&plug);
>
> if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
> diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
> index 005c9157ffb3d..794400382e85b 100644
> --- a/block/blk-crypto-fallback.c
> +++ b/block/blk-crypto-fallback.c
> @@ -209,36 +209,6 @@ blk_crypto_fallback_alloc_cipher_req(struct blk_crypto_keyslot *slot,
> return true;
> }
>
> -static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
> -{
> - struct bio *bio = *bio_ptr;
> - unsigned int i = 0;
> - unsigned int num_sectors = 0;
> - struct bio_vec bv;
> - struct bvec_iter iter;
> -
> - bio_for_each_segment(bv, bio, iter) {
> - num_sectors += bv.bv_len >> SECTOR_SHIFT;
> - if (++i == BIO_MAX_VECS)
> - break;
> - }
> - if (num_sectors < bio_sectors(bio)) {
> - struct bio *split_bio;
> -
> - split_bio = bio_split(bio, num_sectors, GFP_NOIO,
> - &crypto_bio_split);
> - if (IS_ERR(split_bio)) {
> - bio->bi_status = BLK_STS_RESOURCE;
> - return false;
> - }
> - bio_chain(split_bio, bio);
> - submit_bio_noacct(bio);
> - *bio_ptr = split_bio;
> - }
> -
> - return true;
> -}
> -
> union blk_crypto_iv {
> __le64 dun[BLK_CRYPTO_DUN_ARRAY_SIZE];
> u8 bytes[BLK_CRYPTO_MAX_IV_SIZE];
> @@ -275,10 +245,6 @@ static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr)
> bool ret = false;
> blk_status_t blk_st;
>
> - /* Split the bio if it's too big for single page bvec */
> - if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr))
> - return false;
> -
> src_bio = *bio_ptr;
> bc = src_bio->bi_crypt_context;
> data_unit_size = bc->bc_key->crypto_cfg.data_unit_size;
> diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c
> index 94a155912bf1c..2804843310cc4 100644
> --- a/block/blk-crypto-profile.c
> +++ b/block/blk-crypto-profile.c
> @@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile,
> pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n");
> return false;
> }
> +
> + if (queue_max_segments(q) > BIO_MAX_VECS) {
> + struct queue_limits lim;
> +
> + lim = queue_limits_start_update(q);
> + lim.max_segments = BIO_MAX_VECS;
> + if (queue_limits_commit_update(q, &lim))
> + return false;
> + }
> +
> q->crypto_profile = profile;
> return true;
> }
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 3af1d284add50..d6d6f5b214c82 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
> trace_block_split(split, bio->bi_iter.bi_sector);
> WARN_ON_ONCE(bio_zone_write_plugging(bio));
> submit_bio_noacct(bio);
> - return split;
> + bio = split;
> }
>
> + blk_crypto_bio_prep(&bio);
> return bio;
> error:
> bio->bi_status = errno_to_blk_status(split_sectors);
> --
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 0:46 ` Damien Le Moal
@ 2025-06-11 1:00 ` Keith Busch
2025-06-11 1:02 ` Damien Le Moal
0 siblings, 1 reply; 44+ messages in thread
From: Keith Busch @ 2025-06-11 1:00 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers
On Wed, Jun 11, 2025 at 09:46:31AM +0900, Damien Le Moal wrote:
> On 6/11/25 8:18 AM, Keith Busch wrote:
> >
> > I think you could just prep the encryption at the point the bio is split
> > to its queue's limits, and then all you need to do after that is ensure
> > the limits don't exceed what the fallback requires (which appears to
> > just be a simple segment limitation). It looks like most of the bio
> > based drivers split to limits already.
>
> Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs
> using bio_split_to_limits(). The only BIOs being split are "abnormal" ones,
> such as discard.
>
> That said, DM does split the BIOs through cloning and also has the "accept
> partial" thing, which is used in many places and allows DM target drivers to
> split BIOs as they are received, but not necessarilly based on the device queue
> limits (e.g. dm-crypt splits reads and writes using an internal
> max_read|write_size limit).
>
> So inline crypto on top of a DM/bio-based device may need some explicit
> splitting added in dm.c (__max_io_len() maybe ?).
Ah, thanks for pointing that out! In that case, could we have dm
consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That
way it does the split-to-limits for these, and dm appears to do that
action very early before its own split/clone actions on the resulting
bio.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 1:00 ` Keith Busch
@ 2025-06-11 1:02 ` Damien Le Moal
2025-06-11 1:08 ` Keith Busch
0 siblings, 1 reply; 44+ messages in thread
From: Damien Le Moal @ 2025-06-11 1:02 UTC (permalink / raw)
To: Keith Busch
Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers
On 6/11/25 10:00 AM, Keith Busch wrote:
> On Wed, Jun 11, 2025 at 09:46:31AM +0900, Damien Le Moal wrote:
>> On 6/11/25 8:18 AM, Keith Busch wrote:
>>>
>>> I think you could just prep the encryption at the point the bio is split
>>> to its queue's limits, and then all you need to do after that is ensure
>>> the limits don't exceed what the fallback requires (which appears to
>>> just be a simple segment limitation). It looks like most of the bio
>>> based drivers split to limits already.
>>
>> Nope, at least not DM, and by that, I mean not in the sense of splitting BIOs
>> using bio_split_to_limits(). The only BIOs being split are "abnormal" ones,
>> such as discard.
>>
>> That said, DM does split the BIOs through cloning and also has the "accept
>> partial" thing, which is used in many places and allows DM target drivers to
>> split BIOs as they are received, but not necessarilly based on the device queue
>> limits (e.g. dm-crypt splits reads and writes using an internal
>> max_read|write_size limit).
>>
>> So inline crypto on top of a DM/bio-based device may need some explicit
>> splitting added in dm.c (__max_io_len() maybe ?).
>
> Ah, thanks for pointing that out! In that case, could we have dm
> consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That
> way it does the split-to-limits for these, and dm appears to do that
> action very early before its own split/clone actions on the resulting
> bio.
That sounds reasonable to me.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 1:02 ` Damien Le Moal
@ 2025-06-11 1:08 ` Keith Busch
0 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2025-06-11 1:08 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Christoph Hellwig, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers
On Wed, Jun 11, 2025 at 10:02:34AM +0900, Damien Le Moal wrote:
> On 6/11/25 10:00 AM, Keith Busch wrote:
> > Ah, thanks for pointing that out! In that case, could we have dm
> > consider the bio "abnormal" when bio_has_crypt_ctx(bio) is true? That
> > way it does the split-to-limits for these, and dm appears to do that
> > action very early before its own split/clone actions on the resulting
> > bio.
>
> That sounds reasonable to me.
Cool, that's just this atop the previous patch:
---
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76a..d8a43f37a9283 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1632,7 +1632,7 @@ static bool is_abnormal_io(struct bio *bio)
case REQ_OP_READ:
case REQ_OP_WRITE:
case REQ_OP_FLUSH:
- return false;
+ return bio_has_crypt_ctx(bio);
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
case REQ_OP_WRITE_ZEROES:
--
^ permalink raw reply related [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-10 23:18 ` Keith Busch
2025-06-11 0:46 ` Damien Le Moal
@ 2025-06-11 1:34 ` Keith Busch
2025-06-11 3:40 ` Christoph Hellwig
2 siblings, 0 replies; 44+ messages in thread
From: Keith Busch @ 2025-06-11 1:34 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Damien Le Moal, Jens Axboe, linux-block,
Yu Kuai, Ming Lei, Shin'ichiro Kawasaki, Eric Biggers
On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote:
> @@ -124,9 +124,10 @@ static struct bio *bio_submit_split(struct bio *bio, int split_sectors)
> trace_block_split(split, bio->bi_iter.bi_sector);
> WARN_ON_ONCE(bio_zone_write_plugging(bio));
> submit_bio_noacct(bio);
> - return split;
> + bio = split;
> }
>
> + blk_crypto_bio_prep(&bio);
Eh, this part needs to handle the error condition:
+ if (!blk_crypto_bio_prep(&bio))
+ return NULL;
Happy to make a formal patch out of this if this approach is promising.
I think it aligns with the spirit of what Christoph was suggesting, at
least.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-10 23:18 ` Keith Busch
2025-06-11 0:46 ` Damien Le Moal
2025-06-11 1:34 ` Keith Busch
@ 2025-06-11 3:40 ` Christoph Hellwig
2025-06-11 4:21 ` Eric Biggers
2 siblings, 1 reply; 44+ messages in thread
From: Christoph Hellwig @ 2025-06-11 3:40 UTC (permalink / raw)
To: Keith Busch
Cc: Bart Van Assche, Christoph Hellwig, Damien Le Moal, Jens Axboe,
linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki,
Eric Biggers
On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote:
> I think you could just prep the encryption at the point the bio is split
> to its queue's limits, and then all you need to do after that is ensure
> the limits don't exceed what the fallback requires (which appears to
> just be a simple segment limitation). It looks like most of the bio
> based drivers split to limits already.
I'm still a bit confused about the interaction of block-crypto and
stacking drivers. For native support you obviously always want to pass
the crypt context down to the lowest level hardware driver, and dm
has code to support this. But if you stacking driver is not dm, or
the algorithm is not supported by the underlying hardware it looks
like we might still be able to run the fallback for a stacking
driver. Or am I missing something here? Eric/Bart: any chance to
get blktests (or xfstets if that's easier to wire up) to exercise all
these corner cases?
> And if that's all okay, it simplifies quite a lot of things in this path
> because the crypto fallback doesn't need to concern itself with
> splitting anymore. Here's a quick shot a it, but compile tested only:
Exactly. And in the next step we can significantly simply and speed
up the bio submission recursion protection if __bio_split_to_limits
is the only place that can split/reinject bios for blk-mq drivers.
> diff --git a/block/blk-crypto-profile.c b/block/blk-crypto-profile.c
> index 94a155912bf1c..2804843310cc4 100644
> --- a/block/blk-crypto-profile.c
> +++ b/block/blk-crypto-profile.c
> @@ -459,6 +459,16 @@ bool blk_crypto_register(struct blk_crypto_profile *profile,
> pr_warn("Integrity and hardware inline encryption are not supported together. Disabling hardware inline encryption.\n");
> return false;
> }
> +
> + if (queue_max_segments(q) > BIO_MAX_VECS) {
> + struct queue_limits lim;
> +
> + lim = queue_limits_start_update(q);
> + lim.max_segments = BIO_MAX_VECS;
> + if (queue_limits_commit_update(q, &lim))
> + return false;
> + }
I think this limit actually only exists when using the fallback, i.e.
when no profile is registered. And even then only for encrypted bios.
We'll also need to ensure the queue is frozen and deal with limit
changes after the profile registration. I suspect we should probably
integrate the queue profile with the queue limits, so things are don't
in one place, but that's out of scope for this patch.
It's also not just a segment limit, but also an overall size limit,
i.e. it must be no more than 256 * 4k.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 3:40 ` Christoph Hellwig
@ 2025-06-11 4:21 ` Eric Biggers
2025-06-11 16:15 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Eric Biggers @ 2025-06-11 4:21 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Keith Busch, Bart Van Assche, Damien Le Moal, Jens Axboe,
linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Wed, Jun 11, 2025 at 05:40:31AM +0200, Christoph Hellwig wrote:
> On Tue, Jun 10, 2025 at 05:18:03PM -0600, Keith Busch wrote:
> > I think you could just prep the encryption at the point the bio is split
> > to its queue's limits, and then all you need to do after that is ensure
> > the limits don't exceed what the fallback requires (which appears to
> > just be a simple segment limitation). It looks like most of the bio
> > based drivers split to limits already.
>
> I'm still a bit confused about the interaction of block-crypto and
> stacking drivers. For native support you obviously always want to pass
> the crypt context down to the lowest level hardware driver, and dm
> has code to support this. But if you stacking driver is not dm, or
> the algorithm is not supported by the underlying hardware it looks
> like we might still be able to run the fallback for a stacking
> driver. Or am I missing something here? Eric/Bart: any chance to
> get blktests (or xfstets if that's easier to wire up) to exercise all
> these corner cases?
blk-crypto-fallback runs at the top layer, so yes it's different from native
inline encryption support where the encryption is done at the bottom. (But the
results are the same from the filesystem's perspective, since native support
only gets passed through and used when it would give the expected result.)
If it helps, blk-crypto-fallback can be exercised by running the "encrypt" group
of xfstests on ext4 or f2fs with "inlinecrypt" added to the mount options.
These include tests that the data on-disk is actually encrypted correctly.
But that probably doesn't help much here, where the issue is with bio splitting.
It's been a while since I looked at blk-crypto-fallback. But the reason that it
does splitting is because it has to replace the data with encrypted data, and it
allocates encrypted bounce pages individually which each needs its own bio_vec.
Therefore, if the bio is longer than BIO_MAX_VECS pages, it has to be split.
If the splitting to stay within BIO_MAX_VECS pages could be done before
blk-crypto-fallback is given the bio instead, that should work fine too.
Just keep in mind that blk-crypto-fallback is meant to work on any block device
(even ones that don't have a crypto profile, as the profile is just for the
native support). So we do need to make sure it always gets invoked when needed.
- Eric
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 4:21 ` Eric Biggers
@ 2025-06-11 16:15 ` Bart Van Assche
2025-06-11 18:15 ` Eric Biggers
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-06-11 16:15 UTC (permalink / raw)
To: Eric Biggers, Christoph Hellwig
Cc: Keith Busch, Damien Le Moal, Jens Axboe, linux-block, Yu Kuai,
Ming Lei, Shin'ichiro Kawasaki
On 6/10/25 9:21 PM, Eric Biggers wrote:
> blk-crypto-fallback runs at the top layer, so yes it's different from native
> inline encryption support where the encryption is done at the bottom. (But the
> results are the same from the filesystem's perspective, since native support
> only gets passed through and used when it would give the expected result.)
Although I'm not sure Keith realizes this, his patch may move encryption
from the top of the block driver stack (a device mapper driver) to the
bottom (something else than a device mapper driver). This may happen
because device mapper drivers do not split bios unless this is
essential, e.g. because the LBA range of a bio spans more than one entry
in the mapping table.
Is my understanding correct that this is acceptable because the
encryption IV is based on the file offset provided by the filesystem and
not on the LBA where the data is written?
> Just keep in mind that blk-crypto-fallback is meant to work on any block device
> (even ones that don't have a crypto profile, as the profile is just for the
> native support). So we do need to make sure it always gets invoked when needed.
I propose that we require that bio-based drivers must call
bio_split_to_limits() to support inline encryption. Otherwise the
approach of Keith's patch can't work. Does this seem reasonable to you?
As far as I can tell upstream bio-based drivers already call
bio_split_to_limits().
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 16:15 ` Bart Van Assche
@ 2025-06-11 18:15 ` Eric Biggers
2025-06-11 19:43 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Eric Biggers @ 2025-06-11 18:15 UTC (permalink / raw)
To: Bart Van Assche
Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe,
linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On Wed, Jun 11, 2025 at 09:15:05AM -0700, Bart Van Assche wrote:
> On 6/10/25 9:21 PM, Eric Biggers wrote:
> > blk-crypto-fallback runs at the top layer, so yes it's different from native
> > inline encryption support where the encryption is done at the bottom. (But the
> > results are the same from the filesystem's perspective, since native support
> > only gets passed through and used when it would give the expected result.)
>
> Although I'm not sure Keith realizes this, his patch may move encryption
> from the top of the block driver stack (a device mapper driver) to the
> bottom (something else than a device mapper driver). This may happen
> because device mapper drivers do not split bios unless this is
> essential, e.g. because the LBA range of a bio spans more than one entry
> in the mapping table.
>
> Is my understanding correct that this is acceptable because the
> encryption IV is based on the file offset provided by the filesystem and
> not on the LBA where the data is written?
The IV is provided in the bio_crypt_context. The encryption can be done at a
lower layer, like how hardware inline encryption works, if the layers above are
compatible with it. (E.g., they must not change the data.)
Ultimately, the data that's actually written to disk needs to be identical to
the data that would have been written if the user encrypted the data themselves
and submitted a pre-encrypted bio.
>
> > Just keep in mind that blk-crypto-fallback is meant to work on any block device
> > (even ones that don't have a crypto profile, as the profile is just for the
> > native support). So we do need to make sure it always gets invoked when needed.
>
> I propose that we require that bio-based drivers must call
> bio_split_to_limits() to support inline encryption. Otherwise the
> approach of Keith's patch can't work. Does this seem reasonable to you?
>
> As far as I can tell upstream bio-based drivers already call
> bio_split_to_limits().
Well, again it needs to work on any block device. If the encryption might just
not be done and plaintext ends up on-disk, then blk-crypto-fallback would be
unsafe to use.
It would be preferable to have blk-crypto-fallback continue to be handled in the
block layer so that drivers don't need to worry about it.
- Eric
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 18:15 ` Eric Biggers
@ 2025-06-11 19:43 ` Bart Van Assche
2025-06-18 22:27 ` Bart Van Assche
0 siblings, 1 reply; 44+ messages in thread
From: Bart Van Assche @ 2025-06-11 19:43 UTC (permalink / raw)
To: Eric Biggers
Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe,
linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On 6/11/25 11:15 AM, Eric Biggers wrote:
> Well, again it needs to work on any block device. If the encryption might just
> not be done and plaintext ends up on-disk, then blk-crypto-fallback would be
> unsafe to use.
>
> It would be preferable to have blk-crypto-fallback continue to be handled in the
> block layer so that drivers don't need to worry about it.
This concern could be addressed by introducing a new flag in struct
block_device_operations or struct queue_limits - a flag that indicates
that bio_split_to_limits() will be called by the block driver. If that
flag is set, blk_crypto_bio_prep() can be called from
bio_submit_split(). If that flag is not set, blk_crypto_bio_prep()
should be called from __submit_bio(). The latter behavior is is the
current behavior for the upstream kernel.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order
2025-06-11 19:43 ` Bart Van Assche
@ 2025-06-18 22:27 ` Bart Van Assche
0 siblings, 0 replies; 44+ messages in thread
From: Bart Van Assche @ 2025-06-18 22:27 UTC (permalink / raw)
To: Eric Biggers
Cc: Christoph Hellwig, Keith Busch, Damien Le Moal, Jens Axboe,
linux-block, Yu Kuai, Ming Lei, Shin'ichiro Kawasaki
On 6/11/25 12:43 PM, Bart Van Assche wrote:
> On 6/11/25 11:15 AM, Eric Biggers wrote:
>> Well, again it needs to work on any block device. If the encryption
>> might just
>> not be done and plaintext ends up on-disk, then blk-crypto-fallback
>> would be
>> unsafe to use.
>>
>> It would be preferable to have blk-crypto-fallback continue to be
>> handled in the
>> block layer so that drivers don't need to worry about it.
>
> This concern could be addressed by introducing a new flag in struct
> block_device_operations or struct queue_limits - a flag that indicates
> that bio_split_to_limits() will be called by the block driver. If that
> flag is set, blk_crypto_bio_prep() can be called from
> bio_submit_split(). If that flag is not set, blk_crypto_bio_prep()
> should be called from __submit_bio(). The latter behavior is is the
> current behavior for the upstream kernel.
(replying to my own email)
The patch below seems to work but needs further review and testing:
block: Rework splitting of encrypted bios
Modify splitting of encrypted bios as follows:
- Introduce a new block device flag BD_SPLITS_BIO_TO_LIMITS that
allows bio-
based drivers to report that they call bio_split_to_limits() for
every bio.
- For request-based block drivers and for bio-based block drivers
that call
bio_split_to_limits(), call blk_crypto_bio_prep() after bio splitting
happened instead of before bio splitting happened.
- For bio-based block drivers of which it is not known whether they
call
bio_split_to_limits(), call blk_crypto_bio_prep() before
.submit_bio() is
called.
- In blk_crypto_fallback_encrypt_bio(), prevent infinite recursion
by only
trying to split a bio if this function is not called from inside
bio_split_to_limits().
- Since blk_crypto_fallback_encrypt_bio() may clear *bio_ptr, in
its caller
(__blk_crypto_bio_prep()), check whether or not this pointer is NULL.
- In bio_split_rw(), restrict the bio size to the smaller size of
what is
supported to the block driver and the crypto fallback code.
The advantages of these changes are as follows:
- This patch fixes write errors on zoned storage caused by out-of-order
submission of bios. This out-of-order submission happens if both the
crypto fallback code and bio_split_to_limits() split a bio.
- Less code duplication. The crypto fallback code now calls
bio_split_to_limits() instead of open-coding it.
diff --git a/block/bio.c b/block/bio.c
index 3c0a558c90f5..d597cef6d228 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1682,6 +1682,11 @@ struct bio *bio_split(struct bio *bio, int sectors,
if (!split)
return ERR_PTR(-ENOMEM);
+ /*
+ * Tell blk_crypto_fallback_encrypt_bio() not to split this bio further.
+ */
+ bio_set_flag(split, BIO_BEING_SPLIT);
+
split->bi_iter.bi_size = sectors << 9;
if (bio_integrity(split))
diff --git a/block/blk-core.c b/block/blk-core.c
index fdac48aec5ef..78b555ceea77 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -626,16 +626,22 @@ static void __submit_bio(struct bio *bio)
/* If plug is not used, add new plug here to cache nsecs time. */
struct blk_plug plug;
- if (unlikely(!blk_crypto_bio_prep(&bio)))
- return;
-
blk_start_plug(&plug);
if (!bdev_test_flag(bio->bi_bdev, BD_HAS_SUBMIT_BIO)) {
blk_mq_submit_bio(bio);
} else if (likely(bio_queue_enter(bio) == 0)) {
struct gendisk *disk = bio->bi_bdev->bd_disk;
-
+
+ /*
+ * Only call blk_crypto_bio_prep() before .submit_bio() if
+ * the block driver won't call bio_split_to_limits().
+ */
+ if (unlikely(!bdev_test_flag(bio->bi_bdev,
+ BD_SPLITS_BIO_TO_LIMITS) &&
+ !blk_crypto_bio_prep(&bio)))
+ goto exit_queue;
+
if ((bio->bi_opf & REQ_POLLED) &&
!(disk->queue->limits.features & BLK_FEAT_POLL)) {
bio->bi_status = BLK_STS_NOTSUPP;
@@ -643,6 +649,8 @@ static void __submit_bio(struct bio *bio)
} else {
disk->fops->submit_bio(bio);
}
+
+exit_queue:
blk_queue_exit(disk->queue);
}
diff --git a/block/blk-crypto-fallback.c b/block/blk-crypto-fallback.c
index 005c9157ffb3..f2012368ac9e 100644
--- a/block/blk-crypto-fallback.c
+++ b/block/blk-crypto-fallback.c
@@ -209,9 +209,12 @@ blk_crypto_fallback_alloc_cipher_req(struct
blk_crypto_keyslot *slot,
return true;
}
-static bool blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
+/*
+ * The encryption fallback code allocates bounce pages individually.
Hence this
+ * function that calculates an upper limit for the bio size.
+ */
+unsigned int blk_crypto_max_io_size(struct bio *bio)
{
- struct bio *bio = *bio_ptr;
unsigned int i = 0;
unsigned int num_sectors = 0;
struct bio_vec bv;
@@ -222,21 +225,8 @@ static bool
blk_crypto_fallback_split_bio_if_needed(struct bio **bio_ptr)
if (++i == BIO_MAX_VECS)
break;
}
- if (num_sectors < bio_sectors(bio)) {
- struct bio *split_bio;
-
- split_bio = bio_split(bio, num_sectors, GFP_NOIO,
- &crypto_bio_split);
- if (IS_ERR(split_bio)) {
- bio->bi_status = BLK_STS_RESOURCE;
- return false;
- }
- bio_chain(split_bio, bio);
- submit_bio_noacct(bio);
- *bio_ptr = split_bio;
- }
- return true;
+ return num_sectors;
}
union blk_crypto_iv {
@@ -257,8 +247,10 @@ static void blk_crypto_dun_to_iv(const u64
dun[BLK_CRYPTO_DUN_ARRAY_SIZE],
* The crypto API fallback's encryption routine.
* Allocate a bounce bio for encryption, encrypt the input bio using
crypto API,
* and replace *bio_ptr with the bounce bio. May split input bio if
it's too
- * large. Returns true on success. Returns false and sets bio->bi_status on
- * error.
+ * large. Returns %true on success. On error, %false is returned and one of
+ * these two actions is taken:
+ * - Either @bio_ptr->bi_status is set and *@bio_ptr is not modified.
+ * - Or bio_endio() is called and *@bio_ptr is changed into %NULL.
*/
static bool blk_crypto_fallback_encrypt_bio(struct bio **bio_ptr)
{
@@ -275,11 +267,17 @@ static bool blk_crypto_fallback_encrypt_bio(struct
bio **bio_ptr)
bool ret = false;
blk_status_t blk_st;
- /* Split the bio if it's too big for single page bvec */
- if (!blk_crypto_fallback_split_bio_if_needed(bio_ptr))
- return false;
+ if (!bio_flagged(*bio_ptr, BIO_BEING_SPLIT)) {
+ /* Split the bio if it's too big for single page bvec */
+ src_bio = bio_split_to_limits(*bio_ptr);
+ if (!src_bio) {
+ *bio_ptr = NULL;
+ return false;
+ }
+ } else {
+ src_bio = *bio_ptr;
+ }
- src_bio = *bio_ptr;
bc = src_bio->bi_crypt_context;
data_unit_size = bc->bc_key->crypto_cfg.data_unit_size;
@@ -475,9 +473,8 @@ static void blk_crypto_fallback_decrypt_endio(struct
bio *bio)
* @bio_ptr: pointer to the bio to prepare
*
* If bio is doing a WRITE operation, this splits the bio into two
parts if it's
- * too big (see blk_crypto_fallback_split_bio_if_needed()). It then
allocates a
- * bounce bio for the first part, encrypts it, and updates bio_ptr to
point to
- * the bounce bio.
+ * too big. It then allocates a bounce bio for the first part, encrypts
it, and
+ * updates bio_ptr to point to the bounce bio.
*
* For a READ operation, we mark the bio for decryption by using
bi_private and
* bi_end_io.
@@ -495,6 +492,12 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr)
struct bio_crypt_ctx *bc = bio->bi_crypt_context;
struct bio_fallback_crypt_ctx *f_ctx;
+ /*
+ * Check whether blk_crypto_fallback_bio_prep() has already been called.
+ */
+ if (bio->bi_end_io == blk_crypto_fallback_encrypt_endio)
+ return true;
+
if (WARN_ON_ONCE(!tfms_inited[bc->bc_key->crypto_cfg.crypto_mode])) {
/* User didn't call blk_crypto_start_using_key() first */
bio->bi_status = BLK_STS_IOERR;
diff --git a/block/blk-crypto-internal.h b/block/blk-crypto-internal.h
index ccf6dff6ff6b..443ba1fd82e6 100644
--- a/block/blk-crypto-internal.h
+++ b/block/blk-crypto-internal.h
@@ -223,6 +223,8 @@ bool blk_crypto_fallback_bio_prep(struct bio **bio_ptr);
int blk_crypto_fallback_evict_key(const struct blk_crypto_key *key);
+unsigned int blk_crypto_max_io_size(struct bio *bio);
+
#else /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */
static inline int
@@ -245,6 +247,11 @@ blk_crypto_fallback_evict_key(const struct
blk_crypto_key *key)
return 0;
}
+static inline unsigned int blk_crypto_max_io_size(struct bio *bio)
+{
+ return UINT_MAX;
+}
+
#endif /* CONFIG_BLK_INLINE_ENCRYPTION_FALLBACK */
#endif /* __LINUX_BLK_CRYPTO_INTERNAL_H */
diff --git a/block/blk-crypto.c b/block/blk-crypto.c
index 4b1ad84d1b5a..76278e23193d 100644
--- a/block/blk-crypto.c
+++ b/block/blk-crypto.c
@@ -306,7 +306,8 @@ bool __blk_crypto_bio_prep(struct bio **bio_ptr)
if (blk_crypto_fallback_bio_prep(bio_ptr))
return true;
fail:
- bio_endio(*bio_ptr);
+ if (*bio_ptr)
+ bio_endio(*bio_ptr);
return false;
}
diff --git a/block/blk-merge.c b/block/blk-merge.c
index e55a8ec219c9..df65231be543 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -9,6 +9,7 @@
#include <linux/blk-integrity.h>
#include <linux/part_stat.h>
#include <linux/blk-cgroup.h>
+#include <linux/blk-crypto.h>
#include <trace/events/block.h>
@@ -125,9 +126,14 @@ static struct bio *bio_submit_split(struct bio
*bio, int split_sectors)
bio->bi_iter.bi_sector);
WARN_ON_ONCE(bio_zone_write_plugging(bio));
submit_bio_noacct(bio);
- return split;
+
+ bio = split;
}
+ WARN_ON_ONCE(!bio);
+ if (unlikely(!blk_crypto_bio_prep(&bio)))
+ return NULL;
+
return bio;
error:
bio->bi_status = errno_to_blk_status(split_sectors);
@@ -356,9 +362,12 @@ EXPORT_SYMBOL_GPL(bio_split_rw_at);
struct bio *bio_split_rw(struct bio *bio, const struct queue_limits *lim,
unsigned *nr_segs)
{
+ u32 max_sectors =
+ min(get_max_io_size(bio, lim), blk_crypto_max_io_size(bio));
+
return bio_submit_split(bio,
bio_split_rw_at(bio, lim, nr_segs,
- get_max_io_size(bio, lim) << SECTOR_SHIFT));
+ (u64)max_sectors << SECTOR_SHIFT));
}
/*
diff --git a/block/blk.h b/block/blk.h
index 37ec459fe656..8db804f32896 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -367,6 +367,11 @@ static inline bool bio_may_need_split(struct bio *bio,
lim->min_segment_size;
}
+static void bio_clear_split_flag(struct bio **bio)
+{
+ bio_clear_flag(*bio, BIO_BEING_SPLIT);
+}
+
/**
* __bio_split_to_limits - split a bio to fit the queue limits
* @bio: bio to be split
@@ -383,6 +388,10 @@ static inline bool bio_may_need_split(struct bio *bio,
static inline struct bio *__bio_split_to_limits(struct bio *bio,
const struct queue_limits *lim, unsigned int *nr_segs)
{
+ struct bio *clear_split_flag __cleanup(bio_clear_split_flag) = bio;
+
+ bio_set_flag(bio, BIO_BEING_SPLIT);
+
switch (bio_op(bio)) {
case REQ_OP_READ:
case REQ_OP_WRITE:
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5facb06e5924..c79195de2669 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2316,6 +2316,8 @@ static struct mapped_device *alloc_dev(int minor)
md->disk->private_data = md;
sprintf(md->disk->disk_name, "dm-%d", minor);
+ bdev_set_flag(md->disk->part0, BD_SPLITS_BIO_TO_LIMITS);
+
dax_dev = alloc_dax(md, &dm_dax_ops);
if (IS_ERR(dax_dev)) {
if (PTR_ERR(dax_dev) != -EOPNOTSUPP)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..25b789830b75 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -54,6 +54,8 @@ struct block_device {
#ifdef CONFIG_FAIL_MAKE_REQUEST
#define BD_MAKE_IT_FAIL (1u<<12)
#endif
+ /* Set this flag if the driver calls bio_split_to_limits(). */
+#define BD_SPLITS_BIO_TO_LIMITS (1u<<13)
dev_t bd_dev;
struct address_space *bd_mapping; /* page cache */
@@ -308,6 +310,7 @@ enum {
BIO_REMAPPED,
BIO_ZONE_WRITE_PLUGGING, /* bio handled through zone write plugging */
BIO_EMULATES_ZONE_APPEND, /* bio emulates a zone append operation */
+ BIO_BEING_SPLIT,
BIO_FLAG_LAST
};
^ permalink raw reply related [flat|nested] 44+ messages in thread
end of thread, other threads:[~2025-06-18 22:28 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 20:29 [PATCH 0/2] Two bug fixes for zoned block devices Bart Van Assche
2025-05-14 20:29 ` [PATCH 1/2] block: Make __submit_bio_noacct() preserve the bio submission order Bart Van Assche
2025-05-15 7:19 ` Niklas Cassel
2025-05-15 15:58 ` Bart Van Assche
2025-05-16 4:47 ` Christoph Hellwig
2025-05-19 22:12 ` Bart Van Assche
2025-05-20 13:56 ` Christoph Hellwig
2025-05-20 18:09 ` Bart Van Assche
2025-05-21 5:53 ` Christoph Hellwig
2025-05-21 21:18 ` Bart Van Assche
2025-05-22 5:12 ` Damien Le Moal
2025-05-22 17:08 ` Bart Van Assche
2025-05-23 6:02 ` Damien Le Moal
2025-05-23 16:30 ` Bart Van Assche
2025-05-24 8:48 ` Damien Le Moal
2025-05-24 14:05 ` Bart Van Assche
2025-05-24 15:36 ` Damien Le Moal
2025-05-26 5:24 ` Christoph Hellwig
2025-05-27 16:19 ` Bart Van Assche
2025-05-31 0:25 ` Bart Van Assche
2025-06-08 22:07 ` Bart Van Assche
2025-06-08 22:47 ` Damien Le Moal
2025-06-09 3:58 ` Christoph Hellwig
2025-06-09 20:48 ` Bart Van Assche
2025-06-10 5:04 ` Christoph Hellwig
2025-06-09 3:55 ` Christoph Hellwig
2025-06-10 17:23 ` Bart Van Assche
2025-06-10 23:18 ` Keith Busch
2025-06-11 0:46 ` Damien Le Moal
2025-06-11 1:00 ` Keith Busch
2025-06-11 1:02 ` Damien Le Moal
2025-06-11 1:08 ` Keith Busch
2025-06-11 1:34 ` Keith Busch
2025-06-11 3:40 ` Christoph Hellwig
2025-06-11 4:21 ` Eric Biggers
2025-06-11 16:15 ` Bart Van Assche
2025-06-11 18:15 ` Eric Biggers
2025-06-11 19:43 ` Bart Van Assche
2025-06-18 22:27 ` Bart Van Assche
2025-05-23 4:21 ` Christoph Hellwig
2025-05-14 20:29 ` [PATCH 2/2] block: Fix a deadlock related freezing zoned storage devices Bart Van Assche
2025-05-16 4:51 ` Christoph Hellwig
2025-05-19 22:22 ` Bart Van Assche
2025-05-20 13:57 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox