* [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
@ 2025-06-25 9:33 ` Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
` (3 more replies)
2025-06-25 9:33 ` [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
` (4 subsequent siblings)
5 siblings, 4 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 9:33 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
REQ_OP_ZONE_FINISH is defined as "12", which makes
op_is_write(REQ_OP_ZONE_FINISH) return false, despite the fact that a
zone finish operation is an operation that modifies a zone (transition
it to full) and so should be considered as a write operation (albeit
one that does not transfer any data to the device).
Fix this by redefining REQ_OP_ZONE_FINISH to be an odd number (13), and
redefine REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL using sequential
odd numbers from that new value.
Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
include/linux/blk_types.h | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
index 3d1577f07c1c..930daff207df 100644
--- a/include/linux/blk_types.h
+++ b/include/linux/blk_types.h
@@ -350,11 +350,11 @@ enum req_op {
/* Close a zone */
REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11,
/* Transition a zone to full */
- REQ_OP_ZONE_FINISH = (__force blk_opf_t)12,
+ REQ_OP_ZONE_FINISH = (__force blk_opf_t)13,
/* reset a zone write pointer */
- REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
+ REQ_OP_ZONE_RESET = (__force blk_opf_t)15,
/* reset all the zone present on the device */
- REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
+ REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17,
/* Driver private requests */
REQ_OP_DRV_IN = (__force blk_opf_t)34,
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
@ 2025-06-25 11:28 ` Christoph Hellwig
2025-06-25 11:42 ` Johannes Thumshirn
` (2 subsequent siblings)
3 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-25 11:28 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
@ 2025-06-25 11:42 ` Johannes Thumshirn
2025-06-25 15:44 ` Bart Van Assche
2025-06-25 16:29 ` Bart Van Assche
3 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-25 11:42 UTC (permalink / raw)
To: Damien Le Moal, linux-block@vger.kernel.org, Jens Axboe,
dm-devel@lists.linux.dev, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Looks good,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
2025-06-25 11:42 ` Johannes Thumshirn
@ 2025-06-25 15:44 ` Bart Van Assche
2025-06-25 16:29 ` Bart Van Assche
3 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-06-25 15:44 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 2:33 AM, Damien Le Moal wrote:
> REQ_OP_ZONE_FINISH is defined as "12", which makes
> op_is_write(REQ_OP_ZONE_FINISH) return false, despite the fact that a
> zone finish operation is an operation that modifies a zone (transition
> it to full) and so should be considered as a write operation (albeit
> one that does not transfer any data to the device).
>
> Fix this by redefining REQ_OP_ZONE_FINISH to be an odd number (13), and
> redefine REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL using sequential
> odd numbers from that new value.
>
> Fixes: 6c1b1da58f8c ("block: add zone open, close and finish operations")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> include/linux/blk_types.h | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 3d1577f07c1c..930daff207df 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -350,11 +350,11 @@ enum req_op {
> /* Close a zone */
> REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11,
> /* Transition a zone to full */
> - REQ_OP_ZONE_FINISH = (__force blk_opf_t)12,
> + REQ_OP_ZONE_FINISH = (__force blk_opf_t)13,
> /* reset a zone write pointer */
> - REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
> + REQ_OP_ZONE_RESET = (__force blk_opf_t)15,
> /* reset all the zone present on the device */
> - REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
> + REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17,
>
> /* Driver private requests */
> REQ_OP_DRV_IN = (__force blk_opf_t)34,
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
` (2 preceding siblings ...)
2025-06-25 15:44 ` Bart Van Assche
@ 2025-06-25 16:29 ` Bart Van Assche
2025-06-25 23:36 ` Damien Le Moal
3 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-06-25 16:29 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 2:33 AM, Damien Le Moal wrote:
> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
> index 3d1577f07c1c..930daff207df 100644
> --- a/include/linux/blk_types.h
> +++ b/include/linux/blk_types.h
> @@ -350,11 +350,11 @@ enum req_op {
> /* Close a zone */
> REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11,
> /* Transition a zone to full */
> - REQ_OP_ZONE_FINISH = (__force blk_opf_t)12,
> + REQ_OP_ZONE_FINISH = (__force blk_opf_t)13,
> /* reset a zone write pointer */
> - REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
> + REQ_OP_ZONE_RESET = (__force blk_opf_t)15,
> /* reset all the zone present on the device */
> - REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
> + REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17,
>
> /* Driver private requests */
> REQ_OP_DRV_IN = (__force blk_opf_t)34,
Since we are renumbering operation types, how about also
renumbering REQ_OP_ZONE_OPEN and/or REQ_OP_ZONE_CLOSE? Neither operation
modifies data on the storage medium nor any write pointers so these
operations shouldn't be considered as write operations, isn't it?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 16:29 ` Bart Van Assche
@ 2025-06-25 23:36 ` Damien Le Moal
2025-06-26 20:29 ` Bart Van Assche
0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 23:36 UTC (permalink / raw)
To: Bart Van Assche, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/26/25 01:29, Bart Van Assche wrote:
> On 6/25/25 2:33 AM, Damien Le Moal wrote:
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index 3d1577f07c1c..930daff207df 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -350,11 +350,11 @@ enum req_op {
>> /* Close a zone */
>> REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11,
>> /* Transition a zone to full */
>> - REQ_OP_ZONE_FINISH = (__force blk_opf_t)12,
>> + REQ_OP_ZONE_FINISH = (__force blk_opf_t)13,
>> /* reset a zone write pointer */
>> - REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
>> + REQ_OP_ZONE_RESET = (__force blk_opf_t)15,
>> /* reset all the zone present on the device */
>> - REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
>> + REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17,
>>
>> /* Driver private requests */
>> REQ_OP_DRV_IN = (__force blk_opf_t)34,
>
> Since we are renumbering operation types, how about also
> renumbering REQ_OP_ZONE_OPEN and/or REQ_OP_ZONE_CLOSE? Neither operation
> modifies data on the storage medium nor any write pointers so these
> operations shouldn't be considered as write operations, isn't it?
Open and close change the zone condition and act on the drive count of
explicitly open zone resources which impacts the ability to write to zones. So I
would rather consider these also write operations given the changes they imply.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation
2025-06-25 23:36 ` Damien Le Moal
@ 2025-06-26 20:29 ` Bart Van Assche
0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-06-26 20:29 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 4:36 PM, Damien Le Moal wrote:
> On 6/26/25 01:29, Bart Van Assche wrote:
>> On 6/25/25 2:33 AM, Damien Le Moal wrote:
>>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>>> index 3d1577f07c1c..930daff207df 100644
>>> --- a/include/linux/blk_types.h
>>> +++ b/include/linux/blk_types.h
>>> @@ -350,11 +350,11 @@ enum req_op {
>>> /* Close a zone */
>>> REQ_OP_ZONE_CLOSE = (__force blk_opf_t)11,
>>> /* Transition a zone to full */
>>> - REQ_OP_ZONE_FINISH = (__force blk_opf_t)12,
>>> + REQ_OP_ZONE_FINISH = (__force blk_opf_t)13,
>>> /* reset a zone write pointer */
>>> - REQ_OP_ZONE_RESET = (__force blk_opf_t)13,
>>> + REQ_OP_ZONE_RESET = (__force blk_opf_t)15,
>>> /* reset all the zone present on the device */
>>> - REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)15,
>>> + REQ_OP_ZONE_RESET_ALL = (__force blk_opf_t)17,
>>>
>>> /* Driver private requests */
>>> REQ_OP_DRV_IN = (__force blk_opf_t)34,
>>
>> Since we are renumbering operation types, how about also
>> renumbering REQ_OP_ZONE_OPEN and/or REQ_OP_ZONE_CLOSE? Neither operation
>> modifies data on the storage medium nor any write pointers so these
>> operations shouldn't be considered as write operations, isn't it?
>
> Open and close change the zone condition and act on the drive count of
> explicitly open zone resources which impacts the ability to write to zones. So I
> would rather consider these also write operations given the changes they imply.
It's probably a good idea to document above the op_is_write() function
that this function is used for multiple purposes:
- Whether it is allowed to submit an operation to a read-only device.
bio_check_ro() and the loop driver uses op_is_write() for this
purpose.
- Whether or not lim->zone_write_granularity applies. See also
bio_split_rw_at().
- Whether or not to throttle the queue depth. See also various I/O
schedulers.
- To determine the DMA data transfer direction.
- By the block layer I/O statistics code.
- By the blkio cgroup controller statistics code.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
@ 2025-06-25 9:33 ` Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
` (2 more replies)
2025-06-25 9:33 ` [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits Damien Le Moal
` (3 subsequent siblings)
5 siblings, 3 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 9:33 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
In preparation for fixing device mapper zone write handling, introduce
the inline helper function bio_needs_zone_write_plugging() to test if a
BIO requires handling through zone write plugging using the function
blk_zone_plug_bio(). This function returns true for any write
(op_is_write(bio) == true) operation directed at a zoned block device
using zone write plugging, that is, a block device with a disk that has
a zone write plug hash table.
This helper allows simplifying the check on entry to blk_zone_plug_bio()
and used in to protect calls to it for blk-mq devices and DM devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-mq.c | 6 +++--
block/blk-zoned.c | 20 +--------------
drivers/md/dm.c | 4 ++-
include/linux/blkdev.h | 55 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 63 insertions(+), 22 deletions(-)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4806b867e37d..0c61492724d2 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3169,8 +3169,10 @@ void blk_mq_submit_bio(struct bio *bio)
if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
goto queue_exit;
- if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
- goto queue_exit;
+ if (bio_needs_zone_write_plugging(bio)) {
+ if (blk_zone_plug_bio(bio, nr_segs))
+ goto queue_exit;
+ }
new_request:
if (rq) {
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 55e64ca869d7..16b08bf8f82a 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1145,25 +1145,7 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
{
struct block_device *bdev = bio->bi_bdev;
- if (!bdev->bd_disk->zone_wplugs_hash)
- return false;
-
- /*
- * If the BIO already has the plugging flag set, then it was already
- * handled through this path and this is a submission from the zone
- * plug bio submit work.
- */
- if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
- return false;
-
- /*
- * We do not need to do anything special for empty flush BIOs, e.g
- * BIOs such as issued by blkdev_issue_flush(). The is because it is
- * the responsibility of the user to first wait for the completion of
- * write operations for flush to have any effect on the persistence of
- * the written data.
- */
- if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+ if (WARN_ON_ONCE(!bdev->bd_disk->zone_wplugs_hash))
return false;
/*
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 55579adbeb3f..e477765cdd27 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1781,7 +1781,9 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
}
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
{
- return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0);
+ if (!bio_needs_zone_write_plugging(bio))
+ return false;
+ return blk_zone_plug_bio(bio, 0);
}
static blk_status_t __send_zone_reset_all_emulated(struct clone_info *ci,
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c2b3ddea8b6d..a7d09ef4f448 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -848,6 +848,55 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return disk->nr_zones;
}
+
+/**
+ * bio_needs_zone_write_plugging - Check if a BIO needs to be handled with zone
+ * write plugging
+ * @bio: The BIO being submitted
+ *
+ * Return true whenever @bio execution needs to be handled through zone
+ * write plugging (using blk_zone_plug_bio()). Return false otherwise.
+ */
+static inline bool bio_needs_zone_write_plugging(struct bio *bio)
+{
+ enum req_op op = bio_op(bio);
+
+ /*
+ * Only zoned block devices have a zone write plug hash table. But not
+ * all of them have one (e.g. DM devices may not need one).
+ */
+ if (!bio->bi_bdev->bd_disk->zone_wplugs_hash)
+ return false;
+
+ /* Only write operations need zone write plugging. */
+ if (!op_is_write(op))
+ return false;
+
+ /* Ignore empty flush */
+ if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+ return false;
+
+ /* Ignore BIOs that already have been handled by zone write plugging. */
+ if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+ return false;
+
+ /*
+ * All zone write operations must be handled through zone write plugging
+ * using blk_zone_plug_bio().
+ */
+ switch (op) {
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ case REQ_OP_ZONE_FINISH:
+ case REQ_OP_ZONE_RESET:
+ case REQ_OP_ZONE_RESET_ALL:
+ return true;
+ default:
+ return false;
+ }
+}
+
bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
/**
@@ -877,6 +926,12 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return 0;
}
+
+static inline bool bio_needs_zone_write_plugging(struct bio *bio)
+{
+ return false;
+}
+
static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
{
return false;
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 9:33 ` [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
@ 2025-06-25 11:28 ` Christoph Hellwig
2025-06-25 11:47 ` Johannes Thumshirn
2025-06-25 15:48 ` Bart Van Assche
2 siblings, 0 replies; 25+ messages in thread
From: Christoph Hellwig @ 2025-06-25 11:28 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 9:33 ` [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
@ 2025-06-25 11:47 ` Johannes Thumshirn
2025-06-25 15:48 ` Bart Van Assche
2 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-25 11:47 UTC (permalink / raw)
To: Damien Le Moal, linux-block@vger.kernel.org, Jens Axboe,
dm-devel@lists.linux.dev, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
On 25.06.25 11:44, Damien Le Moal wrote:
> static inline bool bio_needs_zone_write_plugging(struct bio *bio)
> +{
> + enum req_op op = bio_op(bio);
> +
> + /*
> + * Only zoned block devices have a zone write plug hash table. But not
> + * all of them have one (e.g. DM devices may not need one).
> + */
> + if (!bio->bi_bdev->bd_disk->zone_wplugs_hash)
> + return false;
> +
> + /* Only write operations need zone write plugging. */
> + if (!op_is_write(op))
> + return false;
> +
> + /* Ignore empty flush */
> + if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
> + return false;
> +
> + /* Ignore BIOs that already have been handled by zone write plugging. */
> + if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
> + return false;
> +
> + /*
> + * All zone write operations must be handled through zone write plugging
> + * using blk_zone_plug_bio().
> + */
> + switch (op) {
> + case REQ_OP_ZONE_APPEND:
> + case REQ_OP_WRITE:
> + case REQ_OP_WRITE_ZEROES:
> + case REQ_OP_ZONE_FINISH:
> + case REQ_OP_ZONE_RESET:
> + case REQ_OP_ZONE_RESET_ALL:
> + return true;
> + default:
> + return false;
> + }
Maybe this is bikeshedding territory here but I'd move the long pointer
chase 'bio->bi_bdev->bd_disk->zone_wplugs_hash' later into the function.
At least after the op_is_write() check, so that reads don't get the
pointer chase.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 9:33 ` [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
2025-06-25 11:28 ` Christoph Hellwig
2025-06-25 11:47 ` Johannes Thumshirn
@ 2025-06-25 15:48 ` Bart Van Assche
2025-06-25 23:38 ` Damien Le Moal
2 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-06-25 15:48 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 2:33 AM, Damien Le Moal wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4806b867e37d..0c61492724d2 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3169,8 +3169,10 @@ void blk_mq_submit_bio(struct bio *bio)
> if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
> goto queue_exit;
>
> - if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
> - goto queue_exit;
> + if (bio_needs_zone_write_plugging(bio)) {
> + if (blk_zone_plug_bio(bio, nr_segs))
> + goto queue_exit;
> + }
Why nested if-statements instead of keeping "&&"? I prefer "&&".
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 15:48 ` Bart Van Assche
@ 2025-06-25 23:38 ` Damien Le Moal
2025-06-26 16:37 ` Bart Van Assche
0 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 23:38 UTC (permalink / raw)
To: Bart Van Assche, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/26/25 00:48, Bart Van Assche wrote:
> On 6/25/25 2:33 AM, Damien Le Moal wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 4806b867e37d..0c61492724d2 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -3169,8 +3169,10 @@ void blk_mq_submit_bio(struct bio *bio)
>> if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>> goto queue_exit;
>>
>> - if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
>> - goto queue_exit;
>> + if (bio_needs_zone_write_plugging(bio)) {
>> + if (blk_zone_plug_bio(bio, nr_segs))
>> + goto queue_exit;
>> + }
>
> Why nested if-statements instead of keeping "&&"? I prefer "&&".
I did this because bio_needs_zone_write_plugging() is inline and
blk_zone_plug_bio() is not, so this ensures that we do not have the function
call for nothing. Though I may be overthinking this since normally, the
generated assembler will not test the second part of a && condition if the first
part is false already.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 23:38 ` Damien Le Moal
@ 2025-06-26 16:37 ` Bart Van Assche
0 siblings, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-06-26 16:37 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 4:38 PM, Damien Le Moal wrote:
> On 6/26/25 00:48, Bart Van Assche wrote:
>> On 6/25/25 2:33 AM, Damien Le Moal wrote:
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 4806b867e37d..0c61492724d2 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -3169,8 +3169,10 @@ void blk_mq_submit_bio(struct bio *bio)
>>> if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>>> goto queue_exit;
>>>
>>> - if (blk_queue_is_zoned(q) && blk_zone_plug_bio(bio, nr_segs))
>>> - goto queue_exit;
>>> + if (bio_needs_zone_write_plugging(bio)) {
>>> + if (blk_zone_plug_bio(bio, nr_segs))
>>> + goto queue_exit;
>>> + }
>>
>> Why nested if-statements instead of keeping "&&"? I prefer "&&".
>
> I did this because bio_needs_zone_write_plugging() is inline and
> blk_zone_plug_bio() is not, so this ensures that we do not have the function
> call for nothing. Though I may be overthinking this since normally, the
> generated assembler will not test the second part of a && condition if the first
> part is false already.
From a (draft version of) the C standard, section "6.5.14 Logical AND
operator": "If the first operand compares equal to 0, the second operand
is not evaluated." I think that text was already present in the
K&R C book.
See also https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
2025-06-25 9:33 ` [PATCH v3 1/5] block: Make REQ_OP_ZONE_FINISH a write operation Damien Le Moal
2025-06-25 9:33 ` [PATCH v3 2/5] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
@ 2025-06-25 9:33 ` Damien Le Moal
2025-06-25 11:49 ` Johannes Thumshirn
` (2 more replies)
2025-06-25 9:33 ` [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets Damien Le Moal
` (2 subsequent siblings)
5 siblings, 3 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 9:33 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Any zoned DM target that requires zone append emulation will use the
block layer zone write plugging. In such case, DM target drivers must
not split BIOs using dm_accept_partial_bio() as doing so can potentially
lead to deadlocks with queue freeze operations. Regular write operations
used to emulate zone append operations also cannot be split by the
target driver as that would result in an invalid writen sector value
return using the BIO sector.
In order for zoned DM target drivers to avoid such incorrect BIO
splitting, we must ensure that large BIOs are split before being passed
to the map() function of the target, thus guaranteeing that the
limits for the mapped device are not exceeded.
dm-crypt and dm-flakey are the only target drivers supporting zoned
devices and using dm_accept_partial_bio().
In the case of dm-crypt, this function is used to split BIOs to the
internal max_write_size limit (which will be suppressed in a different
patch). However, since crypt_alloc_buffer() uses a bioset allowing only
up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device
max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS
(128), must thus be respected and write BIOs split accordingly.
In the case of dm-flakey, since zone append emulation is not required,
the block layer zone write plugging is not used and no splitting of BIOs
required.
Modify the function dm_zone_bio_needs_split() to use the block layer
helper function bio_needs_zone_write_plugging() to force a call to
bio_split_to_limits() in dm_split_and_process_bio(). This allows DM
target drivers to avoid using dm_accept_partial_bio() for write
operations on zoned DM devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e477765cdd27..f1e63c1808b4 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1773,12 +1773,29 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
struct bio *bio)
{
/*
- * For mapped device that need zone append emulation, we must
- * split any large BIO that straddles zone boundaries.
+ * Special case the zone operations that cannot or should not be split.
*/
- return dm_emulate_zone_append(md) && bio_straddles_zones(bio) &&
- !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
+ switch (bio_op(bio)) {
+ case REQ_OP_ZONE_APPEND:
+ case REQ_OP_ZONE_FINISH:
+ case REQ_OP_ZONE_RESET:
+ case REQ_OP_ZONE_RESET_ALL:
+ return false;
+ default:
+ break;
+ }
+
+ /*
+ * Mapped devices that require zone append emulation will use the block
+ * layer zone write plugging. In such case, we must split any large BIO
+ * to the mapped device limits to avoid potential deadlocks with queue
+ * freeze operations.
+ */
+ if (!dm_emulate_zone_append(md))
+ return false;
+ return bio_needs_zone_write_plugging(bio) || bio_straddles_zones(bio);
}
+
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
{
if (!bio_needs_zone_write_plugging(bio))
@@ -1927,9 +1944,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
is_abnormal = is_abnormal_io(bio);
if (static_branch_unlikely(&zoned_enabled)) {
- /* Special case REQ_OP_ZONE_RESET_ALL as it cannot be split. */
- need_split = (bio_op(bio) != REQ_OP_ZONE_RESET_ALL) &&
- (is_abnormal || dm_zone_bio_needs_split(md, bio));
+ need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
} else {
need_split = is_abnormal;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits
2025-06-25 9:33 ` [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits Damien Le Moal
@ 2025-06-25 11:49 ` Johannes Thumshirn
2025-06-25 16:30 ` Mikulas Patocka
2025-06-25 16:34 ` Bart Van Assche
2 siblings, 0 replies; 25+ messages in thread
From: Johannes Thumshirn @ 2025-06-25 11:49 UTC (permalink / raw)
To: Damien Le Moal, linux-block@vger.kernel.org, Jens Axboe,
dm-devel@lists.linux.dev, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Looks good to me,
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits
2025-06-25 9:33 ` [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits Damien Le Moal
2025-06-25 11:49 ` Johannes Thumshirn
@ 2025-06-25 16:30 ` Mikulas Patocka
2025-06-25 16:34 ` Bart Van Assche
2 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2025-06-25 16:30 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> Any zoned DM target that requires zone append emulation will use the
> block layer zone write plugging. In such case, DM target drivers must
> not split BIOs using dm_accept_partial_bio() as doing so can potentially
> lead to deadlocks with queue freeze operations. Regular write operations
> used to emulate zone append operations also cannot be split by the
> target driver as that would result in an invalid writen sector value
> return using the BIO sector.
>
> In order for zoned DM target drivers to avoid such incorrect BIO
> splitting, we must ensure that large BIOs are split before being passed
> to the map() function of the target, thus guaranteeing that the
> limits for the mapped device are not exceeded.
>
> dm-crypt and dm-flakey are the only target drivers supporting zoned
> devices and using dm_accept_partial_bio().
>
> In the case of dm-crypt, this function is used to split BIOs to the
> internal max_write_size limit (which will be suppressed in a different
> patch). However, since crypt_alloc_buffer() uses a bioset allowing only
> up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device
> max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS
> (128), must thus be respected and write BIOs split accordingly.
>
> In the case of dm-flakey, since zone append emulation is not required,
> the block layer zone write plugging is not used and no splitting of BIOs
> required.
>
> Modify the function dm_zone_bio_needs_split() to use the block layer
> helper function bio_needs_zone_write_plugging() to force a call to
> bio_split_to_limits() in dm_split_and_process_bio(). This allows DM
> target drivers to avoid using dm_accept_partial_bio() for write
> operations on zoned DM devices.
>
> Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> ---
> drivers/md/dm.c | 29 ++++++++++++++++++++++-------
> 1 file changed, 22 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index e477765cdd27..f1e63c1808b4 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1773,12 +1773,29 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
> struct bio *bio)
> {
> /*
> - * For mapped device that need zone append emulation, we must
> - * split any large BIO that straddles zone boundaries.
> + * Special case the zone operations that cannot or should not be split.
> */
> - return dm_emulate_zone_append(md) && bio_straddles_zones(bio) &&
> - !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
> + switch (bio_op(bio)) {
> + case REQ_OP_ZONE_APPEND:
> + case REQ_OP_ZONE_FINISH:
> + case REQ_OP_ZONE_RESET:
> + case REQ_OP_ZONE_RESET_ALL:
> + return false;
> + default:
> + break;
> + }
> +
> + /*
> + * Mapped devices that require zone append emulation will use the block
> + * layer zone write plugging. In such case, we must split any large BIO
> + * to the mapped device limits to avoid potential deadlocks with queue
> + * freeze operations.
> + */
> + if (!dm_emulate_zone_append(md))
> + return false;
> + return bio_needs_zone_write_plugging(bio) || bio_straddles_zones(bio);
> }
> +
> static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
> {
> if (!bio_needs_zone_write_plugging(bio))
> @@ -1927,9 +1944,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
>
> is_abnormal = is_abnormal_io(bio);
> if (static_branch_unlikely(&zoned_enabled)) {
> - /* Special case REQ_OP_ZONE_RESET_ALL as it cannot be split. */
> - need_split = (bio_op(bio) != REQ_OP_ZONE_RESET_ALL) &&
> - (is_abnormal || dm_zone_bio_needs_split(md, bio));
> + need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
> } else {
> need_split = is_abnormal;
> }
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits
2025-06-25 9:33 ` [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits Damien Le Moal
2025-06-25 11:49 ` Johannes Thumshirn
2025-06-25 16:30 ` Mikulas Patocka
@ 2025-06-25 16:34 ` Bart Van Assche
2025-06-25 23:41 ` Damien Le Moal
2 siblings, 1 reply; 25+ messages in thread
From: Bart Van Assche @ 2025-06-25 16:34 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 2:33 AM, Damien Le Moal wrote:
> + /*
> + * Mapped devices that require zone append emulation will use the block
> + * layer zone write plugging. In such case, we must split any large BIO
> + * to the mapped device limits to avoid potential deadlocks with queue
> + * freeze operations.
> + */
> + if (!dm_emulate_zone_append(md))
> + return false;
> + return bio_needs_zone_write_plugging(bio) || bio_straddles_zones(bio);
Changing the dm_emulate_zone_append() call into a
disk_need_zone_resources() call or a disk->zone_wplugs_hash test
probably would make the intention of the if-statement more clear.
Additionally, bio_needs_zone_write_plugging() already tests
disk->zone_wplugs_hash. Isn't the dm_emulate_zone_append() call
superfluous because of this?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits
2025-06-25 16:34 ` Bart Van Assche
@ 2025-06-25 23:41 ` Damien Le Moal
0 siblings, 0 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 23:41 UTC (permalink / raw)
To: Bart Van Assche, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/26/25 01:34, Bart Van Assche wrote:
> On 6/25/25 2:33 AM, Damien Le Moal wrote:
>> + /*
>> + * Mapped devices that require zone append emulation will use the block
>> + * layer zone write plugging. In such case, we must split any large BIO
>> + * to the mapped device limits to avoid potential deadlocks with queue
>> + * freeze operations.
>> + */
>> + if (!dm_emulate_zone_append(md))
>> + return false;
>> + return bio_needs_zone_write_plugging(bio) || bio_straddles_zones(bio);
>
> Changing the dm_emulate_zone_append() call into a
> disk_need_zone_resources() call or a disk->zone_wplugs_hash test
> probably would make the intention of the if-statement more clear.
The comment is not clear enough ?
> Additionally, bio_needs_zone_write_plugging() already tests
> disk->zone_wplugs_hash. Isn't the dm_emulate_zone_append() call
> superfluous because of this?
Yes, I think it is. But I did try to minimize changes given that this is a fix
which is not small already...
I will do more testing and send follow-up cleanup patches.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
` (2 preceding siblings ...)
2025-06-25 9:33 ` [PATCH v3 3/5] dm: Always split write BIOs to zoned device limits Damien Le Moal
@ 2025-06-25 9:33 ` Damien Le Moal
2025-06-25 16:31 ` Mikulas Patocka
2025-06-25 16:42 ` Bart Van Assche
2025-06-25 9:33 ` [PATCH v3 5/5] dm: Check for forbidden splitting of zone write operations Damien Le Moal
2025-06-25 16:37 ` [PATCH v3 0/5] Fix write operation handling for zoned DM devices Jens Axboe
5 siblings, 2 replies; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 9:33 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Read and write operations issued to a dm-crypt target may be split
according to the dm-crypt internal limits defined by the max_read_size
and max_write_size module parameters (default is 128 KB). The intent is
to improve processing time of large BIOs by splitting them into smaller
operations that can be parallelized on different CPUs.
For zoned dm-crypt targets, this BIO splitting is still done but without
the parallel execution to ensure that the issuing order of write
operations to the underlying devices remains sequential. However, the
splitting itself causes other problems:
1) Since dm-crypt relies on the block layer zone write plugging to
handle zone append emulation using regular write operations, the
reminder of a split write BIO will always be plugged into the target
zone write plugged. Once the on-going write BIO finishes, this
reminder BIO is unplugged and issued from the zone write plug work.
If this reminder BIO itself needs to be split, the reminder will be
re-issued and plugged again, but that causes a call to a
blk_queue_enter(), which may block if a queue freeze operation was
initiated. This results in a deadlock as DM submission still holds
BIOs that the queue freeze side is waiting for.
2) dm-crypt relies on the emulation done by the block layer using
regular write operations for processing zone append operations. This
still requires to properly return the written sector as the BIO
sector of the original BIO. However, this can be done correctly only
and only if there is a single clone BIO used for processing the
original zone append operation issued by the user. If the size of a
zone append operation is larger than dm-crypt max_write_size, then
the orginal BIO will be split and processed as a chain of regular
write operations. Such chaining result in an incorrect written sector
being returned to the zone append issuer using the original BIO
sector. This in turn results in file system data corruptions using
xfs or btrfs.
Fix this by modifying get_max_request_size() to always return the size
of the BIO to avoid it being split with dm_accpet_partial_bio() in
crypt_map(). get_max_request_size() is renamed to
get_max_request_sectors() to clarify the unit of the value returned
and its interface is changed to take a struct dm_target pointer and a
pointer to the struct bio being processed. In addition to this change,
to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt
device max_hw_sectors limit to be at most
BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture).
This forces DM core to split write BIOs before passing them to
crypt_map(), and thus guaranteeing that dm-crypt can always accept an
entire write BIO without needing to split it.
This change does not have any effect on the read path of dm-crypt. Read
operations can still be split and the BIO fragments processed in
parallel. There is also no impact on the performance of the write path
given that all zone write BIOs were already processed inline instead of
in parallel.
This change also does not affect in any way regular dm-crypt block
devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++---------
1 file changed, 39 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 17157c4216a5..4e80784d1734 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request");
static unsigned int max_write_size = 0;
module_param(max_write_size, uint, 0644);
MODULE_PARM_DESC(max_write_size, "Maximum size of a write request");
-static unsigned get_max_request_size(struct crypt_config *cc, bool wrt)
+
+static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio)
{
+ struct crypt_config *cc = ti->private;
unsigned val, sector_align;
- val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size);
- if (likely(!val))
- val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE;
- if (wrt || cc->used_tag_size) {
- if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT))
- val = BIO_MAX_VECS << PAGE_SHIFT;
- }
- sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size);
+ bool wrt = op_is_write(bio_op(bio));
+
+ if (wrt) {
+ /*
+ * For zoned devices, splitting write operations creates the
+ * risk of deadlocking queue freeze operations with zone write
+ * plugging BIO work when the reminder of a split BIO is
+ * issued. So always allow the entire BIO to proceed.
+ */
+ if (ti->emulate_zone_append)
+ return bio_sectors(bio);
+
+ val = min_not_zero(READ_ONCE(max_write_size),
+ DM_CRYPT_DEFAULT_MAX_WRITE_SIZE);
+ } else {
+ val = min_not_zero(READ_ONCE(max_read_size),
+ DM_CRYPT_DEFAULT_MAX_READ_SIZE);
+ }
+
+ if (wrt || cc->used_tag_size)
+ val = min(val, BIO_MAX_VECS << PAGE_SHIFT);
+
+ sector_align = max(bdev_logical_block_size(cc->dev->bdev),
+ (unsigned)cc->sector_size);
val = round_down(val, sector_align);
if (unlikely(!val))
val = sector_align;
@@ -3496,7 +3514,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
/*
* Check if bio is too large, split as needed.
*/
- max_sectors = get_max_request_size(cc, bio_data_dir(bio) == WRITE);
+ max_sectors = get_max_request_sectors(ti, bio);
if (unlikely(bio_sectors(bio) > max_sectors))
dm_accept_partial_bio(bio, max_sectors);
@@ -3733,6 +3751,17 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
max_t(unsigned int, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned int, limits->io_min, cc->sector_size);
limits->dma_alignment = limits->logical_block_size - 1;
+
+ /*
+ * For zoned dm-crypt targets, there will be no internal splitting of
+ * write BIOs to avoid exceeding BIO_MAX_VECS vectors per BIO. But
+ * without respecting this limit, crypt_alloc_buffer() will trigger a
+ * BUG(). Avoid this by forcing DM core to split write BIOs to this
+ * limit.
+ */
+ if (ti->emulate_zone_append)
+ limits->max_hw_sectors = min(limits->max_hw_sectors,
+ BIO_MAX_VECS << PAGE_SECTORS_SHIFT);
}
static struct target_type crypt_target = {
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets
2025-06-25 9:33 ` [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets Damien Le Moal
@ 2025-06-25 16:31 ` Mikulas Patocka
2025-06-25 16:42 ` Bart Van Assche
1 sibling, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2025-06-25 16:31 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> Read and write operations issued to a dm-crypt target may be split
> according to the dm-crypt internal limits defined by the max_read_size
> and max_write_size module parameters (default is 128 KB). The intent is
> to improve processing time of large BIOs by splitting them into smaller
> operations that can be parallelized on different CPUs.
>
> For zoned dm-crypt targets, this BIO splitting is still done but without
> the parallel execution to ensure that the issuing order of write
> operations to the underlying devices remains sequential. However, the
> splitting itself causes other problems:
>
> 1) Since dm-crypt relies on the block layer zone write plugging to
> handle zone append emulation using regular write operations, the
> reminder of a split write BIO will always be plugged into the target
> zone write plugged. Once the on-going write BIO finishes, this
> reminder BIO is unplugged and issued from the zone write plug work.
> If this reminder BIO itself needs to be split, the reminder will be
> re-issued and plugged again, but that causes a call to a
> blk_queue_enter(), which may block if a queue freeze operation was
> initiated. This results in a deadlock as DM submission still holds
> BIOs that the queue freeze side is waiting for.
>
> 2) dm-crypt relies on the emulation done by the block layer using
> regular write operations for processing zone append operations. This
> still requires to properly return the written sector as the BIO
> sector of the original BIO. However, this can be done correctly only
> and only if there is a single clone BIO used for processing the
> original zone append operation issued by the user. If the size of a
> zone append operation is larger than dm-crypt max_write_size, then
> the orginal BIO will be split and processed as a chain of regular
> write operations. Such chaining result in an incorrect written sector
> being returned to the zone append issuer using the original BIO
> sector. This in turn results in file system data corruptions using
> xfs or btrfs.
>
> Fix this by modifying get_max_request_size() to always return the size
> of the BIO to avoid it being split with dm_accpet_partial_bio() in
> crypt_map(). get_max_request_size() is renamed to
> get_max_request_sectors() to clarify the unit of the value returned
> and its interface is changed to take a struct dm_target pointer and a
> pointer to the struct bio being processed. In addition to this change,
> to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt
> device max_hw_sectors limit to be at most
> BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture).
> This forces DM core to split write BIOs before passing them to
> crypt_map(), and thus guaranteeing that dm-crypt can always accept an
> entire write BIO without needing to split it.
>
> This change does not have any effect on the read path of dm-crypt. Read
> operations can still be split and the BIO fragments processed in
> parallel. There is also no impact on the performance of the write path
> given that all zone write BIOs were already processed inline instead of
> in parallel.
>
> This change also does not affect in any way regular dm-crypt block
> devices.
>
> Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> ---
> drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++---------
> 1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 17157c4216a5..4e80784d1734 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request");
> static unsigned int max_write_size = 0;
> module_param(max_write_size, uint, 0644);
> MODULE_PARM_DESC(max_write_size, "Maximum size of a write request");
> -static unsigned get_max_request_size(struct crypt_config *cc, bool wrt)
> +
> +static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio)
> {
> + struct crypt_config *cc = ti->private;
> unsigned val, sector_align;
> - val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size);
> - if (likely(!val))
> - val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE;
> - if (wrt || cc->used_tag_size) {
> - if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT))
> - val = BIO_MAX_VECS << PAGE_SHIFT;
> - }
> - sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size);
> + bool wrt = op_is_write(bio_op(bio));
> +
> + if (wrt) {
> + /*
> + * For zoned devices, splitting write operations creates the
> + * risk of deadlocking queue freeze operations with zone write
> + * plugging BIO work when the reminder of a split BIO is
> + * issued. So always allow the entire BIO to proceed.
> + */
> + if (ti->emulate_zone_append)
> + return bio_sectors(bio);
> +
> + val = min_not_zero(READ_ONCE(max_write_size),
> + DM_CRYPT_DEFAULT_MAX_WRITE_SIZE);
> + } else {
> + val = min_not_zero(READ_ONCE(max_read_size),
> + DM_CRYPT_DEFAULT_MAX_READ_SIZE);
> + }
> +
> + if (wrt || cc->used_tag_size)
> + val = min(val, BIO_MAX_VECS << PAGE_SHIFT);
> +
> + sector_align = max(bdev_logical_block_size(cc->dev->bdev),
> + (unsigned)cc->sector_size);
> val = round_down(val, sector_align);
> if (unlikely(!val))
> val = sector_align;
> @@ -3496,7 +3514,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
> /*
> * Check if bio is too large, split as needed.
> */
> - max_sectors = get_max_request_size(cc, bio_data_dir(bio) == WRITE);
> + max_sectors = get_max_request_sectors(ti, bio);
> if (unlikely(bio_sectors(bio) > max_sectors))
> dm_accept_partial_bio(bio, max_sectors);
>
> @@ -3733,6 +3751,17 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
> max_t(unsigned int, limits->physical_block_size, cc->sector_size);
> limits->io_min = max_t(unsigned int, limits->io_min, cc->sector_size);
> limits->dma_alignment = limits->logical_block_size - 1;
> +
> + /*
> + * For zoned dm-crypt targets, there will be no internal splitting of
> + * write BIOs to avoid exceeding BIO_MAX_VECS vectors per BIO. But
> + * without respecting this limit, crypt_alloc_buffer() will trigger a
> + * BUG(). Avoid this by forcing DM core to split write BIOs to this
> + * limit.
> + */
> + if (ti->emulate_zone_append)
> + limits->max_hw_sectors = min(limits->max_hw_sectors,
> + BIO_MAX_VECS << PAGE_SECTORS_SHIFT);
> }
>
> static struct target_type crypt_target = {
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets
2025-06-25 9:33 ` [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets Damien Le Moal
2025-06-25 16:31 ` Mikulas Patocka
@ 2025-06-25 16:42 ` Bart Van Assche
1 sibling, 0 replies; 25+ messages in thread
From: Bart Van Assche @ 2025-06-25 16:42 UTC (permalink / raw)
To: Damien Le Moal, linux-block, Jens Axboe, dm-devel, Mike Snitzer,
Mikulas Patocka
On 6/25/25 2:33 AM, Damien Le Moal wrote:
> Fix this by modifying get_max_request_size() to always return the size
> of the BIO to avoid it being split with dm_accpet_partial_bio() in
^^^^^^^^^^^^^^^^^^^^^^^
Two letters got swapped in this function name ...
Otherwise this patch looks good to me.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH v3 5/5] dm: Check for forbidden splitting of zone write operations
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
` (3 preceding siblings ...)
2025-06-25 9:33 ` [PATCH v3 4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets Damien Le Moal
@ 2025-06-25 9:33 ` Damien Le Moal
2025-06-25 16:31 ` Mikulas Patocka
2025-06-25 16:37 ` [PATCH v3 0/5] Fix write operation handling for zoned DM devices Jens Axboe
5 siblings, 1 reply; 25+ messages in thread
From: Damien Le Moal @ 2025-06-25 9:33 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
DM targets must not split zone append and write operations using
dm_accept_partial_bio() as doing so is forbidden for zone append BIOs,
breaks zone append emulation using regular write BIOs and potentially
creates deadlock situations with queue freeze operations.
Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all
these cases, that is, check that the BIO is a write or write zeroes
operation. This change packs all the zone related checks together under
a static_branch_unlikely(&zoned_enabled) and done only if the target is
a zoned device.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f1e63c1808b4..f82457e7eed1 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1286,8 +1286,9 @@ static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
- * operations, REQ_OP_ZONE_APPEND (zone append writes) and any bio serviced by
- * __send_duplicate_bios().
+ * operations, zone append writes (native with REQ_OP_ZONE_APPEND or emulated
+ * with write BIOs flagged with BIO_EMULATES_ZONE_APPEND) and any bio serviced
+ * by __send_duplicate_bios().
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
@@ -1320,11 +1321,19 @@ void dm_accept_partial_bio(struct bio *bio, unsigned int n_sectors)
unsigned int bio_sectors = bio_sectors(bio);
BUG_ON(dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
- BUG_ON(op_is_zone_mgmt(bio_op(bio)));
- BUG_ON(bio_op(bio) == REQ_OP_ZONE_APPEND);
BUG_ON(bio_sectors > *tio->len_ptr);
BUG_ON(n_sectors > bio_sectors);
+ if (static_branch_unlikely(&zoned_enabled) &&
+ unlikely(bdev_is_zoned(bio->bi_bdev))) {
+ enum req_op op = bio_op(bio);
+
+ BUG_ON(op_is_zone_mgmt(op));
+ BUG_ON(op == REQ_OP_WRITE);
+ BUG_ON(op == REQ_OP_WRITE_ZEROES);
+ BUG_ON(op == REQ_OP_ZONE_APPEND);
+ }
+
*tio->len_ptr -= bio_sectors - n_sectors;
bio->bi_iter.bi_size = n_sectors << SECTOR_SHIFT;
--
2.49.0
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH v3 5/5] dm: Check for forbidden splitting of zone write operations
2025-06-25 9:33 ` [PATCH v3 5/5] dm: Check for forbidden splitting of zone write operations Damien Le Moal
@ 2025-06-25 16:31 ` Mikulas Patocka
0 siblings, 0 replies; 25+ messages in thread
From: Mikulas Patocka @ 2025-06-25 16:31 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> DM targets must not split zone append and write operations using
> dm_accept_partial_bio() as doing so is forbidden for zone append BIOs,
> breaks zone append emulation using regular write BIOs and potentially
> creates deadlock situations with queue freeze operations.
>
> Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all
> these cases, that is, check that the BIO is a write or write zeroes
> operation. This change packs all the zone related checks together under
> a static_branch_unlikely(&zoned_enabled) and done only if the target is
> a zoned device.
>
> Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Mikulas Patocka <mpatocka@redhat.com>
> ---
> drivers/md/dm.c | 17 +++++++++++++----
> 1 file changed, 13 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index f1e63c1808b4..f82457e7eed1 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1286,8 +1286,9 @@ static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
> /*
> * A target may call dm_accept_partial_bio only from the map routine. It is
> * allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
> - * operations, REQ_OP_ZONE_APPEND (zone append writes) and any bio serviced by
> - * __send_duplicate_bios().
> + * operations, zone append writes (native with REQ_OP_ZONE_APPEND or emulated
> + * with write BIOs flagged with BIO_EMULATES_ZONE_APPEND) and any bio serviced
> + * by __send_duplicate_bios().
> *
> * dm_accept_partial_bio informs the dm that the target only wants to process
> * additional n_sectors sectors of the bio and the rest of the data should be
> @@ -1320,11 +1321,19 @@ void dm_accept_partial_bio(struct bio *bio, unsigned int n_sectors)
> unsigned int bio_sectors = bio_sectors(bio);
>
> BUG_ON(dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
> - BUG_ON(op_is_zone_mgmt(bio_op(bio)));
> - BUG_ON(bio_op(bio) == REQ_OP_ZONE_APPEND);
> BUG_ON(bio_sectors > *tio->len_ptr);
> BUG_ON(n_sectors > bio_sectors);
>
> + if (static_branch_unlikely(&zoned_enabled) &&
> + unlikely(bdev_is_zoned(bio->bi_bdev))) {
> + enum req_op op = bio_op(bio);
> +
> + BUG_ON(op_is_zone_mgmt(op));
> + BUG_ON(op == REQ_OP_WRITE);
> + BUG_ON(op == REQ_OP_WRITE_ZEROES);
> + BUG_ON(op == REQ_OP_ZONE_APPEND);
> + }
> +
> *tio->len_ptr -= bio_sectors - n_sectors;
> bio->bi_iter.bi_size = n_sectors << SECTOR_SHIFT;
>
> --
> 2.49.0
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH v3 0/5] Fix write operation handling for zoned DM devices
2025-06-25 9:33 [PATCH v3 0/5] Fix write operation handling for zoned DM devices Damien Le Moal
` (4 preceding siblings ...)
2025-06-25 9:33 ` [PATCH v3 5/5] dm: Check for forbidden splitting of zone write operations Damien Le Moal
@ 2025-06-25 16:37 ` Jens Axboe
5 siblings, 0 replies; 25+ messages in thread
From: Jens Axboe @ 2025-06-25 16:37 UTC (permalink / raw)
To: linux-block, dm-devel, Mike Snitzer, Mikulas Patocka,
Damien Le Moal
Cc: Bart Van Assche
On Wed, 25 Jun 2025 18:33:22 +0900, Damien Le Moal wrote:
> Jens, Mike, Mikulas,
>
> Any zoned DM device using target drivers that internally split BIOs
> using dm_accept_partial_bio() can cause deadlocks with concurrent queue
> freeze operations. Furthermore, target splitting write operations used
> to emulate zone append requests break the emulation. This patch series
> addresses both issues by forcing DM to split BIOs to the DM device
> limits before passing the BIOs to the target map() function, and by
> avoiding calls to dm_accept_partial_bio() for Zoned DM targets that use
> zone append emulation.
>
> [...]
Applied, thanks!
[1/5] block: Make REQ_OP_ZONE_FINISH a write operation
commit: 24cb7af081023bbb7a25ae514e6e2b398d4ab25c
[2/5] block: Introduce bio_needs_zone_write_plugging()
commit: bf7a8b5cbbb2d531f3336be2186af0c5590d157c
[3/5] dm: Always split write BIOs to zoned device limits
commit: 74e93dbcd320a8e9d5f7b3f238eedc7da6d6ca7e
[4/5] dm: dm-crypt: Do not partially accept write BIOs with zoned targets
commit: 0277a0d91194b79b4a3db0c53ba205a933554695
[5/5] dm: Check for forbidden splitting of zone write operations
commit: e04a33a18fdb259d7ad3673ddfce6112f5ce30fd
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 25+ messages in thread