* [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 5:59 [PATCH v2 0/4] Fix write operation handling for zoned DM devices Damien Le Moal
@ 2025-06-25 5:59 ` Damien Le Moal
2025-06-25 6:12 ` Christoph Hellwig
2025-06-25 5:59 ` [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits Damien Le Moal
` (2 subsequent siblings)
3 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 5:59 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
In preparation for fixing device mapper zone write handling, introduce
the helper function bio_needs_zone_write_plugging() to test if a BIO
requires handling through zone write plugging. This function returns
true for any write operation to a zoned block device. For zone append
opertions, true is returned only if the device requires zone append
emulation with regular writes.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/blkdev.h | 9 +++++++++
2 files changed, 49 insertions(+)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 55e64ca869d7..9d38b94cad0d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1129,6 +1129,46 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
disk_put_zone_wplug(zwplug);
}
+/**
+ * bio_needs_zone_write_plugging - Check if a BIO needs to be handled with zone
+ * write plugging
+ * @bio: The BIO being submitted
+ *
+ * Return true whenever @bio execution needs to be handled through zone
+ * write plugging. Return false otherwise.
+ */
+bool bio_needs_zone_write_plugging(struct bio *bio)
+{
+ enum req_op op = bio_op(bio);
+
+ if (!bdev_is_zoned(bio->bi_bdev) || !op_is_write(op))
+ return false;
+
+ /* Already handled ? */
+ if (bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING))
+ return false;
+
+ /* Ignore empty flush */
+ if (op_is_flush(bio->bi_opf) && !bio_sectors(bio))
+ return false;
+
+ /*
+ * Regular writes and write zeroes need to be handled through zone
+ * write plugging. Zone append operations only need zone write plugging
+ * if they are emulated.
+ */
+ switch (op) {
+ case REQ_OP_ZONE_APPEND:
+ return bdev_emulates_zone_append(bio->bi_bdev);
+ case REQ_OP_WRITE:
+ case REQ_OP_WRITE_ZEROES:
+ return true;
+ default:
+ return false;
+ }
+}
+EXPORT_SYMBOL_GPL(bio_needs_zone_write_plugging);
+
/**
* blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
* @bio: The BIO being submitted
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index c2b3ddea8b6d..d455057bfaa3 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -848,6 +848,9 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return disk->nr_zones;
}
+
+bool bio_needs_zone_write_plugging(struct bio *bio);
+
bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
/**
@@ -877,6 +880,12 @@ static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
return 0;
}
+
+static inline bool bio_needs_zone_write_plugging(struct bio *bio)
+{
+ return false;
+}
+
static inline bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
{
return false;
--
2.49.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 5:59 ` [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
@ 2025-06-25 6:12 ` Christoph Hellwig
2025-06-25 6:14 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-25 6:12 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
On Wed, Jun 25, 2025 at 02:59:05PM +0900, Damien Le Moal wrote:
> +bool bio_needs_zone_write_plugging(struct bio *bio)
Can you use this in blk_zone_plug_bio instead of duplicating the logic?
I also wonder if we should only it it, as despite looking quite complex
it should compile down to just a few instructions and is used in the
I/O fast path.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 6:12 ` Christoph Hellwig
@ 2025-06-25 6:14 ` Damien Le Moal
2025-06-25 6:18 ` Christoph Hellwig
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 6:14 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
On 6/25/25 3:12 PM, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 02:59:05PM +0900, Damien Le Moal wrote:
>> +bool bio_needs_zone_write_plugging(struct bio *bio)
>
> Can you use this in blk_zone_plug_bio instead of duplicating the logic?
I thought about doing that, but we would still need to again do the switch/case
on the bio op. But the checks before that could go into a common static helper.
> I also wonder if we should only it it, as despite looking quite complex
This does not parse...
> it should compile down to just a few instructions and is used in the
> I/O fast path.
>
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 6:14 ` Damien Le Moal
@ 2025-06-25 6:18 ` Christoph Hellwig
2025-06-25 6:17 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-25 6:18 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christoph Hellwig, linux-block, Jens Axboe, dm-devel,
Mike Snitzer, Mikulas Patocka, Bart Van Assche
On Wed, Jun 25, 2025 at 03:14:51PM +0900, Damien Le Moal wrote:
> On 6/25/25 3:12 PM, Christoph Hellwig wrote:
> > On Wed, Jun 25, 2025 at 02:59:05PM +0900, Damien Le Moal wrote:
> >> +bool bio_needs_zone_write_plugging(struct bio *bio)
> >
> > Can you use this in blk_zone_plug_bio instead of duplicating the logic?
>
> I thought about doing that, but we would still need to again do the switch/case
> on the bio op. But the checks before that could go into a common static helper.
>
> > I also wonder if we should only it it, as despite looking quite complex
>
> This does not parse...
The "only it" above should be "inline", -ENOTENOUGHCOFFEE
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging()
2025-06-25 6:18 ` Christoph Hellwig
@ 2025-06-25 6:17 ` Damien Le Moal
0 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 6:17 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
On 6/25/25 3:18 PM, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 03:14:51PM +0900, Damien Le Moal wrote:
>> On 6/25/25 3:12 PM, Christoph Hellwig wrote:
>>> On Wed, Jun 25, 2025 at 02:59:05PM +0900, Damien Le Moal wrote:
>>>> +bool bio_needs_zone_write_plugging(struct bio *bio)
>>>
>>> Can you use this in blk_zone_plug_bio instead of duplicating the logic?
>>
>> I thought about doing that, but we would still need to again do the switch/case
>> on the bio op. But the checks before that could go into a common static helper.
>>
>>> I also wonder if we should only it it, as despite looking quite complex
>>
>> This does not parse...
>
> The "only it" above should be "inline", -ENOTENOUGHCOFFEE
OK. Let me see if I can do that.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits
2025-06-25 5:59 [PATCH v2 0/4] Fix write operation handling for zoned DM devices Damien Le Moal
2025-06-25 5:59 ` [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
@ 2025-06-25 5:59 ` Damien Le Moal
2025-06-25 6:15 ` Christoph Hellwig
2025-06-25 5:59 ` [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets Damien Le Moal
2025-06-25 5:59 ` [PATCH v2 4/4] dm: Check for forbidden splitting of zone write operations Damien Le Moal
3 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 5:59 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Any zoned DM target that requires zone append emulation will use the
block layer zone write plugging. In such case, DM target drivers must
not split BIOs using dm_accept_partial_bio() as doing so can potentially
lead to deadlocks with queue freeze operations. Regular write operations
used to emulate zone append operations also cannot be split by the
target driver as that would result in an invalid writen sector value
return using the BIO sector.
In order for zoned DM target drivers to avoid such incorrect BIO
splitting, we must ensure that large BIOs are split before being passed
to the map() function of the target, thus guaranteeing that the
limits for the mapped device are not exceeded.
dm-crypt and dm-flakey are the only target drivers supporting zoned
devices and using dm_accept_partial_bio().
In the case of dm-crypt, this function is used to split BIOs to the
internal max_write_size limit (which will be suppressed in a different
patch). However, since crypt_alloc_buffer() uses a bioset allowing only
up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device
max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS
(128), must thus be respected and write BIOs split accordingly.
In the case of dm-flakey, since zone append emulation is not required,
the block layer zone write plugging is not used and no splitting of BIOs
required.
Modify the function dm_zone_bio_needs_split() to use the block layer
helper function bio_needs_zone_write_plugging() to force a call to
bio_split_to_limits() in dm_split_and_process_bio(). This allows DM
target drivers to avoid using dm_accept_partial_bio() for write
operations on zoned DM devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm.c | 28 +++++++++++++++++++++-------
1 file changed, 21 insertions(+), 7 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 55579adbeb3f..e01ed89b2e45 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1773,12 +1773,28 @@ static inline bool dm_zone_bio_needs_split(struct mapped_device *md,
struct bio *bio)
{
/*
- * For mapped device that need zone append emulation, we must
- * split any large BIO that straddles zone boundaries.
+ * Special case REQ_OP_ZONE_RESET_ALL and REQ_OP_ZONE_APPEND as these
+ * operations cannot be split.
*/
- return dm_emulate_zone_append(md) && bio_straddles_zones(bio) &&
- !bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
+ switch (bio_op(bio)) {
+ case REQ_OP_ZONE_RESET_ALL:
+ case REQ_OP_ZONE_APPEND:
+ return false;
+ default:
+ break;
+ }
+
+ /*
+ * Mapped device that require zone append emulation will use the block
+ * layer zone write plugging. In such case, we must split any large BIO
+ * to the mapped device limits to avoid potential deadlocks with queue
+ * freeze operations.
+ */
+ if (!dm_emulate_zone_append(md))
+ return false;
+ return bio_needs_zone_write_plugging(bio) || bio_straddles_zones(bio);
}
+
static inline bool dm_zone_plug_bio(struct mapped_device *md, struct bio *bio)
{
return dm_emulate_zone_append(md) && blk_zone_plug_bio(bio, 0);
@@ -1925,9 +1941,7 @@ static void dm_split_and_process_bio(struct mapped_device *md,
is_abnormal = is_abnormal_io(bio);
if (static_branch_unlikely(&zoned_enabled)) {
- /* Special case REQ_OP_ZONE_RESET_ALL as it cannot be split. */
- need_split = (bio_op(bio) != REQ_OP_ZONE_RESET_ALL) &&
- (is_abnormal || dm_zone_bio_needs_split(md, bio));
+ need_split = is_abnormal || dm_zone_bio_needs_split(md, bio);
} else {
need_split = is_abnormal;
}
--
2.49.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits
2025-06-25 5:59 ` [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits Damien Le Moal
@ 2025-06-25 6:15 ` Christoph Hellwig
2025-06-25 6:15 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2025-06-25 6:15 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
On Wed, Jun 25, 2025 at 02:59:06PM +0900, Damien Le Moal wrote:
> Any zoned DM target that requires zone append emulation will use the
> block layer zone write plugging. In such case, DM target drivers must
> not split BIOs using dm_accept_partial_bio() as doing so can potentially
> lead to deadlocks with queue freeze operations. Regular write operations
> used to emulate zone append operations also cannot be split by the
> target driver as that would result in an invalid writen sector value
> return using the BIO sector.
>
> In order for zoned DM target drivers to avoid such incorrect BIO
> splitting, we must ensure that large BIOs are split before being passed
> to the map() function of the target, thus guaranteeing that the
> limits for the mapped device are not exceeded.
>
> dm-crypt and dm-flakey are the only target drivers supporting zoned
> devices and using dm_accept_partial_bio().
Is there any good way to catch usage dm_accept_partial_bio on zone
devices so that issues like this don't get reintroduced later?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits
2025-06-25 6:15 ` Christoph Hellwig
@ 2025-06-25 6:15 ` Damien Le Moal
0 siblings, 0 replies; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 6:15 UTC (permalink / raw)
To: Christoph Hellwig
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka,
Bart Van Assche
On 6/25/25 3:15 PM, Christoph Hellwig wrote:
> On Wed, Jun 25, 2025 at 02:59:06PM +0900, Damien Le Moal wrote:
>> Any zoned DM target that requires zone append emulation will use the
>> block layer zone write plugging. In such case, DM target drivers must
>> not split BIOs using dm_accept_partial_bio() as doing so can potentially
>> lead to deadlocks with queue freeze operations. Regular write operations
>> used to emulate zone append operations also cannot be split by the
>> target driver as that would result in an invalid writen sector value
>> return using the BIO sector.
>>
>> In order for zoned DM target drivers to avoid such incorrect BIO
>> splitting, we must ensure that large BIOs are split before being passed
>> to the map() function of the target, thus guaranteeing that the
>> limits for the mapped device are not exceeded.
>>
>> dm-crypt and dm-flakey are the only target drivers supporting zoned
>> devices and using dm_accept_partial_bio().
>
> Is there any good way to catch usage dm_accept_partial_bio on zone
> devices so that issues like this don't get reintroduced later?
patch 4 does that. Though with the heavy BUG_ON() hammer. That can be cleaned
up later though.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 5:59 [PATCH v2 0/4] Fix write operation handling for zoned DM devices Damien Le Moal
2025-06-25 5:59 ` [PATCH v2 1/4] block: Introduce bio_needs_zone_write_plugging() Damien Le Moal
2025-06-25 5:59 ` [PATCH v2 2/4] dm: Always split write BIOs to zoned device limits Damien Le Moal
@ 2025-06-25 5:59 ` Damien Le Moal
2025-06-25 10:19 ` Mikulas Patocka
2025-06-25 5:59 ` [PATCH v2 4/4] dm: Check for forbidden splitting of zone write operations Damien Le Moal
3 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 5:59 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
Read and write operations issued to a dm-crypt target may be split
according to the dm-crypt internal limits defined by the max_read_size
and max_write_size module parameters (default is 128 KB). The intent is
to improve processing time of large BIOs by splitting them into smaller
operations that can be parallelized on different CPUs.
For zoned dm-crypt targets, this BIO splitting is still done but without
the parallel execution to ensure that the issuing order of write
operations to the underlying devices remains sequential. However, the
splitting itself causes other problems:
1) Since dm-crypt relies on the block layer zone write plugging to
handle zone append emulation using regular write operations, the
reminder of a split write BIO will always be plugged into the target
zone write plugged. Once the on-going write BIO finishes, this
reminder BIO is unplugged and issued from the zone write plug work.
If this reminder BIO itself needs to be split, the reminder will be
re-issued and plugged again, but that causes a call to a
blk_queue_enter(), which may block if a queue freeze operation was
initiated. This results in a deadlock as DM submission still holds
BIOs that the queue freeze side is waiting for.
2) dm-crypt relies on the emulation done by the block layer using
regular write operations for processing zone append operations. This
still requires to properly return the written sector as the BIO
sector of the original BIO. However, this can be done correctly only
and only if there is a single clone BIO used for processing the
original zone append operation issued by the user. If the size of a
zone append operation is larger than dm-crypt max_write_size, then
the orginal BIO will be split and processed as a chain of regular
write operations. Such chaining result in an incorrect written sector
being returned to the zone append issuer using the original BIO
sector. This in turn results in file system data corruptions using
xfs or btrfs.
Fix this by modifying get_max_request_size() to always return the size
of the BIO to avoid it being split with dm_accpet_partial_bio() in
crypt_map(). get_max_request_size() is renamed to
get_max_request_sectors() to clarify the unit of the value returned
and its interface is changed to take a struct dm_target pointer and a
pointer to the struct bio being processed. In addition to this change,
to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt
device max_hw_sectors limit to be at most
BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture).
This forces DM core to split write BIOs before passing them to
crypt_map(), and thus guaranteeing that dm-crypt can always accept an
entire write BIO without needing to split it.
This change does not have any effect on the read path of dm-crypt. Read
operations can still be split and the BIO fragments processed in
parallel. There is also no impact on the performance of the write path
given that all zone write BIOs were already processed inline instead of
in parallel.
This change also does not affect in any way regular dm-crypt block
devices.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++---------
1 file changed, 39 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 17157c4216a5..4e80784d1734 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request");
static unsigned int max_write_size = 0;
module_param(max_write_size, uint, 0644);
MODULE_PARM_DESC(max_write_size, "Maximum size of a write request");
-static unsigned get_max_request_size(struct crypt_config *cc, bool wrt)
+
+static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio)
{
+ struct crypt_config *cc = ti->private;
unsigned val, sector_align;
- val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size);
- if (likely(!val))
- val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE;
- if (wrt || cc->used_tag_size) {
- if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT))
- val = BIO_MAX_VECS << PAGE_SHIFT;
- }
- sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size);
+ bool wrt = op_is_write(bio_op(bio));
+
+ if (wrt) {
+ /*
+ * For zoned devices, splitting write operations creates the
+ * risk of deadlocking queue freeze operations with zone write
+ * plugging BIO work when the reminder of a split BIO is
+ * issued. So always allow the entire BIO to proceed.
+ */
+ if (ti->emulate_zone_append)
+ return bio_sectors(bio);
+
+ val = min_not_zero(READ_ONCE(max_write_size),
+ DM_CRYPT_DEFAULT_MAX_WRITE_SIZE);
+ } else {
+ val = min_not_zero(READ_ONCE(max_read_size),
+ DM_CRYPT_DEFAULT_MAX_READ_SIZE);
+ }
+
+ if (wrt || cc->used_tag_size)
+ val = min(val, BIO_MAX_VECS << PAGE_SHIFT);
+
+ sector_align = max(bdev_logical_block_size(cc->dev->bdev),
+ (unsigned)cc->sector_size);
val = round_down(val, sector_align);
if (unlikely(!val))
val = sector_align;
@@ -3496,7 +3514,7 @@ static int crypt_map(struct dm_target *ti, struct bio *bio)
/*
* Check if bio is too large, split as needed.
*/
- max_sectors = get_max_request_size(cc, bio_data_dir(bio) == WRITE);
+ max_sectors = get_max_request_sectors(ti, bio);
if (unlikely(bio_sectors(bio) > max_sectors))
dm_accept_partial_bio(bio, max_sectors);
@@ -3733,6 +3751,17 @@ static void crypt_io_hints(struct dm_target *ti, struct queue_limits *limits)
max_t(unsigned int, limits->physical_block_size, cc->sector_size);
limits->io_min = max_t(unsigned int, limits->io_min, cc->sector_size);
limits->dma_alignment = limits->logical_block_size - 1;
+
+ /*
+ * For zoned dm-crypt targets, there will be no internal splitting of
+ * write BIOs to avoid exceeding BIO_MAX_VECS vectors per BIO. But
+ * without respecting this limit, crypt_alloc_buffer() will trigger a
+ * BUG(). Avoid this by forcing DM core to split write BIOs to this
+ * limit.
+ */
+ if (ti->emulate_zone_append)
+ limits->max_hw_sectors = min(limits->max_hw_sectors,
+ BIO_MAX_VECS << PAGE_SECTORS_SHIFT);
}
static struct target_type crypt_target = {
--
2.49.0
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 5:59 ` [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets Damien Le Moal
@ 2025-06-25 10:19 ` Mikulas Patocka
2025-06-25 12:54 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2025-06-25 10:19 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> Read and write operations issued to a dm-crypt target may be split
> according to the dm-crypt internal limits defined by the max_read_size
> and max_write_size module parameters (default is 128 KB). The intent is
> to improve processing time of large BIOs by splitting them into smaller
> operations that can be parallelized on different CPUs.
>
> For zoned dm-crypt targets, this BIO splitting is still done but without
> the parallel execution to ensure that the issuing order of write
> operations to the underlying devices remains sequential. However, the
> splitting itself causes other problems:
>
> 1) Since dm-crypt relies on the block layer zone write plugging to
> handle zone append emulation using regular write operations, the
> reminder of a split write BIO will always be plugged into the target
> zone write plugged. Once the on-going write BIO finishes, this
> reminder BIO is unplugged and issued from the zone write plug work.
> If this reminder BIO itself needs to be split, the reminder will be
> re-issued and plugged again, but that causes a call to a
> blk_queue_enter(), which may block if a queue freeze operation was
> initiated. This results in a deadlock as DM submission still holds
> BIOs that the queue freeze side is waiting for.
>
> 2) dm-crypt relies on the emulation done by the block layer using
> regular write operations for processing zone append operations. This
> still requires to properly return the written sector as the BIO
> sector of the original BIO. However, this can be done correctly only
> and only if there is a single clone BIO used for processing the
> original zone append operation issued by the user. If the size of a
> zone append operation is larger than dm-crypt max_write_size, then
> the orginal BIO will be split and processed as a chain of regular
> write operations. Such chaining result in an incorrect written sector
> being returned to the zone append issuer using the original BIO
> sector. This in turn results in file system data corruptions using
> xfs or btrfs.
>
> Fix this by modifying get_max_request_size() to always return the size
> of the BIO to avoid it being split with dm_accpet_partial_bio() in
> crypt_map(). get_max_request_size() is renamed to
> get_max_request_sectors() to clarify the unit of the value returned
> and its interface is changed to take a struct dm_target pointer and a
> pointer to the struct bio being processed. In addition to this change,
> to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt
> device max_hw_sectors limit to be at most
> BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture).
> This forces DM core to split write BIOs before passing them to
> crypt_map(), and thus guaranteeing that dm-crypt can always accept an
> entire write BIO without needing to split it.
>
> This change does not have any effect on the read path of dm-crypt. Read
> operations can still be split and the BIO fragments processed in
> parallel. There is also no impact on the performance of the write path
> given that all zone write BIOs were already processed inline instead of
> in parallel.
>
> This change also does not affect in any way regular dm-crypt block
> devices.
>
> Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++---------
> 1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 17157c4216a5..4e80784d1734 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request");
> static unsigned int max_write_size = 0;
> module_param(max_write_size, uint, 0644);
> MODULE_PARM_DESC(max_write_size, "Maximum size of a write request");
> -static unsigned get_max_request_size(struct crypt_config *cc, bool wrt)
> +
> +static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio)
> {
> + struct crypt_config *cc = ti->private;
> unsigned val, sector_align;
> - val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size);
> - if (likely(!val))
> - val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE;
> - if (wrt || cc->used_tag_size) {
> - if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT))
> - val = BIO_MAX_VECS << PAGE_SHIFT;
> - }
> - sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size);
> + bool wrt = op_is_write(bio_op(bio));
> +
> + if (wrt) {
> + /*
> + * For zoned devices, splitting write operations creates the
> + * risk of deadlocking queue freeze operations with zone write
> + * plugging BIO work when the reminder of a split BIO is
> + * issued. So always allow the entire BIO to proceed.
> + */
> + if (ti->emulate_zone_append)
> + return bio_sectors(bio);
The overrun may still happen (if the user changes the dm table while some
bio is in progress) and if it happens, you should terminate the bio with
DM_MAPIO_KILL (like it was in my original patch).
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 10:19 ` Mikulas Patocka
@ 2025-06-25 12:54 ` Damien Le Moal
2025-06-25 14:03 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 12:54 UTC (permalink / raw)
To: Mikulas Patocka
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On 6/25/25 19:19, Mikulas Patocka wrote:
>
>
> On Wed, 25 Jun 2025, Damien Le Moal wrote:
>
>> Read and write operations issued to a dm-crypt target may be split
>> according to the dm-crypt internal limits defined by the max_read_size
>> and max_write_size module parameters (default is 128 KB). The intent is
>> to improve processing time of large BIOs by splitting them into smaller
>> operations that can be parallelized on different CPUs.
>>
>> For zoned dm-crypt targets, this BIO splitting is still done but without
>> the parallel execution to ensure that the issuing order of write
>> operations to the underlying devices remains sequential. However, the
>> splitting itself causes other problems:
>>
>> 1) Since dm-crypt relies on the block layer zone write plugging to
>> handle zone append emulation using regular write operations, the
>> reminder of a split write BIO will always be plugged into the target
>> zone write plugged. Once the on-going write BIO finishes, this
>> reminder BIO is unplugged and issued from the zone write plug work.
>> If this reminder BIO itself needs to be split, the reminder will be
>> re-issued and plugged again, but that causes a call to a
>> blk_queue_enter(), which may block if a queue freeze operation was
>> initiated. This results in a deadlock as DM submission still holds
>> BIOs that the queue freeze side is waiting for.
>>
>> 2) dm-crypt relies on the emulation done by the block layer using
>> regular write operations for processing zone append operations. This
>> still requires to properly return the written sector as the BIO
>> sector of the original BIO. However, this can be done correctly only
>> and only if there is a single clone BIO used for processing the
>> original zone append operation issued by the user. If the size of a
>> zone append operation is larger than dm-crypt max_write_size, then
>> the orginal BIO will be split and processed as a chain of regular
>> write operations. Such chaining result in an incorrect written sector
>> being returned to the zone append issuer using the original BIO
>> sector. This in turn results in file system data corruptions using
>> xfs or btrfs.
>>
>> Fix this by modifying get_max_request_size() to always return the size
>> of the BIO to avoid it being split with dm_accpet_partial_bio() in
>> crypt_map(). get_max_request_size() is renamed to
>> get_max_request_sectors() to clarify the unit of the value returned
>> and its interface is changed to take a struct dm_target pointer and a
>> pointer to the struct bio being processed. In addition to this change,
>> to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt
>> device max_hw_sectors limit to be at most
>> BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture).
>> This forces DM core to split write BIOs before passing them to
>> crypt_map(), and thus guaranteeing that dm-crypt can always accept an
>> entire write BIO without needing to split it.
>>
>> This change does not have any effect on the read path of dm-crypt. Read
>> operations can still be split and the BIO fragments processed in
>> parallel. There is also no impact on the performance of the write path
>> given that all zone write BIOs were already processed inline instead of
>> in parallel.
>>
>> This change also does not affect in any way regular dm-crypt block
>> devices.
>>
>> Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>> drivers/md/dm-crypt.c | 49 ++++++++++++++++++++++++++++++++++---------
>> 1 file changed, 39 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>> index 17157c4216a5..4e80784d1734 100644
>> --- a/drivers/md/dm-crypt.c
>> +++ b/drivers/md/dm-crypt.c
>> @@ -253,17 +253,35 @@ MODULE_PARM_DESC(max_read_size, "Maximum size of a read request");
>> static unsigned int max_write_size = 0;
>> module_param(max_write_size, uint, 0644);
>> MODULE_PARM_DESC(max_write_size, "Maximum size of a write request");
>> -static unsigned get_max_request_size(struct crypt_config *cc, bool wrt)
>> +
>> +static unsigned get_max_request_sectors(struct dm_target *ti, struct bio *bio)
>> {
>> + struct crypt_config *cc = ti->private;
>> unsigned val, sector_align;
>> - val = !wrt ? READ_ONCE(max_read_size) : READ_ONCE(max_write_size);
>> - if (likely(!val))
>> - val = !wrt ? DM_CRYPT_DEFAULT_MAX_READ_SIZE : DM_CRYPT_DEFAULT_MAX_WRITE_SIZE;
>> - if (wrt || cc->used_tag_size) {
>> - if (unlikely(val > BIO_MAX_VECS << PAGE_SHIFT))
>> - val = BIO_MAX_VECS << PAGE_SHIFT;
>> - }
>> - sector_align = max(bdev_logical_block_size(cc->dev->bdev), (unsigned)cc->sector_size);
>> + bool wrt = op_is_write(bio_op(bio));
>> +
>> + if (wrt) {
>> + /*
>> + * For zoned devices, splitting write operations creates the
>> + * risk of deadlocking queue freeze operations with zone write
>> + * plugging BIO work when the reminder of a split BIO is
>> + * issued. So always allow the entire BIO to proceed.
>> + */
>> + if (ti->emulate_zone_append)
>> + return bio_sectors(bio);
>
> The overrun may still happen (if the user changes the dm table while some
> bio is in progress) and if it happens, you should terminate the bio with
> DM_MAPIO_KILL (like it was in my original patch).
I am confused... Overrun against what ? We are now completely ignoring the
max_write_size limit so even if the user changes it, that will not affect the
BIO processing. If you are referring to an overrun against the zoned device
max_hw_sectors limit, it is not possible since changing limits is done with the
DM device queue frozen, so we are guaranteed that there will be no BIO in-flight.
I am not sure about what kind of table change you are thinking of, but at the
very least, dm_table_supports_size_change() ensure that there cannot be any
device size change for a zoned DM device. And given the above point about limits
changes, I do not see how a table change can affect the BIO execution.
Do you have a specific example in mind ?
Or is it maybe the if condition that is confusing ?
if (ti->emulate_zone_append)
applies to the target, so *all* write operations (emulated zone append writes
and regular writes) will be handled by this and bio_sectors(bio) returned, thus
avoiding a split for all write operations. Maybe using:
if (bdev_is_zoned(bio->bi_bdev))
would be clearer ?
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 12:54 ` Damien Le Moal
@ 2025-06-25 14:03 ` Mikulas Patocka
2025-06-25 14:43 ` Damien Le Moal
0 siblings, 1 reply; 17+ messages in thread
From: Mikulas Patocka @ 2025-06-25 14:03 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> On 6/25/25 19:19, Mikulas Patocka wrote:
> >
> >
> > On Wed, 25 Jun 2025, Damien Le Moal wrote:
> >
> >> + bool wrt = op_is_write(bio_op(bio));
> >> +
> >> + if (wrt) {
> >> + /*
> >> + * For zoned devices, splitting write operations creates the
> >> + * risk of deadlocking queue freeze operations with zone write
> >> + * plugging BIO work when the reminder of a split BIO is
> >> + * issued. So always allow the entire BIO to proceed.
> >> + */
> >> + if (ti->emulate_zone_append)
> >> + return bio_sectors(bio);
> >
> > The overrun may still happen (if the user changes the dm table while some
> > bio is in progress) and if it happens, you should terminate the bio with
> > DM_MAPIO_KILL (like it was in my original patch).
>
> I am confused... Overrun against what ? We are now completely ignoring the
> max_write_size limit so even if the user changes it, that will not affect the
> BIO processing. If you are referring to an overrun against the zoned device
> max_hw_sectors limit, it is not possible since changing limits is done with the
> DM device queue frozen, so we are guaranteed that there will be no BIO in-flight.
>
> I am not sure about what kind of table change you are thinking of, but at the
> very least, dm_table_supports_size_change() ensure that there cannot be any
> device size change for a zoned DM device. And given the above point about limits
> changes, I do not see how a table change can affect the BIO execution.
>
> Do you have a specific example in mind ?
What happens if a bio that is larger than "BIO_MAX_VECS << PAGE_SHIFT"
enters dm_split_and_process_bio? Where will the bio be split? I don't see
it, but maybe I'm missing something.
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 14:03 ` Mikulas Patocka
@ 2025-06-25 14:43 ` Damien Le Moal
2025-06-25 16:33 ` Mikulas Patocka
0 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 14:43 UTC (permalink / raw)
To: Mikulas Patocka
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On 6/25/25 23:03, Mikulas Patocka wrote:
>
>
> On Wed, 25 Jun 2025, Damien Le Moal wrote:
>
>> On 6/25/25 19:19, Mikulas Patocka wrote:
>>>
>>>
>>> On Wed, 25 Jun 2025, Damien Le Moal wrote:
>>>
>>>> + bool wrt = op_is_write(bio_op(bio));
>>>> +
>>>> + if (wrt) {
>>>> + /*
>>>> + * For zoned devices, splitting write operations creates the
>>>> + * risk of deadlocking queue freeze operations with zone write
>>>> + * plugging BIO work when the reminder of a split BIO is
>>>> + * issued. So always allow the entire BIO to proceed.
>>>> + */
>>>> + if (ti->emulate_zone_append)
>>>> + return bio_sectors(bio);
>>>
>>> The overrun may still happen (if the user changes the dm table while some
>>> bio is in progress) and if it happens, you should terminate the bio with
>>> DM_MAPIO_KILL (like it was in my original patch).
>>
>> I am confused... Overrun against what ? We are now completely ignoring the
>> max_write_size limit so even if the user changes it, that will not affect the
>> BIO processing. If you are referring to an overrun against the zoned device
>> max_hw_sectors limit, it is not possible since changing limits is done with the
>> DM device queue frozen, so we are guaranteed that there will be no BIO in-flight.
>>
>> I am not sure about what kind of table change you are thinking of, but at the
>> very least, dm_table_supports_size_change() ensure that there cannot be any
>> device size change for a zoned DM device. And given the above point about limits
>> changes, I do not see how a table change can affect the BIO execution.
>>
>> Do you have a specific example in mind ?
>
> What happens if a bio that is larger than "BIO_MAX_VECS << PAGE_SHIFT"
> enters dm_split_and_process_bio? Where will the bio be split? I don't see
> it, but maybe I'm missing something.
See patch 3 of the v3 I sent: dm_zone_bio_needs_split() and
dm_split_and_process_bio() have been modified to always endup with need_split ==
true for zone write BIOs, and that causes a call to bio_split_to_limits(). So
dm-crypt will always see BIOs that are smaller than limits->max_hw_sectors,
which is set to BIO_MAX_VECS << PAGE_SECTORS_SHIFT in dm-crypt io_hint. So
dm-crypt can never see a write BIO that is larger than BIO_MAX_VECS << PAGE_SHIFT.
>
> Mikulas
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets
2025-06-25 14:43 ` Damien Le Moal
@ 2025-06-25 16:33 ` Mikulas Patocka
0 siblings, 0 replies; 17+ messages in thread
From: Mikulas Patocka @ 2025-06-25 16:33 UTC (permalink / raw)
To: Damien Le Moal
Cc: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Bart Van Assche
On Wed, 25 Jun 2025, Damien Le Moal wrote:
> >> Do you have a specific example in mind ?
> >
> > What happens if a bio that is larger than "BIO_MAX_VECS << PAGE_SHIFT"
> > enters dm_split_and_process_bio? Where will the bio be split? I don't see
> > it, but maybe I'm missing something.
>
> See patch 3 of the v3 I sent: dm_zone_bio_needs_split() and
> dm_split_and_process_bio() have been modified to always endup with need_split ==
> true for zone write BIOs, and that causes a call to bio_split_to_limits(). So
> dm-crypt will always see BIOs that are smaller than limits->max_hw_sectors,
> which is set to BIO_MAX_VECS << PAGE_SECTORS_SHIFT in dm-crypt io_hint. So
> dm-crypt can never see a write BIO that is larger than BIO_MAX_VECS << PAGE_SHIFT.
OK.
I acked the patches and I suppose that they will be sent through the block
layer tree.
Or - should I send them through the device mapper tree?
Mikulas
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v2 4/4] dm: Check for forbidden splitting of zone write operations
2025-06-25 5:59 [PATCH v2 0/4] Fix write operation handling for zoned DM devices Damien Le Moal
` (2 preceding siblings ...)
2025-06-25 5:59 ` [PATCH v2 3/4] dm: dm-crypt: Do not split write operations with zoned targets Damien Le Moal
@ 2025-06-25 5:59 ` Damien Le Moal
2025-06-25 6:16 ` Christoph Hellwig
3 siblings, 1 reply; 17+ messages in thread
From: Damien Le Moal @ 2025-06-25 5:59 UTC (permalink / raw)
To: linux-block, Jens Axboe, dm-devel, Mike Snitzer, Mikulas Patocka
Cc: Bart Van Assche
DM targets must not split zone append and write operations using
dm_accept_partial_bio() as doing so is forbidden for zone append BIOs,
breaks zone append emulation using regular write BIOs and potentially
creates deadlock situations with queue freeze operations.
Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all
these cases, that is, check that the BIO is a write or write zeroes
operation. This change packs all the zone related checks together under
a static_branch_unlikely(&zoned_enabled) and done only if the target is
a zoned device.
Fixes: f211268ed1f9 ("dm: Use the block layer zone append emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm.c | 17 +++++++++++++----
1 file changed, 13 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e01ed89b2e45..86647dcaf981 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1286,8 +1286,9 @@ static size_t dm_dax_recovery_write(struct dax_device *dax_dev, pgoff_t pgoff,
/*
* A target may call dm_accept_partial_bio only from the map routine. It is
* allowed for all bio types except REQ_PREFLUSH, REQ_OP_ZONE_* zone management
- * operations, REQ_OP_ZONE_APPEND (zone append writes) and any bio serviced by
- * __send_duplicate_bios().
+ * operations, zone append writes (native with REQ_OP_ZONE_APPEND or emulated
+ * with write BIOs flagged with BIO_EMULATES_ZONE_APPEND) and any bio serviced
+ * by __send_duplicate_bios().
*
* dm_accept_partial_bio informs the dm that the target only wants to process
* additional n_sectors sectors of the bio and the rest of the data should be
@@ -1320,11 +1321,19 @@ void dm_accept_partial_bio(struct bio *bio, unsigned int n_sectors)
unsigned int bio_sectors = bio_sectors(bio);
BUG_ON(dm_tio_flagged(tio, DM_TIO_IS_DUPLICATE_BIO));
- BUG_ON(op_is_zone_mgmt(bio_op(bio)));
- BUG_ON(bio_op(bio) == REQ_OP_ZONE_APPEND);
BUG_ON(bio_sectors > *tio->len_ptr);
BUG_ON(n_sectors > bio_sectors);
+ if (static_branch_unlikely(&zoned_enabled) &&
+ unlikely(bdev_is_zoned(bio->bi_bdev))) {
+ enum req_op op = bio_op(bio);
+
+ BUG_ON(op_is_zone_mgmt(op));
+ BUG_ON(op == REQ_OP_WRITE);
+ BUG_ON(op == REQ_OP_WRITE_ZEROES);
+ BUG_ON(op == REQ_OP_ZONE_APPEND);
+ }
+
*tio->len_ptr -= bio_sectors - n_sectors;
bio->bi_iter.bi_size = n_sectors << SECTOR_SHIFT;
--
2.49.0
^ permalink raw reply related [flat|nested] 17+ messages in thread