* [PATCH v2 0/6] dm: fix issues with swapping dm tables
@ 2025-03-17 4:45 Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
` (6 more replies)
0 siblings, 7 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
There were multiple places in dm's __bind() function where it could fail
and not completely roll back, leaving the device using the the old
table, but with device limits and resources from the new table.
Additionally, unused mempools for request-based devices were not always
freed immediately.
Finally, there were a number of issues with switching zoned tables that
emulate zone append (in other words, dm-crypt on top of zoned devices).
dm_blk_report_zones() could be called while the device was suspended and
modifying zoned resources or could possibly fail to end a srcu read
section. More importantly, blk_revalidate_disk_zones() would never get
called when updating a zoned table. This could cause the dm device to
see the wrong zone write offsets, not have a large enough zwplugs
reserved in its mempool, or read invalid memory when checking the
conventional zones bitmap.
This patchset fixes these issues. It deals with the problems around
blk_revalidate_disk_zones() by only calling it for devices that have no
zone write plug resources. This will always correctly update the zoned
settings. If a device has zone write plug resources, calling
blk_revalidate_disk_zones() will not correctly update them in many
cases, so DM simply doesn't call it for devices with zone write plug
resources. Instead of allowing people to load tables that can break the
device, like currently happens, DM disallosw any table reloads that
change the zoned setting for devices that already have zone write plug
resources.
Specifically, if a device already has zone plug resources allocated, it
can only switch to another zoned table that also emulates zone append.
Also, it cannot change the device size or the zone size. There are some
tweaks to make sure that a device can always switch to an error target.
Changes in V2:
- Made queue_limits_set() optionally return the old limits (grabbed
while holding the limits_lock), and used this in
dm_table_set_restrictions()
- dropped changes to disk_free_zone_resources() and the
blk_revalidate_disk_zones() code path (removed patches 0005 & 0006)
- Instead of always calling blk_revalidate_disk_zones(), disallow
changes that would change zone settings if the device has
zone write plug resources (final patch).
Benjamin Marzinski (6):
dm: don't change md if dm_table_set_restrictions() fails
dm: free table mempools if not used in __bind
block: make queue_limits_set() optionally return old limits
dm: handle failures in dm_table_set_restrictions
dm: fix dm_blk_report_zones
dm: limit swapping tables for devices with zone write plugs
block/blk-settings.c | 9 ++++-
drivers/md/dm-core.h | 1 +
drivers/md/dm-table.c | 66 ++++++++++++++++++++++++++-------
drivers/md/dm-zone.c | 84 +++++++++++++++++++++++++++++-------------
drivers/md/dm.c | 36 +++++++++++-------
drivers/md/dm.h | 6 +++
drivers/md/md-linear.c | 2 +-
drivers/md/raid0.c | 2 +-
drivers/md/raid1.c | 2 +-
drivers/md/raid10.c | 2 +-
drivers/md/raid5.c | 2 +-
include/linux/blkdev.h | 3 +-
12 files changed, 154 insertions(+), 61 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 1/6] dm: don't change md if dm_table_set_restrictions() fails
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 2/6] dm: free table mempools if not used in __bind Benjamin Marzinski
` (5 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
__bind was changing the disk capacity, geometry and mempools of the
mapped device before calling dm_table_set_restrictions() which could
fail, forcing dm to drop the new table. Failing here would leave the
device using the old table but with the wrong capacity and mempools.
Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
capacity to be set, save the old version and restore it on failure.
Fixes: bb37d77239af2 ("dm: introduce zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..f5c5ccb6f8d2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2421,21 +2421,29 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct queue_limits *limits)
{
struct dm_table *old_map;
- sector_t size;
+ sector_t size, old_size;
int ret;
lockdep_assert_held(&md->suspend_lock);
size = dm_table_get_size(t);
+ old_size = dm_get_size(md);
+ set_capacity(md->disk, size);
+
+ ret = dm_table_set_restrictions(t, md->queue, limits);
+ if (ret) {
+ set_capacity(md->disk, old_size);
+ old_map = ERR_PTR(ret);
+ goto out;
+ }
+
/*
* Wipe any geometry if the size of the table changed.
*/
- if (size != dm_get_size(md))
+ if (size != old_size)
memset(&md->geometry, 0, sizeof(md->geometry));
- set_capacity(md->disk, size);
-
dm_table_event_callback(t, event_callback, md);
if (dm_table_request_based(t)) {
@@ -2468,12 +2476,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
t->mempools = NULL;
}
- ret = dm_table_set_restrictions(t, md->queue, limits);
- if (ret) {
- old_map = ERR_PTR(ret);
- goto out;
- }
-
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 2/6] dm: free table mempools if not used in __bind
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits Benjamin Marzinski
` (4 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
With request-based dm, the mempools don't need reloading when switching
tables, but the unused table mempools are not freed until the active
table is finally freed. Free them immediately if they are not needed.
Fixes: 29dec90a0f1d9 ("dm: fix bio_set allocation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f5c5ccb6f8d2..292414da871d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2461,10 +2461,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
* requests in the queue may refer to bio from the old bioset,
* so you must walk through the queue to unprep.
*/
- if (!md->mempools) {
+ if (!md->mempools)
md->mempools = t->mempools;
- t->mempools = NULL;
- }
+ else
+ dm_free_md_mempools(t->mempools);
} else {
/*
* The md may already have mempools that need changing.
@@ -2473,8 +2473,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
*/
dm_free_md_mempools(md->mempools);
md->mempools = t->mempools;
- t->mempools = NULL;
}
+ t->mempools = NULL;
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 2/6] dm: free table mempools if not used in __bind Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-18 6:56 ` Christoph Hellwig
2025-03-17 4:45 ` [PATCH v2 4/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
` (3 subsequent siblings)
6 siblings, 1 reply; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
A future device-mapper patch will make use of this new argument. No
functional changes intended in this patch.
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
block/blk-settings.c | 9 +++++++--
drivers/md/dm-table.c | 2 +-
drivers/md/md-linear.c | 2 +-
drivers/md/raid0.c | 2 +-
drivers/md/raid1.c | 2 +-
drivers/md/raid10.c | 2 +-
drivers/md/raid5.c | 2 +-
include/linux/blkdev.h | 3 ++-
8 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c44dadc35e1e..bad690ef8fec 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -476,16 +476,21 @@ EXPORT_SYMBOL_GPL(queue_limits_commit_update_frozen);
* queue_limits_set - apply queue limits to queue
* @q: queue to update
* @lim: limits to apply
+ * @old_lim: store previous limits if non-null.
*
- * Apply the limits in @lim that were freshly initialized to @q.
+ * Apply the limits in @lim that were freshly initialized to @q, and
+ * optionally return the previous limits in @old_lim.
* To update existing limits use queue_limits_start_update() and
* queue_limits_commit_update() instead.
*
* Returns 0 if successful, else a negative error code.
*/
-int queue_limits_set(struct request_queue *q, struct queue_limits *lim)
+int queue_limits_set(struct request_queue *q, struct queue_limits *lim,
+ struct queue_limits *old_lim)
{
mutex_lock(&q->limits_lock);
+ if (old_lim)
+ *old_lim = q->limits;
return queue_limits_commit_update(q, lim);
}
EXPORT_SYMBOL_GPL(queue_limits_set);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 0ef5203387b2..77d8459b2f2a 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1883,7 +1883,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_atomic_writes(t))
limits->features |= BLK_FEAT_ATOMIC_WRITES;
- r = queue_limits_set(q, limits);
+ r = queue_limits_set(q, limits, NULL);
if (r)
return r;
diff --git a/drivers/md/md-linear.c b/drivers/md/md-linear.c
index a382929ce7ba..65ceec5b078f 100644
--- a/drivers/md/md-linear.c
+++ b/drivers/md/md-linear.c
@@ -81,7 +81,7 @@ static int linear_set_limits(struct mddev *mddev)
return err;
}
- return queue_limits_set(mddev->gendisk->queue, &lim);
+ return queue_limits_set(mddev->gendisk->queue, &lim, NULL);
}
static struct linear_conf *linear_conf(struct mddev *mddev, int raid_disks)
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 8fc9339b00c7..2eb42b2c103b 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -390,7 +390,7 @@ static int raid0_set_limits(struct mddev *mddev)
queue_limits_cancel_update(mddev->gendisk->queue);
return err;
}
- return queue_limits_set(mddev->gendisk->queue, &lim);
+ return queue_limits_set(mddev->gendisk->queue, &lim, NULL);
}
static int raid0_run(struct mddev *mddev)
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 9d57a88dbd26..e8103dbc549c 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -3223,7 +3223,7 @@ static int raid1_set_limits(struct mddev *mddev)
queue_limits_cancel_update(mddev->gendisk->queue);
return err;
}
- return queue_limits_set(mddev->gendisk->queue, &lim);
+ return queue_limits_set(mddev->gendisk->queue, &lim, NULL);
}
static int raid1_run(struct mddev *mddev)
diff --git a/drivers/md/raid10.c b/drivers/md/raid10.c
index efe93b979167..dbc2ee70d0d3 100644
--- a/drivers/md/raid10.c
+++ b/drivers/md/raid10.c
@@ -4024,7 +4024,7 @@ static int raid10_set_queue_limits(struct mddev *mddev)
queue_limits_cancel_update(mddev->gendisk->queue);
return err;
}
- return queue_limits_set(mddev->gendisk->queue, &lim);
+ return queue_limits_set(mddev->gendisk->queue, &lim, NULL);
}
static int raid10_run(struct mddev *mddev)
diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c
index 5c79429acc64..a7cc2ec86793 100644
--- a/drivers/md/raid5.c
+++ b/drivers/md/raid5.c
@@ -7774,7 +7774,7 @@ static int raid5_set_limits(struct mddev *mddev)
/* No restrictions on the number of segments in the request */
lim.max_segments = USHRT_MAX;
- return queue_limits_set(mddev->gendisk->queue, &lim);
+ return queue_limits_set(mddev->gendisk->queue, &lim, NULL);
}
static int raid5_run(struct mddev *mddev)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..bb264fd7b2f2 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -953,7 +953,8 @@ int queue_limits_commit_update_frozen(struct request_queue *q,
struct queue_limits *lim);
int queue_limits_commit_update(struct request_queue *q,
struct queue_limits *lim);
-int queue_limits_set(struct request_queue *q, struct queue_limits *lim);
+int queue_limits_set(struct request_queue *q, struct queue_limits *lim,
+ struct queue_limits *old_lim);
int blk_validate_limits(struct queue_limits *lim);
/**
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 4/6] dm: handle failures in dm_table_set_restrictions
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (2 preceding siblings ...)
2025-03-17 4:45 ` [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 5/6] dm: fix dm_blk_report_zones Benjamin Marzinski
` (2 subsequent siblings)
6 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
If dm_table_set_restrictions() fails while swapping tables,
device-mapper will continue using the previous table. It must be sure to
leave the mapped_device in it's previous state on failure. Otherwise
device-mapper could end up using the old table with settings from the
unused table.
Do not update the mapped device in dm_set_zones_restrictions(). Wait
till after dm_table_set_restrictions() is sure to succeed to update the
md zoned settings. Do the same with the dax settings, and if
dm_revalidate_zones() fails, restore the original queue limits.
Fixes: 7f91ccd8a608d ("dm: Call dm_revalidate_zones() after setting the queue limits")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 25 ++++++++++++++++---------
drivers/md/dm-zone.c | 26 ++++++++++++++++++--------
drivers/md/dm.h | 1 +
3 files changed, 35 insertions(+), 17 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 77d8459b2f2a..66ebe76f8c9c 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
int r;
+ struct queue_limits old_limits;
if (!dm_table_supports_nowait(t))
limits->features &= ~BLK_FEAT_NOWAIT;
@@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_flush(t))
limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
- if (dm_table_supports_dax(t, device_not_dax_capable)) {
+ if (dm_table_supports_dax(t, device_not_dax_capable))
limits->features |= BLK_FEAT_DAX;
- if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
- set_dax_synchronous(t->md->dax_dev);
- } else
+ else
limits->features &= ~BLK_FEAT_DAX;
- if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
- dax_write_cache(t->md->dax_dev, true);
-
/* For a zoned table, setup the zone related queue attributes. */
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
@@ -1883,7 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_atomic_writes(t))
limits->features |= BLK_FEAT_ATOMIC_WRITES;
- r = queue_limits_set(q, limits, NULL);
+ r = queue_limits_set(q, limits, &old_limits);
if (r)
return r;
@@ -1894,10 +1890,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
r = dm_revalidate_zones(t, q);
- if (r)
+ if (r) {
+ queue_limits_set(q, &old_limits, NULL);
return r;
+ }
}
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
+ dm_finalize_zone_settings(t, limits);
+
+ if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
+ set_dax_synchronous(t->md->dax_dev);
+
+ if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
+ dax_write_cache(t->md->dax_dev, true);
+
dm_update_crypto_profile(q, t);
return 0;
}
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 20edd3fabbab..681058feb63b 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
* mapped device queue as needing zone append emulation.
*/
WARN_ON_ONCE(queue_is_mq(q));
- if (dm_table_supports_zone_append(t)) {
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- } else {
- set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ if (!dm_table_supports_zone_append(t))
lim->max_hw_zone_append_sectors = 0;
- }
/*
* Determine the max open and max active zone limits for the mapped
@@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
lim->zone_write_granularity = 0;
lim->chunk_sectors = 0;
lim->features &= ~BLK_FEAT_ZONED;
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- md->nr_zones = 0;
- disk->nr_zones = 0;
return 0;
}
@@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
return 0;
}
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
+{
+ struct mapped_device *md = t->md;
+
+ if (lim->features & BLK_FEAT_ZONED) {
+ if (dm_table_supports_zone_append(t))
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ else
+ set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ } else {
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ md->nr_zones = 0;
+ md->disk->nr_zones = 0;
+ }
+}
+
+
/*
* IO completion callback called from clone_endio().
*/
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a0a8ff119815..e5d3a9f46a91 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *lim);
int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
void dm_zone_endio(struct dm_io *io, struct bio *clone);
#ifdef CONFIG_BLK_DEV_ZONED
int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 5/6] dm: fix dm_blk_report_zones
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (3 preceding siblings ...)
2025-03-17 4:45 ` [PATCH v2 4/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-18 6:57 ` Christoph Hellwig
2025-03-17 4:45 ` [PATCH v2 6/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
2025-03-19 19:07 ` [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
6 siblings, 1 reply; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
If dm_get_live_table() returned NULL, dm_put_live_table() was never
called. Also, if md->zone_revalidate_map is set, check that
dm_blk_report_zones() is being called from the process that set it in
__bind(). Otherwise the zone resources could change while accessing
them. Finally, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().
Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-core.h | 1 +
drivers/md/dm-zone.c | 23 +++++++++++++++--------
2 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 3637761f3585..f3a3f2ef6322 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -141,6 +141,7 @@ struct mapped_device {
#ifdef CONFIG_BLK_DEV_ZONED
unsigned int nr_zones;
void *zone_revalidate_map;
+ struct task_struct *revalidate_map_task;
#endif
#ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 681058feb63b..11e19281bb64 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -56,24 +56,29 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
{
struct mapped_device *md = disk->private_data;
struct dm_table *map;
- int srcu_idx, ret;
+ struct dm_table *zone_revalidate_map = md->zone_revalidate_map;
+ int srcu_idx, ret = -EIO;
- if (!md->zone_revalidate_map) {
- /* Regular user context */
+ if (!zone_revalidate_map || md->revalidate_map_task != current) {
+ /*
+ * Regular user context or
+ * Zone revalidation during __bind() is in progress, but this
+ * call is from a different process
+ */
if (dm_suspended_md(md))
return -EAGAIN;
map = dm_get_live_table(md, &srcu_idx);
- if (!map)
- return -EIO;
} else {
/* Zone revalidation during __bind() */
- map = md->zone_revalidate_map;
+ map = zone_revalidate_map;
}
- ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
+ if (map)
+ ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
+ data);
- if (!md->zone_revalidate_map)
+ if (!zone_revalidate_map)
dm_put_live_table(md, srcu_idx);
return ret;
@@ -175,7 +180,9 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
* our table for dm_blk_report_zones() to use directly.
*/
md->zone_revalidate_map = t;
+ md->revalidate_map_task = current;
ret = blk_revalidate_disk_zones(disk);
+ md->revalidate_map_task = NULL;
md->zone_revalidate_map = NULL;
if (ret) {
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 6/6] dm: limit swapping tables for devices with zone write plugs
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (4 preceding siblings ...)
2025-03-17 4:45 ` [PATCH v2 5/6] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-03-17 4:45 ` Benjamin Marzinski
2025-03-19 19:07 ` [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
6 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-17 4:45 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
dm_revalidate_zones() only allowed new or previously unzoned devices to
call blk_revalidate_disk_zones(). If the device was already zoned,
disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones()
returned without doing any work. This would make the zoned settings for
the device not match the new table. If the device had zone write plug
resources, it could run into errors like bdev_zone_is_seq() reading
invalid memory because disk->conv_zones_bitmap was the wrong size.
If the device doesn't have any zone write plug resources, calling
blk_revalidate_disk_zones() will always correctly update device. If
blk_revalidate_disk_zones() fails, it can still overwrite or clear the
current disk->nr_zones value. In this case, DM must restore the previous
value of disk->nr_zones, so that the zoned settings will continue to
match the previous that it failed back to.
If the device already has zone write plug resources,
blk_revalidate_disk_zones() will not correctly update them, if it is
called for arbitrary zoned device changes. Since there is not much need
for this ability, the easiest solution is to disallow any table reloads
that change the zoned settings, for devices that already have zone plug
resources. Specifically, if a device already has zone plug resources
allocated, it can only switch to another zoned table that also emulates
zone append. Also, it cannot change the device size or the zone size. A
device can switch to an error target.
Suggested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 41 ++++++++++++++++++++++++++++++++++++-----
drivers/md/dm-zone.c | 35 ++++++++++++++++++++++++++---------
drivers/md/dm.c | 6 ++++++
drivers/md/dm.h | 5 +++++
4 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 66ebe76f8c9c..5200263b2635 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1492,6 +1492,18 @@ bool dm_table_has_no_data_devices(struct dm_table *t)
return true;
}
+bool dm_table_is_wildcard(struct dm_table *t)
+{
+ for (unsigned int i = 0; i < t->num_targets; i++) {
+ struct dm_target *ti = dm_table_get_target(t, i);
+
+ if (!dm_target_is_wildcard(ti->type))
+ return false;
+ }
+
+ return true;
+}
+
static int device_not_zoned(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1832,6 +1844,19 @@ static bool dm_table_supports_atomic_writes(struct dm_table *t)
return true;
}
+bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
+ sector_t new_size)
+{
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && dm_has_zone_plugs(t->md) &&
+ old_size != new_size) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot change size",
+ dm_device_name(t->md));
+ return false;
+ }
+ return true;
+}
+
int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
@@ -1869,11 +1894,17 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
limits->features &= ~BLK_FEAT_DAX;
/* For a zoned table, setup the zone related queue attributes. */
- if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- (limits->features & BLK_FEAT_ZONED)) {
- r = dm_set_zones_restrictions(t, q, limits);
- if (r)
- return r;
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ if (limits->features & BLK_FEAT_ZONED) {
+ r = dm_set_zones_restrictions(t, q, limits);
+ if (r)
+ return r;
+ } else if (dm_has_zone_plugs(t->md)) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot switch to non-zoned table.",
+ dm_device_name(t->md));
+ return -EINVAL;
+ }
}
if (dm_table_supports_atomic_writes(t))
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 11e19281bb64..1d419734fefc 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -158,22 +158,22 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
{
struct mapped_device *md = t->md;
struct gendisk *disk = md->disk;
+ unsigned int nr_zones = disk->nr_zones;
int ret;
if (!get_capacity(disk))
return 0;
- /* Revalidate only if something changed. */
- if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
- DMINFO("%s using %s zone append",
- disk->disk_name,
- queue_emulates_zone_append(q) ? "emulated" : "native");
- md->nr_zones = 0;
- }
-
- if (md->nr_zones)
+ /*
+ * Do not revalidate if zone append emulation resources have already
+ * been allocated.
+ */
+ if (dm_has_zone_plugs(md))
return 0;
+ DMINFO("%s using %s zone append", disk->disk_name,
+ queue_emulates_zone_append(q) ? "emulated" : "native");
+
/*
* Our table is not live yet. So the call to dm_get_live_table()
* in dm_blk_report_zones() will fail. Set a temporary pointer to
@@ -187,6 +187,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
if (ret) {
DMERR("Revalidate zones failed %d", ret);
+ disk->nr_zones = nr_zones;
return ret;
}
@@ -383,12 +384,28 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
lim->max_open_zones = 0;
lim->max_active_zones = 0;
lim->max_hw_zone_append_sectors = 0;
+ lim->max_zone_append_sectors = 0;
lim->zone_write_granularity = 0;
lim->chunk_sectors = 0;
lim->features &= ~BLK_FEAT_ZONED;
return 0;
}
+ if (get_capacity(disk) && dm_has_zone_plugs(t->md)) {
+ if (q->limits.chunk_sectors != lim->chunk_sectors) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot change zone size",
+ disk->disk_name);
+ return -EINVAL;
+ }
+ if (lim->max_hw_zone_append_sectors != 0 &&
+ !dm_table_is_wildcard(t)) {
+ DMWARN("%s: device has zone write plug resources. "
+ "New table must emulate zone append",
+ disk->disk_name);
+ return -EINVAL;
+ }
+ }
/*
* Warn once (when the capacity is not yet set) if the mapped device is
* partially using zone resources of the target devices as that leads to
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 292414da871d..240f6dab8dda 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2429,6 +2429,12 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
size = dm_table_get_size(t);
old_size = dm_get_size(md);
+
+ if (!dm_table_supports_size_change(t, old_size, size)) {
+ old_map = ERR_PTR(-EINVAL);
+ goto out;
+ }
+
set_capacity(md->disk, size);
ret = dm_table_set_restrictions(t, md->queue, limits);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index e5d3a9f46a91..245f52b59215 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -58,6 +58,7 @@ void dm_table_event_callback(struct dm_table *t,
void (*fn)(void *), void *context);
struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
bool dm_table_has_no_data_devices(struct dm_table *table);
+bool dm_table_is_wildcard(struct dm_table *t);
int dm_calculate_queue_limits(struct dm_table *table,
struct queue_limits *limits);
int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
@@ -72,6 +73,8 @@ struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
struct dm_target *dm_table_get_immutable_target(struct dm_table *t);
struct dm_target *dm_table_get_wildcard_target(struct dm_table *t);
bool dm_table_request_based(struct dm_table *t);
+bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
+ sector_t new_size);
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
@@ -111,12 +114,14 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
sector_t sector, unsigned int nr_zones,
unsigned long *need_reset);
+#define dm_has_zone_plugs(md) ((md)->disk->zone_wplugs_hash != NULL)
#else
#define dm_blk_report_zones NULL
static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
{
return false;
}
+#define dm_has_zone_plugs(md) false
#endif
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits
2025-03-17 4:45 ` [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits Benjamin Marzinski
@ 2025-03-18 6:56 ` Christoph Hellwig
2025-03-18 19:23 ` Benjamin Marzinski
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-03-18 6:56 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
Damien Le Moal, Christoph Hellwig
On Mon, Mar 17, 2025 at 12:45:07AM -0400, Benjamin Marzinski wrote:
> A future device-mapper patch will make use of this new argument. No
> functional changes intended in this patch.
If you care about the old limits just use a queue_limits_start_update +
queue_limits_commit_update{,frozen} pair.
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 5/6] dm: fix dm_blk_report_zones
2025-03-17 4:45 ` [PATCH v2 5/6] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-03-18 6:57 ` Christoph Hellwig
2025-03-18 22:10 ` Benjamin Marzinski
0 siblings, 1 reply; 14+ messages in thread
From: Christoph Hellwig @ 2025-03-18 6:57 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
Damien Le Moal, Christoph Hellwig
On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> __bind(). Otherwise the zone resources could change while accessing
> them. Finally, it is possible that md->zone_revalidate_map will change
> while calling this function. Only read it once, so that we are always
> using the same value. Otherwise we might miss a call to
> dm_put_live_table().
This checking for calling context is pretty ugly. Can you just use
a separate report zones helper or at least a clearly documented flag
for it?
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits
2025-03-18 6:56 ` Christoph Hellwig
@ 2025-03-18 19:23 ` Benjamin Marzinski
0 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-18 19:23 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
Damien Le Moal
On Tue, Mar 18, 2025 at 07:56:03AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 12:45:07AM -0400, Benjamin Marzinski wrote:
> > A future device-mapper patch will make use of this new argument. No
> > functional changes intended in this patch.
>
> If you care about the old limits just use a queue_limits_start_update +
> queue_limits_commit_update{,frozen} pair.
Yep. I'll change that.
-Ben
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 5/6] dm: fix dm_blk_report_zones
2025-03-18 6:57 ` Christoph Hellwig
@ 2025-03-18 22:10 ` Benjamin Marzinski
2025-03-19 1:40 ` Damien Le Moal
0 siblings, 1 reply; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-18 22:10 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
Damien Le Moal
On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> > __bind(). Otherwise the zone resources could change while accessing
> > them. Finally, it is possible that md->zone_revalidate_map will change
> > while calling this function. Only read it once, so that we are always
> > using the same value. Otherwise we might miss a call to
> > dm_put_live_table().
>
> This checking for calling context is pretty ugly. Can you just use
> a separate report zones helper or at least a clearly documented flag for it?
Not sure how that would work. The goal is to keep another process,
calling something like blkdev_report_zones_ioctl(), from being able to
call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
which needs to use that same disk->fops->report_zones() interface in
order to call blk_revalidate_zone_cb(). We need some way to distinguish
between the callers. We could export blk_revalidate_zone_cb() and have
dm_blk_report_zones() check if it was called with that report_zones_cb.
That seems just as ugly to me. But if you like that better, fine.
Otherwise I don't see how we can distinguish between the call from
blk_revalidate_disk_zones() and a call from something like
blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
Allowing the blkdev_report_zones_ioctl() call to go ahead using
md->zone_revalidate_map runs into three problems.
1. It's running while the device is switching tables, and there's no
guarantee that DM won't encounter an error and have to fail back. So it
could report information for this unused table. We could just say that
this is what you get from trying to grab the zone information of a
device while it's switching tables. Fine.
2. Without this patch, it's possible that while
blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
fails switching tables and frees the new table that's being used by
blkdev_report_zones_ioctl(), causing use-after-free errors. However,
this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
that we're in a SRCU read section while using md->zone_revalidate_map.
Making that chage should make DM as safe as any other zoned device,
which brings me to...
3. On any zoned device, not just DM, what's stopping
one process from syncing the zone write plug offsets:
blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
various.report_zones() -> disk_report_zones_cb() ->
disk_zone_wplug_sync_wp_offset()
While another process is running into problems while dealing with the
gendisk's zone configuration changing:
blk_revalidate_disk_zones() -> disk_free_zone_resources()
I don't see any synchronizing between these two call paths. Am I missing
something that stops this? Can this only happen for DM devices, for some
reason? If this can't happen, I'm fine with just adding a
srcu_read_lock() to dm_blk_report_zones() and calling it good. If it
can happen, then we should fix that.
-Ben
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 5/6] dm: fix dm_blk_report_zones
2025-03-18 22:10 ` Benjamin Marzinski
@ 2025-03-19 1:40 ` Damien Le Moal
2025-03-19 16:50 ` Benjamin Marzinski
0 siblings, 1 reply; 14+ messages in thread
From: Damien Le Moal @ 2025-03-19 1:40 UTC (permalink / raw)
To: Benjamin Marzinski, Christoph Hellwig
Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block
On 3/19/25 7:10 AM, Benjamin Marzinski wrote:
> On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
>> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
>>> __bind(). Otherwise the zone resources could change while accessing
>>> them. Finally, it is possible that md->zone_revalidate_map will change
>>> while calling this function. Only read it once, so that we are always
>>> using the same value. Otherwise we might miss a call to
>>> dm_put_live_table().
>>
>> This checking for calling context is pretty ugly. Can you just use
>> a separate report zones helper or at least a clearly documented flag for it?
>
> Not sure how that would work. The goal is to keep another process,
> calling something like blkdev_report_zones_ioctl(), from being able to
> call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
> which needs to use that same disk->fops->report_zones() interface in
> order to call blk_revalidate_zone_cb(). We need some way to distinguish
> between the callers. We could export blk_revalidate_zone_cb() and have
> dm_blk_report_zones() check if it was called with that report_zones_cb.
> That seems just as ugly to me. But if you like that better, fine.
> Otherwise I don't see how we can distinguish between the call from
> blk_revalidate_disk_zones() and a call from something like
> blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
>
> Allowing the blkdev_report_zones_ioctl() call to go ahead using
> md->zone_revalidate_map runs into three problems.
>
> 1. It's running while the device is switching tables, and there's no
> guarantee that DM won't encounter an error and have to fail back. So it
> could report information for this unused table. We could just say that
> this is what you get from trying to grab the zone information of a
> device while it's switching tables. Fine.
>
> 2. Without this patch, it's possible that while
> blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
> fails switching tables and frees the new table that's being used by
> blkdev_report_zones_ioctl(), causing use-after-free errors. However,
> this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
> that we're in a SRCU read section while using md->zone_revalidate_map.
> Making that chage should make DM as safe as any other zoned device,
> which brings me to...
>
> 3. On any zoned device, not just DM, what's stopping
> one process from syncing the zone write plug offsets:
> blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
> various.report_zones() -> disk_report_zones_cb() ->
> disk_zone_wplug_sync_wp_offset()
disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not
have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones
that had a write error. Given that recovering from write errors on zoned device
requires to:
1) stop writing to the zone,
2) Do a report zone (which will sync the wp offset)
3) Restart writing
there is no write IOs going on for the zone that is being reported, for a well
behaved user. If the user is not well behaved, it will get errors/out of sync
wp, until the user behaves :) So I would not worry too much about this.
As we discussed, the table switching should be allowed only for the wildcard
target (dm-error) and that's it. We should not allow table switching otherwise.
And given that inserting dm-error does not cause any change to the zone
configuration, I do not see why we need to revalidate zones.
>
> While another process is running into problems while dealing with the
> gendisk's zone configuration changing:
>
> blk_revalidate_disk_zones() -> disk_free_zone_resources()
We should not allow the zone configuration to change. That is impossible to
deal with at the user level.
> I don't see any synchronizing between these two call paths. Am I missing
> something that stops this? Can this only happen for DM devices, for some
> reason? If this can't happen, I'm fine with just adding a
> srcu_read_lock() to dm_blk_report_zones() and calling it good. If it
> can happen, then we should fix that.
I think it can happen today but no-one ever had this issue because no-one has
tried to switch a dm-table on a zoned drive. And I really think we should
prevent that from being done, except for dm-error since that's used for fstests.
>
> -Ben
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 5/6] dm: fix dm_blk_report_zones
2025-03-19 1:40 ` Damien Le Moal
@ 2025-03-19 16:50 ` Benjamin Marzinski
0 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-19 16:50 UTC (permalink / raw)
To: Damien Le Moal
Cc: Christoph Hellwig, Mikulas Patocka, Mike Snitzer, Jens Axboe,
dm-devel, linux-block
On Wed, Mar 19, 2025 at 10:40:38AM +0900, Damien Le Moal wrote:
> On 3/19/25 7:10 AM, Benjamin Marzinski wrote:
> > On Tue, Mar 18, 2025 at 07:57:21AM +0100, Christoph Hellwig wrote:
> >> On Mon, Mar 17, 2025 at 12:45:09AM -0400, Benjamin Marzinski wrote:
> >>> __bind(). Otherwise the zone resources could change while accessing
> >>> them. Finally, it is possible that md->zone_revalidate_map will change
> >>> while calling this function. Only read it once, so that we are always
> >>> using the same value. Otherwise we might miss a call to
> >>> dm_put_live_table().
> >>
> >> This checking for calling context is pretty ugly. Can you just use
> >> a separate report zones helper or at least a clearly documented flag for it?
> >
> > Not sure how that would work. The goal is to keep another process,
> > calling something like blkdev_report_zones_ioctl(), from being able to
> > call its report_zones_cb, while DM is in blk_revalidate_disk_zones()
> > which needs to use that same disk->fops->report_zones() interface in
> > order to call blk_revalidate_zone_cb(). We need some way to distinguish
> > between the callers. We could export blk_revalidate_zone_cb() and have
> > dm_blk_report_zones() check if it was called with that report_zones_cb.
> > That seems just as ugly to me. But if you like that better, fine.
> > Otherwise I don't see how we can distinguish between the call from
> > blk_revalidate_disk_zones() and a call from something like
> > blkdev_report_zones_ioctl(). Am I not understanding your suggestion?
> >
> > Allowing the blkdev_report_zones_ioctl() call to go ahead using
> > md->zone_revalidate_map runs into three problems.
> >
> > 1. It's running while the device is switching tables, and there's no
> > guarantee that DM won't encounter an error and have to fail back. So it
> > could report information for this unused table. We could just say that
> > this is what you get from trying to grab the zone information of a
> > device while it's switching tables. Fine.
> >
> > 2. Without this patch, it's possible that while
> > blkdev_report_zones_ioctl() is still running its report_zones_cb, DM
> > fails switching tables and frees the new table that's being used by
> > blkdev_report_zones_ioctl(), causing use-after-free errors. However,
> > this is solvable by calling srcu_read_lock(&md->io_barrier) to make sure
> > that we're in a SRCU read section while using md->zone_revalidate_map.
> > Making that chage should make DM as safe as any other zoned device,
> > which brings me to...
> >
> > 3. On any zoned device, not just DM, what's stopping
> > one process from syncing the zone write plug offsets:
> > blkdev_report_zones_ioctl() -> blkdev_report_zones() ->
> > various.report_zones() -> disk_report_zones_cb() ->
> > disk_zone_wplug_sync_wp_offset()
>
> disk_zone_wplug_sync_wp_offset() is a nop for any zone write plug that does not
> have BLK_ZONE_WPLUG_NEED_WP_UPDATE set. And that flag is set only for zones
> that had a write error. Given that recovering from write errors on zoned device
> requires to:
> 1) stop writing to the zone,
> 2) Do a report zone (which will sync the wp offset)
> 3) Restart writing
> there is no write IOs going on for the zone that is being reported, for a well
> behaved user. If the user is not well behaved, it will get errors/out of sync
> wp, until the user behaves :) So I would not worry too much about this.
The issue isn't that there could be IO errors, it's that we are reading
and writing from these zwplugs without any way to keep them from being
freed while we're doing so.
> As we discussed, the table switching should be allowed only for the wildcard
> target (dm-error) and that's it. We should not allow table switching otherwise.
> And given that inserting dm-error does not cause any change to the zone
> configuration, I do not see why we need to revalidate zones.
With this patchset we will never call blk_revalidate_disk_zones() when
we have disk->zone_wplugs_hash, which means that
disk_zone_wplug_sync_wp_offset() can never be called once we are using a
zoned table, so that's not a problem.
The problem we do have happens when we are initially switching to that
zoned table. blk_revalidate_disk_zones() can allocate
disk->zone_wplugs_hash, and the later fail and free it. In this case, if
some userspace process does a BLKREPORTZONE ioctl() while
md->zone_revalidate_map is set and disk->zone_wplugs_hash exists, it
could well be still running when disk->zone_wplugs_hash gets freed. We
obviously can't ban loading a zoned dm-crypt table at all.
I admit, this is a very unlikely occurance, but very unlikely memory
corruptions are still memory corruptions, and much harder to track down
when they do occur. My original patch, which only allows the task that
set md->zone_revalidate_map to work during suspends, solves this issue.
So does my suggestion that dm_blk_report_zones() only allow calls with
blk_revalidate_zone_cb() as the callback to use md->zone_revalidate_map.
If Christoph has a better solution that I'm misunderstanding, I'm fine
with that to. But since we want people to be able to set up dm-crypt on
top of zoned devices, we have to guard against userspace processes
running at the same time as we're doing that, in case we fail and free
the resources.
> >
> > While another process is running into problems while dealing with the
> > gendisk's zone configuration changing:
> >
> > blk_revalidate_disk_zones() -> disk_free_zone_resources()
>
> We should not allow the zone configuration to change. That is impossible to
> deal with at the user level.
The comments above blk_revalidate_disk_zones() sure make it sound like
it could get called after a device has already been set up
/*
* blk_revalidate_disk_zones - (re)allocate and initialize zone write plugs
* @disk: Target disk
*
* Helper function for low-level device drivers to check, (re) allocate and
* initialize resources used for managing zoned disks. This function should
* normally be called by blk-mq based drivers when a zoned gendisk is probed
* and when the zone configuration of the gendisk changes (e.g. after a format).
* Before calling this function, the device driver must already have set the
* device zone size (chunk_sector limit) and the max zone append limit.
* BIO based drivers can also use this function as long as the device queue
* can be safely frozen.
*/
and a brief look does make it appear that nvme and scsi device can call
this when the device is rescanned. Perhaps there's no risk of them
failing in this case?
> > I don't see any synchronizing between these two call paths. Am I missing
> > something that stops this? Can this only happen for DM devices, for some
> > reason? If this can't happen, I'm fine with just adding a
> > srcu_read_lock() to dm_blk_report_zones() and calling it good. If it
> > can happen, then we should fix that.
>
> I think it can happen today but no-one ever had this issue because no-one has
> tried to switch a dm-table on a zoned drive. And I really think we should
> prevent that from being done, except for dm-error since that's used for fstests.
This patchset allows zoned devices that don't allocate zone write plug
resources to be arbitrarily changed. This never caused problems, so we
have no idea how often people were doing it. I don't see anything unsafe
there, and people typically do all sorts of things with linear dm devices.
For devices that have allocated zone write plug resources, they can only
reload tables that do not change the zone resources (only dm-crypat and
dm-error targets will work here), and blk_revalidate_disk_zones() is
never called. Not allowing any dm-crypt reloads would keep people from
reloading a dm-crypt device with the exact same table or simply with an
optional parameter changed. Doing that is safe, and again we have no
idea how often this has been happening, since it wouldn't cause any
problems.
-Ben
> >
> > -Ben
> >
>
>
> --
> Damien Le Moal
> Western Digital Research
^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH v2 0/6] dm: fix issues with swapping dm tables
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (5 preceding siblings ...)
2025-03-17 4:45 ` [PATCH v2 6/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
@ 2025-03-19 19:07 ` Benjamin Marzinski
6 siblings, 0 replies; 14+ messages in thread
From: Benjamin Marzinski @ 2025-03-19 19:07 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer, Jens Axboe
Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig
I noticed another issue while playing areound with this. If you have a
linear device stacked on top of a zoned dm-crypt device, it will
allocate zone write plug resources, although it won't use them.
Actually, this is also true if you create a linear device on top of the
a zoned scsi_debug device.
The reason is that both dm-crypt and scsi_debug set
lim->max_hw_zone_append_sectors to 0. This makes sense, both of them are
simply emulating zone append. DM's check for devices that need zone
append emulation, dm_table_supports_zone_append(), correctly identifies
these linear devices as not needing emulation, so the
DMF_EMULATE_ZONE_APPEND isn't set and dm doesn't do zone write plugging.
But since these devices have lim->max_hw_zone_append_sectors = 0,
blk_revalidate_disk_zones() thinks they do need zone append emulation,
and allocates resources for it.
My question is "How do we solve this?" The easy DM answer is to change
dm_set_zones_restrictions() to do something like:
if (!dm_table_supports_zone_append(t))
lim->max_hw_zone_append_sectors = 0;
else if (lim->max_hw_zone_append_sectors == 0)
lim->max_hw_zone_append_sectors = lim->max_zone_append_sectors;
But I wonder if the correct fix is to change blk_stack_limits()
to do something like:
t->max_hw_zone_append_sectors = min(t->max_hw_zone_append_sectors,
b->max_zone_append_sectors);
Even if the bottom device is emulating zone appends, from the point
of view of the top device, that is the max zone append sectors that
the underlying *hardware* supports. It's just that more drivers
than just DM use this, and prehaps some really do need to know what
the actual hardware supports, and not just what the device below then
can support or emulate.
Thoughts?
-Ben
On Mon, Mar 17, 2025 at 12:45:04AM -0400, Benjamin Marzinski wrote:
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
>
> Finally, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was suspended and
> modifying zoned resources or could possibly fail to end a srcu read
> section. More importantly, blk_revalidate_disk_zones() would never get
> called when updating a zoned table. This could cause the dm device to
> see the wrong zone write offsets, not have a large enough zwplugs
> reserved in its mempool, or read invalid memory when checking the
> conventional zones bitmap.
>
> This patchset fixes these issues. It deals with the problems around
> blk_revalidate_disk_zones() by only calling it for devices that have no
> zone write plug resources. This will always correctly update the zoned
> settings. If a device has zone write plug resources, calling
> blk_revalidate_disk_zones() will not correctly update them in many
> cases, so DM simply doesn't call it for devices with zone write plug
> resources. Instead of allowing people to load tables that can break the
> device, like currently happens, DM disallosw any table reloads that
> change the zoned setting for devices that already have zone write plug
> resources.
>
> Specifically, if a device already has zone plug resources allocated, it
> can only switch to another zoned table that also emulates zone append.
> Also, it cannot change the device size or the zone size. There are some
> tweaks to make sure that a device can always switch to an error target.
>
> Changes in V2:
> - Made queue_limits_set() optionally return the old limits (grabbed
> while holding the limits_lock), and used this in
> dm_table_set_restrictions()
> - dropped changes to disk_free_zone_resources() and the
> blk_revalidate_disk_zones() code path (removed patches 0005 & 0006)
> - Instead of always calling blk_revalidate_disk_zones(), disallow
> changes that would change zone settings if the device has
> zone write plug resources (final patch).
>
> Benjamin Marzinski (6):
> dm: don't change md if dm_table_set_restrictions() fails
> dm: free table mempools if not used in __bind
> block: make queue_limits_set() optionally return old limits
> dm: handle failures in dm_table_set_restrictions
> dm: fix dm_blk_report_zones
> dm: limit swapping tables for devices with zone write plugs
>
> block/blk-settings.c | 9 ++++-
> drivers/md/dm-core.h | 1 +
> drivers/md/dm-table.c | 66 ++++++++++++++++++++++++++-------
> drivers/md/dm-zone.c | 84 +++++++++++++++++++++++++++++-------------
> drivers/md/dm.c | 36 +++++++++++-------
> drivers/md/dm.h | 6 +++
> drivers/md/md-linear.c | 2 +-
> drivers/md/raid0.c | 2 +-
> drivers/md/raid1.c | 2 +-
> drivers/md/raid10.c | 2 +-
> drivers/md/raid5.c | 2 +-
> include/linux/blkdev.h | 3 +-
> 12 files changed, 154 insertions(+), 61 deletions(-)
>
> --
> 2.48.1
>
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-03-19 19:07 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-17 4:45 [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 2/6] dm: free table mempools if not used in __bind Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 3/6] block: make queue_limits_set() optionally return old limits Benjamin Marzinski
2025-03-18 6:56 ` Christoph Hellwig
2025-03-18 19:23 ` Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 4/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 5/6] dm: fix dm_blk_report_zones Benjamin Marzinski
2025-03-18 6:57 ` Christoph Hellwig
2025-03-18 22:10 ` Benjamin Marzinski
2025-03-19 1:40 ` Damien Le Moal
2025-03-19 16:50 ` Benjamin Marzinski
2025-03-17 4:45 ` [PATCH v2 6/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
2025-03-19 19:07 ` [PATCH v2 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).