* [PATCH v3 1/6] dm: don't change md if dm_table_set_restrictions() fails
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-08 23:51 ` [PATCH v3 2/6] dm: free table mempools if not used in __bind Benjamin Marzinski
` (5 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
__bind was changing the disk capacity, geometry and mempools of the
mapped device before calling dm_table_set_restrictions() which could
fail, forcing dm to drop the new table. Failing here would leave the
device using the old table but with the wrong capacity and mempools.
Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
capacity to be set, save the old version and restore it on failure.
Fixes: bb37d77239af2 ("dm: introduce zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..f5c5ccb6f8d2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2421,21 +2421,29 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
struct queue_limits *limits)
{
struct dm_table *old_map;
- sector_t size;
+ sector_t size, old_size;
int ret;
lockdep_assert_held(&md->suspend_lock);
size = dm_table_get_size(t);
+ old_size = dm_get_size(md);
+ set_capacity(md->disk, size);
+
+ ret = dm_table_set_restrictions(t, md->queue, limits);
+ if (ret) {
+ set_capacity(md->disk, old_size);
+ old_map = ERR_PTR(ret);
+ goto out;
+ }
+
/*
* Wipe any geometry if the size of the table changed.
*/
- if (size != dm_get_size(md))
+ if (size != old_size)
memset(&md->geometry, 0, sizeof(md->geometry));
- set_capacity(md->disk, size);
-
dm_table_event_callback(t, event_callback, md);
if (dm_table_request_based(t)) {
@@ -2468,12 +2476,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
t->mempools = NULL;
}
- ret = dm_table_set_restrictions(t, md->queue, limits);
- if (ret) {
- old_map = ERR_PTR(ret);
- goto out;
- }
-
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
md->immutable_target_type = dm_table_get_immutable_target_type(t);
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v3 2/6] dm: free table mempools if not used in __bind
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-04-08 23:51 ` [PATCH v3 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-08 23:51 ` [PATCH v3 3/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
` (4 subsequent siblings)
6 siblings, 0 replies; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
With request-based dm, the mempools don't need reloading when switching
tables, but the unused table mempools are not freed until the active
table is finally freed. Free them immediately if they are not needed.
Fixes: 29dec90a0f1d9 ("dm: fix bio_set allocation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f5c5ccb6f8d2..292414da871d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2461,10 +2461,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
* requests in the queue may refer to bio from the old bioset,
* so you must walk through the queue to unprep.
*/
- if (!md->mempools) {
+ if (!md->mempools)
md->mempools = t->mempools;
- t->mempools = NULL;
- }
+ else
+ dm_free_md_mempools(t->mempools);
} else {
/*
* The md may already have mempools that need changing.
@@ -2473,8 +2473,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
*/
dm_free_md_mempools(md->mempools);
md->mempools = t->mempools;
- t->mempools = NULL;
}
+ t->mempools = NULL;
old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
rcu_assign_pointer(md->map, (void *)t);
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* [PATCH v3 3/6] dm: handle failures in dm_table_set_restrictions
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-04-08 23:51 ` [PATCH v3 1/6] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
2025-04-08 23:51 ` [PATCH v3 2/6] dm: free table mempools if not used in __bind Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-09 5:57 ` Damien Le Moal
2025-04-08 23:51 ` [PATCH v3 4/6] dm: fix dm_blk_report_zones Benjamin Marzinski
` (3 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
If dm_table_set_restrictions() fails while swapping tables,
device-mapper will continue using the previous table. It must be sure to
leave the mapped_device in it's previous state on failure. Otherwise
device-mapper could end up using the old table with settings from the
unused table.
Do not update the mapped device in dm_set_zones_restrictions(). Wait
till after dm_table_set_restrictions() is sure to succeed to update the
md zoned settings. Do the same with the dax settings, and if
dm_revalidate_zones() fails, restore the original queue limits.
Fixes: 7f91ccd8a608d ("dm: Call dm_revalidate_zones() after setting the queue limits")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 26 +++++++++++++++++---------
drivers/md/dm-zone.c | 26 ++++++++++++++++++--------
drivers/md/dm.h | 1 +
3 files changed, 36 insertions(+), 17 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 04b3e9758e53..b8191ce691f1 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1840,6 +1840,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
int r;
+ struct queue_limits old_limits;
if (!dm_table_supports_nowait(t))
limits->features &= ~BLK_FEAT_NOWAIT;
@@ -1866,16 +1867,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_flush(t))
limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
- if (dm_table_supports_dax(t, device_not_dax_capable)) {
+ if (dm_table_supports_dax(t, device_not_dax_capable))
limits->features |= BLK_FEAT_DAX;
- if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
- set_dax_synchronous(t->md->dax_dev);
- } else
+ else
limits->features &= ~BLK_FEAT_DAX;
- if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
- dax_write_cache(t->md->dax_dev, true);
-
/* For a zoned table, setup the zone related queue attributes. */
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
@@ -1887,7 +1883,8 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (dm_table_supports_atomic_writes(t))
limits->features |= BLK_FEAT_ATOMIC_WRITES;
- r = queue_limits_set(q, limits);
+ old_limits = queue_limits_start_update(q);
+ r = queue_limits_commit_update(q, limits);
if (r)
return r;
@@ -1898,10 +1895,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
(limits->features & BLK_FEAT_ZONED)) {
r = dm_revalidate_zones(t, q);
- if (r)
+ if (r) {
+ queue_limits_set(q, &old_limits);
return r;
+ }
}
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
+ dm_finalize_zone_settings(t, limits);
+
+ if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
+ set_dax_synchronous(t->md->dax_dev);
+
+ if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
+ dax_write_cache(t->md->dax_dev, true);
+
dm_update_crypto_profile(q, t);
return 0;
}
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 20edd3fabbab..681058feb63b 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
* mapped device queue as needing zone append emulation.
*/
WARN_ON_ONCE(queue_is_mq(q));
- if (dm_table_supports_zone_append(t)) {
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- } else {
- set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ if (!dm_table_supports_zone_append(t))
lim->max_hw_zone_append_sectors = 0;
- }
/*
* Determine the max open and max active zone limits for the mapped
@@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
lim->zone_write_granularity = 0;
lim->chunk_sectors = 0;
lim->features &= ~BLK_FEAT_ZONED;
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- md->nr_zones = 0;
- disk->nr_zones = 0;
return 0;
}
@@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
return 0;
}
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
+{
+ struct mapped_device *md = t->md;
+
+ if (lim->features & BLK_FEAT_ZONED) {
+ if (dm_table_supports_zone_append(t))
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ else
+ set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ } else {
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ md->nr_zones = 0;
+ md->disk->nr_zones = 0;
+ }
+}
+
+
/*
* IO completion callback called from clone_endio().
*/
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a0a8ff119815..e5d3a9f46a91 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *lim);
int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
void dm_zone_endio(struct dm_io *io, struct bio *clone);
#ifdef CONFIG_BLK_DEV_ZONED
int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 3/6] dm: handle failures in dm_table_set_restrictions
2025-04-08 23:51 ` [PATCH v3 3/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
@ 2025-04-09 5:57 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-04-09 5:57 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer
Cc: dm-devel, Christoph Hellwig
On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> If dm_table_set_restrictions() fails while swapping tables,
> device-mapper will continue using the previous table. It must be sure to
> leave the mapped_device in it's previous state on failure. Otherwise
> device-mapper could end up using the old table with settings from the
> unused table.
>
> Do not update the mapped device in dm_set_zones_restrictions(). Wait
> till after dm_table_set_restrictions() is sure to succeed to update the
> md zoned settings. Do the same with the dax settings, and if
> dm_revalidate_zones() fails, restore the original queue limits.
>
> Fixes: 7f91ccd8a608d ("dm: Call dm_revalidate_zones() after setting the queue limits")
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Looks all good to me.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 4/6] dm: fix dm_blk_report_zones
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (2 preceding siblings ...)
2025-04-08 23:51 ` [PATCH v3 3/6] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-09 6:02 ` Damien Le Moal
2025-04-08 23:51 ` [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
` (2 subsequent siblings)
6 siblings, 1 reply; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
If dm_get_live_table() returned NULL, dm_put_live_table() was never
called. Also, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().
Finally, while md->zone_revalidate_map is set and a process is calling
blk_revalidate_disk_zones() to set up the zone append emulation
resources, it is possible that another process, perhaps triggered by
blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If
blk_revalidate_disk_zones() fails, these resources can be freed while
the other process is still using them, causing a use-after-free error.
blk_revalidate_disk_zones() will only ever be called when initially
setting up the zone append emulation resources, such as when setting up
a zoned dm-crypt table for the first time. Further table swaps will not
set md->zone_revalidate_map or call blk_revalidate_disk_zones().
However it must be called using the new table (referenced by
md->zone_revalidate_map) and the new queue limits while the DM device is
suspended. dm_blk_report_zones() needs some way to distinguish between a
call from blk_revalidate_disk_zones(), which must be allowed to use
md->zone_revalidate_map to access this not yet activated table, and all
other calls to dm_blk_report_zones(), which should not be allowed while
the device is suspended and cannot use md->zone_revalidate_map, since
the zone resources might be freed by the process currently calling
blk_revalidate_disk_zones().
Solve this by tracking the process that set md->zone_revalidate_map
dm_revalidate_zones() and only allowing that process to make use of it
in dm_blk_report_zones().
Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-core.h | 1 +
drivers/md/dm-zone.c | 23 +++++++++++++++--------
2 files changed, 16 insertions(+), 8 deletions(-)
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 3637761f3585..f3a3f2ef6322 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -141,6 +141,7 @@ struct mapped_device {
#ifdef CONFIG_BLK_DEV_ZONED
unsigned int nr_zones;
void *zone_revalidate_map;
+ struct task_struct *revalidate_map_task;
#endif
#ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 681058feb63b..11e19281bb64 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -56,24 +56,29 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
{
struct mapped_device *md = disk->private_data;
struct dm_table *map;
- int srcu_idx, ret;
+ struct dm_table *zone_revalidate_map = md->zone_revalidate_map;
+ int srcu_idx, ret = -EIO;
- if (!md->zone_revalidate_map) {
- /* Regular user context */
+ if (!zone_revalidate_map || md->revalidate_map_task != current) {
+ /*
+ * Regular user context or
+ * Zone revalidation during __bind() is in progress, but this
+ * call is from a different process
+ */
if (dm_suspended_md(md))
return -EAGAIN;
map = dm_get_live_table(md, &srcu_idx);
- if (!map)
- return -EIO;
} else {
/* Zone revalidation during __bind() */
- map = md->zone_revalidate_map;
+ map = zone_revalidate_map;
}
- ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
+ if (map)
+ ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
+ data);
- if (!md->zone_revalidate_map)
+ if (!zone_revalidate_map)
dm_put_live_table(md, srcu_idx);
return ret;
@@ -175,7 +180,9 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
* our table for dm_blk_report_zones() to use directly.
*/
md->zone_revalidate_map = t;
+ md->revalidate_map_task = current;
ret = blk_revalidate_disk_zones(disk);
+ md->revalidate_map_task = NULL;
md->zone_revalidate_map = NULL;
if (ret) {
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 4/6] dm: fix dm_blk_report_zones
2025-04-08 23:51 ` [PATCH v3 4/6] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-04-09 6:02 ` Damien Le Moal
2025-04-09 14:55 ` Benjamin Marzinski
0 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2025-04-09 6:02 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer
Cc: dm-devel, Christoph Hellwig
On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> If dm_get_live_table() returned NULL, dm_put_live_table() was never
> called. Also, it is possible that md->zone_revalidate_map will change
> while calling this function. Only read it once, so that we are always
> using the same value. Otherwise we might miss a call to
> dm_put_live_table().
>
> Finally, while md->zone_revalidate_map is set and a process is calling
> blk_revalidate_disk_zones() to set up the zone append emulation
> resources, it is possible that another process, perhaps triggered by
> blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If
> blk_revalidate_disk_zones() fails, these resources can be freed while
> the other process is still using them, causing a use-after-free error.
>
> blk_revalidate_disk_zones() will only ever be called when initially
> setting up the zone append emulation resources, such as when setting up
> a zoned dm-crypt table for the first time. Further table swaps will not
> set md->zone_revalidate_map or call blk_revalidate_disk_zones().
> However it must be called using the new table (referenced by
> md->zone_revalidate_map) and the new queue limits while the DM device is
> suspended. dm_blk_report_zones() needs some way to distinguish between a
> call from blk_revalidate_disk_zones(), which must be allowed to use
> md->zone_revalidate_map to access this not yet activated table, and all
> other calls to dm_blk_report_zones(), which should not be allowed while
> the device is suspended and cannot use md->zone_revalidate_map, since
> the zone resources might be freed by the process currently calling
> blk_revalidate_disk_zones().
>
> Solve this by tracking the process that set md->zone_revalidate_map
> dm_revalidate_zones() and only allowing that process to make use of it
> in dm_blk_report_zones().
..that sets md->zone_revalidate_map in dm_revalidate_zones() and... ?
>
> Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation")
> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
I do not recall reviewing this version, but looks OK to me with the sentence
above fixed.
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v3 4/6] dm: fix dm_blk_report_zones
2025-04-09 6:02 ` Damien Le Moal
@ 2025-04-09 14:55 ` Benjamin Marzinski
0 siblings, 0 replies; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-09 14:55 UTC (permalink / raw)
To: Damien Le Moal; +Cc: Mikulas Patocka, Mike Snitzer, dm-devel, Christoph Hellwig
On Wed, Apr 09, 2025 at 03:02:58PM +0900, Damien Le Moal wrote:
> On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> > If dm_get_live_table() returned NULL, dm_put_live_table() was never
> > called. Also, it is possible that md->zone_revalidate_map will change
> > while calling this function. Only read it once, so that we are always
> > using the same value. Otherwise we might miss a call to
> > dm_put_live_table().
> >
> > Finally, while md->zone_revalidate_map is set and a process is calling
> > blk_revalidate_disk_zones() to set up the zone append emulation
> > resources, it is possible that another process, perhaps triggered by
> > blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If
> > blk_revalidate_disk_zones() fails, these resources can be freed while
> > the other process is still using them, causing a use-after-free error.
> >
> > blk_revalidate_disk_zones() will only ever be called when initially
> > setting up the zone append emulation resources, such as when setting up
> > a zoned dm-crypt table for the first time. Further table swaps will not
> > set md->zone_revalidate_map or call blk_revalidate_disk_zones().
> > However it must be called using the new table (referenced by
> > md->zone_revalidate_map) and the new queue limits while the DM device is
> > suspended. dm_blk_report_zones() needs some way to distinguish between a
> > call from blk_revalidate_disk_zones(), which must be allowed to use
> > md->zone_revalidate_map to access this not yet activated table, and all
> > other calls to dm_blk_report_zones(), which should not be allowed while
> > the device is suspended and cannot use md->zone_revalidate_map, since
> > the zone resources might be freed by the process currently calling
> > blk_revalidate_disk_zones().
> >
> > Solve this by tracking the process that set md->zone_revalidate_map
> > dm_revalidate_zones() and only allowing that process to make use of it
> > in dm_blk_report_zones().
>
> ..that sets md->zone_revalidate_map in dm_revalidate_zones() and... ?
Yep. I'll fix that.
> >
> > Fixes: f211268ed1f9b ("dm: Use the block layer zone append emulation")
> > Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
>
> I do not recall reviewing this version, but looks OK to me with the sentence
> above fixed.
The code is the same as the original version that you reviewed, but I
forgot to remove your Reviewed-by when I rewrote the commit message.
Sorry.
-Ben
>
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
>
>
> --
> Damien Le Moal
> Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (3 preceding siblings ...)
2025-04-08 23:51 ` [PATCH v3 4/6] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-09 6:20 ` Damien Le Moal
2025-04-08 23:51 ` [PATCH v3 6/6] dm: fix native zone append devices on top of emulated ones Benjamin Marzinski
2025-04-09 6:24 ` [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Damien Le Moal
6 siblings, 1 reply; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
dm_revalidate_zones() only allowed new or previously unzoned devices to
call blk_revalidate_disk_zones(). If the device was already zoned,
disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones()
returned without doing any work. This would make the zoned settings for
the device not match the new table. If the device had zone write plug
resources, it could run into errors like bdev_zone_is_seq() reading
invalid memory because disk->conv_zones_bitmap was the wrong size.
If the device doesn't have any zone write plug resources, calling
blk_revalidate_disk_zones() will always correctly update device. If
blk_revalidate_disk_zones() fails, it can still overwrite or clear the
current disk->nr_zones value. In this case, DM must restore the previous
value of disk->nr_zones, so that the zoned settings will continue to
match the previous that it failed back to.
If the device already has zone write plug resources,
blk_revalidate_disk_zones() will not correctly update them, if it is
called for arbitrary zoned device changes. Since there is not much need
for this ability, the easiest solution is to disallow any table reloads
that change the zoned settings, for devices that already have zone plug
resources. Specifically, if a device already has zone plug resources
allocated, it can only switch to another zoned table that also emulates
zone append. Also, it cannot change the device size or the zone size. A
device can switch to an error target.
Fixes: bb37d77239af2 ("dm: introduce zone append emulation")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-table.c | 41 ++++++++++++++++++++++++++++++++++++-----
drivers/md/dm-zone.c | 35 ++++++++++++++++++++++++++---------
drivers/md/dm.c | 6 ++++++
drivers/md/dm.h | 5 +++++
4 files changed, 73 insertions(+), 14 deletions(-)
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index b8191ce691f1..04680c0950b7 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1496,6 +1496,18 @@ bool dm_table_has_no_data_devices(struct dm_table *t)
return true;
}
+bool dm_table_is_wildcard(struct dm_table *t)
+{
+ for (unsigned int i = 0; i < t->num_targets; i++) {
+ struct dm_target *ti = dm_table_get_target(t, i);
+
+ if (!dm_target_is_wildcard(ti->type))
+ return false;
+ }
+
+ return true;
+}
+
static int device_not_zoned(struct dm_target *ti, struct dm_dev *dev,
sector_t start, sector_t len, void *data)
{
@@ -1836,6 +1848,19 @@ static bool dm_table_supports_atomic_writes(struct dm_table *t)
return true;
}
+bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
+ sector_t new_size)
+{
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && dm_has_zone_plugs(t->md) &&
+ old_size != new_size) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot change size",
+ dm_device_name(t->md));
+ return false;
+ }
+ return true;
+}
+
int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *limits)
{
@@ -1873,11 +1898,17 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
limits->features &= ~BLK_FEAT_DAX;
/* For a zoned table, setup the zone related queue attributes. */
- if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
- (limits->features & BLK_FEAT_ZONED)) {
- r = dm_set_zones_restrictions(t, q, limits);
- if (r)
- return r;
+ if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+ if (limits->features & BLK_FEAT_ZONED) {
+ r = dm_set_zones_restrictions(t, q, limits);
+ if (r)
+ return r;
+ } else if (dm_has_zone_plugs(t->md)) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot switch to non-zoned table.",
+ dm_device_name(t->md));
+ return -EINVAL;
+ }
}
if (dm_table_supports_atomic_writes(t))
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 11e19281bb64..1d419734fefc 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -158,22 +158,22 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
{
struct mapped_device *md = t->md;
struct gendisk *disk = md->disk;
+ unsigned int nr_zones = disk->nr_zones;
int ret;
if (!get_capacity(disk))
return 0;
- /* Revalidate only if something changed. */
- if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
- DMINFO("%s using %s zone append",
- disk->disk_name,
- queue_emulates_zone_append(q) ? "emulated" : "native");
- md->nr_zones = 0;
- }
-
- if (md->nr_zones)
+ /*
+ * Do not revalidate if zone append emulation resources have already
+ * been allocated.
+ */
+ if (dm_has_zone_plugs(md))
return 0;
+ DMINFO("%s using %s zone append", disk->disk_name,
+ queue_emulates_zone_append(q) ? "emulated" : "native");
+
/*
* Our table is not live yet. So the call to dm_get_live_table()
* in dm_blk_report_zones() will fail. Set a temporary pointer to
@@ -187,6 +187,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
if (ret) {
DMERR("Revalidate zones failed %d", ret);
+ disk->nr_zones = nr_zones;
return ret;
}
@@ -383,12 +384,28 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
lim->max_open_zones = 0;
lim->max_active_zones = 0;
lim->max_hw_zone_append_sectors = 0;
+ lim->max_zone_append_sectors = 0;
lim->zone_write_granularity = 0;
lim->chunk_sectors = 0;
lim->features &= ~BLK_FEAT_ZONED;
return 0;
}
+ if (get_capacity(disk) && dm_has_zone_plugs(t->md)) {
+ if (q->limits.chunk_sectors != lim->chunk_sectors) {
+ DMWARN("%s: device has zone write plug resources. "
+ "Cannot change zone size",
+ disk->disk_name);
+ return -EINVAL;
+ }
+ if (lim->max_hw_zone_append_sectors != 0 &&
+ !dm_table_is_wildcard(t)) {
+ DMWARN("%s: device has zone write plug resources. "
+ "New table must emulate zone append",
+ disk->disk_name);
+ return -EINVAL;
+ }
+ }
/*
* Warn once (when the capacity is not yet set) if the mapped device is
* partially using zone resources of the target devices as that leads to
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 292414da871d..240f6dab8dda 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2429,6 +2429,12 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
size = dm_table_get_size(t);
old_size = dm_get_size(md);
+
+ if (!dm_table_supports_size_change(t, old_size, size)) {
+ old_map = ERR_PTR(-EINVAL);
+ goto out;
+ }
+
set_capacity(md->disk, size);
ret = dm_table_set_restrictions(t, md->queue, limits);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index e5d3a9f46a91..245f52b59215 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -58,6 +58,7 @@ void dm_table_event_callback(struct dm_table *t,
void (*fn)(void *), void *context);
struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
bool dm_table_has_no_data_devices(struct dm_table *table);
+bool dm_table_is_wildcard(struct dm_table *t);
int dm_calculate_queue_limits(struct dm_table *table,
struct queue_limits *limits);
int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
@@ -72,6 +73,8 @@ struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
struct dm_target *dm_table_get_immutable_target(struct dm_table *t);
struct dm_target *dm_table_get_wildcard_target(struct dm_table *t);
bool dm_table_request_based(struct dm_table *t);
+bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
+ sector_t new_size);
void dm_lock_md_type(struct mapped_device *md);
void dm_unlock_md_type(struct mapped_device *md);
@@ -111,12 +114,14 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
sector_t sector, unsigned int nr_zones,
unsigned long *need_reset);
+#define dm_has_zone_plugs(md) ((md)->disk->zone_wplugs_hash != NULL)
#else
#define dm_blk_report_zones NULL
static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
{
return false;
}
+#define dm_has_zone_plugs(md) false
#endif
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs
2025-04-08 23:51 ` [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
@ 2025-04-09 6:20 ` Damien Le Moal
2025-04-09 20:15 ` Benjamin Marzinski
0 siblings, 1 reply; 15+ messages in thread
From: Damien Le Moal @ 2025-04-09 6:20 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer
Cc: dm-devel, Christoph Hellwig
On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> dm_revalidate_zones() only allowed new or previously unzoned devices to
> call blk_revalidate_disk_zones(). If the device was already zoned,
> disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones()
> returned without doing any work. This would make the zoned settings for
> the device not match the new table. If the device had zone write plug
> resources, it could run into errors like bdev_zone_is_seq() reading
> invalid memory because disk->conv_zones_bitmap was the wrong size.
>
> If the device doesn't have any zone write plug resources, calling
> blk_revalidate_disk_zones() will always correctly update device. If
> blk_revalidate_disk_zones() fails, it can still overwrite or clear the
> current disk->nr_zones value. In this case, DM must restore the previous
> value of disk->nr_zones, so that the zoned settings will continue to
> match the previous that it failed back to.
s/previous/previous value ?
s/failed/fell ?
>
> If the device already has zone write plug resources,
> blk_revalidate_disk_zones() will not correctly update them, if it is
> called for arbitrary zoned device changes. Since there is not much need
> for this ability, the easiest solution is to disallow any table reloads
> that change the zoned settings, for devices that already have zone plug
> resources. Specifically, if a device already has zone plug resources
> allocated, it can only switch to another zoned table that also emulates
> zone append. Also, it cannot change the device size or the zone size. A
> device can switch to an error target.
>
> Fixes: bb37d77239af2 ("dm: introduce zone append emulation")
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
[...]
> @@ -1873,11 +1898,17 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> limits->features &= ~BLK_FEAT_DAX;
>
> /* For a zoned table, setup the zone related queue attributes. */
> - if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> - (limits->features & BLK_FEAT_ZONED)) {
> - r = dm_set_zones_restrictions(t, q, limits);
> - if (r)
> - return r;
> + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
> + if (limits->features & BLK_FEAT_ZONED) {
> + r = dm_set_zones_restrictions(t, q, limits);
> + if (r)
> + return r;
> + } else if (dm_has_zone_plugs(t->md)) {
> + DMWARN("%s: device has zone write plug resources. "
> + "Cannot switch to non-zoned table.",
> + dm_device_name(t->md));
> + return -EINVAL;
> + }
I am confused with this one. We can only have zone write plugs if the device is
zoned. So it seems that the check for "if (dm_has_zone_plugs(t->md)" should
really be inside dm_set_zones_restrictions(). Or is this trying to detect a
table change that would switch the device from zoned to non-zoned ? If that is
the case, regardless of the zone write plug initialization state, we should
never allow such change.
> }
>
> if (dm_table_supports_atomic_writes(t))
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 11e19281bb64..1d419734fefc 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -158,22 +158,22 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> {
> struct mapped_device *md = t->md;
> struct gendisk *disk = md->disk;
> + unsigned int nr_zones = disk->nr_zones;
> int ret;
>
> if (!get_capacity(disk))
> return 0;
>
> - /* Revalidate only if something changed. */
> - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> - DMINFO("%s using %s zone append",
> - disk->disk_name,
> - queue_emulates_zone_append(q) ? "emulated" : "native");
> - md->nr_zones = 0;
> - }
> -
> - if (md->nr_zones)
> + /*
> + * Do not revalidate if zone append emulation resources have already
> + * been allocated.
Maybe better to rephrased this (to be precise) to something like:
/*
* Do not revalidate if zone write plug resources have already
* been allocated.
*/
> + */
> + if (dm_has_zone_plugs(md))
> return 0;
>
> + DMINFO("%s using %s zone append", disk->disk_name,
> + queue_emulates_zone_append(q) ? "emulated" : "native");
> +
> /*
> * Our table is not live yet. So the call to dm_get_live_table()
> * in dm_blk_report_zones() will fail. Set a temporary pointer to
> @@ -187,6 +187,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
>
> if (ret) {
> DMERR("Revalidate zones failed %d", ret);
> + disk->nr_zones = nr_zones;
> return ret;
> }
>
> @@ -383,12 +384,28 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> lim->max_open_zones = 0;
> lim->max_active_zones = 0;
> lim->max_hw_zone_append_sectors = 0;
> + lim->max_zone_append_sectors = 0;
> lim->zone_write_granularity = 0;
> lim->chunk_sectors = 0;
> lim->features &= ~BLK_FEAT_ZONED;
> return 0;
> }
>
> + if (get_capacity(disk) && dm_has_zone_plugs(t->md)) {
> + if (q->limits.chunk_sectors != lim->chunk_sectors) {
> + DMWARN("%s: device has zone write plug resources. "
> + "Cannot change zone size",
> + disk->disk_name);
> + return -EINVAL;
> + }
> + if (lim->max_hw_zone_append_sectors != 0 &&
> + !dm_table_is_wildcard(t)) {
> + DMWARN("%s: device has zone write plug resources. "
> + "New table must emulate zone append",
> + disk->disk_name);
> + return -EINVAL;
> + }
> + }
I have some concerns about this: what if the new table has the same capacity
and the same zone size but the types of zones changed ? We are not checking
this here and we cannot actually check that without doing a report zones. So I
really think we should just use the big hammer here and simply only allow the
wildcard target and no other change.
> /*
> * Warn once (when the capacity is not yet set) if the mapped device is
> * partially using zone resources of the target devices as that leads to
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 292414da871d..240f6dab8dda 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -2429,6 +2429,12 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
> size = dm_table_get_size(t);
>
> old_size = dm_get_size(md);
> +
> + if (!dm_table_supports_size_change(t, old_size, size)) {
> + old_map = ERR_PTR(-EINVAL);
> + goto out;
> + }
And I guess we could probably move the "wildcard-only is allowed" change check
here as that would further simplify the revalidation code. No ?
> +
> set_capacity(md->disk, size);
>
> ret = dm_table_set_restrictions(t, md->queue, limits);
> diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> index e5d3a9f46a91..245f52b59215 100644
> --- a/drivers/md/dm.h
> +++ b/drivers/md/dm.h
> @@ -58,6 +58,7 @@ void dm_table_event_callback(struct dm_table *t,
> void (*fn)(void *), void *context);
> struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
> bool dm_table_has_no_data_devices(struct dm_table *table);
> +bool dm_table_is_wildcard(struct dm_table *t);
> int dm_calculate_queue_limits(struct dm_table *table,
> struct queue_limits *limits);
> int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> @@ -72,6 +73,8 @@ struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
> struct dm_target *dm_table_get_immutable_target(struct dm_table *t);
> struct dm_target *dm_table_get_wildcard_target(struct dm_table *t);
> bool dm_table_request_based(struct dm_table *t);
> +bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
> + sector_t new_size);
>
> void dm_lock_md_type(struct mapped_device *md);
> void dm_unlock_md_type(struct mapped_device *md);
> @@ -111,12 +114,14 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
> int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
> sector_t sector, unsigned int nr_zones,
> unsigned long *need_reset);
> +#define dm_has_zone_plugs(md) ((md)->disk->zone_wplugs_hash != NULL)
> #else
> #define dm_blk_report_zones NULL
> static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
> {
> return false;
> }
> +#define dm_has_zone_plugs(md) false
> #endif
>
> /*
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs
2025-04-09 6:20 ` Damien Le Moal
@ 2025-04-09 20:15 ` Benjamin Marzinski
2025-04-10 0:56 ` Damien Le Moal
0 siblings, 1 reply; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-09 20:15 UTC (permalink / raw)
To: Damien Le Moal; +Cc: Mikulas Patocka, Mike Snitzer, dm-devel, Christoph Hellwig
On Wed, Apr 09, 2025 at 03:20:31PM +0900, Damien Le Moal wrote:
> On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> > dm_revalidate_zones() only allowed new or previously unzoned devices to
> > call blk_revalidate_disk_zones(). If the device was already zoned,
> > disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones()
> > returned without doing any work. This would make the zoned settings for
> > the device not match the new table. If the device had zone write plug
> > resources, it could run into errors like bdev_zone_is_seq() reading
> > invalid memory because disk->conv_zones_bitmap was the wrong size.
> >
> > If the device doesn't have any zone write plug resources, calling
> > blk_revalidate_disk_zones() will always correctly update device. If
> > blk_revalidate_disk_zones() fails, it can still overwrite or clear the
> > current disk->nr_zones value. In this case, DM must restore the previous
> > value of disk->nr_zones, so that the zoned settings will continue to
> > match the previous that it failed back to.
>
> s/previous/previous value ?
> s/failed/fell ?
Sure.
> >
> > If the device already has zone write plug resources,
> > blk_revalidate_disk_zones() will not correctly update them, if it is
> > called for arbitrary zoned device changes. Since there is not much need
> > for this ability, the easiest solution is to disallow any table reloads
> > that change the zoned settings, for devices that already have zone plug
> > resources. Specifically, if a device already has zone plug resources
> > allocated, it can only switch to another zoned table that also emulates
> > zone append. Also, it cannot change the device size or the zone size. A
> > device can switch to an error target.
> >
> > Fixes: bb37d77239af2 ("dm: introduce zone append emulation")
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
>
> [...]
>
> > @@ -1873,11 +1898,17 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > limits->features &= ~BLK_FEAT_DAX;
> >
> > /* For a zoned table, setup the zone related queue attributes. */
> > - if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> > - (limits->features & BLK_FEAT_ZONED)) {
> > - r = dm_set_zones_restrictions(t, q, limits);
> > - if (r)
> > - return r;
> > + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
> > + if (limits->features & BLK_FEAT_ZONED) {
> > + r = dm_set_zones_restrictions(t, q, limits);
> > + if (r)
> > + return r;
> > + } else if (dm_has_zone_plugs(t->md)) {
> > + DMWARN("%s: device has zone write plug resources. "
> > + "Cannot switch to non-zoned table.",
> > + dm_device_name(t->md));
> > + return -EINVAL;
> > + }
>
> I am confused with this one. We can only have zone write plugs if the device is
> zoned. So it seems that the check for "if (dm_has_zone_plugs(t->md)" should
> really be inside dm_set_zones_restrictions(). Or is this trying to detect a
> table change that would switch the device from zoned to non-zoned ? If that is
> the case, regardless of the zone write plug initialization state, we should
> never allow such change.
I don't think that, for instance, allowing a linear device stacked on
top of a zoned device to get remapped to stack on top of a non-zoned
device runs much more risk than allowing a linear device to get remapped
in any other case? This is currently allowed, and I don't believe anyone
has complained about it. I would rather assume that the user knows what
they are doing, instead of disallowing things that aren't obviously
wrong, and won't cause system problems if they are (aside from the
problems that naturally result from putting the wrong device in your
table line). But if Mikulas and Mike think it would be better to
disallow this, then I'm fine with that.
> > }
> >
> > if (dm_table_supports_atomic_writes(t))
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index 11e19281bb64..1d419734fefc 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -158,22 +158,22 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> > {
> > struct mapped_device *md = t->md;
> > struct gendisk *disk = md->disk;
> > + unsigned int nr_zones = disk->nr_zones;
> > int ret;
> >
> > if (!get_capacity(disk))
> > return 0;
> >
> > - /* Revalidate only if something changed. */
> > - if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> > - DMINFO("%s using %s zone append",
> > - disk->disk_name,
> > - queue_emulates_zone_append(q) ? "emulated" : "native");
> > - md->nr_zones = 0;
> > - }
> > -
> > - if (md->nr_zones)
> > + /*
> > + * Do not revalidate if zone append emulation resources have already
> > + * been allocated.
>
> Maybe better to rephrased this (to be precise) to something like:
>
> /*
> * Do not revalidate if zone write plug resources have already
> * been allocated.
> */
>
> > + */
> > + if (dm_has_zone_plugs(md))
> > return 0;
> >
> > + DMINFO("%s using %s zone append", disk->disk_name,
> > + queue_emulates_zone_append(q) ? "emulated" : "native");
> > +
> > /*
> > * Our table is not live yet. So the call to dm_get_live_table()
> > * in dm_blk_report_zones() will fail. Set a temporary pointer to
> > @@ -187,6 +187,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> >
> > if (ret) {
> > DMERR("Revalidate zones failed %d", ret);
> > + disk->nr_zones = nr_zones;
> > return ret;
> > }
> >
> > @@ -383,12 +384,28 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> > lim->max_open_zones = 0;
> > lim->max_active_zones = 0;
> > lim->max_hw_zone_append_sectors = 0;
> > + lim->max_zone_append_sectors = 0;
> > lim->zone_write_granularity = 0;
> > lim->chunk_sectors = 0;
> > lim->features &= ~BLK_FEAT_ZONED;
> > return 0;
> > }
> >
> > + if (get_capacity(disk) && dm_has_zone_plugs(t->md)) {
> > + if (q->limits.chunk_sectors != lim->chunk_sectors) {
> > + DMWARN("%s: device has zone write plug resources. "
> > + "Cannot change zone size",
> > + disk->disk_name);
> > + return -EINVAL;
> > + }
> > + if (lim->max_hw_zone_append_sectors != 0 &&
> > + !dm_table_is_wildcard(t)) {
> > + DMWARN("%s: device has zone write plug resources. "
> > + "New table must emulate zone append",
> > + disk->disk_name);
> > + return -EINVAL;
> > + }
> > + }
>
> I have some concerns about this: what if the new table has the same capacity
> and the same zone size but the types of zones changed ? We are not checking
> this here and we cannot actually check that without doing a report zones. So I
> really think we should just use the big hammer here and simply only allow the
> wildcard target and no other change.
I don't see how this could lead to accessing invalid memory, which was
my big concern, with these patches. The worst that could happen in an IO
error, AFAICS. Disallowing all table loads except for the error target
would keep people from being able from things like changing options on
the dm-crypt target. Again, that is just my preference, and I'll defer
to Mike and Mikulas on how this should be handled.
> > /*
> > * Warn once (when the capacity is not yet set) if the mapped device is
> > * partially using zone resources of the target devices as that leads to
> > diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> > index 292414da871d..240f6dab8dda 100644
> > --- a/drivers/md/dm.c
> > +++ b/drivers/md/dm.c
> > @@ -2429,6 +2429,12 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
> > size = dm_table_get_size(t);
> >
> > old_size = dm_get_size(md);
> > +
> > + if (!dm_table_supports_size_change(t, old_size, size)) {
> > + old_map = ERR_PTR(-EINVAL);
> > + goto out;
> > + }
>
> And I guess we could probably move the "wildcard-only is allowed" change check
> here as that would further simplify the revalidation code. No ?
If we are disallowing any zoned device to reload its table to something
other than an error target, then yes. It can go here. If we only care
about zoned devices that emulate zoned append, when we need to wait till
dm_set_zones_restrictions() where we determine that. Since we already
need to handle failures in dm_table_set_restrictions(), moving it
doesn't simplify the code much.
-Ben
> > +
> > set_capacity(md->disk, size);
> >
> > ret = dm_table_set_restrictions(t, md->queue, limits);
> > diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> > index e5d3a9f46a91..245f52b59215 100644
> > --- a/drivers/md/dm.h
> > +++ b/drivers/md/dm.h
> > @@ -58,6 +58,7 @@ void dm_table_event_callback(struct dm_table *t,
> > void (*fn)(void *), void *context);
> > struct dm_target *dm_table_find_target(struct dm_table *t, sector_t sector);
> > bool dm_table_has_no_data_devices(struct dm_table *table);
> > +bool dm_table_is_wildcard(struct dm_table *t);
> > int dm_calculate_queue_limits(struct dm_table *table,
> > struct queue_limits *limits);
> > int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > @@ -72,6 +73,8 @@ struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
> > struct dm_target *dm_table_get_immutable_target(struct dm_table *t);
> > struct dm_target *dm_table_get_wildcard_target(struct dm_table *t);
> > bool dm_table_request_based(struct dm_table *t);
> > +bool dm_table_supports_size_change(struct dm_table *t, sector_t old_size,
> > + sector_t new_size);
> >
> > void dm_lock_md_type(struct mapped_device *md);
> > void dm_unlock_md_type(struct mapped_device *md);
> > @@ -111,12 +114,14 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
> > int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
> > sector_t sector, unsigned int nr_zones,
> > unsigned long *need_reset);
> > +#define dm_has_zone_plugs(md) ((md)->disk->zone_wplugs_hash != NULL)
> > #else
> > #define dm_blk_report_zones NULL
> > static inline bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
> > {
> > return false;
> > }
> > +#define dm_has_zone_plugs(md) false
> > #endif
> >
> > /*
>
>
> --
> Damien Le Moal
> Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread* Re: [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs
2025-04-09 20:15 ` Benjamin Marzinski
@ 2025-04-10 0:56 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-04-10 0:56 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Mikulas Patocka, Mike Snitzer, dm-devel, Christoph Hellwig
On 4/10/25 5:15 AM, Benjamin Marzinski wrote:
>>> @@ -1873,11 +1898,17 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>>> limits->features &= ~BLK_FEAT_DAX;
>>>
>>> /* For a zoned table, setup the zone related queue attributes. */
>>> - if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>>> - (limits->features & BLK_FEAT_ZONED)) {
>>> - r = dm_set_zones_restrictions(t, q, limits);
>>> - if (r)
>>> - return r;
>>> + if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
>>> + if (limits->features & BLK_FEAT_ZONED) {
>>> + r = dm_set_zones_restrictions(t, q, limits);
>>> + if (r)
>>> + return r;
>>> + } else if (dm_has_zone_plugs(t->md)) {
>>> + DMWARN("%s: device has zone write plug resources. "
>>> + "Cannot switch to non-zoned table.",
>>> + dm_device_name(t->md));
>>> + return -EINVAL;
>>> + }
>>
>> I am confused with this one. We can only have zone write plugs if the device is
>> zoned. So it seems that the check for "if (dm_has_zone_plugs(t->md)" should
>> really be inside dm_set_zones_restrictions(). Or is this trying to detect a
>> table change that would switch the device from zoned to non-zoned ? If that is
>> the case, regardless of the zone write plug initialization state, we should
>> never allow such change.
And I was wrong on this one: using dm-linear to map only conventional zones of
an SMR HDD, the DM device will *not* be zoned but the underlying device is. So
this check is fine since the dm device will not have wplugs in that case.
> I don't think that, for instance, allowing a linear device stacked on
> top of a zoned device to get remapped to stack on top of a non-zoned
> device runs much more risk than allowing a linear device to get remapped
> in any other case? This is currently allowed, and I don't believe anyone
> has complained about it. I would rather assume that the user knows what
> they are doing, instead of disallowing things that aren't obviously
> wrong, and won't cause system problems if they are (aside from the
> problems that naturally result from putting the wrong device in your
> table line). But if Mikulas and Mike think it would be better to
> disallow this, then I'm fine with that.
OK. Let's leave things as you have now.
>>> + if (get_capacity(disk) && dm_has_zone_plugs(t->md)) {
>>> + if (q->limits.chunk_sectors != lim->chunk_sectors) {
>>> + DMWARN("%s: device has zone write plug resources. "
>>> + "Cannot change zone size",
>>> + disk->disk_name);
>>> + return -EINVAL;
>>> + }
>>> + if (lim->max_hw_zone_append_sectors != 0 &&
>>> + !dm_table_is_wildcard(t)) {
>>> + DMWARN("%s: device has zone write plug resources. "
>>> + "New table must emulate zone append",
>>> + disk->disk_name);
>>> + return -EINVAL;
>>> + }
>>> + }
>>
>> I have some concerns about this: what if the new table has the same capacity
>> and the same zone size but the types of zones changed ? We are not checking
>> this here and we cannot actually check that without doing a report zones. So I
>> really think we should just use the big hammer here and simply only allow the
>> wildcard target and no other change.
>
> I don't see how this could lead to accessing invalid memory, which was
> my big concern, with these patches. The worst that could happen in an IO
> error, AFAICS. Disallowing all table loads except for the error target
> would keep people from being able from things like changing options on
> the dm-crypt target. Again, that is just my preference, and I'll defer
> to Mike and Mikulas on how this should be handled.
Yeah, but these IO errors that can happen would be hard to debug/understand...
But as you said, given that this has never been checked/prevented before and
that no one complained, let's keep this as is. Your example for dm-crypt is
indeed a valid case.
>>> /*
>>> * Warn once (when the capacity is not yet set) if the mapped device is
>>> * partially using zone resources of the target devices as that leads to
>>> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
>>> index 292414da871d..240f6dab8dda 100644
>>> --- a/drivers/md/dm.c
>>> +++ b/drivers/md/dm.c
>>> @@ -2429,6 +2429,12 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
>>> size = dm_table_get_size(t);
>>>
>>> old_size = dm_get_size(md);
>>> +
>>> + if (!dm_table_supports_size_change(t, old_size, size)) {
>>> + old_map = ERR_PTR(-EINVAL);
>>> + goto out;
>>> + }
>>
>> And I guess we could probably move the "wildcard-only is allowed" change check
>> here as that would further simplify the revalidation code. No ?
>
> If we are disallowing any zoned device to reload its table to something
> other than an error target, then yes. It can go here. If we only care
> about zoned devices that emulate zoned append, when we need to wait till
> dm_set_zones_restrictions() where we determine that. Since we already
> need to handle failures in dm_table_set_restrictions(), moving it
> doesn't simplify the code much.
OK. So with the commit message typos fixed, feel free to add:
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH v3 6/6] dm: fix native zone append devices on top of emulated ones
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (4 preceding siblings ...)
2025-04-08 23:51 ` [PATCH v3 5/6] dm: limit swapping tables for devices with zone write plugs Benjamin Marzinski
@ 2025-04-08 23:51 ` Benjamin Marzinski
2025-04-09 6:21 ` Damien Le Moal
2025-04-09 6:24 ` [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Damien Le Moal
6 siblings, 1 reply; 15+ messages in thread
From: Benjamin Marzinski @ 2025-04-08 23:51 UTC (permalink / raw)
To: Mikulas Patocka, Mike Snitzer; +Cc: dm-devel, Damien Le Moal, Christoph Hellwig
If a DM device that can pass down zone append commands is stacked on top
of a device that emulates zone append commands, it will allocate zone
append emulation resources, even though it doesn't use them. This is
because the underlying device will have max_hw_zone_append_sectors set
to 0 to request zone append emulation. When the DM device is stacked on
top of it, it will inherit that max_hw_zone_append_sectors limit,
despite being able to pass down zone append bios. Solve this by making
sure max_hw_zone_append_sectors is non-zero for DM devices that do not
need zone append emulation.
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
drivers/md/dm-zone.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 1d419734fefc..f89590f904e1 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -345,11 +345,15 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
/*
* Check if zone append is natively supported, and if not, set the
- * mapped device queue as needing zone append emulation.
+ * mapped device queue as needing zone append emulation. If zone
+ * append is natively supported, make sure that
+ * max_hw_zone_append_sectors is not set to 0.
*/
WARN_ON_ONCE(queue_is_mq(q));
if (!dm_table_supports_zone_append(t))
lim->max_hw_zone_append_sectors = 0;
+ else if (lim->max_hw_zone_append_sectors == 0)
+ lim->max_hw_zone_append_sectors = lim->max_zone_append_sectors;
/*
* Determine the max open and max active zone limits for the mapped
--
2.48.1
^ permalink raw reply related [flat|nested] 15+ messages in thread* Re: [PATCH v3 6/6] dm: fix native zone append devices on top of emulated ones
2025-04-08 23:51 ` [PATCH v3 6/6] dm: fix native zone append devices on top of emulated ones Benjamin Marzinski
@ 2025-04-09 6:21 ` Damien Le Moal
0 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-04-09 6:21 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer
Cc: dm-devel, Christoph Hellwig
On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
> If a DM device that can pass down zone append commands is stacked on top
> of a device that emulates zone append commands, it will allocate zone
> append emulation resources, even though it doesn't use them. This is
> because the underlying device will have max_hw_zone_append_sectors set
> to 0 to request zone append emulation. When the DM device is stacked on
> top of it, it will inherit that max_hw_zone_append_sectors limit,
> despite being able to pass down zone append bios. Solve this by making
> sure max_hw_zone_append_sectors is non-zero for DM devices that do not
> need zone append emulation.
>
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Looks good.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables
2025-04-08 23:51 [PATCH v3 0/6] [PATCH v3 0/6] dm: fix issues with swapping dm tables Benjamin Marzinski
` (5 preceding siblings ...)
2025-04-08 23:51 ` [PATCH v3 6/6] dm: fix native zone append devices on top of emulated ones Benjamin Marzinski
@ 2025-04-09 6:24 ` Damien Le Moal
6 siblings, 0 replies; 15+ messages in thread
From: Damien Le Moal @ 2025-04-09 6:24 UTC (permalink / raw)
To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer
Cc: dm-devel, Christoph Hellwig
On 4/9/25 8:51 AM, Benjamin Marzinski wrote:
Thanks for all the fixes. Of note is that I am looking into the dm-delay
pre-suspend problem for zoned case (potential reordering of writes). It turns
out to not be an easy fix given how dm-delay works. Still scratching my head :)
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
>
> Also, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was initially
> setting up a zoned table and creating zoned resources or could possibly
> fail to end a srcu read section. More importantly,
> blk_revalidate_disk_zones() would never get called when updating a zoned
> table. This could cause the dm device to see the wrong zone write
> offsets, not have a large enough zwplugs reserved in its mempool, or
> read invalid memory when checking the conventional zones bitmap.
>
> Finally, any DM device created on top of a device emulating zone appeads
> will automatically have zone write plug resources created for it, since
> max_hw_zone_append_sectors will always be 0 for a device stacked on top
> of a device max_hw_zone_append_sectors = 0
>
> This patchset fixes these issues. It deals with the problems around
> blk_revalidate_disk_zones() by only calling it for devices that have no
> zone write plug resources. This will always correctly update the zoned
> settings. If a device has zone write plug resources, calling
> blk_revalidate_disk_zones() will not correctly update them in many
> cases, so DM simply doesn't call it for devices with zone write plug
> resources. Instead of allowing people to load tables that can break the
> device, like currently happens, DM disallosw any table reloads that
> change the zoned setting for devices that already have zone write plug
> resources. Finally, it deals with the max_hw_zone_append_sectors issue
> by making sure that it is non-zero for zoned DM devices that do not need
> zone write append emulation.
>
> Changes in V3:
> - Use queue_limits_start_update() instead of modifying
> queue_limits_set()
> - Rewrote the commit message for patch 0004 ("dm: fix
> dm_blk_report_zones") to explain that this only happens when initially
> setting up a table with zone append resources, so disallowing table
> swaps after you set up a zoned dm-crypt device will not effect the
> issue at all. I did not implement Christoph's suggestion because I
> don't understand how it would work here. Perhaps I'm just being dense.
> I'm not wedded to this solution. Any one that keeps this
> use-after-free error from being possible is fine by me.
> - Added a final patch to deal with the issue of stacked devices always
> getting zone append resources if any underlying device needs them.
>
> Changes in V2:
> - Made queue_limits_set() optionally return the old limits (grabbed
> while holding the limits_lock), and used this in
> dm_table_set_restrictions()
> - dropped changes to disk_free_zone_resources() and the
> blk_revalidate_disk_zones() code path (removed patches 0005 & 0006)
> - Instead of always calling blk_revalidate_disk_zones(), disallow
> changes that would change zone settings if the device has
> zone write plug resources (final patch).
>
> Benjamin Marzinski (6):
> dm: don't change md if dm_table_set_restrictions() fails
> dm: free table mempools if not used in __bind
> dm: handle failures in dm_table_set_restrictions
> dm: fix dm_blk_report_zones
> dm: limit swapping tables for devices with zone write plugs
> dm: fix native zone append devices on top of emulated ones
>
> drivers/md/dm-core.h | 1 +
> drivers/md/dm-table.c | 67 +++++++++++++++++++++++++-------
> drivers/md/dm-zone.c | 90 ++++++++++++++++++++++++++++++-------------
> drivers/md/dm.c | 36 ++++++++++-------
> drivers/md/dm.h | 6 +++
> 5 files changed, 146 insertions(+), 54 deletions(-)
>
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 15+ messages in thread