[RFC PATCH 0/7] dm: fix issues with swapping dm tables

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/7] dm: fix issues with swapping dm tables
@ 2025-03-09 22:28 Benjamin Marzinski
  2025-03-09 22:28 ` [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

There were multiple places in dm's __bind() function where it could fail
and not completely roll back, leaving the device using the the old
table, but with device limits and resources from the new table.
Additionally, unused mempools for request-based devices were not always
freed immediately.

Finally, there were a number of issues with switching zoned tables that
emulate zone append (in other words, dm-crypt on top of zoned devices).
dm_blk_report_zones() could be called while the device was suspended and
modifying zoned resources or could possibly fail to end a srcu read
section.  More importantly, blk_revalidate_disk_zones() would never get
called when updating a zoned table. This could cause the dm device to
see the wrong zone write offsets, not have a large enough zwplugs
reserved in its mempool, or read invalid memory when checking the
conventional zones bitmap.

This patchset fixes these issues. It does not make it so that
device-mapper is able to load any zoned table from any other zoned
table. Zoned dm-crypt devices can be safely grown and shrunk, but
reloading a zoned dm-crypt device to, for instance, point at a
completely different underlying device won't work correctly. IO might
fail since the zone write offsets of the dm-crypt device will not be
updated for all the existing zones with plugs. If the new device's zone
offsets don't match the old device's offsets, IO to the zone will fail.
If the ability to switch tables from a zoned dm-crypt device to an
abritry other zoned dm-crypt device is important to people, it could be
done as long as there are no plugged zones when dm suspends.

This patchset also doesn't touch the code for switching from a zoned to
a non-zoned device. Switching from a zoned dm-crypt device to a
non-zoned device is problematic if there are plugged zones, since the
underlying device will not be prepared to handle these plugged writes.
Switching to a target that does not pass down IOs, like the dm-error
target, is fine. So is switching when there are no plugged zones, except
that we do not free the zoned resources in this case, even though we
safely could.

If people are interested in removing some of these limitations, I can
send patches for them, but I'm not sure how much extra code we want,
just to support niche zoned dm-crypt reloads.

Benjamin Marzinski (7):
  dm: don't change md if dm_table_set_restrictions() fails
  dm: free table mempools if not used in __bind
  dm: handle failures in dm_table_set_restrictions
  dm: fix dm_blk_report_zones
  blk-zoned: clean up zone settings for devices without zwplugs
  blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers
  dm: allow devices to revalidate existing zones

 block/blk-zoned.c      | 78 ++++++++++++++++++++++--------------------
 block/blk.h            |  4 ---
 drivers/md/dm-core.h   |  1 +
 drivers/md/dm-table.c  | 24 ++++++++-----
 drivers/md/dm-zone.c   | 75 ++++++++++++++++++++++++++--------------
 drivers/md/dm.c        | 30 ++++++++--------
 drivers/md/dm.h        |  1 +
 include/linux/blkdev.h |  4 +++
 8 files changed, 128 insertions(+), 89 deletions(-)

-- 
2.48.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
@ 2025-03-09 22:28 ` Benjamin Marzinski
  2025-03-09 23:18   ` Damien Le Moal
  2025-03-09 22:28 ` [PATCH 2/7] dm: free table mempools if not used in __bind Benjamin Marzinski
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

__bind was changing the disk capacity, geometry and mempools of the
mapped device before calling dm_table_set_restrictions() which could
fail, forcing dm to drop the new table. Failing here would leave the
device using the old table but with the wrong capacity and mempools.

Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
capacity to be set, save the old version and restore it on failure.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 5ab7574c0c76..f5c5ccb6f8d2 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2421,21 +2421,29 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 			       struct queue_limits *limits)
 {
 	struct dm_table *old_map;
-	sector_t size;
+	sector_t size, old_size;
 	int ret;
 
 	lockdep_assert_held(&md->suspend_lock);
 
 	size = dm_table_get_size(t);
 
+	old_size = dm_get_size(md);
+	set_capacity(md->disk, size);
+
+	ret = dm_table_set_restrictions(t, md->queue, limits);
+	if (ret) {
+		set_capacity(md->disk, old_size);
+		old_map = ERR_PTR(ret);
+		goto out;
+	}
+
 	/*
 	 * Wipe any geometry if the size of the table changed.
 	 */
-	if (size != dm_get_size(md))
+	if (size != old_size)
 		memset(&md->geometry, 0, sizeof(md->geometry));
 
-	set_capacity(md->disk, size);
-
 	dm_table_event_callback(t, event_callback, md);
 
 	if (dm_table_request_based(t)) {
@@ -2468,12 +2476,6 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 		t->mempools = NULL;
 	}
 
-	ret = dm_table_set_restrictions(t, md->queue, limits);
-	if (ret) {
-		old_map = ERR_PTR(ret);
-		goto out;
-	}
-
 	old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
 	rcu_assign_pointer(md->map, (void *)t);
 	md->immutable_target_type = dm_table_get_immutable_target_type(t);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/7] dm: free table mempools if not used in __bind
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
  2025-03-09 22:28 ` [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
@ 2025-03-09 22:28 ` Benjamin Marzinski
  2025-03-09 23:19   ` Damien Le Moal
  2025-03-09 22:28 ` [PATCH 3/7] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

With request-based dm, the mempools don't need reloading when switching
tables, but the unused table mempools are not freed until the active
table is finally freed. Free them immediately if they are not needed.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index f5c5ccb6f8d2..292414da871d 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -2461,10 +2461,10 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 		 * requests in the queue may refer to bio from the old bioset,
 		 * so you must walk through the queue to unprep.
 		 */
-		if (!md->mempools) {
+		if (!md->mempools)
 			md->mempools = t->mempools;
-			t->mempools = NULL;
-		}
+		else
+			dm_free_md_mempools(t->mempools);
 	} else {
 		/*
 		 * The md may already have mempools that need changing.
@@ -2473,8 +2473,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
 		 */
 		dm_free_md_mempools(md->mempools);
 		md->mempools = t->mempools;
-		t->mempools = NULL;
 	}
+	t->mempools = NULL;
 
 	old_map = rcu_dereference_protected(md->map, lockdep_is_held(&md->suspend_lock));
 	rcu_assign_pointer(md->map, (void *)t);
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
  2025-03-09 22:28 ` [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
  2025-03-09 22:28 ` [PATCH 2/7] dm: free table mempools if not used in __bind Benjamin Marzinski
@ 2025-03-09 22:28 ` Benjamin Marzinski
  2025-03-09 23:25   ` Damien Le Moal
  2025-03-09 22:29 ` [PATCH 4/7] dm: fix dm_blk_report_zones Benjamin Marzinski
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:28 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

If dm_table_set_restrictions() fails while swapping tables,
device-mapper will continue using the previous table. It must be sure to
leave the mapped_device in it's previous state on failure.  Otherwise
device-mapper could end up using the old table with settings from the
unused table.

Do not update the mapped device in dm_set_zones_restrictions(). Wait
till after dm_table_set_restrictions() is sure to succeed to update the
md zoned settings. Do the same with the dax settings, and if
dm_revalidate_zones() fails, restore the original queue limits.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-table.c | 24 ++++++++++++++++--------
 drivers/md/dm-zone.c  | 26 ++++++++++++++++++--------
 drivers/md/dm.h       |  1 +
 3 files changed, 35 insertions(+), 16 deletions(-)

diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index 0ef5203387b2..4003e84af11d 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 			      struct queue_limits *limits)
 {
 	int r;
+	struct queue_limits old_limits;
 
 	if (!dm_table_supports_nowait(t))
 		limits->features &= ~BLK_FEAT_NOWAIT;
@@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (dm_table_supports_flush(t))
 		limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
 
-	if (dm_table_supports_dax(t, device_not_dax_capable)) {
+	if (dm_table_supports_dax(t, device_not_dax_capable))
 		limits->features |= BLK_FEAT_DAX;
-		if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
-			set_dax_synchronous(t->md->dax_dev);
-	} else
+	else
 		limits->features &= ~BLK_FEAT_DAX;
 
-	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
-		dax_write_cache(t->md->dax_dev, true);
-
 	/* For a zoned table, setup the zone related queue attributes. */
 	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
 	    (limits->features & BLK_FEAT_ZONED)) {
@@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (dm_table_supports_atomic_writes(t))
 		limits->features |= BLK_FEAT_ATOMIC_WRITES;
 
+	old_limits = q->limits;
 	r = queue_limits_set(q, limits);
 	if (r)
 		return r;
@@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
 	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
 	    (limits->features & BLK_FEAT_ZONED)) {
 		r = dm_revalidate_zones(t, q);
-		if (r)
+		if (r) {
+			queue_limits_set(q, &old_limits);
 			return r;
+		}
 	}
 
+	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
+		dm_finalize_zone_settings(t, limits);
+
+	if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
+		set_dax_synchronous(t->md->dax_dev);
+
+	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
+		dax_write_cache(t->md->dax_dev, true);
+
 	dm_update_crypto_profile(q, t);
 	return 0;
 }
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 20edd3fabbab..681058feb63b 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 	 * mapped device queue as needing zone append emulation.
 	 */
 	WARN_ON_ONCE(queue_is_mq(q));
-	if (dm_table_supports_zone_append(t)) {
-		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
-	} else {
-		set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+	if (!dm_table_supports_zone_append(t))
 		lim->max_hw_zone_append_sectors = 0;
-	}
 
 	/*
 	 * Determine the max open and max active zone limits for the mapped
@@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 		lim->zone_write_granularity = 0;
 		lim->chunk_sectors = 0;
 		lim->features &= ~BLK_FEAT_ZONED;
-		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
-		md->nr_zones = 0;
-		disk->nr_zones = 0;
 		return 0;
 	}
 
@@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 	return 0;
 }
 
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
+{
+	struct mapped_device *md = t->md;
+
+	if (lim->features & BLK_FEAT_ZONED) {
+		if (dm_table_supports_zone_append(t))
+			clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+		else
+			set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+	} else {
+		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+		md->nr_zones = 0;
+		md->disk->nr_zones = 0;
+	}
+}
+
+
 /*
  * IO completion callback called from clone_endio().
  */
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index a0a8ff119815..e5d3a9f46a91 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
 int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 		struct queue_limits *lim);
 int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
+void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
 void dm_zone_endio(struct dm_io *io, struct bio *clone);
 #ifdef CONFIG_BLK_DEV_ZONED
 int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/7] dm: fix dm_blk_report_zones
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
                   ` (2 preceding siblings ...)
  2025-03-09 22:28 ` [PATCH 3/7] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
@ 2025-03-09 22:29 ` Benjamin Marzinski
  2025-03-09 23:27   ` Damien Le Moal
  2025-03-09 22:29 ` [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs Benjamin Marzinski
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

If dm_get_live_table() returned NULL, dm_put_live_table() was never
called.  Also, if md->zone_revalidate_map is set, check that
dm_blk_report_zones() is being called from the process that set it in
__bind(). Otherwise the zone resources could change while accessing
them. Finally, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-core.h |  1 +
 drivers/md/dm-zone.c | 23 +++++++++++++++--------
 2 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 3637761f3585..f3a3f2ef6322 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -141,6 +141,7 @@ struct mapped_device {
 #ifdef CONFIG_BLK_DEV_ZONED
 	unsigned int nr_zones;
 	void *zone_revalidate_map;
+	struct task_struct *revalidate_map_task;
 #endif
 
 #ifdef CONFIG_IMA
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 681058feb63b..11e19281bb64 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -56,24 +56,29 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
 {
 	struct mapped_device *md = disk->private_data;
 	struct dm_table *map;
-	int srcu_idx, ret;
+	struct dm_table *zone_revalidate_map = md->zone_revalidate_map;
+	int srcu_idx, ret = -EIO;
 
-	if (!md->zone_revalidate_map) {
-		/* Regular user context */
+	if (!zone_revalidate_map || md->revalidate_map_task != current) {
+		/*
+		 * Regular user context or
+		 * Zone revalidation during __bind() is in progress, but this
+		 * call is from a different process
+		 */
 		if (dm_suspended_md(md))
 			return -EAGAIN;
 
 		map = dm_get_live_table(md, &srcu_idx);
-		if (!map)
-			return -EIO;
 	} else {
 		/* Zone revalidation during __bind() */
-		map = md->zone_revalidate_map;
+		map = zone_revalidate_map;
 	}
 
-	ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb, data);
+	if (map)
+		ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
+					     data);
 
-	if (!md->zone_revalidate_map)
+	if (!zone_revalidate_map)
 		dm_put_live_table(md, srcu_idx);
 
 	return ret;
@@ -175,7 +180,9 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 	 * our table for dm_blk_report_zones() to use directly.
 	 */
 	md->zone_revalidate_map = t;
+	md->revalidate_map_task = current;
 	ret = blk_revalidate_disk_zones(disk);
+	md->revalidate_map_task = NULL;
 	md->zone_revalidate_map = NULL;
 
 	if (ret) {
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
                   ` (3 preceding siblings ...)
  2025-03-09 22:29 ` [PATCH 4/7] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-03-09 22:29 ` Benjamin Marzinski
  2025-03-09 23:31   ` Damien Le Moal
  2025-03-09 22:29 ` [RFC PATCH 6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers Benjamin Marzinski
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

Previously disk_free_zone_resources() would only clean up zoned settings
on a disk if the disk had write plugs allocated. Make it clean up zoned
settings on any disk, so disks that don't allocate write plugs can use
it as well.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 block/blk-zoned.c | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 761ea662ddc3..d7dc89cbdccb 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1363,24 +1363,23 @@ static unsigned int disk_set_conv_zones_bitmap(struct gendisk *disk,
 
 void disk_free_zone_resources(struct gendisk *disk)
 {
-	if (!disk->zone_wplugs_pool)
-		return;
-
-	if (disk->zone_wplugs_wq) {
-		destroy_workqueue(disk->zone_wplugs_wq);
-		disk->zone_wplugs_wq = NULL;
-	}
+	if (disk->zone_wplugs_pool) {
+		if (disk->zone_wplugs_wq) {
+			destroy_workqueue(disk->zone_wplugs_wq);
+			disk->zone_wplugs_wq = NULL;
+		}
 
-	disk_destroy_zone_wplugs_hash_table(disk);
+		disk_destroy_zone_wplugs_hash_table(disk);
 
-	/*
-	 * Wait for the zone write plugs to be RCU-freed before
-	 * destorying the mempool.
-	 */
-	rcu_barrier();
+		/*
+		 * Wait for the zone write plugs to be RCU-freed before
+		 * destorying the mempool.
+		 */
+		rcu_barrier();
 
-	mempool_destroy(disk->zone_wplugs_pool);
-	disk->zone_wplugs_pool = NULL;
+		mempool_destroy(disk->zone_wplugs_pool);
+		disk->zone_wplugs_pool = NULL;
+	}
 
 	disk_set_conv_zones_bitmap(disk, NULL);
 	disk->zone_capacity = 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
                   ` (4 preceding siblings ...)
  2025-03-09 22:29 ` [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs Benjamin Marzinski
@ 2025-03-09 22:29 ` Benjamin Marzinski
  2025-03-09 22:29 ` [RFC PATCH 7/7] dm: allow devices to revalidate existing zones Benjamin Marzinski
  2025-03-09 23:16 ` [RFC PATCH 0/7] dm: fix issues with swapping dm tables Damien Le Moal
  7 siblings, 0 replies; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

If device-mapper is swapping zoned tables, and
blk_revalidate_disk_zones() fails. It must retain its current zoned
resources since device-mapper will be failing back to using the previous
table and the zoned settings need to match the table. Allocating
unnecessary zwplugs is acceptable, but the zoned configuration must not
change.  Otherwise it can run into errors like bdev_zone_is_seq()
reading invalid memory because disk->conv_zones_bitmap is the wrong
size. However if device-mapper did not previously have a zoned table, it
should free up the zoned resources, instead of leaving them allocated
and unused.

To solve this, do not free the zone resources when
blk_revalidate_disk_zones() fails for bio based drivers. Additionally,
delay copying the zoned settings to the gendisk until
disk_update_zone_resources() can no longer fail, and do not freeze the
queue for bio-based drivers, since this will hang if there are any
plugged zone write bios.

Also, export disk_free_zone_resources() so that device-mapper can choose
when to free the zoned resources.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 block/blk-zoned.c      | 49 +++++++++++++++++++++++-------------------
 block/blk.h            |  4 ----
 drivers/md/dm-zone.c   |  3 +++
 include/linux/blkdev.h |  4 ++++
 4 files changed, 34 insertions(+), 26 deletions(-)

diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index d7dc89cbdccb..3bec289d27db 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1343,22 +1343,17 @@ static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
 	disk->zone_wplugs_hash_bits = 0;
 }
 
-static unsigned int disk_set_conv_zones_bitmap(struct gendisk *disk,
-					       unsigned long *bitmap)
+static void disk_set_conv_zones_bitmap(struct gendisk *disk,
+				       unsigned long *bitmap)
 {
-	unsigned int nr_conv_zones = 0;
 	unsigned long flags;
 
 	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
-	if (bitmap)
-		nr_conv_zones = bitmap_weight(bitmap, disk->nr_zones);
 	bitmap = rcu_replace_pointer(disk->conv_zones_bitmap, bitmap,
 				     lockdep_is_held(&disk->zone_wplugs_lock));
 	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
 
 	kfree_rcu_mightsleep(bitmap);
-
-	return nr_conv_zones;
 }
 
 void disk_free_zone_resources(struct gendisk *disk)
@@ -1386,6 +1381,7 @@ void disk_free_zone_resources(struct gendisk *disk)
 	disk->last_zone_capacity = 0;
 	disk->nr_zones = 0;
 }
+EXPORT_SYMBOL_GPL(disk_free_zone_resources);
 
 static inline bool disk_need_zone_resources(struct gendisk *disk)
 {
@@ -1434,24 +1430,23 @@ struct blk_revalidate_zone_args {
 
 /*
  * Update the disk zone resources information and device queue limits.
- * The disk queue is frozen when this is executed.
+ * The disk queue is frozen when this is executed on blk-mq drivers.
  */
 static int disk_update_zone_resources(struct gendisk *disk,
 				      struct blk_revalidate_zone_args *args)
 {
 	struct request_queue *q = disk->queue;
-	unsigned int nr_seq_zones, nr_conv_zones;
+	unsigned int nr_seq_zones, nr_conv_zones = 0;
 	unsigned int pool_size;
 	struct queue_limits lim;
+	int ret;
 
-	disk->nr_zones = args->nr_zones;
-	disk->zone_capacity = args->zone_capacity;
-	disk->last_zone_capacity = args->last_zone_capacity;
-	nr_conv_zones =
-		disk_set_conv_zones_bitmap(disk, args->conv_zones_bitmap);
-	if (nr_conv_zones >= disk->nr_zones) {
+	if (args->conv_zones_bitmap)
+		nr_conv_zones = bitmap_weight(args->conv_zones_bitmap,
+					      args->nr_zones);
+	if (nr_conv_zones >= args->nr_zones) {
 		pr_warn("%s: Invalid number of conventional zones %u / %u\n",
-			disk->disk_name, nr_conv_zones, disk->nr_zones);
+			disk->disk_name, nr_conv_zones, args->nr_zones);
 		return -ENODEV;
 	}
 
@@ -1463,7 +1458,7 @@ static int disk_update_zone_resources(struct gendisk *disk,
 	 * small ZNS namespace. For such case, assume that the zoned device has
 	 * no zone resource limits.
 	 */
-	nr_seq_zones = disk->nr_zones - nr_conv_zones;
+	nr_seq_zones = args->nr_zones - nr_conv_zones;
 	if (lim.max_open_zones >= nr_seq_zones)
 		lim.max_open_zones = 0;
 	if (lim.max_active_zones >= nr_seq_zones)
@@ -1493,7 +1488,19 @@ static int disk_update_zone_resources(struct gendisk *disk,
 	}
 
 commit:
-	return queue_limits_commit_update_frozen(q, &lim);
+	if (queue_is_mq(disk->queue))
+		ret = queue_limits_commit_update_frozen(q, &lim);
+	else
+		ret = queue_limits_commit_update(q, &lim);
+
+	if (!ret) {
+		disk->nr_zones = args->nr_zones;
+		disk->zone_capacity = args->zone_capacity;
+		disk->last_zone_capacity = args->last_zone_capacity;
+		disk_set_conv_zones_bitmap(disk, args->conv_zones_bitmap);
+	}
+
+	return ret;
 }
 
 static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
@@ -1648,8 +1655,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
  * and when the zone configuration of the gendisk changes (e.g. after a format).
  * Before calling this function, the device driver must already have set the
  * device zone size (chunk_sector limit) and the max zone append limit.
- * BIO based drivers can also use this function as long as the device queue
- * can be safely frozen.
  */
 int blk_revalidate_disk_zones(struct gendisk *disk)
 {
@@ -1709,13 +1714,13 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
 
 	/*
 	 * Set the new disk zone parameters only once the queue is frozen and
-	 * all I/Os are completed.
+	 * all I/Os are completed on blk-mq drivers.
 	 */
 	if (ret > 0)
 		ret = disk_update_zone_resources(disk, &args);
 	else
 		pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
-	if (ret) {
+	if (ret && queue_is_mq(disk->queue)) {
 		unsigned int memflags = blk_mq_freeze_queue(q);
 
 		disk_free_zone_resources(disk);
diff --git a/block/blk.h b/block/blk.h
index 90fa5f28ccab..c84af503b77b 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -454,7 +454,6 @@ static inline struct bio *blk_queue_bounce(struct bio *bio,
 
 #ifdef CONFIG_BLK_DEV_ZONED
 void disk_init_zone_resources(struct gendisk *disk);
-void disk_free_zone_resources(struct gendisk *disk);
 static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return bio_flagged(bio, BIO_ZONE_WRITE_PLUGGING);
@@ -500,9 +499,6 @@ int blkdev_zone_mgmt_ioctl(struct block_device *bdev, blk_mode_t mode,
 static inline void disk_init_zone_resources(struct gendisk *disk)
 {
 }
-static inline void disk_free_zone_resources(struct gendisk *disk)
-{
-}
 static inline bool bio_zone_write_plugging(struct bio *bio)
 {
 	return false;
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 11e19281bb64..ac86011640c3 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -159,6 +159,7 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 	struct mapped_device *md = t->md;
 	struct gendisk *disk = md->disk;
 	int ret;
+	bool was_zoned = disk->nr_zones != 0;
 
 	if (!get_capacity(disk))
 		return 0;
@@ -187,6 +188,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 
 	if (ret) {
 		DMERR("Revalidate zones failed %d", ret);
+		if (!was_zoned)
+			disk_free_zone_resources(disk);
 		return ret;
 	}
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..51edf35ff715 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -690,12 +690,16 @@ static inline bool blk_queue_is_zoned(struct request_queue *q)
 }
 
 #ifdef CONFIG_BLK_DEV_ZONED
+void disk_free_zone_resources(struct gendisk *disk);
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return disk->nr_zones;
 }
 bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs);
 #else /* CONFIG_BLK_DEV_ZONED */
+static inline void disk_free_zone_resources(struct gendisk *disk)
+{
+}
 static inline unsigned int disk_nr_zones(struct gendisk *disk)
 {
 	return 0;
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
                   ` (5 preceding siblings ...)
  2025-03-09 22:29 ` [RFC PATCH 6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers Benjamin Marzinski
@ 2025-03-09 22:29 ` Benjamin Marzinski
  2025-03-09 23:59   ` Damien Le Moal
  2025-03-09 23:16 ` [RFC PATCH 0/7] dm: fix issues with swapping dm tables Damien Le Moal
  7 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-09 22:29 UTC (permalink / raw)
  To: Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Damien Le Moal, Christoph Hellwig

dm_revalidate_zones() only allowed devices that had no zone resources
set up to call blk_revalidate_disk_zones(). If the device already had
zone resources, disk->nr_zones would always equal md->nr_zones so
dm_revalidate_zones() returned without doing any work. Instead, always
call blk_revalidate_disk_zones() if you are loading a new zoned table.

However, if the device emulates zone append operations and already has
zone append emulation resources, the table size cannot change when
loading a new table. Otherwise, all those resources will be garbage.

If emulated zone append operations are needed and the zone write pointer
offsets of the new table do not match those of the old table, writes to
the device will still fail. This patch allows users to safely grow and
shrink zone devices. But swapping arbitrary zoned tables will still not
work.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
---
 drivers/md/dm-zone.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index ac86011640c3..7e9ebeee7eac 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
 	if (!get_capacity(disk))
 		return 0;
 
-	/* Revalidate only if something changed. */
-	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
-		DMINFO("%s using %s zone append",
-		       disk->disk_name,
-		       queue_emulates_zone_append(q) ? "emulated" : "native");
-		md->nr_zones = 0;
-	}
-
-	if (md->nr_zones)
-		return 0;
+	DMINFO("%s using %s zone append", disk->disk_name,
+	       queue_emulates_zone_append(q) ? "emulated" : "native");
 
 	/*
 	 * Our table is not live yet. So the call to dm_get_live_table()
@@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
 		return 0;
 	}
 
+	/*
+	 * If the device needs zone append emulation, and the device already has
+	 * zone append emulation resources, make sure that the chunk_sectors
+	 * hasn't changed size. Otherwise those resources will be garbage.
+	 */
+	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
+	    q->limits.chunk_sectors != lim->chunk_sectors) {
+		DMERR("Cannot change zone size when swapping tables");
+		return -EINVAL;
+	}
+
 	/*
 	 * Warn once (when the capacity is not yet set) if the mapped device is
 	 * partially using zone resources of the target devices as that leads to
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/7] dm: fix issues with swapping dm tables
  2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
                   ` (6 preceding siblings ...)
  2025-03-09 22:29 ` [RFC PATCH 7/7] dm: allow devices to revalidate existing zones Benjamin Marzinski
@ 2025-03-09 23:16 ` Damien Le Moal
  2025-03-10 16:38   ` Benjamin Marzinski
  7 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:16 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:28, Benjamin Marzinski wrote:
> There were multiple places in dm's __bind() function where it could fail
> and not completely roll back, leaving the device using the the old
> table, but with device limits and resources from the new table.
> Additionally, unused mempools for request-based devices were not always
> freed immediately.
> 
> Finally, there were a number of issues with switching zoned tables that
> emulate zone append (in other words, dm-crypt on top of zoned devices).
> dm_blk_report_zones() could be called while the device was suspended and
> modifying zoned resources or could possibly fail to end a srcu read
> section.  More importantly, blk_revalidate_disk_zones() would never get
> called when updating a zoned table. This could cause the dm device to
> see the wrong zone write offsets, not have a large enough zwplugs
> reserved in its mempool, or read invalid memory when checking the
> conventional zones bitmap.
> 
> This patchset fixes these issues. It does not make it so that
> device-mapper is able to load any zoned table from any other zoned
> table. Zoned dm-crypt devices can be safely grown and shrunk, but
> reloading a zoned dm-crypt device to, for instance, point at a
> completely different underlying device won't work correctly. IO might
> fail since the zone write offsets of the dm-crypt device will not be
> updated for all the existing zones with plugs. If the new device's zone
> offsets don't match the old device's offsets, IO to the zone will fail.
> If the ability to switch tables from a zoned dm-crypt device to an
> abritry other zoned dm-crypt device is important to people, it could be
> done as long as there are no plugged zones when dm suspends.

Thanks for fixing this.

Given that in the general case switching tables will always likely result in
unaligned write errors, I think we should just report a ENOTSUPP error if the
user attempts to swap tables.

> This patchset also doesn't touch the code for switching from a zoned to
> a non-zoned device. Switching from a zoned dm-crypt device to a
> non-zoned device is problematic if there are plugged zones, since the
> underlying device will not be prepared to handle these plugged writes.
> Switching to a target that does not pass down IOs, like the dm-error
> target, is fine. So is switching when there are no plugged zones, except
> that we do not free the zoned resources in this case, even though we
> safely could.

This is another case that does not make much sense in practice. So instead of
still allowing it knowing that it most likely will not work, we should return
ENOTSUPP here too I think.

> If people are interested in removing some of these limitations, I can
> send patches for them, but I'm not sure how much extra code we want,
> just to support niche zoned dm-crypt reloads.

I have never heard any complaints/bug reports from people attempting this. Which
likely means that no-one is needing/trying to do this. So as mentionned above,
we should make sure that this feature is not reported as not supported with a
ENOTSUPP error, and maybe a warning.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails
  2025-03-09 22:28 ` [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
@ 2025-03-09 23:18   ` Damien Le Moal
  2025-03-10 16:39     ` Benjamin Marzinski
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:18 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:28, Benjamin Marzinski wrote:
> __bind was changing the disk capacity, geometry and mempools of the
> mapped device before calling dm_table_set_restrictions() which could
> fail, forcing dm to drop the new table. Failing here would leave the
> device using the old table but with the wrong capacity and mempools.
> 
> Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
> capacity to be set, save the old version and restore it on failure.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>

Does this need a "Fixes" tag maybe ?

Otherwise looks good to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/7] dm: free table mempools if not used in __bind
  2025-03-09 22:28 ` [PATCH 2/7] dm: free table mempools if not used in __bind Benjamin Marzinski
@ 2025-03-09 23:19   ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:19 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:28, Benjamin Marzinski wrote:
> With request-based dm, the mempools don't need reloading when switching
> tables, but the unused table mempools are not freed until the active
> table is finally freed. Free them immediately if they are not needed.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>

Looks OK to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-09 22:28 ` [PATCH 3/7] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
@ 2025-03-09 23:25   ` Damien Le Moal
  2025-03-10 17:37     ` Benjamin Marzinski
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:25 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:28, Benjamin Marzinski wrote:
> If dm_table_set_restrictions() fails while swapping tables,
> device-mapper will continue using the previous table. It must be sure to
> leave the mapped_device in it's previous state on failure.  Otherwise
> device-mapper could end up using the old table with settings from the
> unused table.
> 
> Do not update the mapped device in dm_set_zones_restrictions(). Wait
> till after dm_table_set_restrictions() is sure to succeed to update the
> md zoned settings. Do the same with the dax settings, and if
> dm_revalidate_zones() fails, restore the original queue limits.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  drivers/md/dm-table.c | 24 ++++++++++++++++--------
>  drivers/md/dm-zone.c  | 26 ++++++++++++++++++--------
>  drivers/md/dm.h       |  1 +
>  3 files changed, 35 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index 0ef5203387b2..4003e84af11d 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  			      struct queue_limits *limits)
>  {
>  	int r;
> +	struct queue_limits old_limits;
>  
>  	if (!dm_table_supports_nowait(t))
>  		limits->features &= ~BLK_FEAT_NOWAIT;
> @@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  	if (dm_table_supports_flush(t))
>  		limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
>  
> -	if (dm_table_supports_dax(t, device_not_dax_capable)) {
> +	if (dm_table_supports_dax(t, device_not_dax_capable))
>  		limits->features |= BLK_FEAT_DAX;
> -		if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> -			set_dax_synchronous(t->md->dax_dev);
> -	} else
> +	else
>  		limits->features &= ~BLK_FEAT_DAX;
>  
> -	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> -		dax_write_cache(t->md->dax_dev, true);
> -
>  	/* For a zoned table, setup the zone related queue attributes. */
>  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>  	    (limits->features & BLK_FEAT_ZONED)) {
> @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  	if (dm_table_supports_atomic_writes(t))
>  		limits->features |= BLK_FEAT_ATOMIC_WRITES;
>  
> +	old_limits = q->limits;

I am not sure this is safe to do like this since the user may be simultaneously
changing attributes, which would result in the old_limits struct being in an
inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
a queue_limits_get() helper for that though.

>  	r = queue_limits_set(q, limits);

...Or, we could modify queue_limits_set() to also return the old limit struct
under the q limits_lock. That maybe easier.

>  	if (r)
>  		return r;
> @@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>  	    (limits->features & BLK_FEAT_ZONED)) {
>  		r = dm_revalidate_zones(t, q);
> -		if (r)
> +		if (r) {
> +			queue_limits_set(q, &old_limits);
>  			return r;
> +		}
>  	}
>  
> +	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
> +		dm_finalize_zone_settings(t, limits);
> +
> +	if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> +		set_dax_synchronous(t->md->dax_dev);
> +
> +	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> +		dax_write_cache(t->md->dax_dev, true);
> +
>  	dm_update_crypto_profile(q, t);
>  	return 0;
>  }
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 20edd3fabbab..681058feb63b 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  	 * mapped device queue as needing zone append emulation.
>  	 */
>  	WARN_ON_ONCE(queue_is_mq(q));
> -	if (dm_table_supports_zone_append(t)) {
> -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> -	} else {
> -		set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> +	if (!dm_table_supports_zone_append(t))
>  		lim->max_hw_zone_append_sectors = 0;
> -	}
>  
>  	/*
>  	 * Determine the max open and max active zone limits for the mapped
> @@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		lim->zone_write_granularity = 0;
>  		lim->chunk_sectors = 0;
>  		lim->features &= ~BLK_FEAT_ZONED;
> -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> -		md->nr_zones = 0;
> -		disk->nr_zones = 0;
>  		return 0;
>  	}
>  
> @@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  	return 0;
>  }
>  
> +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
> +{
> +	struct mapped_device *md = t->md;
> +
> +	if (lim->features & BLK_FEAT_ZONED) {
> +		if (dm_table_supports_zone_append(t))
> +			clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> +		else
> +			set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> +	} else {
> +		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> +		md->nr_zones = 0;
> +		md->disk->nr_zones = 0;
> +	}
> +}
> +
> +
>  /*
>   * IO completion callback called from clone_endio().
>   */
> diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> index a0a8ff119815..e5d3a9f46a91 100644
> --- a/drivers/md/dm.h
> +++ b/drivers/md/dm.h
> @@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
>  int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		struct queue_limits *lim);
>  int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
> +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
>  void dm_zone_endio(struct dm_io *io, struct bio *clone);
>  #ifdef CONFIG_BLK_DEV_ZONED
>  int dm_blk_report_zones(struct gendisk *disk, sector_t sector,


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/7] dm: fix dm_blk_report_zones
  2025-03-09 22:29 ` [PATCH 4/7] dm: fix dm_blk_report_zones Benjamin Marzinski
@ 2025-03-09 23:27   ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:27 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:29, Benjamin Marzinski wrote:
> If dm_get_live_table() returned NULL, dm_put_live_table() was never
> called.  Also, if md->zone_revalidate_map is set, check that
> dm_blk_report_zones() is being called from the process that set it in
> __bind(). Otherwise the zone resources could change while accessing
> them. Finally, it is possible that md->zone_revalidate_map will change
> while calling this function. Only read it once, so that we are always
> using the same value. Otherwise we might miss a call to
> dm_put_live_table().
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>

Looks good to me.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs
  2025-03-09 22:29 ` [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs Benjamin Marzinski
@ 2025-03-09 23:31   ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:31 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:29, Benjamin Marzinski wrote:
> Previously disk_free_zone_resources() would only clean up zoned settings
> on a disk if the disk had write plugs allocated. Make it clean up zoned
> settings on any disk, so disks that don't allocate write plugs can use
> it as well.
> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>

Looks OK to me, but as commented in the cover letter, if we do not allow
swtiching tables for a zoned device, then I do not think this patch is needed.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-09 22:29 ` [RFC PATCH 7/7] dm: allow devices to revalidate existing zones Benjamin Marzinski
@ 2025-03-09 23:59   ` Damien Le Moal
  2025-03-10 17:43     ` Benjamin Marzinski
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-09 23:59 UTC (permalink / raw)
  To: Benjamin Marzinski, Mikulas Patocka, Mike Snitzer, Jens Axboe
  Cc: dm-devel, linux-block, Christoph Hellwig

On 3/10/25 07:29, Benjamin Marzinski wrote:
> dm_revalidate_zones() only allowed devices that had no zone resources
> set up to call blk_revalidate_disk_zones(). If the device already had
> zone resources, disk->nr_zones would always equal md->nr_zones so
> dm_revalidate_zones() returned without doing any work. Instead, always
> call blk_revalidate_disk_zones() if you are loading a new zoned table.
> 
> However, if the device emulates zone append operations and already has
> zone append emulation resources, the table size cannot change when
> loading a new table. Otherwise, all those resources will be garbage.
> 
> If emulated zone append operations are needed and the zone write pointer
> offsets of the new table do not match those of the old table, writes to
> the device will still fail. This patch allows users to safely grow and
> shrink zone devices. But swapping arbitrary zoned tables will still not
> work.

I do not think that this patch correctly address the shrinking of dm zoned
device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
will leave already hashed zone write plugs outside of that new zone range in the
disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
reallocate the hash table if the number of zones shrinks. For a physical drive,
this can only happen if the drive is reformatted with some magic vendor unique
command, which is why this was never implemented as that is not a valid
production use case.

To make things simpler, I think we should allow growing/shrinking zoned device
tables, and much less swapping tables between zoned and not-zoned. I am more
inclined to avoid all these corner cases by simply not supporting table
switching for zoned device. That would be much safer I think.
No-one complained about any issue with table switching until now, which likely
means that no-one is using this. So what about simply returning an error for
table switching for a zoned device ? If someone request this support, we can
revisit this.

> 
> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> ---
>  drivers/md/dm-zone.c | 23 +++++++++++++----------
>  1 file changed, 13 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index ac86011640c3..7e9ebeee7eac 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
>  	if (!get_capacity(disk))
>  		return 0;
>  
> -	/* Revalidate only if something changed. */
> -	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> -		DMINFO("%s using %s zone append",
> -		       disk->disk_name,
> -		       queue_emulates_zone_append(q) ? "emulated" : "native");
> -		md->nr_zones = 0;
> -	}
> -
> -	if (md->nr_zones)
> -		return 0;
> +	DMINFO("%s using %s zone append", disk->disk_name,
> +	       queue_emulates_zone_append(q) ? "emulated" : "native");
>  
>  	/*
>  	 * Our table is not live yet. So the call to dm_get_live_table()
> @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
>  		return 0;
>  	}
>  
> +	/*
> +	 * If the device needs zone append emulation, and the device already has
> +	 * zone append emulation resources, make sure that the chunk_sectors
> +	 * hasn't changed size. Otherwise those resources will be garbage.
> +	 */
> +	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
> +	    q->limits.chunk_sectors != lim->chunk_sectors) {
> +		DMERR("Cannot change zone size when swapping tables");
> +		return -EINVAL;
> +	}
> +
>  	/*
>  	 * Warn once (when the capacity is not yet set) if the mapped device is
>  	 * partially using zone resources of the target devices as that leads to


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/7] dm: fix issues with swapping dm tables
  2025-03-09 23:16 ` [RFC PATCH 0/7] dm: fix issues with swapping dm tables Damien Le Moal
@ 2025-03-10 16:38   ` Benjamin Marzinski
  2025-03-10 23:13     ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 16:38 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Mon, Mar 10, 2025 at 08:16:43AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:28, Benjamin Marzinski wrote:
> > There were multiple places in dm's __bind() function where it could fail
> > and not completely roll back, leaving the device using the the old
> > table, but with device limits and resources from the new table.
> > Additionally, unused mempools for request-based devices were not always
> > freed immediately.
> > 
> > Finally, there were a number of issues with switching zoned tables that
> > emulate zone append (in other words, dm-crypt on top of zoned devices).
> > dm_blk_report_zones() could be called while the device was suspended and
> > modifying zoned resources or could possibly fail to end a srcu read
> > section.  More importantly, blk_revalidate_disk_zones() would never get
> > called when updating a zoned table. This could cause the dm device to
> > see the wrong zone write offsets, not have a large enough zwplugs
> > reserved in its mempool, or read invalid memory when checking the
> > conventional zones bitmap.
> > 
> > This patchset fixes these issues. It does not make it so that
> > device-mapper is able to load any zoned table from any other zoned
> > table. Zoned dm-crypt devices can be safely grown and shrunk, but
> > reloading a zoned dm-crypt device to, for instance, point at a
> > completely different underlying device won't work correctly. IO might
> > fail since the zone write offsets of the dm-crypt device will not be
> > updated for all the existing zones with plugs. If the new device's zone
> > offsets don't match the old device's offsets, IO to the zone will fail.
> > If the ability to switch tables from a zoned dm-crypt device to an
> > abritry other zoned dm-crypt device is important to people, it could be
> > done as long as there are no plugged zones when dm suspends.
> 
> Thanks for fixing this.
> 
> Given that in the general case switching tables will always likely result in
> unaligned write errors, I think we should just report a ENOTSUPP error if the
> user attempts to swap tables.

If we don't think there's any interest in growing or shrinking zoned
dm-crypt devices, that's fine.  I do think we should make an exception
for switching to the dm-error target. We specifically call that out with
DM_TARGET_WILDCARD so that we can always switch to it from any table if
we just want to fail out all the IO.

-Ben
 
> > This patchset also doesn't touch the code for switching from a zoned to
> > a non-zoned device. Switching from a zoned dm-crypt device to a
> > non-zoned device is problematic if there are plugged zones, since the
> > underlying device will not be prepared to handle these plugged writes.
> > Switching to a target that does not pass down IOs, like the dm-error
> > target, is fine. So is switching when there are no plugged zones, except
> > that we do not free the zoned resources in this case, even though we
> > safely could.
> 
> This is another case that does not make much sense in practice. So instead of
> still allowing it knowing that it most likely will not work, we should return
> ENOTSUPP here too I think.
> 
> > If people are interested in removing some of these limitations, I can
> > send patches for them, but I'm not sure how much extra code we want,
> > just to support niche zoned dm-crypt reloads.
> 
> I have never heard any complaints/bug reports from people attempting this. Which
> likely means that no-one is needing/trying to do this. So as mentionned above,
> we should make sure that this feature is not reported as not supported with a
> ENOTSUPP error, and maybe a warning.
> 
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails
  2025-03-09 23:18   ` Damien Le Moal
@ 2025-03-10 16:39     ` Benjamin Marzinski
  0 siblings, 0 replies; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 16:39 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Mon, Mar 10, 2025 at 08:18:31AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:28, Benjamin Marzinski wrote:
> > __bind was changing the disk capacity, geometry and mempools of the
> > mapped device before calling dm_table_set_restrictions() which could
> > fail, forcing dm to drop the new table. Failing here would leave the
> > device using the old table but with the wrong capacity and mempools.
> > 
> > Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
> > capacity to be set, save the old version and restore it on failure.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> 
> Does this need a "Fixes" tag maybe ?

Yeah. I can go through and add the appropriate Fixes tags.

-Ben

> 
> Otherwise looks good to me.
> 
> Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
> 
> -- 
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-09 23:25   ` Damien Le Moal
@ 2025-03-10 17:37     ` Benjamin Marzinski
  2025-03-10 18:15       ` Benjamin Marzinski
  2025-03-10 23:16       ` Damien Le Moal
  0 siblings, 2 replies; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 17:37 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Mon, Mar 10, 2025 at 08:25:39AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:28, Benjamin Marzinski wrote:
> > If dm_table_set_restrictions() fails while swapping tables,
> > device-mapper will continue using the previous table. It must be sure to
> > leave the mapped_device in it's previous state on failure.  Otherwise
> > device-mapper could end up using the old table with settings from the
> > unused table.
> > 
> > Do not update the mapped device in dm_set_zones_restrictions(). Wait
> > till after dm_table_set_restrictions() is sure to succeed to update the
> > md zoned settings. Do the same with the dax settings, and if
> > dm_revalidate_zones() fails, restore the original queue limits.
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  drivers/md/dm-table.c | 24 ++++++++++++++++--------
> >  drivers/md/dm-zone.c  | 26 ++++++++++++++++++--------
> >  drivers/md/dm.h       |  1 +
> >  3 files changed, 35 insertions(+), 16 deletions(-)
> > 
> > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > index 0ef5203387b2..4003e84af11d 100644
> > --- a/drivers/md/dm-table.c
> > +++ b/drivers/md/dm-table.c
> > @@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> >  			      struct queue_limits *limits)
> >  {
> >  	int r;
> > +	struct queue_limits old_limits;
> >  
> >  	if (!dm_table_supports_nowait(t))
> >  		limits->features &= ~BLK_FEAT_NOWAIT;
> > @@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> >  	if (dm_table_supports_flush(t))
> >  		limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
> >  
> > -	if (dm_table_supports_dax(t, device_not_dax_capable)) {
> > +	if (dm_table_supports_dax(t, device_not_dax_capable))
> >  		limits->features |= BLK_FEAT_DAX;
> > -		if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> > -			set_dax_synchronous(t->md->dax_dev);
> > -	} else
> > +	else
> >  		limits->features &= ~BLK_FEAT_DAX;
> >  
> > -	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> > -		dax_write_cache(t->md->dax_dev, true);
> > -
> >  	/* For a zoned table, setup the zone related queue attributes. */
> >  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> >  	    (limits->features & BLK_FEAT_ZONED)) {
> > @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> >  	if (dm_table_supports_atomic_writes(t))
> >  		limits->features |= BLK_FEAT_ATOMIC_WRITES;
> >  
> > +	old_limits = q->limits;
> 
> I am not sure this is safe to do like this since the user may be simultaneously
> changing attributes, which would result in the old_limits struct being in an
> inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
> a queue_limits_get() helper for that though.
> 
> >  	r = queue_limits_set(q, limits);
> 
> ...Or, we could modify queue_limits_set() to also return the old limit struct
> under the q limits_lock. That maybe easier.

If we disallow switching between zoned devices then this is unnecssary.
OTherwise you're right. We do want to make sure that we don't grep the
limits while something is updating the limits.

Unfortunately, thinking about this just made me realize a different
problem, that has nothing to do with this patchset. bio-based devices
can't handle freezing the queue while there are plugged zone write bios.
So, for instance, if you do something like:

# modprobe scsi_debug dev_size_mb=512 zbc=managed zone_size_mb=128 zone_nr_conv=0 delay=20
# dmsetup create test --table "0 1048576 crypt aes-cbc-essiv:sha256 deadbeefdeadbeefdeadbeefdeadbeef 0 /dev/sda 0"
# dd if=/dev/zero of=/dev/mapper/test bs=1M count=128 &
# echo 0 > /sys/block/dm-1/queue/iostats

you hang.

*** FREEZING THE QUEUE WHILE UPDATING THE LIMIT ***
root@fedora-test:~# cat /proc/1344/stack 
[<0>] blk_mq_freeze_queue_wait+0x9f/0xd0
[<0>] queue_limits_commit_update_frozen+0x38/0x60
[<0>] queue_attr_store+0xc6/0x1b0
[<0>] kernfs_fop_write_iter+0x13b/0x1f0
[<0>] vfs_write+0x28d/0x450
[<0>] ksys_write+0x6c/0xe0
[<0>] do_syscall_64+0x82/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e

**** WRITING OUT THE PLUGGED BIOS [kworker/0:1H+dm-1_zwplugs] ****
root@fedora-test:~# cat /proc/341/stack
[<0>] __bio_queue_enter+0x13c/0x270
[<0>] __submit_bio+0xf9/0x280
[<0>] submit_bio_noacct_nocheck+0x197/0x3e0
[<0>] blk_zone_wplug_bio_work+0x1ad/0x1f0
[<0>] process_one_work+0x176/0x330
[<0>] worker_thread+0x252/0x390
[<0>] kthread+0xec/0x230
[<0>] ret_from_fork+0x31/0x50
[<0>] ret_from_fork_asm+0x1a/0x30

Changing the iostats limit could be swapped with anything that freezes
the queue.

-Ben

> 
> >  	if (r)
> >  		return r;
> > @@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> >  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> >  	    (limits->features & BLK_FEAT_ZONED)) {
> >  		r = dm_revalidate_zones(t, q);
> > -		if (r)
> > +		if (r) {
> > +			queue_limits_set(q, &old_limits);
> >  			return r;
> > +		}
> >  	}
> >  
> > +	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
> > +		dm_finalize_zone_settings(t, limits);
> > +
> > +	if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> > +		set_dax_synchronous(t->md->dax_dev);
> > +
> > +	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> > +		dax_write_cache(t->md->dax_dev, true);
> > +
> >  	dm_update_crypto_profile(q, t);
> >  	return 0;
> >  }
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index 20edd3fabbab..681058feb63b 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  	 * mapped device queue as needing zone append emulation.
> >  	 */
> >  	WARN_ON_ONCE(queue_is_mq(q));
> > -	if (dm_table_supports_zone_append(t)) {
> > -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > -	} else {
> > -		set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > +	if (!dm_table_supports_zone_append(t))
> >  		lim->max_hw_zone_append_sectors = 0;
> > -	}
> >  
> >  	/*
> >  	 * Determine the max open and max active zone limits for the mapped
> > @@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  		lim->zone_write_granularity = 0;
> >  		lim->chunk_sectors = 0;
> >  		lim->features &= ~BLK_FEAT_ZONED;
> > -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > -		md->nr_zones = 0;
> > -		disk->nr_zones = 0;
> >  		return 0;
> >  	}
> >  
> > @@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  	return 0;
> >  }
> >  
> > +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
> > +{
> > +	struct mapped_device *md = t->md;
> > +
> > +	if (lim->features & BLK_FEAT_ZONED) {
> > +		if (dm_table_supports_zone_append(t))
> > +			clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > +		else
> > +			set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > +	} else {
> > +		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > +		md->nr_zones = 0;
> > +		md->disk->nr_zones = 0;
> > +	}
> > +}
> > +
> > +
> >  /*
> >   * IO completion callback called from clone_endio().
> >   */
> > diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> > index a0a8ff119815..e5d3a9f46a91 100644
> > --- a/drivers/md/dm.h
> > +++ b/drivers/md/dm.h
> > @@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
> >  int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  		struct queue_limits *lim);
> >  int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
> > +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
> >  void dm_zone_endio(struct dm_io *io, struct bio *clone);
> >  #ifdef CONFIG_BLK_DEV_ZONED
> >  int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-09 23:59   ` Damien Le Moal
@ 2025-03-10 17:43     ` Benjamin Marzinski
  2025-03-10 23:19       ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 17:43 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
> On 3/10/25 07:29, Benjamin Marzinski wrote:
> > dm_revalidate_zones() only allowed devices that had no zone resources
> > set up to call blk_revalidate_disk_zones(). If the device already had
> > zone resources, disk->nr_zones would always equal md->nr_zones so
> > dm_revalidate_zones() returned without doing any work. Instead, always
> > call blk_revalidate_disk_zones() if you are loading a new zoned table.
> > 
> > However, if the device emulates zone append operations and already has
> > zone append emulation resources, the table size cannot change when
> > loading a new table. Otherwise, all those resources will be garbage.
> > 
> > If emulated zone append operations are needed and the zone write pointer
> > offsets of the new table do not match those of the old table, writes to
> > the device will still fail. This patch allows users to safely grow and
> > shrink zone devices. But swapping arbitrary zoned tables will still not
> > work.
> 
> I do not think that this patch correctly address the shrinking of dm zoned
> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
> will leave already hashed zone write plugs outside of that new zone range in the
> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
> reallocate the hash table if the number of zones shrinks.

This is necessary for DM. There could be plugged bios that are on on
these no longer in-range zones.  They will obviously fail when they get
reissued, but we need to keep the plugs around so that they *do* get
reissued. A cleaner alternative would be to add code to immediately
error out all the plugged bios on shrinks, but I was trying to avoid
adding a bunch of code to deal with these cases (of course simply
disallowing them adds even less code).

-Ben

> For a physical drive,
> this can only happen if the drive is reformatted with some magic vendor unique
> command, which is why this was never implemented as that is not a valid
> production use case.
> 
> To make things simpler, I think we should allow growing/shrinking zoned device
> tables, and much less swapping tables between zoned and not-zoned. I am more
> inclined to avoid all these corner cases by simply not supporting table
> switching for zoned device. That would be much safer I think.
> No-one complained about any issue with table switching until now, which likely
> means that no-one is using this. So what about simply returning an error for
> table switching for a zoned device ? If someone request this support, we can
> revisit this.
> 
> > 
> > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > ---
> >  drivers/md/dm-zone.c | 23 +++++++++++++----------
> >  1 file changed, 13 insertions(+), 10 deletions(-)
> > 
> > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > index ac86011640c3..7e9ebeee7eac 100644
> > --- a/drivers/md/dm-zone.c
> > +++ b/drivers/md/dm-zone.c
> > @@ -164,16 +164,8 @@ int dm_revalidate_zones(struct dm_table *t, struct request_queue *q)
> >  	if (!get_capacity(disk))
> >  		return 0;
> >  
> > -	/* Revalidate only if something changed. */
> > -	if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> > -		DMINFO("%s using %s zone append",
> > -		       disk->disk_name,
> > -		       queue_emulates_zone_append(q) ? "emulated" : "native");
> > -		md->nr_zones = 0;
> > -	}
> > -
> > -	if (md->nr_zones)
> > -		return 0;
> > +	DMINFO("%s using %s zone append", disk->disk_name,
> > +	       queue_emulates_zone_append(q) ? "emulated" : "native");
> >  
> >  	/*
> >  	 * Our table is not live yet. So the call to dm_get_live_table()
> > @@ -392,6 +384,17 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> >  		return 0;
> >  	}
> >  
> > +	/*
> > +	 * If the device needs zone append emulation, and the device already has
> > +	 * zone append emulation resources, make sure that the chunk_sectors
> > +	 * hasn't changed size. Otherwise those resources will be garbage.
> > +	 */
> > +	if (!lim->max_hw_zone_append_sectors && disk->zone_wplugs_hash &&
> > +	    q->limits.chunk_sectors != lim->chunk_sectors) {
> > +		DMERR("Cannot change zone size when swapping tables");
> > +		return -EINVAL;
> > +	}
> > +
> >  	/*
> >  	 * Warn once (when the capacity is not yet set) if the mapped device is
> >  	 * partially using zone resources of the target devices as that leads to
> 
> 
> -- 
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-10 17:37     ` Benjamin Marzinski
@ 2025-03-10 18:15       ` Benjamin Marzinski
  2025-03-10 23:27         ` Damien Le Moal
  2025-03-10 23:16       ` Damien Le Moal
  1 sibling, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 18:15 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Mon, Mar 10, 2025 at 01:37:19PM -0400, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:25:39AM +0900, Damien Le Moal wrote:
> > On 3/10/25 07:28, Benjamin Marzinski wrote:
> > > If dm_table_set_restrictions() fails while swapping tables,
> > > device-mapper will continue using the previous table. It must be sure to
> > > leave the mapped_device in it's previous state on failure.  Otherwise
> > > device-mapper could end up using the old table with settings from the
> > > unused table.
> > > 
> > > Do not update the mapped device in dm_set_zones_restrictions(). Wait
> > > till after dm_table_set_restrictions() is sure to succeed to update the
> > > md zoned settings. Do the same with the dax settings, and if
> > > dm_revalidate_zones() fails, restore the original queue limits.
> > > 
> > > Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
> > > ---
> > >  drivers/md/dm-table.c | 24 ++++++++++++++++--------
> > >  drivers/md/dm-zone.c  | 26 ++++++++++++++++++--------
> > >  drivers/md/dm.h       |  1 +
> > >  3 files changed, 35 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> > > index 0ef5203387b2..4003e84af11d 100644
> > > --- a/drivers/md/dm-table.c
> > > +++ b/drivers/md/dm-table.c
> > > @@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > >  			      struct queue_limits *limits)
> > >  {
> > >  	int r;
> > > +	struct queue_limits old_limits;
> > >  
> > >  	if (!dm_table_supports_nowait(t))
> > >  		limits->features &= ~BLK_FEAT_NOWAIT;
> > > @@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > >  	if (dm_table_supports_flush(t))
> > >  		limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
> > >  
> > > -	if (dm_table_supports_dax(t, device_not_dax_capable)) {
> > > +	if (dm_table_supports_dax(t, device_not_dax_capable))
> > >  		limits->features |= BLK_FEAT_DAX;
> > > -		if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> > > -			set_dax_synchronous(t->md->dax_dev);
> > > -	} else
> > > +	else
> > >  		limits->features &= ~BLK_FEAT_DAX;
> > >  
> > > -	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> > > -		dax_write_cache(t->md->dax_dev, true);
> > > -
> > >  	/* For a zoned table, setup the zone related queue attributes. */
> > >  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> > >  	    (limits->features & BLK_FEAT_ZONED)) {
> > > @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > >  	if (dm_table_supports_atomic_writes(t))
> > >  		limits->features |= BLK_FEAT_ATOMIC_WRITES;
> > >  
> > > +	old_limits = q->limits;
> > 
> > I am not sure this is safe to do like this since the user may be simultaneously
> > changing attributes, which would result in the old_limits struct being in an
> > inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
> > a queue_limits_get() helper for that though.
> > 
> > >  	r = queue_limits_set(q, limits);
> > 
> > ...Or, we could modify queue_limits_set() to also return the old limit struct
> > under the q limits_lock. That maybe easier.
> 
> If we disallow switching between zoned devices then this is unnecssary.

Err.. nevermind that last line. There are still multiple cases where we
could still fail here and need to fail back to the earlier limits. But
I'm less sure that it's really necessary to lock the limits before
reading them. For DM devices, I don't see a place where a bunch of
limits could be updated at the same time, while we are swapping tables.
An individual limit could get updated by things like the sysfs
interface. But since that could happen at any time, I don't see what
locking gets us. And if it's not safe to simply read a limit without
locking them, then there are lots of places where we have unsafe code.
Am I missing something here?

-Ben

> OTherwise you're right. We do want to make sure that we don't grep the
> limits while something is updating the limits.
> 
> Unfortunately, thinking about this just made me realize a different
> problem, that has nothing to do with this patchset. bio-based devices
> can't handle freezing the queue while there are plugged zone write bios.
> So, for instance, if you do something like:
> 
> # modprobe scsi_debug dev_size_mb=512 zbc=managed zone_size_mb=128 zone_nr_conv=0 delay=20
> # dmsetup create test --table "0 1048576 crypt aes-cbc-essiv:sha256 deadbeefdeadbeefdeadbeefdeadbeef 0 /dev/sda 0"
> # dd if=/dev/zero of=/dev/mapper/test bs=1M count=128 &
> # echo 0 > /sys/block/dm-1/queue/iostats
> 
> you hang.
> 
> *** FREEZING THE QUEUE WHILE UPDATING THE LIMIT ***
> root@fedora-test:~# cat /proc/1344/stack 
> [<0>] blk_mq_freeze_queue_wait+0x9f/0xd0
> [<0>] queue_limits_commit_update_frozen+0x38/0x60
> [<0>] queue_attr_store+0xc6/0x1b0
> [<0>] kernfs_fop_write_iter+0x13b/0x1f0
> [<0>] vfs_write+0x28d/0x450
> [<0>] ksys_write+0x6c/0xe0
> [<0>] do_syscall_64+0x82/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> **** WRITING OUT THE PLUGGED BIOS [kworker/0:1H+dm-1_zwplugs] ****
> root@fedora-test:~# cat /proc/341/stack
> [<0>] __bio_queue_enter+0x13c/0x270
> [<0>] __submit_bio+0xf9/0x280
> [<0>] submit_bio_noacct_nocheck+0x197/0x3e0
> [<0>] blk_zone_wplug_bio_work+0x1ad/0x1f0
> [<0>] process_one_work+0x176/0x330
> [<0>] worker_thread+0x252/0x390
> [<0>] kthread+0xec/0x230
> [<0>] ret_from_fork+0x31/0x50
> [<0>] ret_from_fork_asm+0x1a/0x30
> 
> Changing the iostats limit could be swapped with anything that freezes
> the queue.
> 
> -Ben
> 
> > 
> > >  	if (r)
> > >  		return r;
> > > @@ -1894,10 +1891,21 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
> > >  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
> > >  	    (limits->features & BLK_FEAT_ZONED)) {
> > >  		r = dm_revalidate_zones(t, q);
> > > -		if (r)
> > > +		if (r) {
> > > +			queue_limits_set(q, &old_limits);
> > >  			return r;
> > > +		}
> > >  	}
> > >  
> > > +	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED))
> > > +		dm_finalize_zone_settings(t, limits);
> > > +
> > > +	if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
> > > +		set_dax_synchronous(t->md->dax_dev);
> > > +
> > > +	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
> > > +		dax_write_cache(t->md->dax_dev, true);
> > > +
> > >  	dm_update_crypto_profile(q, t);
> > >  	return 0;
> > >  }
> > > diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> > > index 20edd3fabbab..681058feb63b 100644
> > > --- a/drivers/md/dm-zone.c
> > > +++ b/drivers/md/dm-zone.c
> > > @@ -340,12 +340,8 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> > >  	 * mapped device queue as needing zone append emulation.
> > >  	 */
> > >  	WARN_ON_ONCE(queue_is_mq(q));
> > > -	if (dm_table_supports_zone_append(t)) {
> > > -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > -	} else {
> > > -		set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > +	if (!dm_table_supports_zone_append(t))
> > >  		lim->max_hw_zone_append_sectors = 0;
> > > -	}
> > >  
> > >  	/*
> > >  	 * Determine the max open and max active zone limits for the mapped
> > > @@ -383,9 +379,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> > >  		lim->zone_write_granularity = 0;
> > >  		lim->chunk_sectors = 0;
> > >  		lim->features &= ~BLK_FEAT_ZONED;
> > > -		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > -		md->nr_zones = 0;
> > > -		disk->nr_zones = 0;
> > >  		return 0;
> > >  	}
> > >  
> > > @@ -408,6 +401,23 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> > >  	return 0;
> > >  }
> > >  
> > > +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim)
> > > +{
> > > +	struct mapped_device *md = t->md;
> > > +
> > > +	if (lim->features & BLK_FEAT_ZONED) {
> > > +		if (dm_table_supports_zone_append(t))
> > > +			clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > +		else
> > > +			set_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > +	} else {
> > > +		clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> > > +		md->nr_zones = 0;
> > > +		md->disk->nr_zones = 0;
> > > +	}
> > > +}
> > > +
> > > +
> > >  /*
> > >   * IO completion callback called from clone_endio().
> > >   */
> > > diff --git a/drivers/md/dm.h b/drivers/md/dm.h
> > > index a0a8ff119815..e5d3a9f46a91 100644
> > > --- a/drivers/md/dm.h
> > > +++ b/drivers/md/dm.h
> > > @@ -102,6 +102,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t);
> > >  int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> > >  		struct queue_limits *lim);
> > >  int dm_revalidate_zones(struct dm_table *t, struct request_queue *q);
> > > +void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
> > >  void dm_zone_endio(struct dm_io *io, struct bio *clone);
> > >  #ifdef CONFIG_BLK_DEV_ZONED
> > >  int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
> > 
> > 
> > -- 
> > Damien Le Moal
> > Western Digital Research
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 0/7] dm: fix issues with swapping dm tables
  2025-03-10 16:38   ` Benjamin Marzinski
@ 2025-03-10 23:13     ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-10 23:13 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On 3/11/25 01:38, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:16:43AM +0900, Damien Le Moal wrote:
>> On 3/10/25 07:28, Benjamin Marzinski wrote:
>>> There were multiple places in dm's __bind() function where it could fail
>>> and not completely roll back, leaving the device using the the old
>>> table, but with device limits and resources from the new table.
>>> Additionally, unused mempools for request-based devices were not always
>>> freed immediately.
>>>
>>> Finally, there were a number of issues with switching zoned tables that
>>> emulate zone append (in other words, dm-crypt on top of zoned devices).
>>> dm_blk_report_zones() could be called while the device was suspended and
>>> modifying zoned resources or could possibly fail to end a srcu read
>>> section.  More importantly, blk_revalidate_disk_zones() would never get
>>> called when updating a zoned table. This could cause the dm device to
>>> see the wrong zone write offsets, not have a large enough zwplugs
>>> reserved in its mempool, or read invalid memory when checking the
>>> conventional zones bitmap.
>>>
>>> This patchset fixes these issues. It does not make it so that
>>> device-mapper is able to load any zoned table from any other zoned
>>> table. Zoned dm-crypt devices can be safely grown and shrunk, but
>>> reloading a zoned dm-crypt device to, for instance, point at a
>>> completely different underlying device won't work correctly. IO might
>>> fail since the zone write offsets of the dm-crypt device will not be
>>> updated for all the existing zones with plugs. If the new device's zone
>>> offsets don't match the old device's offsets, IO to the zone will fail.
>>> If the ability to switch tables from a zoned dm-crypt device to an
>>> abritry other zoned dm-crypt device is important to people, it could be
>>> done as long as there are no plugged zones when dm suspends.
>>
>> Thanks for fixing this.
>>
>> Given that in the general case switching tables will always likely result in
>> unaligned write errors, I think we should just report a ENOTSUPP error if the
>> user attempts to swap tables.
> 
> If we don't think there's any interest in growing or shrinking zoned
> dm-crypt devices, that's fine.  I do think we should make an exception
> for switching to the dm-error target. We specifically call that out with
> DM_TARGET_WILDCARD so that we can always switch to it from any table if
> we just want to fail out all the IO.

Arg ! dm-error is used in xfstests so we need it (for btrfs at least since btrfs
supports zoned devices, and soon xfs as well). So I guess we should disallow
switching tables when the new table changes something to the zone configuration
(grow, shrink, zone size, zoned/non-zoned). dm-error does not change anything,
so we should still be able to allow it.



-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-10 17:37     ` Benjamin Marzinski
  2025-03-10 18:15       ` Benjamin Marzinski
@ 2025-03-10 23:16       ` Damien Le Moal
  1 sibling, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-10 23:16 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On 3/11/25 02:37, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:25:39AM +0900, Damien Le Moal wrote:
>> On 3/10/25 07:28, Benjamin Marzinski wrote:
>>> If dm_table_set_restrictions() fails while swapping tables,
>>> device-mapper will continue using the previous table. It must be sure to
>>> leave the mapped_device in it's previous state on failure.  Otherwise
>>> device-mapper could end up using the old table with settings from the
>>> unused table.
>>>
>>> Do not update the mapped device in dm_set_zones_restrictions(). Wait
>>> till after dm_table_set_restrictions() is sure to succeed to update the
>>> md zoned settings. Do the same with the dax settings, and if
>>> dm_revalidate_zones() fails, restore the original queue limits.
>>>
>>> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
>>> ---
>>>  drivers/md/dm-table.c | 24 ++++++++++++++++--------
>>>  drivers/md/dm-zone.c  | 26 ++++++++++++++++++--------
>>>  drivers/md/dm.h       |  1 +
>>>  3 files changed, 35 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
>>> index 0ef5203387b2..4003e84af11d 100644
>>> --- a/drivers/md/dm-table.c
>>> +++ b/drivers/md/dm-table.c
>>> @@ -1836,6 +1836,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>>>  			      struct queue_limits *limits)
>>>  {
>>>  	int r;
>>> +	struct queue_limits old_limits;
>>>  
>>>  	if (!dm_table_supports_nowait(t))
>>>  		limits->features &= ~BLK_FEAT_NOWAIT;
>>> @@ -1862,16 +1863,11 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>>>  	if (dm_table_supports_flush(t))
>>>  		limits->features |= BLK_FEAT_WRITE_CACHE | BLK_FEAT_FUA;
>>>  
>>> -	if (dm_table_supports_dax(t, device_not_dax_capable)) {
>>> +	if (dm_table_supports_dax(t, device_not_dax_capable))
>>>  		limits->features |= BLK_FEAT_DAX;
>>> -		if (dm_table_supports_dax(t, device_not_dax_synchronous_capable))
>>> -			set_dax_synchronous(t->md->dax_dev);
>>> -	} else
>>> +	else
>>>  		limits->features &= ~BLK_FEAT_DAX;
>>>  
>>> -	if (dm_table_any_dev_attr(t, device_dax_write_cache_enabled, NULL))
>>> -		dax_write_cache(t->md->dax_dev, true);
>>> -
>>>  	/* For a zoned table, setup the zone related queue attributes. */
>>>  	if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
>>>  	    (limits->features & BLK_FEAT_ZONED)) {
>>> @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>>>  	if (dm_table_supports_atomic_writes(t))
>>>  		limits->features |= BLK_FEAT_ATOMIC_WRITES;
>>>  
>>> +	old_limits = q->limits;
>>
>> I am not sure this is safe to do like this since the user may be simultaneously
>> changing attributes, which would result in the old_limits struct being in an
>> inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
>> a queue_limits_get() helper for that though.
>>
>>>  	r = queue_limits_set(q, limits);
>>
>> ...Or, we could modify queue_limits_set() to also return the old limit struct
>> under the q limits_lock. That maybe easier.
> 
> If we disallow switching between zoned devices then this is unnecssary.
> OTherwise you're right. We do want to make sure that we don't grep the
> limits while something is updating the limits.
> 
> Unfortunately, thinking about this just made me realize a different
> problem, that has nothing to do with this patchset. bio-based devices
> can't handle freezing the queue while there are plugged zone write bios.
> So, for instance, if you do something like:
> 
> # modprobe scsi_debug dev_size_mb=512 zbc=managed zone_size_mb=128 zone_nr_conv=0 delay=20
> # dmsetup create test --table "0 1048576 crypt aes-cbc-essiv:sha256 deadbeefdeadbeefdeadbeefdeadbeef 0 /dev/sda 0"
> # dd if=/dev/zero of=/dev/mapper/test bs=1M count=128 &
> # echo 0 > /sys/block/dm-1/queue/iostats
> 
> you hang.

Jens just applied a patch series cleaning up locking around limits and queue
freeze vs locking ordering. So we should check if this issue is still there with
these patches. Will try to test that.

-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-10 17:43     ` Benjamin Marzinski
@ 2025-03-10 23:19       ` Damien Le Moal
  2025-03-10 23:42         ` Benjamin Marzinski
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-10 23:19 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On 3/11/25 02:43, Benjamin Marzinski wrote:
> On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
>> On 3/10/25 07:29, Benjamin Marzinski wrote:
>>> dm_revalidate_zones() only allowed devices that had no zone resources
>>> set up to call blk_revalidate_disk_zones(). If the device already had
>>> zone resources, disk->nr_zones would always equal md->nr_zones so
>>> dm_revalidate_zones() returned without doing any work. Instead, always
>>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
>>>
>>> However, if the device emulates zone append operations and already has
>>> zone append emulation resources, the table size cannot change when
>>> loading a new table. Otherwise, all those resources will be garbage.
>>>
>>> If emulated zone append operations are needed and the zone write pointer
>>> offsets of the new table do not match those of the old table, writes to
>>> the device will still fail. This patch allows users to safely grow and
>>> shrink zone devices. But swapping arbitrary zoned tables will still not
>>> work.
>>
>> I do not think that this patch correctly address the shrinking of dm zoned
>> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
>> will leave already hashed zone write plugs outside of that new zone range in the
>> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
>> reallocate the hash table if the number of zones shrinks.
> 
> This is necessary for DM. There could be plugged bios that are on on
> these no longer in-range zones.  They will obviously fail when they get
> reissued, but we need to keep the plugs around so that they *do* get
> reissued. A cleaner alternative would be to add code to immediately
> error out all the plugged bios on shrinks, but I was trying to avoid
> adding a bunch of code to deal with these cases (of course simply
> disallowing them adds even less code).

I am confused now :)
Under the assumption that we do not allow switching to a new table that changes
the zone configuration (in particualr, there is no grow/shrink of the device),
then I do not think we have to do anything special for DM.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-10 18:15       ` Benjamin Marzinski
@ 2025-03-10 23:27         ` Damien Le Moal
  2025-03-14 13:38           ` Mikulas Patocka
  0 siblings, 1 reply; 28+ messages in thread
From: Damien Le Moal @ 2025-03-10 23:27 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On 3/11/25 03:15, Benjamin Marzinski wrote:
>>>> @@ -1883,6 +1879,7 @@ int dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
>>>>  	if (dm_table_supports_atomic_writes(t))
>>>>  		limits->features |= BLK_FEAT_ATOMIC_WRITES;
>>>>  
>>>> +	old_limits = q->limits;
>>>
>>> I am not sure this is safe to do like this since the user may be simultaneously
>>> changing attributes, which would result in the old_limits struct being in an
>>> inconsistent state. So shouldn't we hold q->limits_lock here ? We probably want
>>> a queue_limits_get() helper for that though.
>>>
>>>>  	r = queue_limits_set(q, limits);
>>>
>>> ...Or, we could modify queue_limits_set() to also return the old limit struct
>>> under the q limits_lock. That maybe easier.
>>
>> If we disallow switching between zoned devices then this is unnecssary.
> 
> Err.. nevermind that last line. There are still multiple cases where we
> could still fail here and need to fail back to the earlier limits. But
> I'm less sure that it's really necessary to lock the limits before
> reading them. For DM devices, I don't see a place where a bunch of
> limits could be updated at the same time, while we are swapping tables.
> An individual limit could get updated by things like the sysfs
> interface. But since that could happen at any time, I don't see what
> locking gets us. And if it's not safe to simply read a limit without
> locking them, then there are lots of places where we have unsafe code.
> Am I missing something here?

Yes, for simple scalar limits, I do not think there is any issue. But there are
some cases where changing one limit implies a change to other limits when the
limits are committed (under the limits lock). So my concern was that if the
above runs simultaneously with a queue limits commit, we may endup with the
limits struct copy grabbing part of the new limits and thus resulting in an
inconsistent limits struct. Not entirely sure that can actually happen though.
But given that queue_limits_commit_update() does:

	q->limits = *lim;

and this code does:

	old_limits = q->limits;

we may endup depending on how the compiler handles the struct copy ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-10 23:19       ` Damien Le Moal
@ 2025-03-10 23:42         ` Benjamin Marzinski
  2025-03-11  0:00           ` Damien Le Moal
  0 siblings, 1 reply; 28+ messages in thread
From: Benjamin Marzinski @ 2025-03-10 23:42 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote:
> On 3/11/25 02:43, Benjamin Marzinski wrote:
> > On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
> >> On 3/10/25 07:29, Benjamin Marzinski wrote:
> >>> dm_revalidate_zones() only allowed devices that had no zone resources
> >>> set up to call blk_revalidate_disk_zones(). If the device already had
> >>> zone resources, disk->nr_zones would always equal md->nr_zones so
> >>> dm_revalidate_zones() returned without doing any work. Instead, always
> >>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
> >>>
> >>> However, if the device emulates zone append operations and already has
> >>> zone append emulation resources, the table size cannot change when
> >>> loading a new table. Otherwise, all those resources will be garbage.
> >>>
> >>> If emulated zone append operations are needed and the zone write pointer
> >>> offsets of the new table do not match those of the old table, writes to
> >>> the device will still fail. This patch allows users to safely grow and
> >>> shrink zone devices. But swapping arbitrary zoned tables will still not
> >>> work.
> >>
> >> I do not think that this patch correctly address the shrinking of dm zoned
> >> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
> >> will leave already hashed zone write plugs outside of that new zone range in the
> >> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
> >> reallocate the hash table if the number of zones shrinks.
> > 
> > This is necessary for DM. There could be plugged bios that are on on
> > these no longer in-range zones.  They will obviously fail when they get
> > reissued, but we need to keep the plugs around so that they *do* get
> > reissued. A cleaner alternative would be to add code to immediately
> > error out all the plugged bios on shrinks, but I was trying to avoid
> > adding a bunch of code to deal with these cases (of course simply
> > disallowing them adds even less code).
> 
> I am confused now :)
> Under the assumption that we do not allow switching to a new table that changes
> the zone configuration (in particualr, there is no grow/shrink of the device),
> then I do not think we have to do anything special for DM.

If we don't allow switching between zoned tables, then we obviously
don't need to make DM call blk_revalidate_disk_zones() on a zoned table
switch. I was just saying that I know that this patch would leave
out-of-range zone plugs behind on a shrink, but that is necessary to
allow shrinking while there could still be outstanding plugged bios
attached to those plugs.

So, if we wanted to allow shrinking, then I think this patch is correct
(although erroring out all the out-of-range bios would be a cleaner
solution). But assuming we don't allow shrinking, then you are correct.
We don't need to do anything special for DM.

-Ben

> 
> 
> -- 
> Damien Le Moal
> Western Digital Research


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC PATCH 7/7] dm: allow devices to revalidate existing zones
  2025-03-10 23:42         ` Benjamin Marzinski
@ 2025-03-11  0:00           ` Damien Le Moal
  0 siblings, 0 replies; 28+ messages in thread
From: Damien Le Moal @ 2025-03-11  0:00 UTC (permalink / raw)
  To: Benjamin Marzinski
  Cc: Mikulas Patocka, Mike Snitzer, Jens Axboe, dm-devel, linux-block,
	Christoph Hellwig

On 3/11/25 08:42, Benjamin Marzinski wrote:
> On Tue, Mar 11, 2025 at 08:19:29AM +0900, Damien Le Moal wrote:
>> On 3/11/25 02:43, Benjamin Marzinski wrote:
>>> On Mon, Mar 10, 2025 at 08:59:26AM +0900, Damien Le Moal wrote:
>>>> On 3/10/25 07:29, Benjamin Marzinski wrote:
>>>>> dm_revalidate_zones() only allowed devices that had no zone resources
>>>>> set up to call blk_revalidate_disk_zones(). If the device already had
>>>>> zone resources, disk->nr_zones would always equal md->nr_zones so
>>>>> dm_revalidate_zones() returned without doing any work. Instead, always
>>>>> call blk_revalidate_disk_zones() if you are loading a new zoned table.
>>>>>
>>>>> However, if the device emulates zone append operations and already has
>>>>> zone append emulation resources, the table size cannot change when
>>>>> loading a new table. Otherwise, all those resources will be garbage.
>>>>>
>>>>> If emulated zone append operations are needed and the zone write pointer
>>>>> offsets of the new table do not match those of the old table, writes to
>>>>> the device will still fail. This patch allows users to safely grow and
>>>>> shrink zone devices. But swapping arbitrary zoned tables will still not
>>>>> work.
>>>>
>>>> I do not think that this patch correctly address the shrinking of dm zoned
>>>> device: blk_revalidate_disk_zones() will look at a smaller set of zones, which
>>>> will leave already hashed zone write plugs outside of that new zone range in the
>>>> disk zwplug hash table. disk_revalidate_zone_resources() does not cleanup and
>>>> reallocate the hash table if the number of zones shrinks.
>>>
>>> This is necessary for DM. There could be plugged bios that are on on
>>> these no longer in-range zones.  They will obviously fail when they get
>>> reissued, but we need to keep the plugs around so that they *do* get
>>> reissued. A cleaner alternative would be to add code to immediately
>>> error out all the plugged bios on shrinks, but I was trying to avoid
>>> adding a bunch of code to deal with these cases (of course simply
>>> disallowing them adds even less code).
>>
>> I am confused now :)
>> Under the assumption that we do not allow switching to a new table that changes
>> the zone configuration (in particualr, there is no grow/shrink of the device),
>> then I do not think we have to do anything special for DM.
> 
> If we don't allow switching between zoned tables, then we obviously
> don't need to make DM call blk_revalidate_disk_zones() on a zoned table
> switch. I was just saying that I know that this patch would leave
> out-of-range zone plugs behind on a shrink, but that is necessary to
> allow shrinking while there could still be outstanding plugged bios
> attached to those plugs.
> 
> So, if we wanted to allow shrinking, then I think this patch is correct
> (although erroring out all the out-of-range bios would be a cleaner
> solution). But assuming we don't allow shrinking, then you are correct.
> We don't need to do anything special for DM.

OK. Got it.
As mentioned, I think we really still need to allow switching (or inserting ?)
dm-error. In that case, calling blk_revalidate_disk_zones() should still be fine
as long as there is no change to the zone config, which is true for dm-error. Do
you agree ?


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-10 23:27         ` Damien Le Moal
@ 2025-03-14 13:38           ` Mikulas Patocka
  2025-03-14 13:46             ` Mikulas Patocka
  0 siblings, 1 reply; 28+ messages in thread
From: Mikulas Patocka @ 2025-03-14 13:38 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Benjamin Marzinski, Mike Snitzer, Jens Axboe, dm-devel,
	linux-block, Christoph Hellwig



On Tue, 11 Mar 2025, Damien Le Moal wrote:

> Yes, for simple scalar limits, I do not think there is any issue. But there are
> some cases where changing one limit implies a change to other limits when the
> limits are committed (under the limits lock). So my concern was that if the
> above runs simultaneously with a queue limits commit, we may endup with the
> limits struct copy grabbing part of the new limits and thus resulting in an
> inconsistent limits struct. Not entirely sure that can actually happen though.
> But given that queue_limits_commit_update() does:
> 
> 	q->limits = *lim;
> 
> and this code does:
> 
> 	old_limits = q->limits;
> 
> we may endup depending on how the compiler handles the struct copy ?

There is no guarantee that struct copy will update the structure fields 
atomically.

On some CPUs, a "rep movsb" instruction may be used, which may be 
optimized by the CPU, but it may be also interrupted at any byte boundary.

I think it should be changed to the sequence of WRITE_ONCE statements, for 
example:
WRITE_ONCE(q->limits->file, lim->field);

Mikulas


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/7] dm: handle failures in dm_table_set_restrictions
  2025-03-14 13:38           ` Mikulas Patocka
@ 2025-03-14 13:46             ` Mikulas Patocka
  0 siblings, 0 replies; 28+ messages in thread
From: Mikulas Patocka @ 2025-03-14 13:46 UTC (permalink / raw)
  To: Damien Le Moal
  Cc: Benjamin Marzinski, Mike Snitzer, Jens Axboe, dm-devel,
	linux-block, Christoph Hellwig



On Fri, 14 Mar 2025, Mikulas Patocka wrote:

> 
> 
> On Tue, 11 Mar 2025, Damien Le Moal wrote:
> 
> > Yes, for simple scalar limits, I do not think there is any issue. But there are
> > some cases where changing one limit implies a change to other limits when the
> > limits are committed (under the limits lock). So my concern was that if the
> > above runs simultaneously with a queue limits commit, we may endup with the
> > limits struct copy grabbing part of the new limits and thus resulting in an
> > inconsistent limits struct. Not entirely sure that can actually happen though.
> > But given that queue_limits_commit_update() does:
> > 
> > 	q->limits = *lim;
> > 
> > and this code does:
> > 
> > 	old_limits = q->limits;
> > 
> > we may endup depending on how the compiler handles the struct copy ?
> 
> There is no guarantee that struct copy will update the structure fields 
> atomically.
> 
> On some CPUs, a "rep movsb" instruction may be used, which may be 
> optimized by the CPU, but it may be also interrupted at any byte boundary.
> 
> I think it should be changed to the sequence of WRITE_ONCE statements, for 
> example:
> WRITE_ONCE(q->limits->file, lim->field);
> 
> Mikulas

BTW. some SPARC CPUs had an instruction that would zero a cache line 
without loading it from memory. The memcpy implementation on SPARC used 
this instruction to avoid loading data that would be soon overwritten.

The result was that if you read a region of memory concurrently with 
memcpy writing to it, you could read zeros produced by this instruction.

Mikulas


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-03-14 13:46 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-09 22:28 [RFC PATCH 0/7] dm: fix issues with swapping dm tables Benjamin Marzinski
2025-03-09 22:28 ` [PATCH 1/7] dm: don't change md if dm_table_set_restrictions() fails Benjamin Marzinski
2025-03-09 23:18   ` Damien Le Moal
2025-03-10 16:39     ` Benjamin Marzinski
2025-03-09 22:28 ` [PATCH 2/7] dm: free table mempools if not used in __bind Benjamin Marzinski
2025-03-09 23:19   ` Damien Le Moal
2025-03-09 22:28 ` [PATCH 3/7] dm: handle failures in dm_table_set_restrictions Benjamin Marzinski
2025-03-09 23:25   ` Damien Le Moal
2025-03-10 17:37     ` Benjamin Marzinski
2025-03-10 18:15       ` Benjamin Marzinski
2025-03-10 23:27         ` Damien Le Moal
2025-03-14 13:38           ` Mikulas Patocka
2025-03-14 13:46             ` Mikulas Patocka
2025-03-10 23:16       ` Damien Le Moal
2025-03-09 22:29 ` [PATCH 4/7] dm: fix dm_blk_report_zones Benjamin Marzinski
2025-03-09 23:27   ` Damien Le Moal
2025-03-09 22:29 ` [PATCH 5/7] blk-zoned: clean up zone settings for devices without zwplugs Benjamin Marzinski
2025-03-09 23:31   ` Damien Le Moal
2025-03-09 22:29 ` [RFC PATCH 6/7] blk-zoned: modify blk_revalidate_disk_zones for bio-based drivers Benjamin Marzinski
2025-03-09 22:29 ` [RFC PATCH 7/7] dm: allow devices to revalidate existing zones Benjamin Marzinski
2025-03-09 23:59   ` Damien Le Moal
2025-03-10 17:43     ` Benjamin Marzinski
2025-03-10 23:19       ` Damien Le Moal
2025-03-10 23:42         ` Benjamin Marzinski
2025-03-11  0:00           ` Damien Le Moal
2025-03-09 23:16 ` [RFC PATCH 0/7] dm: fix issues with swapping dm tables Damien Le Moal
2025-03-10 16:38   ` Benjamin Marzinski
2025-03-10 23:13     ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox