* [PATCH 0/4] Zone write plugging and DM zone fixes
@ 2024-05-30 5:40 Damien Le Moal
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
` (5 more replies)
0 siblings, 6 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 5:40 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
The first patch of this series fixes null_blk to avoid weird zone
configurations, namely, a zoned device with a last smaller zone with a
zone capacity smaller than the zone size. Related to this, the next 2
patches fix the handling by zone write plugging of zoned devices with a
last smaller zone. That was completely botched in the initial series.
Finally, the last patch addresses a long standing issue with zoned
device-mapper devices: no zone resource limits (max open and max active
zones limits) are not exposed to the user. This patch fixes that,
allowing for the limits of the underlying target devices to be exposed
with a warning for setups that lead to unreliable limits.
This is all based on block/block-6.10 branch and the last patch depends
on Christoph's recent DM queue limits fixes. While the last patch is
technically not really a fix for a recent bug, it would be nice to get
it in this cycle as the change in the max open zone limits introduced
with zone write plugging (i.e. expose a imax open zone limit of 128 for
devices with no open zones limits) is confusing zonefs tests causing
failures.
Damien Le Moal (4):
null_blk: Do not allow runt zone with zone capacity smaller then zone size
block: Fix validation of zoned device with a runt zone
block: Fix zone write plugging handling of devices with a runt zone
dm: Improve zone resource limits handling
block/blk-zoned.c | 47 ++++++--
drivers/block/null_blk/zoned.c | 11 ++
drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++------
include/linux/blkdev.h | 1 +
4 files changed, 225 insertions(+), 48 deletions(-)
--
2.45.1
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
@ 2024-05-30 5:40 ` Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
` (3 more replies)
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
` (4 subsequent siblings)
5 siblings, 4 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 5:40 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
A zoned device with a smaller last zone together with a zone capacity
smaller than the zone size does make any sense as that does not
correspond to any possible setup for a real device:
1) For ZNS and zoned UFS devices, all zones are always the same size.
2) For SMR HDDs, all zones always have the same capacity.
In other words, if we have a smaller last runt zone, then this zone
capacity should always be equal to the zone size.
Add a check in null_init_zoned_dev() to prevent a configuration to have
both a smaller zone size and a zone capacity smaller than the zone size.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/block/null_blk/zoned.c | 11 +++++++++++
1 file changed, 11 insertions(+)
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 79c8e5e99f7f..f118d304f310 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -74,6 +74,17 @@ int null_init_zoned_dev(struct nullb_device *dev,
return -EINVAL;
}
+ /*
+ * If a smaller zone capacity was requested, do not allow a smaller last
+ * zone at the same time as such zone configuration does not correspond
+ * to any real zoned device.
+ */
+ if (dev->zone_capacity != dev->zone_size &&
+ dev->size & (dev->zone_size - 1)) {
+ pr_err("A smaller last zone is not allowed with zone capacity smaller than zone size.\n");
+ return -EINVAL;
+ }
+
zone_capacity_sects = mb_to_sects(dev->zone_capacity);
dev_capacity_sects = mb_to_sects(dev->size);
dev->zone_size_sects = mb_to_sects(dev->zone_size);
--
2.45.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 2/4] block: Fix validation of zoned device with a runt zone
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
@ 2024-05-30 5:40 ` Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
` (3 more replies)
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
` (3 subsequent siblings)
5 siblings, 4 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 5:40 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
Commit ecfe43b11b02 ("block: Remember zone capacity when revalidating
zones") introduced checks to ensure that the capacity of the zones of
a zoned device is constant for all zones. However, this check ignores
the possibility that a zoned device has a smaller last zone with a size
not equal to the capacity of other zones. Such device correspond in
practice to an SMR drive with a smaller last zone and all zones with a
capacity equal to the zone size, leading to the last zone capacity being
different than the capacity of other zones.
Correctly handle such device by fixing the check for the constant zone
capacity in blk_revalidate_seq_zone() using the new helper function
disk_zone_is_last(). This helper function is also used in
blk_revalidate_zone_cb() when checking the zone size.
Fixes: ecfe43b11b02 ("block: Remember zone capacity when revalidating zones")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 03aa4eead39e..402a50a1ac4d 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -450,6 +450,11 @@ static inline bool disk_zone_is_conv(struct gendisk *disk, sector_t sector)
return test_bit(disk_zone_no(disk, sector), disk->conv_zones_bitmap);
}
+static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
+{
+ return zone->start + zone->len >= get_capacity(disk);
+}
+
static bool disk_insert_zone_wplug(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
@@ -1693,11 +1698,13 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
/*
* Remember the capacity of the first sequential zone and check
- * if it is constant for all zones.
+ * if it is constant for all zones, ignoring the last zone as it can be
+ * smaller.
*/
if (!args->zone_capacity)
args->zone_capacity = zone->capacity;
- if (zone->capacity != args->zone_capacity) {
+ if (!disk_zone_is_last(disk, zone) &&
+ zone->capacity != args->zone_capacity) {
pr_warn("%s: Invalid variable zone capacity\n",
disk->disk_name);
return -ENODEV;
@@ -1732,7 +1739,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
{
struct blk_revalidate_zone_args *args = data;
struct gendisk *disk = args->disk;
- sector_t capacity = get_capacity(disk);
sector_t zone_sectors = disk->queue->limits.chunk_sectors;
int ret;
@@ -1743,7 +1749,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
return -ENODEV;
}
- if (zone->start >= capacity || !zone->len) {
+ if (zone->start >= get_capacity(disk) || !zone->len) {
pr_warn("%s: Invalid zone start %llu, length %llu\n",
disk->disk_name, zone->start, zone->len);
return -ENODEV;
@@ -1753,7 +1759,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
* All zones must have the same size, with the exception on an eventual
* smaller last zone.
*/
- if (zone->start + zone->len < capacity) {
+ if (!disk_zone_is_last(disk, zone)) {
if (zone->len != zone_sectors) {
pr_warn("%s: Invalid zoned device with non constant zone size\n",
disk->disk_name);
--
2.45.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
@ 2024-05-30 5:40 ` Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
` (4 more replies)
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
` (2 subsequent siblings)
5 siblings, 5 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 5:40 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
A zoned device may have a last sequential write required zone that is
smaller than other zones. However, all tests to check if a zone write
plug write offset exceeds the zone capacity use the same capacity
value stored in the gendisk zone_capacity field. This is incorrect for a
zoned device with a last runt (smaller) zone.
Add the new field last_zone_capacity to struct gendisk to store the
capacity of the last zone of the device. blk_revalidate_seq_zone() and
blk_revalidate_conv_zone() are both modified to get this value when
disk_zone_is_last() returns true. Similarly to zone_capacity, the value
is first stored using the last_zone_capacity field of struct
blk_revalidate_zone_args. Once zone revalidation of all zones is done,
this is used to set the gendisk last_zone_capacity field.
The checks to determine if a zone is full or if a sector offset in a
zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
and blk_zone_wplug_prepare_bio() are modified to use the new helper
functions disk_zone_is_full() and disk_zone_wplug_is_full().
disk_zone_is_full() uses the zone index to determine if the zone being
tested is the last one of the disk and uses the either the disk
zone_capacity or last_zone_capacity accordingly.
Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 35 +++++++++++++++++++++++++++--------
include/linux/blkdev.h | 1 +
2 files changed, 28 insertions(+), 8 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 402a50a1ac4d..52abebf56027 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -455,6 +455,20 @@ static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
return zone->start + zone->len >= get_capacity(disk);
}
+static bool disk_zone_is_full(struct gendisk *disk,
+ unsigned int zno, unsigned int offset_in_zone)
+{
+ if (zno < disk->nr_zones - 1)
+ return offset_in_zone >= disk->zone_capacity;
+ return offset_in_zone >= disk->last_zone_capacity;
+}
+
+static bool disk_zone_wplug_is_full(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ return disk_zone_is_full(disk, zwplug->zone_no, zwplug->wp_offset);
+}
+
static bool disk_insert_zone_wplug(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
@@ -548,7 +562,7 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
return false;
/* We can remove zone write plugs for zones that are empty or full. */
- return !zwplug->wp_offset || zwplug->wp_offset >= disk->zone_capacity;
+ return !zwplug->wp_offset || disk_zone_wplug_is_full(disk, zwplug);
}
static void disk_remove_zone_wplug(struct gendisk *disk,
@@ -669,13 +683,12 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
struct blk_zone_wplug *zwplug)
{
- unsigned int zone_capacity = disk->zone_capacity;
unsigned int wp_offset = zwplug->wp_offset;
struct bio_list bl = BIO_EMPTY_LIST;
struct bio *bio;
while ((bio = bio_list_pop(&zwplug->bio_list))) {
- if (wp_offset >= zone_capacity ||
+ if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
(bio_op(bio) != REQ_OP_ZONE_APPEND &&
bio_offset_from_zone_start(bio) != wp_offset)) {
blk_zone_wplug_bio_io_error(zwplug, bio);
@@ -914,7 +927,6 @@ void blk_zone_write_plug_init_request(struct request *req)
sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
struct request_queue *q = req->q;
struct gendisk *disk = q->disk;
- unsigned int zone_capacity = disk->zone_capacity;
struct blk_zone_wplug *zwplug =
disk_get_zone_wplug(disk, blk_rq_pos(req));
unsigned long flags;
@@ -938,7 +950,7 @@ void blk_zone_write_plug_init_request(struct request *req)
* into the back of the request.
*/
spin_lock_irqsave(&zwplug->lock, flags);
- while (zwplug->wp_offset < zone_capacity) {
+ while (!disk_zone_wplug_is_full(disk, zwplug)) {
bio = bio_list_peek(&zwplug->bio_list);
if (!bio)
break;
@@ -984,7 +996,7 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
* We know such BIO will fail, and that would potentially overflow our
* write pointer offset beyond the end of the zone.
*/
- if (zwplug->wp_offset >= disk->zone_capacity)
+ if (disk_zone_wplug_is_full(disk, zwplug))
goto err;
if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
@@ -1561,6 +1573,7 @@ void disk_free_zone_resources(struct gendisk *disk)
kfree(disk->conv_zones_bitmap);
disk->conv_zones_bitmap = NULL;
disk->zone_capacity = 0;
+ disk->last_zone_capacity = 0;
disk->nr_zones = 0;
}
@@ -1605,6 +1618,7 @@ struct blk_revalidate_zone_args {
unsigned long *conv_zones_bitmap;
unsigned int nr_zones;
unsigned int zone_capacity;
+ unsigned int last_zone_capacity;
sector_t sector;
};
@@ -1622,6 +1636,7 @@ static int disk_update_zone_resources(struct gendisk *disk,
disk->nr_zones = args->nr_zones;
disk->zone_capacity = args->zone_capacity;
+ disk->last_zone_capacity = args->last_zone_capacity;
swap(disk->conv_zones_bitmap, args->conv_zones_bitmap);
if (disk->conv_zones_bitmap)
nr_conv_zones = bitmap_weight(disk->conv_zones_bitmap,
@@ -1673,6 +1688,9 @@ static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
return -ENODEV;
}
+ if (disk_zone_is_last(disk, zone))
+ args->last_zone_capacity = zone->capacity;
+
if (!disk_need_zone_resources(disk))
return 0;
@@ -1703,8 +1721,9 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
*/
if (!args->zone_capacity)
args->zone_capacity = zone->capacity;
- if (!disk_zone_is_last(disk, zone) &&
- zone->capacity != args->zone_capacity) {
+ if (disk_zone_is_last(disk, zone)) {
+ args->last_zone_capacity = zone->capacity;
+ } else if (zone->capacity != args->zone_capacity) {
pr_warn("%s: Invalid variable zone capacity\n",
disk->disk_name);
return -ENODEV;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index aefdda9f4ec7..24c36929920b 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -186,6 +186,7 @@ struct gendisk {
*/
unsigned int nr_zones;
unsigned int zone_capacity;
+ unsigned int last_zone_capacity;
unsigned long *conv_zones_bitmap;
unsigned int zone_wplugs_hash_bits;
spinlock_t zone_wplugs_lock;
--
2.45.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
` (2 preceding siblings ...)
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
@ 2024-05-30 5:40 ` Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
` (3 more replies)
2024-05-30 21:03 ` [PATCH 0/4] Zone write plugging and DM zone fixes Jens Axboe
2024-05-30 21:04 ` (subset) " Jens Axboe
5 siblings, 4 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 5:40 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
The generic stacking of limits implemented in the block layer cannot
correctly handle stacking of zone resource limits (max open zones and
max active zones) because these limits are for an entire device but the
stacking may be for a portion of that device (e.g. a dm-linear target
that does not cover an entire block device). As a result, when DM
devices are created on top of zoned block devices, the DM device never
has any zone resource limits advertized, which is only correct if all
underlying target devices also have no zone resource limits.
If at least one target device has resource limits, the user may see
either performance issues (if the max open zone limit of the device is
exceeded) or write I/O errors if the max active zone limit of one of
the underlying target devices is exceeded.
While it is very difficult to correctly and reliably stack zone resource
limits in general, cases where targets are not sharing zone resources of
the same device can be dealt with relatively easily. Such situation
happens when a target maps all sequential zones of a zoned block device:
for such mapping, other targets mapping other parts of the same zoned
block device can only contain conventional zones and thus will not
require any zone resource to correctly handle write operations.
For a mapped device constructed with such targets, which includes mapped
devices constructed with targets mapping entire zoned block devices, the
zone resource limits can be reliably determined using the non-zero
minimum of the zone resource limits of all targets.
For mapped devices that include targets partially mapping the set of
sequential write required zones of zoned block devices, instead of
advertizing no zone resource limits, it is also better to set the mapped
device limits to the non-zero minimum of the limits of all targets,
capped with the number of sequential zones being mapped.
While still not completely reliable, this allows indicating to the user
that the underlying devices used have limits.
This commit improves zone resource limits handling as described above
using the function dm_set_zone_resource_limits(). This function is
executed from dm_set_zones_restrictions() and iterates the targets of a
mapped device to evaluate the max open and max active zone limits. This
relies on an internal "stacking" of the limits of the target devices
combined with a direct counting of the number of sequential zones
mapped by the target.
1) For a target mapping an entire zoned block device, the limits are
evaluated as the non-zero minimum of the limits of the target device
and of the mapped device.
2) For a target partially mapping a zoned block device, the number of
mapped sequential zones is compared to the total number of sequential
zones of the target device to determine the limits: if the target
maps all sequential write required zones of the device, then the
target can reliably inherit the device limits. Otherwise, the target
limits are set to the device limits capped by the number of mapped
sequential zones.
With this evaluation done for each target, the mapped device zone
resource limits are evaluated as the non-zero minimum of the limits of
all the targets.
For configurations resulting in unreliable limits, a warning message is
issued.
The counting of mapped sequential zones for the target is done using the
new function dm_device_count_zones() which performs a report zones on
the entire block device with the callback dm_device_count_zones_cb().
This count of mapped sequential zones is used to determine if the mapped
device contains only conventional zones. This allows simplifying
dm_set_zones_restrictions() to not do a report zones. For mapped devices
mapping only conventional zones, dm_set_zone_resource_limits() changes
the mapped device to a regular device.
To further cleanup dm_set_zones_restrictions(), the message about the
type of zone append (native or emulated) is moved inside
dm_revalidate_zones().
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
1 file changed, 177 insertions(+), 37 deletions(-)
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 5d66d916730e..5f8b499529cf 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
}
}
+struct dm_device_zone_count {
+ sector_t start;
+ sector_t len;
+ unsigned int total_nr_seq_zones;
+ unsigned int target_nr_seq_zones;
+};
+
/*
- * Count conventional zones of a mapped zoned device. If the device
- * only has conventional zones, do not expose it as zoned.
+ * Count the total number of and the number of mapped sequential zones of a
+ * target zoned device.
*/
-static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
- void *data)
+static int dm_device_count_zones_cb(struct blk_zone *zone,
+ unsigned int idx, void *data)
{
- unsigned int *nr_conv_zones = data;
+ struct dm_device_zone_count *zc = data;
+
+ if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
+ zc->total_nr_seq_zones++;
+ if (zone->start >= zc->start &&
+ zone->start < zc->start + zc->len)
+ zc->target_nr_seq_zones++;
+ }
- if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
- (*nr_conv_zones)++;
+ return 0;
+}
+
+static int dm_device_count_zones(struct dm_dev *dev,
+ struct dm_device_zone_count *zc)
+{
+ int ret;
+
+ ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX,
+ dm_device_count_zones_cb, zc);
+ if (ret < 0)
+ return ret;
+ if (!ret)
+ return -EIO;
+ return 0;
+}
+
+struct dm_zone_resource_limits {
+ unsigned int mapped_nr_seq_zones;
+ unsigned int max_open_zones;
+ unsigned int max_active_zones;
+ bool reliable_limits;
+};
+
+static int device_get_zone_resource_limits(struct dm_target *ti,
+ struct dm_dev *dev, sector_t start,
+ sector_t len, void *data)
+{
+ struct dm_zone_resource_limits *zlim = data;
+ struct gendisk *disk = dev->bdev->bd_disk;
+ int ret;
+ struct dm_device_zone_count zc = {
+ .start = start,
+ .len = len,
+ };
+
+ /*
+ * If the target is not the whole device, the device zone resources may
+ * be shared between different targets. Check this by counting the
+ * number of mapped sequential zones: if this number is smaller than the
+ * total number of sequential zones of the target device, then resource
+ * sharing may happen and the zone limits will not be reliable.
+ */
+ ret = dm_device_count_zones(dev, &zc);
+ if (ret) {
+ DMERR("Count device %s zones failed",
+ disk->disk_name);
+ return ret;
+ }
+
+ zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
+
+ /*
+ * If the target does not map any sequential zones, then we do not need
+ * any zone resource limits.
+ */
+ if (!zc.target_nr_seq_zones)
+ return 0;
+
+ /*
+ * If the target does not map all sequential zones, the limits
+ * will not be reliable.
+ */
+ if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
+ zlim->reliable_limits = false;
+
+ zlim->max_active_zones =
+ min_not_zero(disk->queue->limits.max_active_zones,
+ zlim->max_active_zones);
+ if (zlim->max_active_zones != UINT_MAX)
+ zlim->max_active_zones =
+ min(zlim->max_active_zones, zc.target_nr_seq_zones);
+
+ zlim->max_open_zones =
+ min_not_zero(disk->queue->limits.max_open_zones,
+ zlim->max_open_zones);
+ if (zlim->max_open_zones != UINT_MAX)
+ zlim->max_open_zones =
+ min3(zlim->max_open_zones, zc.target_nr_seq_zones,
+ zlim->max_active_zones);
+
+ return 0;
+}
+
+static int dm_set_zone_resource_limits(struct mapped_device *md,
+ struct dm_table *t, struct queue_limits *lim)
+{
+ struct gendisk *disk = md->disk;
+ struct dm_zone_resource_limits zlim = {
+ .max_open_zones = UINT_MAX,
+ .max_active_zones = UINT_MAX,
+ .reliable_limits = true,
+ };
+
+ /* Get the zone resource limits from the targets. */
+ for (unsigned int i = 0; i < t->num_targets; i++) {
+ struct dm_target *ti = dm_table_get_target(t, i);
+
+ if (!ti->type->iterate_devices ||
+ ti->type->iterate_devices(ti,
+ device_get_zone_resource_limits, &zlim)) {
+ DMERR("Could not determine %s zone resource limits",
+ md->disk->disk_name);
+ return -ENODEV;
+ }
+ }
+
+ /*
+ * If we only have conventional zones mapped, expose the mapped device
+ + as a regular device.
+ */
+ if (!zlim.mapped_nr_seq_zones) {
+ lim->max_open_zones = 0;
+ lim->max_active_zones = 0;
+ lim->max_zone_append_sectors = 0;
+ lim->zone_write_granularity = 0;
+ lim->zoned = false;
+ clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
+ md->nr_zones = 0;
+ disk->nr_zones = 0;
+ return 0;
+ }
+
+ /*
+ * Warn if the mapped device is partially using zone resources of the
+ * target devices as that leads to unreliable limits, i.e. if another
+ * mapped device uses the same underlying devices, we cannot enforce
+ * zone limits to guarantee that writing will not lead to errors.
+ * Note that we really should return an error for such case but there is
+ * no easy way to find out if another mapped device uses the same
+ * underlying zoned devices.
+ */
+ if (!zlim.reliable_limits)
+ DMWARN("%s zone resource limits may be unreliable",
+ disk->disk_name);
+
+ if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones)
+ lim->max_open_zones = 0;
+ else
+ lim->max_open_zones = zlim.max_open_zones;
+
+ if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones)
+ lim->max_active_zones = 0;
+ else
+ lim->max_active_zones = zlim.max_active_zones;
return 0;
}
@@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
int ret;
/* Revalidate only if something changed. */
- if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
+ if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
+ DMINFO("%s using %s zone append",
+ md->disk->disk_name,
+ queue_emulates_zone_append(disk->queue) ?
+ "emulated" : "native");
md->nr_zones = 0;
+ }
if (md->nr_zones)
return 0;
@@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
struct queue_limits *lim)
{
struct mapped_device *md = t->md;
- struct gendisk *disk = md->disk;
- unsigned int nr_conv_zones = 0;
int ret;
/*
@@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
return 0;
/*
- * Count conventional zones to check that the mapped device will indeed
- * have sequential write required zones.
+ * Determine the max open and max active zone limits for the mapped
+ * device. For a mapped device containing only conventional zones, the
+ * mapped device is changed to be a regular block device, so exit early
+ * for such case.
*/
- md->zone_revalidate_map = t;
- ret = dm_blk_report_zones(disk, 0, UINT_MAX,
- dm_check_zoned_cb, &nr_conv_zones);
- md->zone_revalidate_map = NULL;
- if (ret < 0) {
- DMERR("Check zoned failed %d", ret);
+ ret = dm_set_zone_resource_limits(md, t, lim);
+ if (ret)
return ret;
- }
-
- /*
- * If we only have conventional zones, expose the mapped device as
- * a regular device.
- */
- if (nr_conv_zones >= ret) {
- lim->max_open_zones = 0;
- lim->max_active_zones = 0;
- lim->zoned = false;
- clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
- disk->nr_zones = 0;
+ if (!lim->zoned)
return 0;
- }
-
- if (!md->disk->nr_zones) {
- DMINFO("%s using %s zone append",
- md->disk->disk_name,
- queue_emulates_zone_append(q) ? "emulated" : "native");
- }
ret = dm_revalidate_zones(md, t);
if (ret < 0)
--
2.45.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
@ 2024-05-30 7:37 ` Niklas Cassel
2024-05-30 11:09 ` Damien Le Moal
2024-05-30 12:51 ` Niklas Cassel
` (3 subsequent siblings)
4 siblings, 1 reply; 29+ messages in thread
From: Niklas Cassel @ 2024-05-30 7:37 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:34PM +0900, Damien Le Moal wrote:
> A zoned device may have a last sequential write required zone that is
> smaller than other zones. However, all tests to check if a zone write
> plug write offset exceeds the zone capacity use the same capacity
> value stored in the gendisk zone_capacity field. This is incorrect for a
> zoned device with a last runt (smaller) zone.
>
> Add the new field last_zone_capacity to struct gendisk to store the
> capacity of the last zone of the device. blk_revalidate_seq_zone() and
> blk_revalidate_conv_zone() are both modified to get this value when
> disk_zone_is_last() returns true. Similarly to zone_capacity, the value
> is first stored using the last_zone_capacity field of struct
> blk_revalidate_zone_args. Once zone revalidation of all zones is done,
> this is used to set the gendisk last_zone_capacity field.
>
> The checks to determine if a zone is full or if a sector offset in a
> zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
> disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
> and blk_zone_wplug_prepare_bio() are modified to use the new helper
> functions disk_zone_is_full() and disk_zone_wplug_is_full().
> disk_zone_is_full() uses the zone index to determine if the zone being
> tested is the last one of the disk and uses the either the disk
> zone_capacity or last_zone_capacity accordingly.
>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 35 +++++++++++++++++++++++++++--------
> include/linux/blkdev.h | 1 +
> 2 files changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 402a50a1ac4d..52abebf56027 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -455,6 +455,20 @@ static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
> return zone->start + zone->len >= get_capacity(disk);
> }
>
> +static bool disk_zone_is_full(struct gendisk *disk,
> + unsigned int zno, unsigned int offset_in_zone)
Why not just call the third parameter wp?
> +{
> + if (zno < disk->nr_zones - 1)
> + return offset_in_zone >= disk->zone_capacity;
> + return offset_in_zone >= disk->last_zone_capacity;
> +}
> +
> +static bool disk_zone_wplug_is_full(struct gendisk *disk,
> + struct blk_zone_wplug *zwplug)
> +{
> + return disk_zone_is_full(disk, zwplug->zone_no, zwplug->wp_offset);
> +}
> +
> static bool disk_insert_zone_wplug(struct gendisk *disk,
> struct blk_zone_wplug *zwplug)
> {
> @@ -548,7 +562,7 @@ static inline bool disk_should_remove_zone_wplug(struct gendisk *disk,
> return false;
>
> /* We can remove zone write plugs for zones that are empty or full. */
> - return !zwplug->wp_offset || zwplug->wp_offset >= disk->zone_capacity;
> + return !zwplug->wp_offset || disk_zone_wplug_is_full(disk, zwplug);
> }
>
> static void disk_remove_zone_wplug(struct gendisk *disk,
> @@ -669,13 +683,12 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
> static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
> struct blk_zone_wplug *zwplug)
> {
> - unsigned int zone_capacity = disk->zone_capacity;
> unsigned int wp_offset = zwplug->wp_offset;
> struct bio_list bl = BIO_EMPTY_LIST;
> struct bio *bio;
>
> while ((bio = bio_list_pop(&zwplug->bio_list))) {
> - if (wp_offset >= zone_capacity ||
> + if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
Why don't you use disk_zone_wplug_is_full() here?
> (bio_op(bio) != REQ_OP_ZONE_APPEND &&
> bio_offset_from_zone_start(bio) != wp_offset)) {
> blk_zone_wplug_bio_io_error(zwplug, bio);
> @@ -914,7 +927,6 @@ void blk_zone_write_plug_init_request(struct request *req)
> sector_t req_back_sector = blk_rq_pos(req) + blk_rq_sectors(req);
> struct request_queue *q = req->q;
> struct gendisk *disk = q->disk;
> - unsigned int zone_capacity = disk->zone_capacity;
> struct blk_zone_wplug *zwplug =
> disk_get_zone_wplug(disk, blk_rq_pos(req));
> unsigned long flags;
> @@ -938,7 +950,7 @@ void blk_zone_write_plug_init_request(struct request *req)
> * into the back of the request.
> */
> spin_lock_irqsave(&zwplug->lock, flags);
> - while (zwplug->wp_offset < zone_capacity) {
> + while (!disk_zone_wplug_is_full(disk, zwplug)) {
> bio = bio_list_peek(&zwplug->bio_list);
> if (!bio)
> break;
> @@ -984,7 +996,7 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
> * We know such BIO will fail, and that would potentially overflow our
> * write pointer offset beyond the end of the zone.
> */
> - if (zwplug->wp_offset >= disk->zone_capacity)
> + if (disk_zone_wplug_is_full(disk, zwplug))
> goto err;
>
> if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
> @@ -1561,6 +1573,7 @@ void disk_free_zone_resources(struct gendisk *disk)
> kfree(disk->conv_zones_bitmap);
> disk->conv_zones_bitmap = NULL;
> disk->zone_capacity = 0;
> + disk->last_zone_capacity = 0;
> disk->nr_zones = 0;
> }
>
> @@ -1605,6 +1618,7 @@ struct blk_revalidate_zone_args {
> unsigned long *conv_zones_bitmap;
> unsigned int nr_zones;
> unsigned int zone_capacity;
> + unsigned int last_zone_capacity;
> sector_t sector;
> };
>
> @@ -1622,6 +1636,7 @@ static int disk_update_zone_resources(struct gendisk *disk,
>
> disk->nr_zones = args->nr_zones;
> disk->zone_capacity = args->zone_capacity;
> + disk->last_zone_capacity = args->last_zone_capacity;
> swap(disk->conv_zones_bitmap, args->conv_zones_bitmap);
> if (disk->conv_zones_bitmap)
> nr_conv_zones = bitmap_weight(disk->conv_zones_bitmap,
> @@ -1673,6 +1688,9 @@ static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
> return -ENODEV;
> }
>
> + if (disk_zone_is_last(disk, zone))
> + args->last_zone_capacity = zone->capacity;
> +
> if (!disk_need_zone_resources(disk))
> return 0;
>
> @@ -1703,8 +1721,9 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
> */
> if (!args->zone_capacity)
> args->zone_capacity = zone->capacity;
> - if (!disk_zone_is_last(disk, zone) &&
> - zone->capacity != args->zone_capacity) {
> + if (disk_zone_is_last(disk, zone)) {
> + args->last_zone_capacity = zone->capacity;
> + } else if (zone->capacity != args->zone_capacity) {
> pr_warn("%s: Invalid variable zone capacity\n",
> disk->disk_name);
> return -ENODEV;
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index aefdda9f4ec7..24c36929920b 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -186,6 +186,7 @@ struct gendisk {
> */
> unsigned int nr_zones;
> unsigned int zone_capacity;
> + unsigned int last_zone_capacity;
> unsigned long *conv_zones_bitmap;
> unsigned int zone_wplugs_hash_bits;
> spinlock_t zone_wplugs_lock;
> --
> 2.45.1
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
@ 2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:34 ` Bart Van Assche
` (2 subsequent siblings)
3 siblings, 0 replies; 29+ messages in thread
From: Niklas Cassel @ 2024-05-30 7:37 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:32PM +0900, Damien Le Moal wrote:
> A zoned device with a smaller last zone together with a zone capacity
> smaller than the zone size does make any sense as that does not
> correspond to any possible setup for a real device:
> 1) For ZNS and zoned UFS devices, all zones are always the same size.
> 2) For SMR HDDs, all zones always have the same capacity.
> In other words, if we have a smaller last runt zone, then this zone
> capacity should always be equal to the zone size.
>
> Add a check in null_init_zoned_dev() to prevent a configuration to have
> both a smaller zone size and a zone capacity smaller than the zone size.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/block/null_blk/zoned.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 79c8e5e99f7f..f118d304f310 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -74,6 +74,17 @@ int null_init_zoned_dev(struct nullb_device *dev,
> return -EINVAL;
> }
>
> + /*
> + * If a smaller zone capacity was requested, do not allow a smaller last
> + * zone at the same time as such zone configuration does not correspond
> + * to any real zoned device.
> + */
> + if (dev->zone_capacity != dev->zone_size &&
> + dev->size & (dev->zone_size - 1)) {
> + pr_err("A smaller last zone is not allowed with zone capacity smaller than zone size.\n");
> + return -EINVAL;
> + }
> +
> zone_capacity_sects = mb_to_sects(dev->zone_capacity);
> dev_capacity_sects = mb_to_sects(dev->size);
> dev->zone_size_sects = mb_to_sects(dev->zone_size);
> --
> 2.45.1
>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/4] block: Fix validation of zoned device with a runt zone
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
@ 2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:37 ` Bart Van Assche
` (2 subsequent siblings)
3 siblings, 0 replies; 29+ messages in thread
From: Niklas Cassel @ 2024-05-30 7:37 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:33PM +0900, Damien Le Moal wrote:
> Commit ecfe43b11b02 ("block: Remember zone capacity when revalidating
> zones") introduced checks to ensure that the capacity of the zones of
> a zoned device is constant for all zones. However, this check ignores
> the possibility that a zoned device has a smaller last zone with a size
> not equal to the capacity of other zones. Such device correspond in
> practice to an SMR drive with a smaller last zone and all zones with a
> capacity equal to the zone size, leading to the last zone capacity being
> different than the capacity of other zones.
>
> Correctly handle such device by fixing the check for the constant zone
> capacity in blk_revalidate_seq_zone() using the new helper function
> disk_zone_is_last(). This helper function is also used in
> blk_revalidate_zone_cb() when checking the zone size.
>
> Fixes: ecfe43b11b02 ("block: Remember zone capacity when revalidating zones")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 03aa4eead39e..402a50a1ac4d 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -450,6 +450,11 @@ static inline bool disk_zone_is_conv(struct gendisk *disk, sector_t sector)
> return test_bit(disk_zone_no(disk, sector), disk->conv_zones_bitmap);
> }
>
> +static bool disk_zone_is_last(struct gendisk *disk, struct blk_zone *zone)
> +{
> + return zone->start + zone->len >= get_capacity(disk);
> +}
> +
> static bool disk_insert_zone_wplug(struct gendisk *disk,
> struct blk_zone_wplug *zwplug)
> {
> @@ -1693,11 +1698,13 @@ static int blk_revalidate_seq_zone(struct blk_zone *zone, unsigned int idx,
>
> /*
> * Remember the capacity of the first sequential zone and check
> - * if it is constant for all zones.
> + * if it is constant for all zones, ignoring the last zone as it can be
> + * smaller.
> */
> if (!args->zone_capacity)
> args->zone_capacity = zone->capacity;
> - if (zone->capacity != args->zone_capacity) {
> + if (!disk_zone_is_last(disk, zone) &&
> + zone->capacity != args->zone_capacity) {
> pr_warn("%s: Invalid variable zone capacity\n",
> disk->disk_name);
> return -ENODEV;
> @@ -1732,7 +1739,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> {
> struct blk_revalidate_zone_args *args = data;
> struct gendisk *disk = args->disk;
> - sector_t capacity = get_capacity(disk);
> sector_t zone_sectors = disk->queue->limits.chunk_sectors;
> int ret;
>
> @@ -1743,7 +1749,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> return -ENODEV;
> }
>
> - if (zone->start >= capacity || !zone->len) {
> + if (zone->start >= get_capacity(disk) || !zone->len) {
> pr_warn("%s: Invalid zone start %llu, length %llu\n",
> disk->disk_name, zone->start, zone->len);
> return -ENODEV;
> @@ -1753,7 +1759,7 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> * All zones must have the same size, with the exception on an eventual
> * smaller last zone.
> */
> - if (zone->start + zone->len < capacity) {
> + if (!disk_zone_is_last(disk, zone)) {
> if (zone->len != zone_sectors) {
> pr_warn("%s: Invalid zoned device with non constant zone size\n",
> disk->disk_name);
> --
> 2.45.1
>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
@ 2024-05-30 7:37 ` Niklas Cassel
2024-05-31 19:26 ` Benjamin Marzinski
` (2 subsequent siblings)
3 siblings, 0 replies; 29+ messages in thread
From: Niklas Cassel @ 2024-05-30 7:37 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:35PM +0900, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
>
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
>
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
>
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets,
> capped with the number of sequential zones being mapped.
> While still not completely reliable, this allows indicating to the user
> that the underlying devices used have limits.
>
> This commit improves zone resource limits handling as described above
> using the function dm_set_zone_resource_limits(). This function is
> executed from dm_set_zones_restrictions() and iterates the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the target.
> 1) For a target mapping an entire zoned block device, the limits are
> evaluated as the non-zero minimum of the limits of the target device
> and of the mapped device.
> 2) For a target partially mapping a zoned block device, the number of
> mapped sequential zones is compared to the total number of sequential
> zones of the target device to determine the limits: if the target
> maps all sequential write required zones of the device, then the
> target can reliably inherit the device limits. Otherwise, the target
> limits are set to the device limits capped by the number of mapped
> sequential zones.
> With this evaluation done for each target, the mapped device zone
> resource limits are evaluated as the non-zero minimum of the limits of
> all the targets.
>
> For configurations resulting in unreliable limits, a warning message is
> issued.
>
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is used to determine if the mapped
> device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones. For mapped devices
> mapping only conventional zones, dm_set_zone_resource_limits() changes
> the mapped device to a regular device.
>
> To further cleanup dm_set_zones_restrictions(), the message about the
> type of zone append (native or emulated) is moved inside
> dm_revalidate_zones().
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 177 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 5d66d916730e..5f8b499529cf 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
> }
> }
>
> +struct dm_device_zone_count {
> + sector_t start;
> + sector_t len;
> + unsigned int total_nr_seq_zones;
> + unsigned int target_nr_seq_zones;
> +};
> +
> /*
> - * Count conventional zones of a mapped zoned device. If the device
> - * only has conventional zones, do not expose it as zoned.
> + * Count the total number of and the number of mapped sequential zones of a
> + * target zoned device.
> */
> -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
> - void *data)
> +static int dm_device_count_zones_cb(struct blk_zone *zone,
> + unsigned int idx, void *data)
> {
> - unsigned int *nr_conv_zones = data;
> + struct dm_device_zone_count *zc = data;
> +
> + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
> + zc->total_nr_seq_zones++;
> + if (zone->start >= zc->start &&
> + zone->start < zc->start + zc->len)
> + zc->target_nr_seq_zones++;
> + }
>
> - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> - (*nr_conv_zones)++;
> + return 0;
> +}
> +
> +static int dm_device_count_zones(struct dm_dev *dev,
> + struct dm_device_zone_count *zc)
> +{
> + int ret;
> +
> + ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX,
> + dm_device_count_zones_cb, zc);
> + if (ret < 0)
> + return ret;
> + if (!ret)
> + return -EIO;
> + return 0;
> +}
> +
> +struct dm_zone_resource_limits {
> + unsigned int mapped_nr_seq_zones;
> + unsigned int max_open_zones;
> + unsigned int max_active_zones;
> + bool reliable_limits;
> +};
> +
> +static int device_get_zone_resource_limits(struct dm_target *ti,
> + struct dm_dev *dev, sector_t start,
> + sector_t len, void *data)
> +{
> + struct dm_zone_resource_limits *zlim = data;
> + struct gendisk *disk = dev->bdev->bd_disk;
> + int ret;
> + struct dm_device_zone_count zc = {
> + .start = start,
> + .len = len,
> + };
> +
> + /*
> + * If the target is not the whole device, the device zone resources may
> + * be shared between different targets. Check this by counting the
> + * number of mapped sequential zones: if this number is smaller than the
> + * total number of sequential zones of the target device, then resource
> + * sharing may happen and the zone limits will not be reliable.
> + */
> + ret = dm_device_count_zones(dev, &zc);
> + if (ret) {
> + DMERR("Count device %s zones failed",
> + disk->disk_name);
> + return ret;
> + }
> +
> + zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
> +
> + /*
> + * If the target does not map any sequential zones, then we do not need
> + * any zone resource limits.
> + */
> + if (!zc.target_nr_seq_zones)
> + return 0;
> +
> + /*
> + * If the target does not map all sequential zones, the limits
> + * will not be reliable.
> + */
> + if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
> + zlim->reliable_limits = false;
> +
> + zlim->max_active_zones =
> + min_not_zero(disk->queue->limits.max_active_zones,
> + zlim->max_active_zones);
> + if (zlim->max_active_zones != UINT_MAX)
> + zlim->max_active_zones =
> + min(zlim->max_active_zones, zc.target_nr_seq_zones);
> +
> + zlim->max_open_zones =
> + min_not_zero(disk->queue->limits.max_open_zones,
> + zlim->max_open_zones);
> + if (zlim->max_open_zones != UINT_MAX)
> + zlim->max_open_zones =
> + min3(zlim->max_open_zones, zc.target_nr_seq_zones,
> + zlim->max_active_zones);
> +
> + return 0;
> +}
> +
> +static int dm_set_zone_resource_limits(struct mapped_device *md,
> + struct dm_table *t, struct queue_limits *lim)
> +{
> + struct gendisk *disk = md->disk;
> + struct dm_zone_resource_limits zlim = {
> + .max_open_zones = UINT_MAX,
> + .max_active_zones = UINT_MAX,
> + .reliable_limits = true,
> + };
> +
> + /* Get the zone resource limits from the targets. */
> + for (unsigned int i = 0; i < t->num_targets; i++) {
> + struct dm_target *ti = dm_table_get_target(t, i);
> +
> + if (!ti->type->iterate_devices ||
> + ti->type->iterate_devices(ti,
> + device_get_zone_resource_limits, &zlim)) {
> + DMERR("Could not determine %s zone resource limits",
> + md->disk->disk_name);
> + return -ENODEV;
> + }
> + }
> +
> + /*
> + * If we only have conventional zones mapped, expose the mapped device
> + + as a regular device.
> + */
> + if (!zlim.mapped_nr_seq_zones) {
> + lim->max_open_zones = 0;
> + lim->max_active_zones = 0;
> + lim->max_zone_append_sectors = 0;
> + lim->zone_write_granularity = 0;
> + lim->zoned = false;
> + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + md->nr_zones = 0;
> + disk->nr_zones = 0;
> + return 0;
> + }
> +
> + /*
> + * Warn if the mapped device is partially using zone resources of the
> + * target devices as that leads to unreliable limits, i.e. if another
> + * mapped device uses the same underlying devices, we cannot enforce
> + * zone limits to guarantee that writing will not lead to errors.
> + * Note that we really should return an error for such case but there is
> + * no easy way to find out if another mapped device uses the same
> + * underlying zoned devices.
> + */
> + if (!zlim.reliable_limits)
> + DMWARN("%s zone resource limits may be unreliable",
> + disk->disk_name);
> +
> + if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_open_zones = 0;
> + else
> + lim->max_open_zones = zlim.max_open_zones;
> +
> + if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_active_zones = 0;
> + else
> + lim->max_active_zones = zlim.max_active_zones;
>
> return 0;
> }
> @@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> int ret;
>
> /* Revalidate only if something changed. */
> - if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
> + if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> + DMINFO("%s using %s zone append",
> + md->disk->disk_name,
> + queue_emulates_zone_append(disk->queue) ?
> + "emulated" : "native");
> md->nr_zones = 0;
> + }
>
> if (md->nr_zones)
> return 0;
> @@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> struct queue_limits *lim)
> {
> struct mapped_device *md = t->md;
> - struct gendisk *disk = md->disk;
> - unsigned int nr_conv_zones = 0;
> int ret;
>
> /*
> @@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> return 0;
>
> /*
> - * Count conventional zones to check that the mapped device will indeed
> - * have sequential write required zones.
> + * Determine the max open and max active zone limits for the mapped
> + * device. For a mapped device containing only conventional zones, the
> + * mapped device is changed to be a regular block device, so exit early
> + * for such case.
> */
> - md->zone_revalidate_map = t;
> - ret = dm_blk_report_zones(disk, 0, UINT_MAX,
> - dm_check_zoned_cb, &nr_conv_zones);
> - md->zone_revalidate_map = NULL;
> - if (ret < 0) {
> - DMERR("Check zoned failed %d", ret);
> + ret = dm_set_zone_resource_limits(md, t, lim);
> + if (ret)
> return ret;
> - }
> -
> - /*
> - * If we only have conventional zones, expose the mapped device as
> - * a regular device.
> - */
> - if (nr_conv_zones >= ret) {
> - lim->max_open_zones = 0;
> - lim->max_active_zones = 0;
> - lim->zoned = false;
> - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> - disk->nr_zones = 0;
> + if (!lim->zoned)
> return 0;
> - }
> -
> - if (!md->disk->nr_zones) {
> - DMINFO("%s using %s zone append",
> - md->disk->disk_name,
> - queue_emulates_zone_append(q) ? "emulated" : "native");
> - }
I would have moved this print in a separate commit.
Regardless:
Reviewed-by: Niklas Cassel <cassel@kernel.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 7:37 ` Niklas Cassel
@ 2024-05-30 11:09 ` Damien Le Moal
0 siblings, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 11:09 UTC (permalink / raw)
To: Niklas Cassel
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On 5/30/24 16:37, Niklas Cassel wrote:
[...]
>> +static bool disk_zone_is_full(struct gendisk *disk,
>> + unsigned int zno, unsigned int offset_in_zone)
>
> Why not just call the third parameter wp?
Because it does not have to be a plug write pointer. And even then, zone write
plugging uses offset in a zone as write pointer values :)
[...]
>> static void disk_remove_zone_wplug(struct gendisk *disk,
>> @@ -669,13 +683,12 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
>> static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
>> struct blk_zone_wplug *zwplug)
>> {
>> - unsigned int zone_capacity = disk->zone_capacity;
>> unsigned int wp_offset = zwplug->wp_offset;
>> struct bio_list bl = BIO_EMPTY_LIST;
>> struct bio *bio;
>>
>> while ((bio = bio_list_pop(&zwplug->bio_list))) {
>> - if (wp_offset >= zone_capacity ||
>> + if (disk_zone_is_full(disk, zwplug->zone_no, wp_offset) ||
>
> Why don't you use disk_zone_wplug_is_full() here?
Because this function does not modify the zone write plug write offset. So we
cannot use it.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
@ 2024-05-30 12:51 ` Niklas Cassel
2024-05-30 20:40 ` Bart Van Assche
` (2 subsequent siblings)
4 siblings, 0 replies; 29+ messages in thread
From: Niklas Cassel @ 2024-05-30 12:51 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:34PM +0900, Damien Le Moal wrote:
> A zoned device may have a last sequential write required zone that is
> smaller than other zones. However, all tests to check if a zone write
> plug write offset exceeds the zone capacity use the same capacity
> value stored in the gendisk zone_capacity field. This is incorrect for a
> zoned device with a last runt (smaller) zone.
>
> Add the new field last_zone_capacity to struct gendisk to store the
> capacity of the last zone of the device. blk_revalidate_seq_zone() and
> blk_revalidate_conv_zone() are both modified to get this value when
> disk_zone_is_last() returns true. Similarly to zone_capacity, the value
> is first stored using the last_zone_capacity field of struct
> blk_revalidate_zone_args. Once zone revalidation of all zones is done,
> this is used to set the gendisk last_zone_capacity field.
>
> The checks to determine if a zone is full or if a sector offset in a
> zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
> disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
> and blk_zone_wplug_prepare_bio() are modified to use the new helper
> functions disk_zone_is_full() and disk_zone_wplug_is_full().
> disk_zone_is_full() uses the zone index to determine if the zone being
> tested is the last one of the disk and uses the either the disk
> zone_capacity or last_zone_capacity accordingly.
>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
Reviewed-by: Niklas Cassel <cassel@kernel.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
@ 2024-05-30 20:34 ` Bart Van Assche
2024-06-01 5:25 ` Christoph Hellwig
2024-06-03 6:53 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Bart Van Assche @ 2024-05-30 20:34 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/29/24 22:40, Damien Le Moal wrote:
> A zoned device with a smaller last zone together with a zone capacity
> smaller than the zone size does make any sense as that does not
> correspond to any possible setup for a real device:
> 1) For ZNS and zoned UFS devices, all zones are always the same size.
> 2) For SMR HDDs, all zones always have the same capacity.
> In other words, if we have a smaller last runt zone, then this zone
> capacity should always be equal to the zone size.
>
> Add a check in null_init_zoned_dev() to prevent a configuration to have
> both a smaller zone size and a zone capacity smaller than the zone size.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/4] block: Fix validation of zoned device with a runt zone
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
@ 2024-05-30 20:37 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:55 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Bart Van Assche @ 2024-05-30 20:37 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/29/24 22:40, Damien Le Moal wrote:
> Commit ecfe43b11b02 ("block: Remember zone capacity when revalidating
> zones") introduced checks to ensure that the capacity of the zones of
> a zoned device is constant for all zones. However, this check ignores
> the possibility that a zoned device has a smaller last zone with a size
> not equal to the capacity of other zones. Such device correspond in
> practice to an SMR drive with a smaller last zone and all zones with a
> capacity equal to the zone size, leading to the last zone capacity being
> different than the capacity of other zones.
>
> Correctly handle such device by fixing the check for the constant zone
> capacity in blk_revalidate_seq_zone() using the new helper function
> disk_zone_is_last(). This helper function is also used in
> blk_revalidate_zone_cb() when checking the zone size.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
> @@ -1732,7 +1739,6 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
> {
> struct blk_revalidate_zone_args *args = data;
> struct gendisk *disk = args->disk;
> - sector_t capacity = get_capacity(disk);
> sector_t zone_sectors = disk->queue->limits.chunk_sectors;
> int ret;
Thank you for having removed the local variable with the name 'capacity' from
this function. Having a variable with the name 'capacity' in a function that
deals both with disk capacity and zone capacity was confusing ...
Bart.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 12:51 ` Niklas Cassel
@ 2024-05-30 20:40 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:56 ` Hannes Reinecke
4 siblings, 0 replies; 29+ messages in thread
From: Bart Van Assche @ 2024-05-30 20:40 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/29/24 22:40, Damien Le Moal wrote:
> A zoned device may have a last sequential write required zone that is
> smaller than other zones. However, all tests to check if a zone write
> plug write offset exceeds the zone capacity use the same capacity
> value stored in the gendisk zone_capacity field. This is incorrect for a
> zoned device with a last runt (smaller) zone.
>
> Add the new field last_zone_capacity to struct gendisk to store the
> capacity of the last zone of the device. blk_revalidate_seq_zone() and
> blk_revalidate_conv_zone() are both modified to get this value when
> disk_zone_is_last() returns true. Similarly to zone_capacity, the value
> is first stored using the last_zone_capacity field of struct
> blk_revalidate_zone_args. Once zone revalidation of all zones is done,
> this is used to set the gendisk last_zone_capacity field.
>
> The checks to determine if a zone is full or if a sector offset in a
> zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
> disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
> and blk_zone_wplug_prepare_bio() are modified to use the new helper
> functions disk_zone_is_full() and disk_zone_wplug_is_full().
> disk_zone_is_full() uses the zone index to determine if the zone being
> tested is the last one of the disk and uses the either the disk
> zone_capacity or last_zone_capacity accordingly.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/4] Zone write plugging and DM zone fixes
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
` (3 preceding siblings ...)
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
@ 2024-05-30 21:03 ` Jens Axboe
2024-05-30 23:58 ` Damien Le Moal
2024-05-30 21:04 ` (subset) " Jens Axboe
5 siblings, 1 reply; 29+ messages in thread
From: Jens Axboe @ 2024-05-30 21:03 UTC (permalink / raw)
To: Damien Le Moal, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/29/24 11:40 PM, Damien Le Moal wrote:
> The first patch of this series fixes null_blk to avoid weird zone
> configurations, namely, a zoned device with a last smaller zone with a
> zone capacity smaller than the zone size. Related to this, the next 2
> patches fix the handling by zone write plugging of zoned devices with a
> last smaller zone. That was completely botched in the initial series.
>
> Finally, the last patch addresses a long standing issue with zoned
> device-mapper devices: no zone resource limits (max open and max active
> zones limits) are not exposed to the user. This patch fixes that,
> allowing for the limits of the underlying target devices to be exposed
> with a warning for setups that lead to unreliable limits.
Would be nice to get the dm part acked, but I guess I can queue up 1-3
for now for 6.10?
--
Jens Axboe
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: (subset) [PATCH 0/4] Zone write plugging and DM zone fixes
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
` (4 preceding siblings ...)
2024-05-30 21:03 ` [PATCH 0/4] Zone write plugging and DM zone fixes Jens Axboe
@ 2024-05-30 21:04 ` Jens Axboe
5 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2024-05-30 21:04 UTC (permalink / raw)
To: linux-block, dm-devel, Mike Snitzer, Mikulas Patocka,
Damien Le Moal
On Thu, 30 May 2024 14:40:31 +0900, Damien Le Moal wrote:
> The first patch of this series fixes null_blk to avoid weird zone
> configurations, namely, a zoned device with a last smaller zone with a
> zone capacity smaller than the zone size. Related to this, the next 2
> patches fix the handling by zone write plugging of zoned devices with a
> last smaller zone. That was completely botched in the initial series.
>
> Finally, the last patch addresses a long standing issue with zoned
> device-mapper devices: no zone resource limits (max open and max active
> zones limits) are not exposed to the user. This patch fixes that,
> allowing for the limits of the underlying target devices to be exposed
> with a warning for setups that lead to unreliable limits.
>
> [...]
Applied, thanks!
[1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
commit: b164316808ec5de391c3e7b0148ec937d32d280d
[2/4] block: Fix validation of zoned device with a runt zone
commit: cd6399936869b4a042dd1270078cbf2bb871a407
[3/4] block: Fix zone write plugging handling of devices with a runt zone
commit: 29459c3eaa5c6261fbe0dea7bdeb9b48d35d862a
Best regards,
--
Jens Axboe
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 0/4] Zone write plugging and DM zone fixes
2024-05-30 21:03 ` [PATCH 0/4] Zone write plugging and DM zone fixes Jens Axboe
@ 2024-05-30 23:58 ` Damien Le Moal
0 siblings, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-05-30 23:58 UTC (permalink / raw)
To: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On 5/31/24 6:03 AM, Jens Axboe wrote:
> On 5/29/24 11:40 PM, Damien Le Moal wrote:
>> The first patch of this series fixes null_blk to avoid weird zone
>> configurations, namely, a zoned device with a last smaller zone with a
>> zone capacity smaller than the zone size. Related to this, the next 2
>> patches fix the handling by zone write plugging of zoned devices with a
>> last smaller zone. That was completely botched in the initial series.
>>
>> Finally, the last patch addresses a long standing issue with zoned
>> device-mapper devices: no zone resource limits (max open and max active
>> zones limits) are not exposed to the user. This patch fixes that,
>> allowing for the limits of the underlying target devices to be exposed
>> with a warning for setups that lead to unreliable limits.
>
> Would be nice to get the dm part acked, but I guess I can queue up 1-3
> for now for 6.10?
Yes please.
Mike, Mikulas,
Could you please review patch 4 ? It is a long standing issue with DM but now
that zone write plugging fakes a smaller 128 max open zones limit for drives
with no limits, zonefs gets confused and tests fail. So it would be great to
get this patch in for 6.10 as a fix.
Thanks !
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
@ 2024-05-31 19:26 ` Benjamin Marzinski
2024-06-01 5:29 ` Christoph Hellwig
2024-06-01 5:29 ` Christoph Hellwig
2024-06-03 6:58 ` Hannes Reinecke
3 siblings, 1 reply; 29+ messages in thread
From: Benjamin Marzinski @ 2024-05-31 19:26 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On Thu, May 30, 2024 at 02:40:35PM +0900, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
>
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
>
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
>
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets,
> capped with the number of sequential zones being mapped.
> While still not completely reliable, this allows indicating to the user
> that the underlying devices used have limits.
>
> This commit improves zone resource limits handling as described above
> using the function dm_set_zone_resource_limits(). This function is
> executed from dm_set_zones_restrictions() and iterates the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the target.
> 1) For a target mapping an entire zoned block device, the limits are
> evaluated as the non-zero minimum of the limits of the target device
> and of the mapped device.
> 2) For a target partially mapping a zoned block device, the number of
> mapped sequential zones is compared to the total number of sequential
> zones of the target device to determine the limits: if the target
> maps all sequential write required zones of the device, then the
> target can reliably inherit the device limits. Otherwise, the target
> limits are set to the device limits capped by the number of mapped
> sequential zones.
> With this evaluation done for each target, the mapped device zone
> resource limits are evaluated as the non-zero minimum of the limits of
> all the targets.
Does this mean that if a dm device was made up of two linear targets,
one of which mapped an entire zoned device, and one of which mapped a
single zone of another zoned device, the max active zone limit of the
entire dm device would be 1? That seems wrong.
-Ben
> For configurations resulting in unreliable limits, a warning message is
> issued.
>
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is used to determine if the mapped
> device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones. For mapped devices
> mapping only conventional zones, dm_set_zone_resource_limits() changes
> the mapped device to a regular device.
>
> To further cleanup dm_set_zones_restrictions(), the message about the
> type of zone append (native or emulated) is moved inside
> dm_revalidate_zones().
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 177 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 5d66d916730e..5f8b499529cf 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
> }
> }
>
> +struct dm_device_zone_count {
> + sector_t start;
> + sector_t len;
> + unsigned int total_nr_seq_zones;
> + unsigned int target_nr_seq_zones;
> +};
> +
> /*
> - * Count conventional zones of a mapped zoned device. If the device
> - * only has conventional zones, do not expose it as zoned.
> + * Count the total number of and the number of mapped sequential zones of a
> + * target zoned device.
> */
> -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
> - void *data)
> +static int dm_device_count_zones_cb(struct blk_zone *zone,
> + unsigned int idx, void *data)
> {
> - unsigned int *nr_conv_zones = data;
> + struct dm_device_zone_count *zc = data;
> +
> + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
> + zc->total_nr_seq_zones++;
> + if (zone->start >= zc->start &&
> + zone->start < zc->start + zc->len)
> + zc->target_nr_seq_zones++;
> + }
>
> - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> - (*nr_conv_zones)++;
> + return 0;
> +}
> +
> +static int dm_device_count_zones(struct dm_dev *dev,
> + struct dm_device_zone_count *zc)
> +{
> + int ret;
> +
> + ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX,
> + dm_device_count_zones_cb, zc);
> + if (ret < 0)
> + return ret;
> + if (!ret)
> + return -EIO;
> + return 0;
> +}
> +
> +struct dm_zone_resource_limits {
> + unsigned int mapped_nr_seq_zones;
> + unsigned int max_open_zones;
> + unsigned int max_active_zones;
> + bool reliable_limits;
> +};
> +
> +static int device_get_zone_resource_limits(struct dm_target *ti,
> + struct dm_dev *dev, sector_t start,
> + sector_t len, void *data)
> +{
> + struct dm_zone_resource_limits *zlim = data;
> + struct gendisk *disk = dev->bdev->bd_disk;
> + int ret;
> + struct dm_device_zone_count zc = {
> + .start = start,
> + .len = len,
> + };
> +
> + /*
> + * If the target is not the whole device, the device zone resources may
> + * be shared between different targets. Check this by counting the
> + * number of mapped sequential zones: if this number is smaller than the
> + * total number of sequential zones of the target device, then resource
> + * sharing may happen and the zone limits will not be reliable.
> + */
> + ret = dm_device_count_zones(dev, &zc);
> + if (ret) {
> + DMERR("Count device %s zones failed",
> + disk->disk_name);
> + return ret;
> + }
> +
> + zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
> +
> + /*
> + * If the target does not map any sequential zones, then we do not need
> + * any zone resource limits.
> + */
> + if (!zc.target_nr_seq_zones)
> + return 0;
> +
> + /*
> + * If the target does not map all sequential zones, the limits
> + * will not be reliable.
> + */
> + if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
> + zlim->reliable_limits = false;
> +
> + zlim->max_active_zones =
> + min_not_zero(disk->queue->limits.max_active_zones,
> + zlim->max_active_zones);
> + if (zlim->max_active_zones != UINT_MAX)
> + zlim->max_active_zones =
> + min(zlim->max_active_zones, zc.target_nr_seq_zones);
> +
> + zlim->max_open_zones =
> + min_not_zero(disk->queue->limits.max_open_zones,
> + zlim->max_open_zones);
> + if (zlim->max_open_zones != UINT_MAX)
> + zlim->max_open_zones =
> + min3(zlim->max_open_zones, zc.target_nr_seq_zones,
> + zlim->max_active_zones);
> +
> + return 0;
> +}
> +
> +static int dm_set_zone_resource_limits(struct mapped_device *md,
> + struct dm_table *t, struct queue_limits *lim)
> +{
> + struct gendisk *disk = md->disk;
> + struct dm_zone_resource_limits zlim = {
> + .max_open_zones = UINT_MAX,
> + .max_active_zones = UINT_MAX,
> + .reliable_limits = true,
> + };
> +
> + /* Get the zone resource limits from the targets. */
> + for (unsigned int i = 0; i < t->num_targets; i++) {
> + struct dm_target *ti = dm_table_get_target(t, i);
> +
> + if (!ti->type->iterate_devices ||
> + ti->type->iterate_devices(ti,
> + device_get_zone_resource_limits, &zlim)) {
> + DMERR("Could not determine %s zone resource limits",
> + md->disk->disk_name);
> + return -ENODEV;
> + }
> + }
> +
> + /*
> + * If we only have conventional zones mapped, expose the mapped device
> + + as a regular device.
> + */
> + if (!zlim.mapped_nr_seq_zones) {
> + lim->max_open_zones = 0;
> + lim->max_active_zones = 0;
> + lim->max_zone_append_sectors = 0;
> + lim->zone_write_granularity = 0;
> + lim->zoned = false;
> + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + md->nr_zones = 0;
> + disk->nr_zones = 0;
> + return 0;
> + }
> +
> + /*
> + * Warn if the mapped device is partially using zone resources of the
> + * target devices as that leads to unreliable limits, i.e. if another
> + * mapped device uses the same underlying devices, we cannot enforce
> + * zone limits to guarantee that writing will not lead to errors.
> + * Note that we really should return an error for such case but there is
> + * no easy way to find out if another mapped device uses the same
> + * underlying zoned devices.
> + */
> + if (!zlim.reliable_limits)
> + DMWARN("%s zone resource limits may be unreliable",
> + disk->disk_name);
> +
> + if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_open_zones = 0;
> + else
> + lim->max_open_zones = zlim.max_open_zones;
> +
> + if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_active_zones = 0;
> + else
> + lim->max_active_zones = zlim.max_active_zones;
>
> return 0;
> }
> @@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> int ret;
>
> /* Revalidate only if something changed. */
> - if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
> + if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> + DMINFO("%s using %s zone append",
> + md->disk->disk_name,
> + queue_emulates_zone_append(disk->queue) ?
> + "emulated" : "native");
> md->nr_zones = 0;
> + }
>
> if (md->nr_zones)
> return 0;
> @@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> struct queue_limits *lim)
> {
> struct mapped_device *md = t->md;
> - struct gendisk *disk = md->disk;
> - unsigned int nr_conv_zones = 0;
> int ret;
>
> /*
> @@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> return 0;
>
> /*
> - * Count conventional zones to check that the mapped device will indeed
> - * have sequential write required zones.
> + * Determine the max open and max active zone limits for the mapped
> + * device. For a mapped device containing only conventional zones, the
> + * mapped device is changed to be a regular block device, so exit early
> + * for such case.
> */
> - md->zone_revalidate_map = t;
> - ret = dm_blk_report_zones(disk, 0, UINT_MAX,
> - dm_check_zoned_cb, &nr_conv_zones);
> - md->zone_revalidate_map = NULL;
> - if (ret < 0) {
> - DMERR("Check zoned failed %d", ret);
> + ret = dm_set_zone_resource_limits(md, t, lim);
> + if (ret)
> return ret;
> - }
> -
> - /*
> - * If we only have conventional zones, expose the mapped device as
> - * a regular device.
> - */
> - if (nr_conv_zones >= ret) {
> - lim->max_open_zones = 0;
> - lim->max_active_zones = 0;
> - lim->zoned = false;
> - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> - disk->nr_zones = 0;
> + if (!lim->zoned)
> return 0;
> - }
> -
> - if (!md->disk->nr_zones) {
> - DMINFO("%s using %s zone append",
> - md->disk->disk_name,
> - queue_emulates_zone_append(q) ? "emulated" : "native");
> - }
>
> ret = dm_revalidate_zones(md, t);
> if (ret < 0)
> --
> 2.45.1
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:34 ` Bart Van Assche
@ 2024-06-01 5:25 ` Christoph Hellwig
2024-06-03 6:53 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:25 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/4] block: Fix validation of zoned device with a runt zone
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:37 ` Bart Van Assche
@ 2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:55 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:26 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
` (2 preceding siblings ...)
2024-05-30 20:40 ` Bart Van Assche
@ 2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:56 ` Hannes Reinecke
4 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:26 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-31 19:26 ` Benjamin Marzinski
@ 2024-06-01 5:29 ` Christoph Hellwig
2024-06-01 5:33 ` Christoph Hellwig
0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:29 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On Fri, May 31, 2024 at 03:26:06PM -0400, Benjamin Marzinski wrote:
> Does this mean that if a dm device was made up of two linear targets,
> one of which mapped an entire zoned device, and one of which mapped a
> single zone of another zoned device, the max active zone limit of the
> entire dm device would be 1? That seems wrong.
In this particular case it is a bit supoptimal as the limit could be
2, but as soon as you add more than a single zone of the second
device anything more would be wrong.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-31 19:26 ` Benjamin Marzinski
@ 2024-06-01 5:29 ` Christoph Hellwig
2024-06-03 6:58 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:29 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-06-01 5:29 ` Christoph Hellwig
@ 2024-06-01 5:33 ` Christoph Hellwig
2024-06-03 0:44 ` Damien Le Moal
0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-01 5:33 UTC (permalink / raw)
To: Benjamin Marzinski
Cc: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On Fri, May 31, 2024 at 10:29:13PM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 03:26:06PM -0400, Benjamin Marzinski wrote:
> > Does this mean that if a dm device was made up of two linear targets,
> > one of which mapped an entire zoned device, and one of which mapped a
> > single zone of another zoned device, the max active zone limit of the
> > entire dm device would be 1? That seems wrong.
>
> In this particular case it is a bit supoptimal as the limit could be
> 2, but as soon as you add more than a single zone of the second
> device anything more would be wrong.
Actually even for this case it's the only valid one, sorry.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-06-01 5:33 ` Christoph Hellwig
@ 2024-06-03 0:44 ` Damien Le Moal
0 siblings, 0 replies; 29+ messages in thread
From: Damien Le Moal @ 2024-06-03 0:44 UTC (permalink / raw)
To: Christoph Hellwig, Benjamin Marzinski
Cc: Jens Axboe, linux-block, dm-devel, Mike Snitzer, Mikulas Patocka
On 6/1/24 14:33, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 10:29:13PM -0700, Christoph Hellwig wrote:
>> On Fri, May 31, 2024 at 03:26:06PM -0400, Benjamin Marzinski wrote:
>>> Does this mean that if a dm device was made up of two linear targets,
>>> one of which mapped an entire zoned device, and one of which mapped a
>>> single zone of another zoned device, the max active zone limit of the
>>> entire dm device would be 1? That seems wrong.
>>
>> In this particular case it is a bit supoptimal as the limit could be
>> 2, but as soon as you add more than a single zone of the second
>> device anything more would be wrong.
>
> Actually even for this case it's the only valid one, sorry.
Not really. I think that Benjamin has a very valid point here.
Image target 1 mapping an entire SMR HDD with MOZ1=128 max open limit and target
2 being one zone of another drive with MOZ2=whatever max open zones limit.
For such mapped device, the user can simultaneously write at most 128 zones and
these zones can be:
1) all in target 1 -> then the mapped device max open zone limit of 1 is silly.
128 would work just fine.
2) 127 zones in target 1 and the 1 zone of target 2: then again the mapped
device max open zone limit of 1 is overkill and 128 limit is still OK.
Now if MOZ2 is say 16 and we map more than 16 zones from target 2, THEN we need
to cap the mapped device max open zone limit to 16 as we can potentially have
the user trying to simultaneously write more than 16 zones belonging to target 2.
So the cap on the number of zones of the target I have is not correct. What I
need to not blindly do a min_not_zero() of the limits and look at the number of
zones being mapped. That is, something like:
if (target_nr_zones > target_moz)
moz = target_moz;
else
moz = 0;
mapped_dev_moz = min_not_zero(mapped_dev_moz, moz);
And same for max active limit. And we still need a cap of the limits on the
number of zones mapped at the end, when setting the mapped device limits.
Let me cook a V2 of this patch, with *lots* of comments to clarify this.
Note: I am also writing some blktests cases to test all this. I will add more
setups with challenges like this to ensure that the right limits are set.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
` (2 preceding siblings ...)
2024-06-01 5:25 ` Christoph Hellwig
@ 2024-06-03 6:53 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2024-06-03 6:53 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/30/24 07:40, Damien Le Moal wrote:
> A zoned device with a smaller last zone together with a zone capacity
> smaller than the zone size does make any sense as that does not
^not
> correspond to any possible setup for a real device:
> 1) For ZNS and zoned UFS devices, all zones are always the same size.
> 2) For SMR HDDs, all zones always have the same capacity.
> In other words, if we have a smaller last runt zone, then this zone
> capacity should always be equal to the zone size.
>
> Add a check in null_init_zoned_dev() to prevent a configuration to have
> both a smaller zone size and a zone capacity smaller than the zone size.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/block/null_blk/zoned.c | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
> index 79c8e5e99f7f..f118d304f310 100644
> --- a/drivers/block/null_blk/zoned.c
> +++ b/drivers/block/null_blk/zoned.c
> @@ -74,6 +74,17 @@ int null_init_zoned_dev(struct nullb_device *dev,
> return -EINVAL;
> }
>
> + /*
> + * If a smaller zone capacity was requested, do not allow a smaller last
> + * zone at the same time as such zone configuration does not correspond
> + * to any real zoned device.
> + */
> + if (dev->zone_capacity != dev->zone_size &&
> + dev->size & (dev->zone_size - 1)) {
> + pr_err("A smaller last zone is not allowed with zone capacity smaller than zone size.\n");
> + return -EINVAL;
> + }
> +
> zone_capacity_sects = mb_to_sects(dev->zone_capacity);
> dev_capacity_sects = mb_to_sects(dev->size);
> dev->zone_size_sects = mb_to_sects(dev->zone_size);
Otherwise:
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 2/4] block: Fix validation of zoned device with a runt zone
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
` (2 preceding siblings ...)
2024-06-01 5:26 ` Christoph Hellwig
@ 2024-06-03 6:55 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2024-06-03 6:55 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/30/24 07:40, Damien Le Moal wrote:
> Commit ecfe43b11b02 ("block: Remember zone capacity when revalidating
> zones") introduced checks to ensure that the capacity of the zones of
> a zoned device is constant for all zones. However, this check ignores
> the possibility that a zoned device has a smaller last zone with a size
> not equal to the capacity of other zones. Such device correspond in
> practice to an SMR drive with a smaller last zone and all zones with a
> capacity equal to the zone size, leading to the last zone capacity being
> different than the capacity of other zones.
>
> Correctly handle such device by fixing the check for the constant zone
> capacity in blk_revalidate_seq_zone() using the new helper function
> disk_zone_is_last(). This helper function is also used in
> blk_revalidate_zone_cb() when checking the zone size.
>
> Fixes: ecfe43b11b02 ("block: Remember zone capacity when revalidating zones")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 3/4] block: Fix zone write plugging handling of devices with a runt zone
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
` (3 preceding siblings ...)
2024-06-01 5:26 ` Christoph Hellwig
@ 2024-06-03 6:56 ` Hannes Reinecke
4 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2024-06-03 6:56 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/30/24 07:40, Damien Le Moal wrote:
> A zoned device may have a last sequential write required zone that is
> smaller than other zones. However, all tests to check if a zone write
> plug write offset exceeds the zone capacity use the same capacity
> value stored in the gendisk zone_capacity field. This is incorrect for a
> zoned device with a last runt (smaller) zone.
>
> Add the new field last_zone_capacity to struct gendisk to store the
> capacity of the last zone of the device. blk_revalidate_seq_zone() and
> blk_revalidate_conv_zone() are both modified to get this value when
> disk_zone_is_last() returns true. Similarly to zone_capacity, the value
> is first stored using the last_zone_capacity field of struct
> blk_revalidate_zone_args. Once zone revalidation of all zones is done,
> this is used to set the gendisk last_zone_capacity field.
>
> The checks to determine if a zone is full or if a sector offset in a
> zone exceeds the zone capacity in disk_should_remove_zone_wplug(),
> disk_zone_wplug_abort_unaligned(), blk_zone_write_plug_init_request(),
> and blk_zone_wplug_prepare_bio() are modified to use the new helper
> functions disk_zone_is_full() and disk_zone_wplug_is_full().
> disk_zone_is_full() uses the zone index to determine if the zone being
> tested is the last one of the disk and uses the either the disk
> zone_capacity or last_zone_capacity accordingly.
>
> Fixes: dd291d77cc90 ("block: Introduce zone write plugging")
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 35 +++++++++++++++++++++++++++--------
> include/linux/blkdev.h | 1 +
> 2 files changed, 28 insertions(+), 8 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH 4/4] dm: Improve zone resource limits handling
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
` (2 preceding siblings ...)
2024-06-01 5:29 ` Christoph Hellwig
@ 2024-06-03 6:58 ` Hannes Reinecke
3 siblings, 0 replies; 29+ messages in thread
From: Hannes Reinecke @ 2024-06-03 6:58 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, dm-devel, Mike Snitzer,
Mikulas Patocka
On 5/30/24 07:40, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
>
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
>
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
>
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets,
> capped with the number of sequential zones being mapped.
> While still not completely reliable, this allows indicating to the user
> that the underlying devices used have limits.
>
> This commit improves zone resource limits handling as described above
> using the function dm_set_zone_resource_limits(). This function is
> executed from dm_set_zones_restrictions() and iterates the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the target.
> 1) For a target mapping an entire zoned block device, the limits are
> evaluated as the non-zero minimum of the limits of the target device
> and of the mapped device.
> 2) For a target partially mapping a zoned block device, the number of
> mapped sequential zones is compared to the total number of sequential
> zones of the target device to determine the limits: if the target
> maps all sequential write required zones of the device, then the
> target can reliably inherit the device limits. Otherwise, the target
> limits are set to the device limits capped by the number of mapped
> sequential zones.
> With this evaluation done for each target, the mapped device zone
> resource limits are evaluated as the non-zero minimum of the limits of
> all the targets.
>
> For configurations resulting in unreliable limits, a warning message is
> issued.
>
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is used to determine if the mapped
> device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones. For mapped devices
> mapping only conventional zones, dm_set_zone_resource_limits() changes
> the mapped device to a regular device.
>
> To further cleanup dm_set_zones_restrictions(), the message about the
> type of zone append (native or emulated) is moved inside
> dm_revalidate_zones().
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 177 insertions(+), 37 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2024-06-03 6:58 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:34 ` Bart Van Assche
2024-06-01 5:25 ` Christoph Hellwig
2024-06-03 6:53 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:37 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:55 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 11:09 ` Damien Le Moal
2024-05-30 12:51 ` Niklas Cassel
2024-05-30 20:40 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:56 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-31 19:26 ` Benjamin Marzinski
2024-06-01 5:29 ` Christoph Hellwig
2024-06-01 5:33 ` Christoph Hellwig
2024-06-03 0:44 ` Damien Le Moal
2024-06-01 5:29 ` Christoph Hellwig
2024-06-03 6:58 ` Hannes Reinecke
2024-05-30 21:03 ` [PATCH 0/4] Zone write plugging and DM zone fixes Jens Axboe
2024-05-30 23:58 ` Damien Le Moal
2024-05-30 21:04 ` (subset) " Jens Axboe
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).