From: Niklas Cassel <cassel@kernel.org>
To: Damien Le Moal <dlemoal@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>,
linux-block@vger.kernel.org, dm-devel@lists.linux.dev,
Mike Snitzer <snitzer@kernel.org>,
Mikulas Patocka <mpatocka@redhat.com>
Subject: Re: [PATCH 4/4] dm: Improve zone resource limits handling
Date: Thu, 30 May 2024 09:37:40 +0200 [thread overview]
Message-ID: <ZlgsxKE2tpu68O3n@ryzen.lan> (raw)
In-Reply-To: <20240530054035.491497-5-dlemoal@kernel.org>
On Thu, May 30, 2024 at 02:40:35PM +0900, Damien Le Moal wrote:
> The generic stacking of limits implemented in the block layer cannot
> correctly handle stacking of zone resource limits (max open zones and
> max active zones) because these limits are for an entire device but the
> stacking may be for a portion of that device (e.g. a dm-linear target
> that does not cover an entire block device). As a result, when DM
> devices are created on top of zoned block devices, the DM device never
> has any zone resource limits advertized, which is only correct if all
> underlying target devices also have no zone resource limits.
> If at least one target device has resource limits, the user may see
> either performance issues (if the max open zone limit of the device is
> exceeded) or write I/O errors if the max active zone limit of one of
> the underlying target devices is exceeded.
>
> While it is very difficult to correctly and reliably stack zone resource
> limits in general, cases where targets are not sharing zone resources of
> the same device can be dealt with relatively easily. Such situation
> happens when a target maps all sequential zones of a zoned block device:
> for such mapping, other targets mapping other parts of the same zoned
> block device can only contain conventional zones and thus will not
> require any zone resource to correctly handle write operations.
>
> For a mapped device constructed with such targets, which includes mapped
> devices constructed with targets mapping entire zoned block devices, the
> zone resource limits can be reliably determined using the non-zero
> minimum of the zone resource limits of all targets.
>
> For mapped devices that include targets partially mapping the set of
> sequential write required zones of zoned block devices, instead of
> advertizing no zone resource limits, it is also better to set the mapped
> device limits to the non-zero minimum of the limits of all targets,
> capped with the number of sequential zones being mapped.
> While still not completely reliable, this allows indicating to the user
> that the underlying devices used have limits.
>
> This commit improves zone resource limits handling as described above
> using the function dm_set_zone_resource_limits(). This function is
> executed from dm_set_zones_restrictions() and iterates the targets of a
> mapped device to evaluate the max open and max active zone limits. This
> relies on an internal "stacking" of the limits of the target devices
> combined with a direct counting of the number of sequential zones
> mapped by the target.
> 1) For a target mapping an entire zoned block device, the limits are
> evaluated as the non-zero minimum of the limits of the target device
> and of the mapped device.
> 2) For a target partially mapping a zoned block device, the number of
> mapped sequential zones is compared to the total number of sequential
> zones of the target device to determine the limits: if the target
> maps all sequential write required zones of the device, then the
> target can reliably inherit the device limits. Otherwise, the target
> limits are set to the device limits capped by the number of mapped
> sequential zones.
> With this evaluation done for each target, the mapped device zone
> resource limits are evaluated as the non-zero minimum of the limits of
> all the targets.
>
> For configurations resulting in unreliable limits, a warning message is
> issued.
>
> The counting of mapped sequential zones for the target is done using the
> new function dm_device_count_zones() which performs a report zones on
> the entire block device with the callback dm_device_count_zones_cb().
> This count of mapped sequential zones is used to determine if the mapped
> device contains only conventional zones. This allows simplifying
> dm_set_zones_restrictions() to not do a report zones. For mapped devices
> mapping only conventional zones, dm_set_zone_resource_limits() changes
> the mapped device to a regular device.
>
> To further cleanup dm_set_zones_restrictions(), the message about the
> type of zone append (native or emulated) is moved inside
> dm_revalidate_zones().
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> drivers/md/dm-zone.c | 214 +++++++++++++++++++++++++++++++++++--------
> 1 file changed, 177 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
> index 5d66d916730e..5f8b499529cf 100644
> --- a/drivers/md/dm-zone.c
> +++ b/drivers/md/dm-zone.c
> @@ -145,17 +145,174 @@ bool dm_is_zone_write(struct mapped_device *md, struct bio *bio)
> }
> }
>
> +struct dm_device_zone_count {
> + sector_t start;
> + sector_t len;
> + unsigned int total_nr_seq_zones;
> + unsigned int target_nr_seq_zones;
> +};
> +
> /*
> - * Count conventional zones of a mapped zoned device. If the device
> - * only has conventional zones, do not expose it as zoned.
> + * Count the total number of and the number of mapped sequential zones of a
> + * target zoned device.
> */
> -static int dm_check_zoned_cb(struct blk_zone *zone, unsigned int idx,
> - void *data)
> +static int dm_device_count_zones_cb(struct blk_zone *zone,
> + unsigned int idx, void *data)
> {
> - unsigned int *nr_conv_zones = data;
> + struct dm_device_zone_count *zc = data;
> +
> + if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL) {
> + zc->total_nr_seq_zones++;
> + if (zone->start >= zc->start &&
> + zone->start < zc->start + zc->len)
> + zc->target_nr_seq_zones++;
> + }
>
> - if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> - (*nr_conv_zones)++;
> + return 0;
> +}
> +
> +static int dm_device_count_zones(struct dm_dev *dev,
> + struct dm_device_zone_count *zc)
> +{
> + int ret;
> +
> + ret = blkdev_report_zones(dev->bdev, 0, UINT_MAX,
> + dm_device_count_zones_cb, zc);
> + if (ret < 0)
> + return ret;
> + if (!ret)
> + return -EIO;
> + return 0;
> +}
> +
> +struct dm_zone_resource_limits {
> + unsigned int mapped_nr_seq_zones;
> + unsigned int max_open_zones;
> + unsigned int max_active_zones;
> + bool reliable_limits;
> +};
> +
> +static int device_get_zone_resource_limits(struct dm_target *ti,
> + struct dm_dev *dev, sector_t start,
> + sector_t len, void *data)
> +{
> + struct dm_zone_resource_limits *zlim = data;
> + struct gendisk *disk = dev->bdev->bd_disk;
> + int ret;
> + struct dm_device_zone_count zc = {
> + .start = start,
> + .len = len,
> + };
> +
> + /*
> + * If the target is not the whole device, the device zone resources may
> + * be shared between different targets. Check this by counting the
> + * number of mapped sequential zones: if this number is smaller than the
> + * total number of sequential zones of the target device, then resource
> + * sharing may happen and the zone limits will not be reliable.
> + */
> + ret = dm_device_count_zones(dev, &zc);
> + if (ret) {
> + DMERR("Count device %s zones failed",
> + disk->disk_name);
> + return ret;
> + }
> +
> + zlim->mapped_nr_seq_zones += zc.target_nr_seq_zones;
> +
> + /*
> + * If the target does not map any sequential zones, then we do not need
> + * any zone resource limits.
> + */
> + if (!zc.target_nr_seq_zones)
> + return 0;
> +
> + /*
> + * If the target does not map all sequential zones, the limits
> + * will not be reliable.
> + */
> + if (zc.target_nr_seq_zones < zc.total_nr_seq_zones)
> + zlim->reliable_limits = false;
> +
> + zlim->max_active_zones =
> + min_not_zero(disk->queue->limits.max_active_zones,
> + zlim->max_active_zones);
> + if (zlim->max_active_zones != UINT_MAX)
> + zlim->max_active_zones =
> + min(zlim->max_active_zones, zc.target_nr_seq_zones);
> +
> + zlim->max_open_zones =
> + min_not_zero(disk->queue->limits.max_open_zones,
> + zlim->max_open_zones);
> + if (zlim->max_open_zones != UINT_MAX)
> + zlim->max_open_zones =
> + min3(zlim->max_open_zones, zc.target_nr_seq_zones,
> + zlim->max_active_zones);
> +
> + return 0;
> +}
> +
> +static int dm_set_zone_resource_limits(struct mapped_device *md,
> + struct dm_table *t, struct queue_limits *lim)
> +{
> + struct gendisk *disk = md->disk;
> + struct dm_zone_resource_limits zlim = {
> + .max_open_zones = UINT_MAX,
> + .max_active_zones = UINT_MAX,
> + .reliable_limits = true,
> + };
> +
> + /* Get the zone resource limits from the targets. */
> + for (unsigned int i = 0; i < t->num_targets; i++) {
> + struct dm_target *ti = dm_table_get_target(t, i);
> +
> + if (!ti->type->iterate_devices ||
> + ti->type->iterate_devices(ti,
> + device_get_zone_resource_limits, &zlim)) {
> + DMERR("Could not determine %s zone resource limits",
> + md->disk->disk_name);
> + return -ENODEV;
> + }
> + }
> +
> + /*
> + * If we only have conventional zones mapped, expose the mapped device
> + + as a regular device.
> + */
> + if (!zlim.mapped_nr_seq_zones) {
> + lim->max_open_zones = 0;
> + lim->max_active_zones = 0;
> + lim->max_zone_append_sectors = 0;
> + lim->zone_write_granularity = 0;
> + lim->zoned = false;
> + clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> + md->nr_zones = 0;
> + disk->nr_zones = 0;
> + return 0;
> + }
> +
> + /*
> + * Warn if the mapped device is partially using zone resources of the
> + * target devices as that leads to unreliable limits, i.e. if another
> + * mapped device uses the same underlying devices, we cannot enforce
> + * zone limits to guarantee that writing will not lead to errors.
> + * Note that we really should return an error for such case but there is
> + * no easy way to find out if another mapped device uses the same
> + * underlying zoned devices.
> + */
> + if (!zlim.reliable_limits)
> + DMWARN("%s zone resource limits may be unreliable",
> + disk->disk_name);
> +
> + if (zlim.max_open_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_open_zones = 0;
> + else
> + lim->max_open_zones = zlim.max_open_zones;
> +
> + if (zlim.max_active_zones >= zlim.mapped_nr_seq_zones)
> + lim->max_active_zones = 0;
> + else
> + lim->max_active_zones = zlim.max_active_zones;
>
> return 0;
> }
> @@ -172,8 +329,13 @@ static int dm_revalidate_zones(struct mapped_device *md, struct dm_table *t)
> int ret;
>
> /* Revalidate only if something changed. */
> - if (!disk->nr_zones || disk->nr_zones != md->nr_zones)
> + if (!disk->nr_zones || disk->nr_zones != md->nr_zones) {
> + DMINFO("%s using %s zone append",
> + md->disk->disk_name,
> + queue_emulates_zone_append(disk->queue) ?
> + "emulated" : "native");
> md->nr_zones = 0;
> + }
>
> if (md->nr_zones)
> return 0;
> @@ -224,8 +386,6 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> struct queue_limits *lim)
> {
> struct mapped_device *md = t->md;
> - struct gendisk *disk = md->disk;
> - unsigned int nr_conv_zones = 0;
> int ret;
>
> /*
> @@ -244,36 +404,16 @@ int dm_set_zones_restrictions(struct dm_table *t, struct request_queue *q,
> return 0;
>
> /*
> - * Count conventional zones to check that the mapped device will indeed
> - * have sequential write required zones.
> + * Determine the max open and max active zone limits for the mapped
> + * device. For a mapped device containing only conventional zones, the
> + * mapped device is changed to be a regular block device, so exit early
> + * for such case.
> */
> - md->zone_revalidate_map = t;
> - ret = dm_blk_report_zones(disk, 0, UINT_MAX,
> - dm_check_zoned_cb, &nr_conv_zones);
> - md->zone_revalidate_map = NULL;
> - if (ret < 0) {
> - DMERR("Check zoned failed %d", ret);
> + ret = dm_set_zone_resource_limits(md, t, lim);
> + if (ret)
> return ret;
> - }
> -
> - /*
> - * If we only have conventional zones, expose the mapped device as
> - * a regular device.
> - */
> - if (nr_conv_zones >= ret) {
> - lim->max_open_zones = 0;
> - lim->max_active_zones = 0;
> - lim->zoned = false;
> - clear_bit(DMF_EMULATE_ZONE_APPEND, &md->flags);
> - disk->nr_zones = 0;
> + if (!lim->zoned)
> return 0;
> - }
> -
> - if (!md->disk->nr_zones) {
> - DMINFO("%s using %s zone append",
> - md->disk->disk_name,
> - queue_emulates_zone_append(q) ? "emulated" : "native");
> - }
I would have moved this print in a separate commit.
Regardless:
Reviewed-by: Niklas Cassel <cassel@kernel.org>
next prev parent reply other threads:[~2024-05-30 7:37 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-30 5:40 [PATCH 0/4] Zone write plugging and DM zone fixes Damien Le Moal
2024-05-30 5:40 ` [PATCH 1/4] null_blk: Do not allow runt zone with zone capacity smaller then zone size Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:34 ` Bart Van Assche
2024-06-01 5:25 ` Christoph Hellwig
2024-06-03 6:53 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 2/4] block: Fix validation of zoned device with a runt zone Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 20:37 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:55 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 3/4] block: Fix zone write plugging handling of devices " Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel
2024-05-30 11:09 ` Damien Le Moal
2024-05-30 12:51 ` Niklas Cassel
2024-05-30 20:40 ` Bart Van Assche
2024-06-01 5:26 ` Christoph Hellwig
2024-06-03 6:56 ` Hannes Reinecke
2024-05-30 5:40 ` [PATCH 4/4] dm: Improve zone resource limits handling Damien Le Moal
2024-05-30 7:37 ` Niklas Cassel [this message]
2024-05-31 19:26 ` Benjamin Marzinski
2024-06-01 5:29 ` Christoph Hellwig
2024-06-01 5:33 ` Christoph Hellwig
2024-06-03 0:44 ` Damien Le Moal
2024-06-01 5:29 ` Christoph Hellwig
2024-06-03 6:58 ` Hannes Reinecke
2024-05-30 21:03 ` [PATCH 0/4] Zone write plugging and DM zone fixes Jens Axboe
2024-05-30 23:58 ` Damien Le Moal
2024-05-30 21:04 ` (subset) " Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZlgsxKE2tpu68O3n@ryzen.lan \
--to=cassel@kernel.org \
--cc=axboe@kernel.dk \
--cc=dlemoal@kernel.org \
--cc=dm-devel@lists.linux.dev \
--cc=linux-block@vger.kernel.org \
--cc=mpatocka@redhat.com \
--cc=snitzer@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).