* [PATCH 00/13] Introduce cached report zones
@ 2025-10-31 6:12 Damien Le Moal
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
` (12 more replies)
0 siblings, 13 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
The patch series implements a cached report zones using information from
the block layer zone write plugs and a new zone condition tracking. This
avoids having to execute slow report zones commands on the device when
for instance mounting file systems, which can significantly speed things
up, especially in setups with multiple SMR HDDs (e.g. a RAID volume).
The first patch improves zone resource updates. The following 3 patches
cleanup and improve handling of zone reports and of other zone
management operations. From patch 5 to 10, cached report zones in
implemented and made available to users with a new ioctl() command.
Finally, patches 12 and 13 introduce the use of cached report zones in
the mount operation of XFS and BTRFS.
These patches are against Jen's for-next tree.
Damien Le Moal (13):
block: freeze queue when updating zone resources
block: cleanup blkdev_report_zones()
block: handle zone management operations completions
block: introduce disk_report_zone()
block: reorganize struct blk_zone_wplug
block: use zone condition to determine conventional zones
block: track zone conditions
block: introduce blkdev_get_zone_info()
block: introduce blkdev_report_zones_cached()
block: introduce BLKREPORTZONESV2 ioctl
block: add zone write plug condition to debugfs zone_wplugs
btrfs: use blkdev_report_zones_cached()
xfs: use blkdev_report_zones_cached()
block/blk-zoned.c | 762 ++++++++++++++++++++++++------
block/blk.h | 14 +
block/ioctl.c | 1 +
drivers/block/null_blk/null_blk.h | 3 +-
drivers/block/null_blk/zoned.c | 4 +-
drivers/block/ublk_drv.c | 4 +-
drivers/block/virtio_blk.c | 11 +-
drivers/block/zloop.c | 4 +-
drivers/md/dm-zone.c | 54 ++-
drivers/md/dm.h | 3 +-
drivers/nvme/host/core.c | 5 +-
drivers/nvme/host/multipath.c | 4 +-
drivers/nvme/host/nvme.h | 2 +-
drivers/nvme/host/zns.c | 10 +-
drivers/scsi/sd.h | 2 +-
drivers/scsi/sd_zbc.c | 17 +-
fs/btrfs/zoned.c | 11 +-
fs/xfs/xfs_zone_alloc.c | 2 +-
include/linux/blkdev.h | 44 +-
include/linux/device-mapper.h | 10 +-
include/uapi/linux/blkzoned.h | 36 +-
include/uapi/linux/fs.h | 2 +-
22 files changed, 746 insertions(+), 259 deletions(-)
base-commit: ba6a8208cc205c6545c610b5863ea89466fc486a
--
2.51.0
^ permalink raw reply [flat|nested] 63+ messages in thread
* [PATCH 01/13] block: freeze queue when updating zone resources
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
@ 2025-10-31 6:12 ` Damien Le Moal
2025-10-31 8:44 ` Christoph Hellwig
` (2 more replies)
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
` (11 subsequent siblings)
12 siblings, 3 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Modify disk_update_zone_resources() to freeze the device queue before
updating the number of zones, zone capacity and other zone related
resources. The locking order resulting from the call to
queue_limits_commit_update_frozen() is preserved, that is, the queue
limits lock is first taken by calling queue_limits_start_update() before
freezing the queue, and the queue is unfrozen after executing
queue_limits_commit_update(), which replaces the call to
queue_limits_commit_update_frozen().
This change ensures that there are no in-flights I/Os when the zone
resources are updated due to a zone revalidation.
Fixes: 0b83c86b444a ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 5e2a5788dc3b..f3b371056df4 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -1516,8 +1516,13 @@ static int disk_update_zone_resources(struct gendisk *disk,
{
struct request_queue *q = disk->queue;
unsigned int nr_seq_zones, nr_conv_zones;
- unsigned int pool_size;
+ unsigned int pool_size, memflags;
struct queue_limits lim;
+ int ret;
+
+ lim = queue_limits_start_update(q);
+
+ memflags = blk_mq_freeze_queue(q);
disk->nr_zones = args->nr_zones;
disk->zone_capacity = args->zone_capacity;
@@ -1527,11 +1532,10 @@ static int disk_update_zone_resources(struct gendisk *disk,
if (nr_conv_zones >= disk->nr_zones) {
pr_warn("%s: Invalid number of conventional zones %u / %u\n",
disk->disk_name, nr_conv_zones, disk->nr_zones);
- return -ENODEV;
+ ret = -ENODEV;
+ goto unfreeze;
}
- lim = queue_limits_start_update(q);
-
/*
* Some devices can advertize zone resource limits that are larger than
* the number of sequential zones of the zoned block device, e.g. a
@@ -1568,7 +1572,12 @@ static int disk_update_zone_resources(struct gendisk *disk,
}
commit:
- return queue_limits_commit_update_frozen(q, &lim);
+ ret = queue_limits_commit_update(q, &lim);
+
+unfreeze:
+ blk_mq_unfreeze_queue(q, memflags);
+
+ return ret;
}
static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 02/13] block: cleanup blkdev_report_zones()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
@ 2025-10-31 6:12 ` Damien Le Moal
2025-10-31 8:45 ` Christoph Hellwig
` (2 more replies)
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
` (10 subsequent siblings)
12 siblings, 3 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
The variable capacity is used only in one place and so can be removed
and get_capacity(disk) used directly instead.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index f3b371056df4..8d2111879328 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -156,7 +156,6 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data)
{
struct gendisk *disk = bdev->bd_disk;
- sector_t capacity = get_capacity(disk);
struct disk_report_zones_cb_args args = {
.disk = disk,
.user_cb = cb,
@@ -166,7 +165,7 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
return -EOPNOTSUPP;
- if (!nr_zones || sector >= capacity)
+ if (!nr_zones || sector >= get_capacity(disk))
return 0;
return disk->fops->report_zones(disk, sector, nr_zones,
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 03/13] block: handle zone management operations completions
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
@ 2025-10-31 6:12 ` Damien Le Moal
2025-10-31 8:46 ` Christoph Hellwig
` (2 more replies)
2025-10-31 6:12 ` [PATCH 04/13] block: introduce disk_report_zone() Damien Le Moal
` (9 subsequent siblings)
12 siblings, 3 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
The functions blk_zone_wplug_handle_reset_or_finish() and
blk_zone_wplug_handle_reset_all() both modify the zone write pointer
offset of zone write plugs that are the target of a reset, reset all or
finish zone management operation. However, these functions do this
modification before the BIO is executed. So if the zone operation fails,
the modified zone write pointer offsets become invalid.
Avoid this by modifying the zone write pointer offset of a zone write
plug that is the target of a zone management operation when the
operation completes. To do so, modify blk_zone_bio_endio() to call the
new function blk_zone_mgmt_bio_endio() which in turn calls the functions
blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
blk_zone_finish_bio_endio() depending on the operation of the completed
BIO, to modify a zone write plug write pointer offset accordingly.
These functions are called only if the BIO execution was successful.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 139 ++++++++++++++++++++++++++++++----------------
block/blk.h | 14 +++++
2 files changed, 104 insertions(+), 49 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 8d2111879328..b6332b3124fc 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -71,6 +71,11 @@ struct blk_zone_wplug {
struct gendisk *disk;
};
+static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
+{
+ return 1U << disk->zone_wplugs_hash_bits;
+}
+
/*
* Zone write plug flags bits:
* - BLK_ZONE_WPLUG_PLUGGED: Indicates that the zone write plug is plugged,
@@ -697,71 +702,91 @@ static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector)
disk_report_zones_cb, &args);
}
-static bool blk_zone_wplug_handle_reset_or_finish(struct bio *bio,
- unsigned int wp_offset)
+static void blk_zone_reset_bio_endio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
- sector_t sector = bio->bi_iter.bi_sector;
struct blk_zone_wplug *zwplug;
- unsigned long flags;
-
- /* Conventional zones cannot be reset nor finished. */
- if (!bdev_zone_is_seq(bio->bi_bdev, sector)) {
- bio_io_error(bio);
- return true;
- }
-
- /*
- * No-wait reset or finish BIOs do not make much sense as the callers
- * issue these as blocking operations in most cases. To avoid issues
- * the BIO execution potentially failing with BLK_STS_AGAIN, warn about
- * REQ_NOWAIT being set and ignore that flag.
- */
- if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT))
- bio->bi_opf &= ~REQ_NOWAIT;
/*
- * If we have a zone write plug, set its write pointer offset to 0
- * (reset case) or to the zone size (finish case). This will abort all
- * BIOs plugged for the target zone. It is fine as resetting or
- * finishing zones while writes are still in-flight will result in the
+ * If we have a zone write plug, set its write pointer offset to 0.
+ * This will abort all BIOs plugged for the target zone. It is fine as
+ * resetting zones while writes are still in-flight will result in the
* writes failing anyway.
*/
- zwplug = disk_get_zone_wplug(disk, sector);
+ zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
if (zwplug) {
+ unsigned long flags;
+
spin_lock_irqsave(&zwplug->lock, flags);
- disk_zone_wplug_set_wp_offset(disk, zwplug, wp_offset);
+ disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
}
-
- return false;
}
-static bool blk_zone_wplug_handle_reset_all(struct bio *bio)
+static void blk_zone_reset_all_bio_endio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug;
unsigned long flags;
- sector_t sector;
+ unsigned int i;
- /*
- * Set the write pointer offset of all zone write plugs to 0. This will
- * abort all plugged BIOs. It is fine as resetting zones while writes
- * are still in-flight will result in the writes failing anyway.
- */
- for (sector = 0; sector < get_capacity(disk);
- sector += disk->queue->limits.chunk_sectors) {
- zwplug = disk_get_zone_wplug(disk, sector);
- if (zwplug) {
+ /* Update the condition of all zone write plugs. */
+ rcu_read_lock();
+ for (i = 0; i < disk_zone_wplugs_hash_size(disk); i++) {
+ hlist_for_each_entry_rcu(zwplug, &disk->zone_wplugs_hash[i],
+ node) {
spin_lock_irqsave(&zwplug->lock, flags);
disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
spin_unlock_irqrestore(&zwplug->lock, flags);
- disk_put_zone_wplug(zwplug);
}
}
+ rcu_read_unlock();
+}
- return false;
+static void blk_zone_finish_bio_endio(struct bio *bio)
+{
+ struct block_device *bdev = bio->bi_bdev;
+ struct gendisk *disk = bdev->bd_disk;
+ struct blk_zone_wplug *zwplug;
+
+ /*
+ * If we have a zone write plug, set its write pointer offset to the
+ * zone size. This will abort all BIOs plugged for the target zone. It
+ * is fine as resetting zones while writes are still in-flight will
+ * result in the writes failing anyway.
+ */
+ zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+ if (zwplug) {
+ unsigned long flags;
+
+ spin_lock_irqsave(&zwplug->lock, flags);
+ disk_zone_wplug_set_wp_offset(disk, zwplug,
+ bdev_zone_sectors(bdev));
+ spin_unlock_irqrestore(&zwplug->lock, flags);
+ disk_put_zone_wplug(zwplug);
+ }
+}
+
+void blk_zone_mgmt_bio_endio(struct bio *bio)
+{
+ /* If the BIO failed, we have nothing to do. */
+ if (bio->bi_status != BLK_STS_OK)
+ return;
+
+ switch (bio_op(bio)) {
+ case REQ_OP_ZONE_RESET:
+ blk_zone_reset_bio_endio(bio);
+ return;
+ case REQ_OP_ZONE_RESET_ALL:
+ blk_zone_reset_all_bio_endio(bio);
+ return;
+ case REQ_OP_ZONE_FINISH:
+ blk_zone_finish_bio_endio(bio);
+ return;
+ default:
+ return;
+ }
}
static void disk_zone_wplug_schedule_bio_work(struct gendisk *disk,
@@ -1105,6 +1130,30 @@ static void blk_zone_wplug_handle_native_zone_append(struct bio *bio)
disk_put_zone_wplug(zwplug);
}
+static bool blk_zone_wplug_handle_zone_mgmt(struct bio *bio)
+{
+ if (bio_op(bio) != REQ_OP_ZONE_RESET_ALL &&
+ !bdev_zone_is_seq(bio->bi_bdev, bio->bi_iter.bi_sector)) {
+ /*
+ * Zone reset and zone finish operations do not apply to
+ * conventional zones.
+ */
+ bio_io_error(bio);
+ return true;
+ }
+
+ /*
+ * No-wait zone management BIOs do not make much sense as the callers
+ * issue these as blocking operations in most cases. To avoid issues
+ * with the BIO execution potentially failing with BLK_STS_AGAIN, warn
+ * about REQ_NOWAIT being set and ignore that flag.
+ */
+ if (WARN_ON_ONCE(bio->bi_opf & REQ_NOWAIT))
+ bio->bi_opf &= ~REQ_NOWAIT;
+
+ return false;
+}
+
/**
* blk_zone_plug_bio - Handle a zone write BIO with zone write plugging
* @bio: The BIO being submitted
@@ -1152,12 +1201,9 @@ bool blk_zone_plug_bio(struct bio *bio, unsigned int nr_segs)
case REQ_OP_WRITE_ZEROES:
return blk_zone_wplug_handle_write(bio, nr_segs);
case REQ_OP_ZONE_RESET:
- return blk_zone_wplug_handle_reset_or_finish(bio, 0);
case REQ_OP_ZONE_FINISH:
- return blk_zone_wplug_handle_reset_or_finish(bio,
- bdev_zone_sectors(bdev));
case REQ_OP_ZONE_RESET_ALL:
- return blk_zone_wplug_handle_reset_all(bio);
+ return blk_zone_wplug_handle_zone_mgmt(bio);
default:
return false;
}
@@ -1331,11 +1377,6 @@ static void blk_zone_wplug_bio_work(struct work_struct *work)
disk_put_zone_wplug(zwplug);
}
-static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
-{
- return 1U << disk->zone_wplugs_hash_bits;
-}
-
void disk_init_zone_resources(struct gendisk *disk)
{
spin_lock_init(&disk->zone_wplugs_lock);
diff --git a/block/blk.h b/block/blk.h
index 32a10024efba..687b15e8bbd3 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -489,9 +489,23 @@ static inline bool blk_req_bio_is_zone_append(struct request *rq,
void blk_zone_write_plug_bio_merged(struct bio *bio);
void blk_zone_write_plug_init_request(struct request *rq);
void blk_zone_append_update_request_bio(struct request *rq, struct bio *bio);
+void blk_zone_mgmt_bio_endio(struct bio *bio);
void blk_zone_write_plug_bio_endio(struct bio *bio);
static inline void blk_zone_bio_endio(struct bio *bio)
{
+ /*
+ * Zone mamnagement BIOs may impact zone write plugs (e.g. a zone reset
+ * changes a zone write plug zone write pointer offset), but these
+ * operation do not go through zone write plugging as they may operate
+ * on zones that do not have a zone write
+ * plug. blk_zone_mgmt_bio_endio() handles the potential changes to zone
+ * write plugs that are present.
+ */
+ if (op_is_zone_mgmt(bio_op(bio))) {
+ blk_zone_mgmt_bio_endio(bio);
+ return;
+ }
+
/*
* For write BIOs to zoned devices, signal the completion of the BIO so
* that the next write BIO can be submitted by zone write plugging.
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 04/13] block: introduce disk_report_zone()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (2 preceding siblings ...)
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
@ 2025-10-31 6:12 ` Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:54 ` Bart Van Assche
2025-10-31 6:12 ` [PATCH 05/13] block: reorganize struct blk_zone_wplug Damien Le Moal
` (8 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Commit b76b840fd933 ("dm: Fix dm-zoned-reclaim zone write pointer
alignment") introduced an indirect call for the callback function of a
report zones executed with blkdev_report_zones(). This is necessary so
that the function disk_zone_wplug_sync_wp_offset() can be called to
refresh a zone write plug zone write pointer offset after a write error.
However, this solution makes following the path of a zone information
harder to understand.
Clean this up by introducing the new blk_report_zones_args structure to
define a zone report callback and its private data and introduce the
helper function disk_report_zone() which calls both
disk_zone_wplug_sync_wp_offset() and the zone report user callback
function for all zones of a zone report. This helper function must be
called by all block device drivers that implement the report zones
block operation in order to correctly report a zone information.
All block device drivers supporting the report_zones block operation are
updated to use this new scheme.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 79 ++++++++++++++++---------------
drivers/block/null_blk/null_blk.h | 3 +-
drivers/block/null_blk/zoned.c | 4 +-
drivers/block/ublk_drv.c | 4 +-
drivers/block/virtio_blk.c | 11 +++--
drivers/block/zloop.c | 4 +-
drivers/md/dm-zone.c | 54 +++++++++++----------
drivers/md/dm.h | 3 +-
drivers/nvme/host/core.c | 5 +-
drivers/nvme/host/multipath.c | 4 +-
drivers/nvme/host/nvme.h | 2 +-
drivers/nvme/host/zns.c | 10 ++--
drivers/scsi/sd.h | 2 +-
drivers/scsi/sd_zbc.c | 17 +++----
include/linux/blkdev.h | 7 ++-
include/linux/device-mapper.h | 10 +++-
16 files changed, 119 insertions(+), 100 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index b6332b3124fc..603ed8dc472c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -114,30 +114,16 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
}
EXPORT_SYMBOL_GPL(blk_zone_cond_str);
-struct disk_report_zones_cb_args {
- struct gendisk *disk;
- report_zones_cb user_cb;
- void *user_data;
+/*
+ * Zone report arguments for block device drivers report_zones operation.
+ * @cb: report_zones_cb callback for each reported zone.
+ * @data: Private data passed to report_zones_cb.
+ */
+struct blk_report_zones_args {
+ report_zones_cb cb;
+ void *data;
};
-static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
- struct blk_zone *zone);
-
-static int disk_report_zones_cb(struct blk_zone *zone, unsigned int idx,
- void *data)
-{
- struct disk_report_zones_cb_args *args = data;
- struct gendisk *disk = args->disk;
-
- if (disk->zone_wplugs_hash)
- disk_zone_wplug_sync_wp_offset(disk, zone);
-
- if (!args->user_cb)
- return 0;
-
- return args->user_cb(zone, idx, args->user_data);
-}
-
/**
* blkdev_report_zones - Get zones information
* @bdev: Target block device
@@ -161,10 +147,9 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data)
{
struct gendisk *disk = bdev->bd_disk;
- struct disk_report_zones_cb_args args = {
- .disk = disk,
- .user_cb = cb,
- .user_data = data,
+ struct blk_report_zones_args args = {
+ .cb = cb,
+ .data = data,
};
if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
@@ -173,8 +158,7 @@ int blkdev_report_zones(struct block_device *bdev, sector_t sector,
if (!nr_zones || sector >= get_capacity(disk))
return 0;
- return disk->fops->report_zones(disk, sector, nr_zones,
- disk_report_zones_cb, &args);
+ return disk->fops->report_zones(disk, sector, nr_zones, &args);
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
@@ -692,15 +676,32 @@ static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
disk_put_zone_wplug(zwplug);
}
-static int disk_zone_sync_wp_offset(struct gendisk *disk, sector_t sector)
+/**
+ * disk_report_zone - Report one zone
+ * @disk: Target disk
+ * @zone: The zone to report
+ * @idx: The index of the zone in the overall zone report
+ * @args: report zones callback and data
+ *
+ * Description:
+ * Helper function for block device drivers to report one zone of a zone
+ * report initiated with blkdev_report_zones(). The zone being reported is
+ * specified by @zone and used to update, if necessary, the zone write plug
+ * information for the zone. If @args specifies a user callback function,
+ * this callback is executed.
+ */
+int disk_report_zone(struct gendisk *disk, struct blk_zone *zone,
+ unsigned int idx, struct blk_report_zones_args *args)
{
- struct disk_report_zones_cb_args args = {
- .disk = disk,
- };
+ if (disk->zone_wplugs_hash)
+ disk_zone_wplug_sync_wp_offset(disk, zone);
+
+ if (args && args->cb)
+ return args->cb(zone, idx, args->data);
- return disk->fops->report_zones(disk, sector, 1,
- disk_report_zones_cb, &args);
+ return 0;
}
+EXPORT_SYMBOL_GPL(disk_report_zone);
static void blk_zone_reset_bio_endio(struct bio *bio)
{
@@ -1783,6 +1784,10 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
sector_t capacity = get_capacity(disk);
struct blk_revalidate_zone_args args = { };
unsigned int noio_flag;
+ struct blk_report_zones_args rep_args = {
+ .cb = blk_revalidate_zone_cb,
+ .data = &args,
+ };
int ret = -ENOMEM;
if (WARN_ON_ONCE(!blk_queue_is_zoned(q)))
@@ -1814,8 +1819,7 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
return ret;
}
- ret = disk->fops->report_zones(disk, 0, UINT_MAX,
- blk_revalidate_zone_cb, &args);
+ ret = disk->fops->report_zones(disk, 0, UINT_MAX, &rep_args);
if (!ret) {
pr_warn("%s: No zones reported\n", disk->disk_name);
ret = -ENODEV;
@@ -1866,6 +1870,7 @@ EXPORT_SYMBOL_GPL(blk_revalidate_disk_zones);
int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask)
{
+ struct gendisk *disk = bdev->bd_disk;
int ret;
if (WARN_ON_ONCE(!bdev_is_zoned(bdev)))
@@ -1881,7 +1886,7 @@ int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
* pointer. Undo this using a report zone to update the zone write
* pointer to the correct current value.
*/
- ret = disk_zone_sync_wp_offset(bdev->bd_disk, sector);
+ ret = disk->fops->report_zones(disk, sector, 1, NULL);
if (ret != 1)
return ret < 0 ? ret : -EIO;
diff --git a/drivers/block/null_blk/null_blk.h b/drivers/block/null_blk/null_blk.h
index 7bb6128dbaaf..6c4c4bbe7dad 100644
--- a/drivers/block/null_blk/null_blk.h
+++ b/drivers/block/null_blk/null_blk.h
@@ -143,7 +143,8 @@ int null_init_zoned_dev(struct nullb_device *dev, struct queue_limits *lim);
int null_register_zoned_dev(struct nullb *nullb);
void null_free_zoned_dev(struct nullb_device *dev);
int null_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data);
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args);
blk_status_t null_process_zoned_cmd(struct nullb_cmd *cmd, enum req_op op,
sector_t sector, sector_t nr_sectors);
size_t null_zone_valid_read_len(struct nullb *nullb,
diff --git a/drivers/block/null_blk/zoned.c b/drivers/block/null_blk/zoned.c
index 4e5728f45989..6a93b12a06ff 100644
--- a/drivers/block/null_blk/zoned.c
+++ b/drivers/block/null_blk/zoned.c
@@ -191,7 +191,7 @@ void null_free_zoned_dev(struct nullb_device *dev)
}
int null_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
struct nullb *nullb = disk->private_data;
struct nullb_device *dev = nullb->dev;
@@ -225,7 +225,7 @@ int null_report_zones(struct gendisk *disk, sector_t sector,
blkz.capacity = zone->capacity;
null_unlock_zone(dev, zone);
- error = cb(&blkz, i, data);
+ error = disk_report_zone(disk, &blkz, i, args);
if (error)
return error;
}
diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0c74a41a6753..b1549e56c69b 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -368,7 +368,7 @@ static void *ublk_alloc_report_buffer(struct ublk_device *ublk,
}
static int ublk_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
struct ublk_device *ub = disk->private_data;
unsigned int zone_size_sectors = disk->queue->limits.chunk_sectors;
@@ -431,7 +431,7 @@ static int ublk_report_zones(struct gendisk *disk, sector_t sector,
if (!zone->len)
break;
- ret = cb(zone, i, data);
+ ret = disk_report_zone(disk, zone, i, args);
if (ret)
goto out;
diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index f061420dfb10..a5e97f03dbf0 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -584,7 +584,8 @@ static int virtblk_submit_zone_report(struct virtio_blk *vblk,
static int virtblk_parse_zone(struct virtio_blk *vblk,
struct virtio_blk_zone_descriptor *entry,
- unsigned int idx, report_zones_cb cb, void *data)
+ unsigned int idx,
+ struct blk_report_zones_args *args)
{
struct blk_zone zone = { };
@@ -650,12 +651,12 @@ static int virtblk_parse_zone(struct virtio_blk *vblk,
* The callback below checks the validity of the reported
* entry data, no need to further validate it here.
*/
- return cb(&zone, idx, data);
+ return disk_report_zone(vblk->disk, &zone, idx, args);
}
static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb,
- void *data)
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args)
{
struct virtio_blk *vblk = disk->private_data;
struct virtio_blk_zone_report *report;
@@ -693,7 +694,7 @@ static int virtblk_report_zones(struct gendisk *disk, sector_t sector,
for (i = 0; i < nz && zone_idx < nr_zones; i++) {
ret = virtblk_parse_zone(vblk, &report->zones[i],
- zone_idx, cb, data);
+ zone_idx, args);
if (ret)
goto fail_report;
diff --git a/drivers/block/zloop.c b/drivers/block/zloop.c
index a423228e201b..92be9f0af00a 100644
--- a/drivers/block/zloop.c
+++ b/drivers/block/zloop.c
@@ -647,7 +647,7 @@ static int zloop_open(struct gendisk *disk, blk_mode_t mode)
}
static int zloop_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
struct zloop_device *zlo = disk->private_data;
struct blk_zone blkz = {};
@@ -687,7 +687,7 @@ static int zloop_report_zones(struct gendisk *disk, sector_t sector,
mutex_unlock(&zone->lock);
- ret = cb(&blkz, i, data);
+ ret = disk_report_zone(disk, &blkz, i, args);
if (ret)
return ret;
}
diff --git a/drivers/md/dm-zone.c b/drivers/md/dm-zone.c
index 78e17dd4d01b..984fb621b0e9 100644
--- a/drivers/md/dm-zone.c
+++ b/drivers/md/dm-zone.c
@@ -17,33 +17,26 @@
* For internal zone reports bypassing the top BIO submission path.
*/
static int dm_blk_do_report_zones(struct mapped_device *md, struct dm_table *t,
- sector_t sector, unsigned int nr_zones,
- report_zones_cb cb, void *data)
+ unsigned int nr_zones,
+ struct dm_report_zones_args *args)
{
- struct gendisk *disk = md->disk;
- int ret;
- struct dm_report_zones_args args = {
- .next_sector = sector,
- .orig_data = data,
- .orig_cb = cb,
- };
-
do {
struct dm_target *tgt;
+ int ret;
- tgt = dm_table_find_target(t, args.next_sector);
+ tgt = dm_table_find_target(t, args->next_sector);
if (WARN_ON_ONCE(!tgt->type->report_zones))
return -EIO;
- args.tgt = tgt;
- ret = tgt->type->report_zones(tgt, &args,
- nr_zones - args.zone_idx);
+ args->tgt = tgt;
+ ret = tgt->type->report_zones(tgt, args,
+ nr_zones - args->zone_idx);
if (ret < 0)
return ret;
- } while (args.zone_idx < nr_zones &&
- args.next_sector < get_capacity(disk));
+ } while (args->zone_idx < nr_zones &&
+ args->next_sector < get_capacity(md->disk));
- return args.zone_idx;
+ return args->zone_idx;
}
/*
@@ -52,7 +45,8 @@ static int dm_blk_do_report_zones(struct mapped_device *md, struct dm_table *t,
* generally implemented by targets using dm_report_zones().
*/
int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args)
{
struct mapped_device *md = disk->private_data;
struct dm_table *map;
@@ -76,9 +70,14 @@ int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
map = zone_revalidate_map;
}
- if (map)
- ret = dm_blk_do_report_zones(md, map, sector, nr_zones, cb,
- data);
+ if (map) {
+ struct dm_report_zones_args dm_args = {
+ .disk = md->disk,
+ .next_sector = sector,
+ .rep_args = args,
+ };
+ ret = dm_blk_do_report_zones(md, map, nr_zones, &dm_args);
+ }
if (put_table)
dm_put_live_table(md, srcu_idx);
@@ -113,7 +112,9 @@ static int dm_report_zones_cb(struct blk_zone *zone, unsigned int idx,
}
args->next_sector = zone->start + zone->len;
- return args->orig_cb(zone, args->zone_idx++, args->orig_data);
+
+ return disk_report_zone(args->disk, zone, args->zone_idx++,
+ args->rep_args);
}
/*
@@ -492,10 +493,15 @@ int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
sector_t sector, unsigned int nr_zones,
unsigned long *need_reset)
{
+ struct dm_report_zones_args args = {
+ .disk = md->disk,
+ .next_sector = sector,
+ .cb = dm_zone_need_reset_cb,
+ .data = need_reset,
+ };
int ret;
- ret = dm_blk_do_report_zones(md, t, sector, nr_zones,
- dm_zone_need_reset_cb, need_reset);
+ ret = dm_blk_do_report_zones(md, t, nr_zones, &args);
if (ret != nr_zones) {
DMERR("Get %s zone reset bitmap failed\n",
md->disk->disk_name);
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 245f52b59215..7a795979ec72 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -109,7 +109,8 @@ void dm_finalize_zone_settings(struct dm_table *t, struct queue_limits *lim);
void dm_zone_endio(struct dm_io *io, struct bio *clone);
#ifdef CONFIG_BLK_DEV_ZONED
int dm_blk_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data);
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args);
bool dm_is_zone_write(struct mapped_device *md, struct bio *bio);
int dm_zone_get_reset_bitmap(struct mapped_device *md, struct dm_table *t,
sector_t sector, unsigned int nr_zones,
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fa4181d7de73..c0fe50fb7b08 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2599,10 +2599,9 @@ static void nvme_configure_opal(struct nvme_ctrl *ctrl, bool was_suspended)
#ifdef CONFIG_BLK_DEV_ZONED
static int nvme_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
- return nvme_ns_report_zones(disk->private_data, sector, nr_zones, cb,
- data);
+ return nvme_ns_report_zones(disk->private_data, sector, nr_zones, args);
}
#else
#define nvme_report_zones NULL
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 543e17aead12..0b7ac0735bd0 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -576,7 +576,7 @@ static int nvme_ns_head_get_unique_id(struct gendisk *disk, u8 id[16],
#ifdef CONFIG_BLK_DEV_ZONED
static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
struct nvme_ns_head *head = disk->private_data;
struct nvme_ns *ns;
@@ -585,7 +585,7 @@ static int nvme_ns_head_report_zones(struct gendisk *disk, sector_t sector,
srcu_idx = srcu_read_lock(&head->srcu);
ns = nvme_find_path(head);
if (ns)
- ret = nvme_ns_report_zones(ns, sector, nr_zones, cb, data);
+ ret = nvme_ns_report_zones(ns, sector, nr_zones, args);
srcu_read_unlock(&head->srcu, srcu_idx);
return ret;
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 102fae6a231c..928c748ccbd1 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -1108,7 +1108,7 @@ struct nvme_zone_info {
};
int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data);
+ unsigned int nr_zones, struct blk_report_zones_args *args);
int nvme_query_zone_info(struct nvme_ns *ns, unsigned lbaf,
struct nvme_zone_info *zi);
void nvme_update_zone_info(struct nvme_ns *ns, struct queue_limits *lim,
diff --git a/drivers/nvme/host/zns.c b/drivers/nvme/host/zns.c
index cce4c5b55aa9..deea2dbef5b8 100644
--- a/drivers/nvme/host/zns.c
+++ b/drivers/nvme/host/zns.c
@@ -148,8 +148,8 @@ static void *nvme_zns_alloc_report_buffer(struct nvme_ns *ns,
static int nvme_zone_parse_entry(struct nvme_ns *ns,
struct nvme_zone_descriptor *entry,
- unsigned int idx, report_zones_cb cb,
- void *data)
+ unsigned int idx,
+ struct blk_report_zones_args *args)
{
struct nvme_ns_head *head = ns->head;
struct blk_zone zone = { };
@@ -169,11 +169,11 @@ static int nvme_zone_parse_entry(struct nvme_ns *ns,
else
zone.wp = nvme_lba_to_sect(head, le64_to_cpu(entry->wp));
- return cb(&zone, idx, data);
+ return disk_report_zone(ns->disk, &zone, idx, args);
}
int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones, struct blk_report_zones_args *args)
{
struct nvme_zone_report *report;
struct nvme_command c = { };
@@ -213,7 +213,7 @@ int nvme_ns_report_zones(struct nvme_ns *ns, sector_t sector,
for (i = 0; i < nz && zone_idx < nr_zones; i++) {
ret = nvme_zone_parse_entry(ns, &report->entries[i],
- zone_idx, cb, data);
+ zone_idx, args);
if (ret)
goto out_free;
zone_idx++;
diff --git a/drivers/scsi/sd.h b/drivers/scsi/sd.h
index 36382eca941c..574af8243016 100644
--- a/drivers/scsi/sd.h
+++ b/drivers/scsi/sd.h
@@ -240,7 +240,7 @@ blk_status_t sd_zbc_setup_zone_mgmt_cmnd(struct scsi_cmnd *cmd,
unsigned int sd_zbc_complete(struct scsi_cmnd *cmd, unsigned int good_bytes,
struct scsi_sense_hdr *sshdr);
int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data);
+ unsigned int nr_zones, struct blk_report_zones_args *args);
#else /* CONFIG_BLK_DEV_ZONED */
diff --git a/drivers/scsi/sd_zbc.c b/drivers/scsi/sd_zbc.c
index a8db66428f80..e57595a36dc5 100644
--- a/drivers/scsi/sd_zbc.c
+++ b/drivers/scsi/sd_zbc.c
@@ -44,12 +44,11 @@ static bool sd_zbc_is_gap_zone(const u8 buf[64])
* call @cb(blk_zone, @data).
*/
static int sd_zbc_parse_report(struct scsi_disk *sdkp, const u8 buf[64],
- unsigned int idx, report_zones_cb cb, void *data)
+ unsigned int idx, struct blk_report_zones_args *args)
{
struct scsi_device *sdp = sdkp->device;
struct blk_zone zone = { 0 };
sector_t start_lba, gran;
- int ret;
if (WARN_ON_ONCE(sd_zbc_is_gap_zone(buf)))
return -EINVAL;
@@ -87,11 +86,7 @@ static int sd_zbc_parse_report(struct scsi_disk *sdkp, const u8 buf[64],
else
zone.wp = logical_to_sectors(sdp, get_unaligned_be64(&buf[24]));
- ret = cb(&zone, idx, data);
- if (ret)
- return ret;
-
- return 0;
+ return disk_report_zone(sdkp->disk, &zone, idx, args);
}
/**
@@ -217,14 +212,14 @@ static inline sector_t sd_zbc_zone_sectors(struct scsi_disk *sdkp)
* @disk: Disk to report zones for.
* @sector: Start sector.
* @nr_zones: Maximum number of zones to report.
- * @cb: Callback function called to report zone information.
- * @data: Second argument passed to @cb.
+ * @args: Callback arguments.
*
* Called by the block layer to iterate over zone information. See also the
* disk->fops->report_zones() calls in block/blk-zoned.c.
*/
int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data)
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args)
{
struct scsi_disk *sdkp = scsi_disk(disk);
sector_t lba = sectors_to_logical(sdkp->device, sector);
@@ -283,7 +278,7 @@ int sd_zbc_report_zones(struct gendisk *disk, sector_t sector,
}
ret = sd_zbc_parse_report(sdkp, buf + offset, zone_idx,
- cb, data);
+ args);
if (ret)
goto out;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 99be263b31ab..2f75fb15f55f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -38,6 +38,7 @@ struct blk_flush_queue;
struct kiocb;
struct pr_ops;
struct rq_qos;
+struct blk_report_zones_args;
struct blk_queue_stats;
struct blk_stat_callback;
struct blk_crypto_profile;
@@ -432,6 +433,9 @@ struct queue_limits {
typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
void *data);
+int disk_report_zone(struct gendisk *disk, struct blk_zone *zone,
+ unsigned int idx, struct blk_report_zones_args *args);
+
#define BLK_ALL_ZONES ((unsigned int)-1)
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data);
@@ -1662,7 +1666,8 @@ struct block_device_operations {
/* this callback is with swap_lock and sometimes page table lock held */
void (*swap_slot_free_notify) (struct block_device *, unsigned long);
int (*report_zones)(struct gendisk *, sector_t sector,
- unsigned int nr_zones, report_zones_cb cb, void *data);
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args);
char *(*devnode)(struct gendisk *disk, umode_t *mode);
/* returns the length of the identifier or a negative errno: */
int (*get_unique_id)(struct gendisk *disk, u8 id[16],
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 84fdc3a6a19a..38f625af6ab4 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -538,12 +538,18 @@ void dm_submit_bio_remap(struct bio *clone, struct bio *tgt_clone);
#ifdef CONFIG_BLK_DEV_ZONED
struct dm_report_zones_args {
struct dm_target *tgt;
+ struct gendisk *disk;
sector_t next_sector;
- void *orig_data;
- report_zones_cb orig_cb;
unsigned int zone_idx;
+ /* for block layer ->report_zones */
+ struct blk_report_zones_args *rep_args;
+
+ /* for internal users */
+ report_zones_cb cb;
+ void *data;
+
/* must be filled by ->report_zones before calling dm_report_zones_cb */
sector_t start;
};
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 05/13] block: reorganize struct blk_zone_wplug
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (3 preceding siblings ...)
2025-10-31 6:12 ` [PATCH 04/13] block: introduce disk_report_zone() Damien Le Moal
@ 2025-10-31 6:12 ` Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:55 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 06/13] block: use zone condition to determine conventional zones Damien Le Moal
` (7 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:12 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Reorganize the fields of struct blk_zone_wplug to remove a hole after
the wp_offset field and avoid having the bio_work structure split
between 2 cache lines.
No functional changes.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 20 ++++++++++----------
1 file changed, 10 insertions(+), 10 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 603ed8dc472c..29891253a920 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -39,6 +39,11 @@ static const char *const zone_cond_name[] = {
/*
* Per-zone write plug.
* @node: hlist_node structure for managing the plug using a hash table.
+ * @bio_list: The list of BIOs that are currently plugged.
+ * @bio_work: Work struct to handle issuing of plugged BIOs
+ * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
+ * @disk: The gendisk the plug belongs to.
+ * @lock: Spinlock to atomically manipulate the plug.
* @ref: Zone write plug reference counter. A zone write plug reference is
* always at least 1 when the plug is hashed in the disk plug hash table.
* The reference is incremented whenever a new BIO needing plugging is
@@ -48,27 +53,22 @@ static const char *const zone_cond_name[] = {
* reference is dropped whenever the zone of the zone write plug is reset,
* finished and when the zone becomes full (last write BIO to the zone
* completes).
- * @lock: Spinlock to atomically manipulate the plug.
* @flags: Flags indicating the plug state.
* @zone_no: The number of the zone the plug is managing.
* @wp_offset: The zone write pointer location relative to the start of the zone
* as a number of 512B sectors.
- * @bio_list: The list of BIOs that are currently plugged.
- * @bio_work: Work struct to handle issuing of plugged BIOs
- * @rcu_head: RCU head to free zone write plugs with an RCU grace period.
- * @disk: The gendisk the plug belongs to.
*/
struct blk_zone_wplug {
struct hlist_node node;
- refcount_t ref;
- spinlock_t lock;
- unsigned int flags;
- unsigned int zone_no;
- unsigned int wp_offset;
struct bio_list bio_list;
struct work_struct bio_work;
struct rcu_head rcu_head;
struct gendisk *disk;
+ spinlock_t lock;
+ refcount_t ref;
+ unsigned int flags;
+ unsigned int zone_no;
+ unsigned int wp_offset;
};
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 06/13] block: use zone condition to determine conventional zones
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (4 preceding siblings ...)
2025-10-31 6:12 ` [PATCH 05/13] block: reorganize struct blk_zone_wplug Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:48 ` Christoph Hellwig
2025-10-31 21:04 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 07/13] block: track zone conditions Damien Le Moal
` (6 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
The conv_zones_bitmap field of struct gendisk is used to define a bitmap
to identify the conventional zones of a zoned block device. The bit for
a zone is set in this bitmap if the zone is a conventional one, that is,
if the zone type is BLK_ZONE_TYPE_CONVENTIONAL. For such zone, this
always corresponds to the zone condition BLK_ZONE_COND_NOT_WP.
In other words, conv_zones_bitmap tracks a single condition of the
zones of a zoned block device.
In preparation for tracking more zone conditions, change
conv_zones_bitmap into an array of zone conditions, using 1 byte per
zone. This increases the memory usage from 1 bit per zone to 1 bytes per
zone, that is, from 16 KiB to about 100 KiB for a 30 TB SMR HDD with 256
MiB zones. This is a trade-off to allow fast cached report zones later
on top of this change.
Rename the conv_zones_bitmap field of struct gendisk to zones_cond. Add
a blk_revalidate_zone_cond() function to initialize the zones_cond array
of a disk during device scan and to update it on device revalidation.
Move the allocation of the zones_cond array to
disk_revalidate_zone_resources(), making sure that this array is always
allocated, even for devices that do not need zone write plugs (zone
resources), to ensure that bdev_zone_is_seq() can be re-implemented to
use the zone condition array in place of the conv zones bitmap.
Finally, the function bdev_zone_is_seq() is rewritten to use a test on
the condition of the target zone.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 151 +++++++++++++++++++++++++++++------------
include/linux/blkdev.h | 32 ++-------
2 files changed, 111 insertions(+), 72 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 29891253a920..4997f11caa0c 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -114,6 +114,33 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
}
EXPORT_SYMBOL_GPL(blk_zone_cond_str);
+/**
+ * bdev_zone_is_seq - check if a sector belongs to a sequential write zone
+ * @bdev: block device to check
+ * @sector: sector number
+ *
+ * Check if @sector on @bdev is contained in a sequential write required zone.
+ */
+bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector)
+{
+ struct gendisk *disk = bdev->bd_disk;
+ unsigned int zno = disk_zone_no(disk, sector);
+ bool is_seq = false;
+ u8 *zones_cond;
+
+ if (!bdev_is_zoned(bdev))
+ return false;
+
+ rcu_read_lock();
+ zones_cond = rcu_dereference(disk->zones_cond);
+ if (zones_cond)
+ is_seq = zones_cond[zno] != BLK_ZONE_COND_NOT_WP;
+ rcu_read_unlock();
+
+ return is_seq;
+}
+EXPORT_SYMBOL_GPL(bdev_zone_is_seq);
+
/*
* Zone report arguments for block device drivers report_zones operation.
* @cb: report_zones_cb callback for each reported zone.
@@ -1458,22 +1485,16 @@ static void disk_destroy_zone_wplugs_hash_table(struct gendisk *disk)
disk->zone_wplugs_hash_bits = 0;
}
-static unsigned int disk_set_conv_zones_bitmap(struct gendisk *disk,
- unsigned long *bitmap)
+static void disk_set_zones_cond_array(struct gendisk *disk, u8 *zones_cond)
{
- unsigned int nr_conv_zones = 0;
unsigned long flags;
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
- if (bitmap)
- nr_conv_zones = bitmap_weight(bitmap, disk->nr_zones);
- bitmap = rcu_replace_pointer(disk->conv_zones_bitmap, bitmap,
- lockdep_is_held(&disk->zone_wplugs_lock));
+ zones_cond = rcu_replace_pointer(disk->zones_cond, zones_cond,
+ lockdep_is_held(&disk->zone_wplugs_lock));
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
- kfree_rcu_mightsleep(bitmap);
-
- return nr_conv_zones;
+ kfree_rcu_mightsleep(zones_cond);
}
void disk_free_zone_resources(struct gendisk *disk)
@@ -1497,7 +1518,7 @@ void disk_free_zone_resources(struct gendisk *disk)
mempool_destroy(disk->zone_wplugs_pool);
disk->zone_wplugs_pool = NULL;
- disk_set_conv_zones_bitmap(disk, NULL);
+ disk_set_zones_cond_array(disk, NULL);
disk->zone_capacity = 0;
disk->last_zone_capacity = 0;
disk->nr_zones = 0;
@@ -1516,12 +1537,31 @@ static inline bool disk_need_zone_resources(struct gendisk *disk)
queue_emulates_zone_append(disk->queue);
}
+struct blk_revalidate_zone_args {
+ struct gendisk *disk;
+ u8 *zones_cond;
+ unsigned int nr_zones;
+ unsigned int nr_conv_zones;
+ unsigned int zone_capacity;
+ unsigned int last_zone_capacity;
+ sector_t sector;
+};
+
static int disk_revalidate_zone_resources(struct gendisk *disk,
- unsigned int nr_zones)
+ struct blk_revalidate_zone_args *args)
{
struct queue_limits *lim = &disk->queue->limits;
unsigned int pool_size;
+ args->disk = disk;
+ args->nr_zones =
+ DIV_ROUND_UP_ULL(get_capacity(disk), lim->chunk_sectors);
+
+ /* Cached zone conditions: 1 byte per zone */
+ args->zones_cond = kzalloc(args->nr_zones, GFP_NOIO);
+ if (!args->zones_cond)
+ return -ENOMEM;
+
if (!disk_need_zone_resources(disk))
return 0;
@@ -1531,7 +1571,8 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
*/
pool_size = max(lim->max_open_zones, lim->max_active_zones);
if (!pool_size)
- pool_size = min(BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, nr_zones);
+ pool_size =
+ min(BLK_ZONE_WPLUG_DEFAULT_POOL_SIZE, args->nr_zones);
if (!disk->zone_wplugs_hash)
return disk_alloc_zone_resources(disk, pool_size);
@@ -1539,15 +1580,6 @@ static int disk_revalidate_zone_resources(struct gendisk *disk,
return 0;
}
-struct blk_revalidate_zone_args {
- struct gendisk *disk;
- unsigned long *conv_zones_bitmap;
- unsigned int nr_zones;
- unsigned int zone_capacity;
- unsigned int last_zone_capacity;
- sector_t sector;
-};
-
/*
* Update the disk zone resources information and device queue limits.
* The disk queue is frozen when this is executed.
@@ -1556,7 +1588,7 @@ static int disk_update_zone_resources(struct gendisk *disk,
struct blk_revalidate_zone_args *args)
{
struct request_queue *q = disk->queue;
- unsigned int nr_seq_zones, nr_conv_zones;
+ unsigned int nr_seq_zones;
unsigned int pool_size, memflags;
struct queue_limits lim;
int ret;
@@ -1566,24 +1598,24 @@ static int disk_update_zone_resources(struct gendisk *disk,
memflags = blk_mq_freeze_queue(q);
disk->nr_zones = args->nr_zones;
- disk->zone_capacity = args->zone_capacity;
- disk->last_zone_capacity = args->last_zone_capacity;
- nr_conv_zones =
- disk_set_conv_zones_bitmap(disk, args->conv_zones_bitmap);
- if (nr_conv_zones >= disk->nr_zones) {
+ if (args->nr_conv_zones >= disk->nr_zones) {
pr_warn("%s: Invalid number of conventional zones %u / %u\n",
- disk->disk_name, nr_conv_zones, disk->nr_zones);
+ disk->disk_name, args->nr_conv_zones, disk->nr_zones);
ret = -ENODEV;
goto unfreeze;
}
+ disk->zone_capacity = args->zone_capacity;
+ disk->last_zone_capacity = args->last_zone_capacity;
+ disk_set_zones_cond_array(disk, args->zones_cond);
+
/*
* Some devices can advertize zone resource limits that are larger than
* the number of sequential zones of the zoned block device, e.g. a
* small ZNS namespace. For such case, assume that the zoned device has
* no zone resource limits.
*/
- nr_seq_zones = disk->nr_zones - nr_conv_zones;
+ nr_seq_zones = disk->nr_zones - args->nr_conv_zones;
if (lim.max_open_zones >= nr_seq_zones)
lim.max_open_zones = 0;
if (lim.max_active_zones >= nr_seq_zones)
@@ -1621,6 +1653,44 @@ static int disk_update_zone_resources(struct gendisk *disk,
return ret;
}
+static int blk_revalidate_zone_cond(struct blk_zone *zone, unsigned int idx,
+ struct blk_revalidate_zone_args *args)
+{
+ enum blk_zone_cond cond = zone->cond;
+
+ /* Check that the zone condition is consistent with the zone type. */
+ switch (cond) {
+ case BLK_ZONE_COND_NOT_WP:
+ if (zone->type != BLK_ZONE_TYPE_CONVENTIONAL)
+ goto invalid_condition;
+ break;
+ case BLK_ZONE_COND_IMP_OPEN:
+ case BLK_ZONE_COND_EXP_OPEN:
+ case BLK_ZONE_COND_CLOSED:
+ case BLK_ZONE_COND_EMPTY:
+ case BLK_ZONE_COND_FULL:
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
+ goto invalid_condition;
+ break;
+ default:
+ pr_warn("%s: Invalid zone condition 0x%X\n",
+ args->disk->disk_name, cond);
+ return -ENODEV;
+ }
+
+ args->zones_cond[idx] = cond;
+
+ return 0;
+
+invalid_condition:
+ pr_warn("%s: Invalid zone condition 0x%x for type 0x%x\n",
+ args->disk->disk_name, cond, zone->type);
+
+ return -ENODEV;
+}
+
static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
struct blk_revalidate_zone_args *args)
{
@@ -1635,17 +1705,7 @@ static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
if (disk_zone_is_last(disk, zone))
args->last_zone_capacity = zone->capacity;
- if (!disk_need_zone_resources(disk))
- return 0;
-
- if (!args->conv_zones_bitmap) {
- args->conv_zones_bitmap =
- bitmap_zalloc(args->nr_zones, GFP_NOIO);
- if (!args->conv_zones_bitmap)
- return -ENOMEM;
- }
-
- set_bit(idx, args->conv_zones_bitmap);
+ args->nr_conv_zones++;
return 0;
}
@@ -1743,6 +1803,11 @@ static int blk_revalidate_zone_cb(struct blk_zone *zone, unsigned int idx,
return -ENODEV;
}
+ /* Check zone condition */
+ ret = blk_revalidate_zone_cond(zone, idx, args);
+ if (ret)
+ return ret;
+
/* Check zone type */
switch (zone->type) {
case BLK_ZONE_TYPE_CONVENTIONAL:
@@ -1810,10 +1875,8 @@ int blk_revalidate_disk_zones(struct gendisk *disk)
* Ensure that all memory allocations in this context are done as if
* GFP_NOIO was specified.
*/
- args.disk = disk;
- args.nr_zones = (capacity + zone_sectors - 1) >> ilog2(zone_sectors);
noio_flag = memalloc_noio_save();
- ret = disk_revalidate_zone_resources(disk, args.nr_zones);
+ ret = disk_revalidate_zone_resources(disk, &args);
if (ret) {
memalloc_noio_restore(noio_flag);
return ret;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2f75fb15f55f..15cc13006d06 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -196,7 +196,7 @@ struct gendisk {
unsigned int nr_zones;
unsigned int zone_capacity;
unsigned int last_zone_capacity;
- unsigned long __rcu *conv_zones_bitmap;
+ u8 __rcu *zones_cond;
unsigned int zone_wplugs_hash_bits;
atomic_t nr_zone_wplugs;
spinlock_t zone_wplugs_lock;
@@ -925,6 +925,9 @@ static inline unsigned int bdev_zone_capacity(struct block_device *bdev,
{
return disk_zone_capacity(bdev->bd_disk, pos);
}
+
+bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector);
+
#else /* CONFIG_BLK_DEV_ZONED */
static inline unsigned int disk_nr_zones(struct gendisk *disk)
{
@@ -1533,33 +1536,6 @@ static inline bool bdev_is_zone_aligned(struct block_device *bdev,
return bdev_is_zone_start(bdev, sector);
}
-/**
- * bdev_zone_is_seq - check if a sector belongs to a sequential write zone
- * @bdev: block device to check
- * @sector: sector number
- *
- * Check if @sector on @bdev is contained in a sequential write required zone.
- */
-static inline bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector)
-{
- bool is_seq = false;
-
-#if IS_ENABLED(CONFIG_BLK_DEV_ZONED)
- if (bdev_is_zoned(bdev)) {
- struct gendisk *disk = bdev->bd_disk;
- unsigned long *bitmap;
-
- rcu_read_lock();
- bitmap = rcu_dereference(disk->conv_zones_bitmap);
- is_seq = !bitmap ||
- !test_bit(disk_zone_no(disk, sector), bitmap);
- rcu_read_unlock();
- }
-#endif
-
- return is_seq;
-}
-
int blk_zone_issue_zeroout(struct block_device *bdev, sector_t sector,
sector_t nr_sects, gfp_t gfp_mask);
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 07/13] block: track zone conditions
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (5 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 06/13] block: use zone condition to determine conventional zones Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:51 ` Christoph Hellwig
2025-10-31 21:17 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 08/13] block: introduce blkdev_get_zone_info() Damien Le Moal
` (5 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
The function blk_revalidate_zone_cond() already cache the condition of
all zones of a zoned block device in the zones_cond array of a gendisk.
However, the zone conditions are updated only when the device is scanned
or revalidated.
Implement tracking of the runtime changes to zone conditions using
the new cond field in struct blk_zone_wplug. The size of this structure
remains 112 Bytes as the new field replaces the 4 Bytes padding at the
end of the structure. For zones that do not have a zone write plug, the
zones_cond array of a disk is used to track changes to zone conditions,
e.g. when a zone reset, reset all or finish operation is executed.
Since a device may automatically close an implicitly open zone when
writing to an empty or closed zone, if the total number of open zones
has reached the device limit, the BLK_ZONE_COND_IMP_OPEN and
BLK_ZONE_COND_CLOSED zone conditions cannot be precisely tracked. To
overcome this, the zone condition BLK_ZONE_COND_ACTIVE is introduced to
represent a zone that has the condition BLK_ZONE_COND_IMP_OPEN,
BLK_ZONE_COND_EXP_OPEN or BLK_ZONE_COND_CLOSED. This follows the
definition of an active zone as defined in the NVMe Zoned Namespace
specifications. As such, for a zoned device that has a limit on the
maximum number of open zones, we will never have more zones in the
BLK_ZONE_COND_ACTIVE condition than the device limit. This is compatible
with the SCSI ZBC and ATA ZAC specifications for SMR HDDs as these
devices do not have a limit on the number of active zones.
The function disk_zone_wplug_set_wp_offset() is modified to use the new
helper disk_zone_wplug_update_cond() to update a zone write plug
condition whenever a zone write plug write offset is updated on
submission or merging of write BIOs to a zone.
The functions blk_zone_reset_bio_endio(), blk_zone_reset_all_bio_endio()
and blk_zone_finish_bio_endio() are modified to update the condition of
the zones targeted by reset, reset_all and finish operations, either
using though disk_zone_wplug_set_wp_offset() for zones that have a
zone write plug, or using the disk_zone_set_cond() helper to update the
zones_cond array of the disk for zones that do not have a zone write
plug.
When a zone write plug is removed from the disk hash table (when the
zone becomes empty or full), the condition of struct blk_zone_wplug is
used to update the disk zones_cond array. Conversely, when a zone write
plug is added to the disk hash table, the zones_cond array is used to
initialize the zone write plug condition.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 110 ++++++++++++++++++++++++++++++++--
include/uapi/linux/blkzoned.h | 6 ++
2 files changed, 110 insertions(+), 6 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 4997f11caa0c..00cfd9431c3e 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -57,6 +57,7 @@ static const char *const zone_cond_name[] = {
* @zone_no: The number of the zone the plug is managing.
* @wp_offset: The zone write pointer location relative to the start of the zone
* as a number of 512B sectors.
+ * @cond: Condition of the zone
*/
struct blk_zone_wplug {
struct hlist_node node;
@@ -69,6 +70,7 @@ struct blk_zone_wplug {
unsigned int flags;
unsigned int zone_no;
unsigned int wp_offset;
+ enum blk_zone_cond cond;
};
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
@@ -114,6 +116,57 @@ const char *blk_zone_cond_str(enum blk_zone_cond zone_cond)
}
EXPORT_SYMBOL_GPL(blk_zone_cond_str);
+static void blk_zone_set_cond(u8 *zones_cond, unsigned int zno,
+ enum blk_zone_cond cond)
+{
+ if (!zones_cond)
+ return;
+
+ switch (cond) {
+ case BLK_ZONE_COND_IMP_OPEN:
+ case BLK_ZONE_COND_EXP_OPEN:
+ case BLK_ZONE_COND_CLOSED:
+ zones_cond[zno] = BLK_ZONE_COND_ACTIVE;
+ return;
+ case BLK_ZONE_COND_NOT_WP:
+ case BLK_ZONE_COND_EMPTY:
+ case BLK_ZONE_COND_FULL:
+ case BLK_ZONE_COND_OFFLINE:
+ case BLK_ZONE_COND_READONLY:
+ default:
+ zones_cond[zno] = cond;
+ return;
+ }
+}
+
+static void disk_zone_set_cond(struct gendisk *disk, sector_t sector,
+ enum blk_zone_cond cond)
+{
+ u8 *zones_cond;
+
+ rcu_read_lock();
+ zones_cond = rcu_dereference(disk->zones_cond);
+ if (zones_cond) {
+ unsigned int zno = disk_zone_no(disk, sector);
+
+ /*
+ * The condition of a conventional, readonly and offline zones
+ * never changes, so do nothing if the target zone is in one of
+ * these conditions.
+ */
+ switch (zones_cond[zno]) {
+ case BLK_ZONE_COND_NOT_WP:
+ case BLK_ZONE_COND_READONLY:
+ case BLK_ZONE_COND_OFFLINE:
+ break;
+ default:
+ blk_zone_set_cond(zones_cond, zno, cond);
+ break;
+ }
+ }
+ rcu_read_unlock();
+}
+
/**
* bdev_zone_is_seq - check if a sector belongs to a sequential write zone
* @bdev: block device to check
@@ -416,6 +469,7 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
{
struct blk_zone_wplug *zwplg;
unsigned long flags;
+ u8 *zones_cond;
unsigned int idx =
hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
@@ -431,6 +485,12 @@ static bool disk_insert_zone_wplug(struct gendisk *disk,
return false;
}
}
+ zones_cond = rcu_dereference_check(disk->zones_cond,
+ lockdep_is_held(&disk->zone_wplugs_lock));
+ if (zones_cond)
+ zwplug->cond = zones_cond[zwplug->zone_no];
+ else
+ zwplug->cond = BLK_ZONE_COND_NOT_WP;
hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
atomic_inc(&disk->nr_zone_wplugs);
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
@@ -530,10 +590,15 @@ static void disk_remove_zone_wplug(struct gendisk *disk,
/*
* Mark the zone write plug as unhashed and drop the extra reference we
- * took when the plug was inserted in the hash table.
+ * took when the plug was inserted in the hash table. Also update the
+ * disk zone condition array with the current condition of the zone
+ * write plug.
*/
zwplug->flags |= BLK_ZONE_WPLUG_UNHASHED;
spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
+ blk_zone_set_cond(rcu_dereference_check(disk->zones_cond,
+ lockdep_is_held(&disk->zone_wplugs_lock)),
+ zwplug->zone_no, zwplug->cond);
hlist_del_init_rcu(&zwplug->node);
atomic_dec(&disk->nr_zone_wplugs);
spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
@@ -635,6 +700,22 @@ static void disk_zone_wplug_abort(struct blk_zone_wplug *zwplug)
blk_zone_wplug_bio_io_error(zwplug, bio);
}
+/*
+ * Update a zone write plug condition based on the write pointer offset.
+ */
+static void disk_zone_wplug_update_cond(struct gendisk *disk,
+ struct blk_zone_wplug *zwplug)
+{
+ lockdep_assert_held(&zwplug->lock);
+
+ if (disk_zone_wplug_is_full(disk, zwplug))
+ zwplug->cond = BLK_ZONE_COND_FULL;
+ else if (!zwplug->wp_offset)
+ zwplug->cond = BLK_ZONE_COND_EMPTY;
+ else
+ zwplug->cond = BLK_ZONE_COND_ACTIVE;
+}
+
/*
* Set a zone write plug write pointer offset to the specified value.
* This aborts all plugged BIOs, which is fine as this function is called for
@@ -650,6 +731,8 @@ static void disk_zone_wplug_set_wp_offset(struct gendisk *disk,
/* Update the zone write pointer and abort all plugged BIOs. */
zwplug->flags &= ~BLK_ZONE_WPLUG_NEED_WP_UPDATE;
zwplug->wp_offset = wp_offset;
+ disk_zone_wplug_update_cond(disk, zwplug);
+
disk_zone_wplug_abort(zwplug);
/*
@@ -733,6 +816,7 @@ EXPORT_SYMBOL_GPL(disk_report_zone);
static void blk_zone_reset_bio_endio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
+ sector_t sector = bio->bi_iter.bi_sector;
struct blk_zone_wplug *zwplug;
/*
@@ -741,7 +825,7 @@ static void blk_zone_reset_bio_endio(struct bio *bio)
* resetting zones while writes are still in-flight will result in the
* writes failing anyway.
*/
- zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+ zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
unsigned long flags;
@@ -749,6 +833,8 @@ static void blk_zone_reset_bio_endio(struct bio *bio)
disk_zone_wplug_set_wp_offset(disk, zwplug, 0);
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
+ } else {
+ disk_zone_set_cond(disk, sector, BLK_ZONE_COND_EMPTY);
}
}
@@ -757,6 +843,7 @@ static void blk_zone_reset_all_bio_endio(struct bio *bio)
struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug;
unsigned long flags;
+ sector_t sector;
unsigned int i;
/* Update the condition of all zone write plugs. */
@@ -770,12 +857,18 @@ static void blk_zone_reset_all_bio_endio(struct bio *bio)
}
}
rcu_read_unlock();
+
+ /* Update the cached zone conditions. */
+ for (sector = 0; sector < get_capacity(disk);
+ sector += bdev_zone_sectors(bio->bi_bdev))
+ disk_zone_set_cond(disk, sector, BLK_ZONE_COND_EMPTY);
}
static void blk_zone_finish_bio_endio(struct bio *bio)
{
struct block_device *bdev = bio->bi_bdev;
struct gendisk *disk = bdev->bd_disk;
+ sector_t sector = bio->bi_iter.bi_sector;
struct blk_zone_wplug *zwplug;
/*
@@ -784,7 +877,7 @@ static void blk_zone_finish_bio_endio(struct bio *bio)
* is fine as resetting zones while writes are still in-flight will
* result in the writes failing anyway.
*/
- zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
+ zwplug = disk_get_zone_wplug(disk, sector);
if (zwplug) {
unsigned long flags;
@@ -793,6 +886,8 @@ static void blk_zone_finish_bio_endio(struct bio *bio)
bdev_zone_sectors(bdev));
spin_unlock_irqrestore(&zwplug->lock, flags);
disk_put_zone_wplug(zwplug);
+ } else {
+ disk_zone_set_cond(disk, sector, BLK_ZONE_COND_FULL);
}
}
@@ -888,6 +983,7 @@ static inline void disk_zone_wplug_add_bio(struct gendisk *disk,
*/
void blk_zone_write_plug_bio_merged(struct bio *bio)
{
+ struct gendisk *disk = bio->bi_bdev->bd_disk;
struct blk_zone_wplug *zwplug;
unsigned long flags;
@@ -909,13 +1005,13 @@ void blk_zone_write_plug_bio_merged(struct bio *bio)
* have at least one request and one BIO referencing the zone write
* plug. So this should not fail.
*/
- zwplug = disk_get_zone_wplug(bio->bi_bdev->bd_disk,
- bio->bi_iter.bi_sector);
+ zwplug = disk_get_zone_wplug(disk, bio->bi_iter.bi_sector);
if (WARN_ON_ONCE(!zwplug))
return;
spin_lock_irqsave(&zwplug->lock, flags);
zwplug->wp_offset += bio_sectors(bio);
+ disk_zone_wplug_update_cond(disk, zwplug);
spin_unlock_irqrestore(&zwplug->lock, flags);
}
@@ -974,6 +1070,7 @@ void blk_zone_write_plug_init_request(struct request *req)
/* Drop the reference taken by disk_zone_wplug_add_bio(). */
blk_queue_exit(q);
zwplug->wp_offset += bio_sectors(bio);
+ disk_zone_wplug_update_cond(disk, zwplug);
req_back_sector += bio_sectors(bio);
}
@@ -1037,6 +1134,7 @@ static bool blk_zone_wplug_prepare_bio(struct blk_zone_wplug *zwplug,
/* Advance the zone write pointer offset. */
zwplug->wp_offset += bio_sectors(bio);
+ disk_zone_wplug_update_cond(disk, zwplug);
return true;
}
@@ -1680,7 +1778,7 @@ static int blk_revalidate_zone_cond(struct blk_zone *zone, unsigned int idx,
return -ENODEV;
}
- args->zones_cond[idx] = cond;
+ blk_zone_set_cond(args->zones_cond, idx, cond);
return 0;
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index f85743ef6e7d..dab5d9700898 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -61,6 +61,10 @@ enum blk_zone_type {
*
* Conditions 0x5 to 0xC are reserved by the current ZBC/ZAC spec and should
* be considered invalid.
+ *
+ * The condition BLK_ZONE_COND_ACTIVE is used to represent any of the
+ * BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN and BLK_ZONE_COND_CLOSED
+ * conditions.
*/
enum blk_zone_cond {
BLK_ZONE_COND_NOT_WP = 0x0,
@@ -71,6 +75,8 @@ enum blk_zone_cond {
BLK_ZONE_COND_READONLY = 0xD,
BLK_ZONE_COND_FULL = 0xE,
BLK_ZONE_COND_OFFLINE = 0xF,
+
+ BLK_ZONE_COND_ACTIVE = 0xFF,
};
/**
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 08/13] block: introduce blkdev_get_zone_info()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (6 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 07/13] block: track zone conditions Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:52 ` Christoph Hellwig
2025-10-31 21:40 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 09/13] block: introduce blkdev_report_zones_cached() Damien Le Moal
` (4 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Introduce the function blkdev_get_zone_info() to obtain a single zone
information from cached zone data, that is, either from the zone write
plug for the target zone if it exists and from the disk zones_cond
array otherwise.
Since sequential zones that do not have a zone write plug are either
full, empty or in a bad state (read-only or offline), the zone write
pointer can be inferred from the zone condition cached in the disk
zones_cond array. For sequential zones that have a zone write plug, the
zone condition and zone write pointer are obtained from the condition
and write pointer offset managed with the zone write plug. This allows
obtaining the information for a zone much more quickly than having to
execute a report zones command on the device.
blkdev_get_zone_info() falls back to using a regular zone report if the
target zone is flagged as needing an update with the
BLK_ZONE_WPLUG_NEED_WP_UPDATE flag, or if the target device does not
use zone write plugs (i.e. a device mapper device). In this case, the
new function blkdev_report_zone_fallback() is used and the zone
condition is reported consistantly with the cahced report, that is, the
BLK_ZONE_COND_ACTIVE condition is used in place of the implicit open,
explicit open and closed conditions. This is achieved by adding the
.report_active field to struct blk_report_zones_args and by having
disk_report_zone() sets the correct zone condition if .report_active is
true.
In preparation for using blkdev_get_zone_info() in upcoming file systems
changes, also export this function as a GPL symbol.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 165 +++++++++++++++++++++++++++++++++++++++--
include/linux/blkdev.h | 3 +
2 files changed, 160 insertions(+), 8 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 00cfd9431c3e..03394e38645f 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -202,8 +202,24 @@ EXPORT_SYMBOL_GPL(bdev_zone_is_seq);
struct blk_report_zones_args {
report_zones_cb cb;
void *data;
+ bool report_active;
};
+static int blkdev_do_report_zones(struct block_device *bdev, sector_t sector,
+ unsigned int nr_zones,
+ struct blk_report_zones_args *args)
+{
+ struct gendisk *disk = bdev->bd_disk;
+
+ if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
+ return -EOPNOTSUPP;
+
+ if (!nr_zones || sector >= get_capacity(disk))
+ return 0;
+
+ return disk->fops->report_zones(disk, sector, nr_zones, args);
+}
+
/**
* blkdev_report_zones - Get zones information
* @bdev: Target block device
@@ -226,19 +242,12 @@ struct blk_report_zones_args {
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data)
{
- struct gendisk *disk = bdev->bd_disk;
struct blk_report_zones_args args = {
.cb = cb,
.data = data,
};
- if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
- return -EOPNOTSUPP;
-
- if (!nr_zones || sector >= get_capacity(disk))
- return 0;
-
- return disk->fops->report_zones(disk, sector, nr_zones, &args);
+ return blkdev_do_report_zones(bdev, sector, nr_zones, &args);
}
EXPORT_SYMBOL_GPL(blkdev_report_zones);
@@ -803,6 +812,23 @@ static void disk_zone_wplug_sync_wp_offset(struct gendisk *disk,
int disk_report_zone(struct gendisk *disk, struct blk_zone *zone,
unsigned int idx, struct blk_report_zones_args *args)
{
+ if (args->report_active) {
+ /*
+ * If we come here, then this is a report zones as a fallback
+ * for a cached report. So collapse the implicit open, explicit
+ * open and closed conditions into the active zone condition.
+ */
+ switch (zone->cond) {
+ case BLK_ZONE_COND_IMP_OPEN:
+ case BLK_ZONE_COND_EXP_OPEN:
+ case BLK_ZONE_COND_CLOSED:
+ zone->cond = BLK_ZONE_COND_ACTIVE;
+ break;
+ default:
+ break;
+ }
+ }
+
if (disk->zone_wplugs_hash)
disk_zone_wplug_sync_wp_offset(disk, zone);
@@ -813,6 +839,129 @@ int disk_report_zone(struct gendisk *disk, struct blk_zone *zone,
}
EXPORT_SYMBOL_GPL(disk_report_zone);
+static int blkdev_report_zone_cb(struct blk_zone *zone, unsigned int idx,
+ void *data)
+{
+ memcpy(data, zone, sizeof(struct blk_zone));
+ return 0;
+}
+
+static int blkdev_report_zone_fallback(struct block_device *bdev,
+ sector_t sector, struct blk_zone *zone)
+{
+ struct blk_report_zones_args args = {
+ .cb = blkdev_report_zone_cb,
+ .data = zone,
+ .report_active = true,
+ };
+
+ return blkdev_do_report_zones(bdev, sector, 1, &args);
+}
+
+/**
+ * blkdev_get_zone_info - Get a zone information from cached data
+ * @bdev: Target block device
+ * @sector: Sector contained by the target zone
+ * @zone: zone structure to return the zone information
+ *
+ * Description:
+ * Get the zone information for the zone containing @sector using the zone
+ * write plug of the target zone, if one exist, or the disk zone condition
+ * array otherwise. The zone condition may be reported as being
+ * the BLK_ZONE_COND_ACTIVE condition for a zone that is in the implicit
+ * open, explicit open or closed condition.
+ *
+ * Returns 0 on success and a negative error code on failure.
+ */
+int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
+ struct blk_zone *zone)
+{
+ struct gendisk *disk = bdev->bd_disk;
+ sector_t zone_sectors = bdev_zone_sectors(bdev);
+ struct blk_zone_wplug *zwplug;
+ unsigned long flags;
+ u8 *zones_cond;
+
+ if (!bdev_is_zoned(bdev))
+ return -EOPNOTSUPP;
+
+ if (sector >= get_capacity(disk))
+ return -EINVAL;
+
+ memset(zone, 0, sizeof(*zone));
+ sector = sector & (~(zone_sectors - 1));
+
+ rcu_read_lock();
+ zones_cond = rcu_dereference(disk->zones_cond);
+ if (!disk->zone_wplugs_hash || !zones_cond) {
+ rcu_read_unlock();
+ return blkdev_report_zone_fallback(bdev, sector, zone);
+ }
+ zone->cond = zones_cond[disk_zone_no(disk, sector)];
+ rcu_read_unlock();
+
+ zone->start = sector;
+ zone->len = zone_sectors;
+
+ /*
+ * If this is a conventional zone, we do not have a zone write plug and
+ * can report the zone immediately.
+ */
+ if (zone->cond == BLK_ZONE_COND_NOT_WP) {
+ zone->type = BLK_ZONE_TYPE_CONVENTIONAL;
+ zone->capacity = zone_sectors;
+ zone->wp = ULLONG_MAX;
+ return 0;
+ }
+
+ /*
+ * This is a sequential write required zone. If the zone is read-only or
+ * offline, only set the zone write pointer to an invalid value and
+ * report the zone.
+ */
+ zone->type = BLK_ZONE_TYPE_SEQWRITE_REQ;
+ if (disk_zone_is_last(disk, zone))
+ zone->capacity = disk->last_zone_capacity;
+ else
+ zone->capacity = disk->zone_capacity;
+
+ if (zone->cond == BLK_ZONE_COND_READONLY ||
+ zone->cond == BLK_ZONE_COND_OFFLINE) {
+ zone->wp = ULLONG_MAX;
+ return 0;
+ }
+
+ /*
+ * If the zone does not have a zone write plug, it is either full or
+ * empty, as we otherwise would have a zone write plug for it. In this
+ * case, set the write pointer accordingly and report the zone.
+ * Otherwise, if we have a zone write plug, use it.
+ */
+ zwplug = disk_get_zone_wplug(disk, sector);
+ if (!zwplug) {
+ if (zone->cond == BLK_ZONE_COND_FULL)
+ zone->wp = sector + zone_sectors;
+ else
+ zone->wp = sector;
+ return 0;
+ }
+
+ spin_lock_irqsave(&zwplug->lock, flags);
+ if (zwplug->flags & BLK_ZONE_WPLUG_NEED_WP_UPDATE) {
+ spin_unlock_irqrestore(&zwplug->lock, flags);
+ disk_put_zone_wplug(zwplug);
+ return blkdev_report_zone_fallback(bdev, sector, zone);
+ }
+ zone->cond = zwplug->cond;
+ zone->wp = sector + zwplug->wp_offset;
+ spin_unlock_irqrestore(&zwplug->lock, flags);
+
+ disk_put_zone_wplug(zwplug);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(blkdev_get_zone_info);
+
static void blk_zone_reset_bio_endio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 15cc13006d06..98a0ed989d21 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -436,6 +436,9 @@ typedef int (*report_zones_cb)(struct blk_zone *zone, unsigned int idx,
int disk_report_zone(struct gendisk *disk, struct blk_zone *zone,
unsigned int idx, struct blk_report_zones_args *args);
+int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
+ struct blk_zone *zone);
+
#define BLK_ALL_ZONES ((unsigned int)-1)
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data);
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 09/13] block: introduce blkdev_report_zones_cached()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (7 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 08/13] block: introduce blkdev_get_zone_info() Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:53 ` Christoph Hellwig
2025-10-31 21:53 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl Damien Le Moal
` (3 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Introduce the function blkdev_report_zones_cached() to provide a fast
report zone built using the blkdev_get_zone_info() function, which gets
zone information from a disk zones_cond array or zone write plugs.
For a large capacity SMR drive, such fast report zone can be completed
in a few millioseconds compared to several seconds completion times
when the report zone is obtained from the device.
The zone report is built in the same manner as with the regular
blkdev_report_zones() function, that is, the first zone reported is the
one containing the specified start sector and the report is limited to
the specified number of zones (nr_zones argument). The information for
each zone in the report is obtained using blkdev_get_zone_info().
For zoned device that do not use zone write plug resources,
using blkdev_get_zone_info() is inefficient as the zone report would
be very slow, generated one zone at a time. To avoid this,
blkdev_report_zones_cached() falls back to calling
blkdev_do_report_zones() to execute a regular zone report. In this case,
the .report_active field of struct blk_report_zones_args is set to true
to report zone conditions using the BLK_ZONE_COND_ACTIVE condition in
place of the implicit open, explicit open and closed conditions.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 88 +++++++++++++++++++++++++++++++++++-------
include/linux/blkdev.h | 2 +
2 files changed, 77 insertions(+), 13 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 03394e38645f..0234bb7f41b3 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -73,6 +73,19 @@ struct blk_zone_wplug {
enum blk_zone_cond cond;
};
+static inline bool disk_need_zone_resources(struct gendisk *disk)
+{
+ /*
+ * All mq zoned devices need zone resources so that the block layer
+ * can automatically handle write BIO plugging. BIO-based device drivers
+ * (e.g. DM devices) are normally responsible for handling zone write
+ * ordering and do not need zone resources, unless the driver requires
+ * zone append emulation.
+ */
+ return queue_is_mq(disk->queue) ||
+ queue_emulates_zone_append(disk->queue);
+}
+
static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
{
return 1U << disk->zone_wplugs_hash_bits;
@@ -962,6 +975,68 @@ int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
}
EXPORT_SYMBOL_GPL(blkdev_get_zone_info);
+/**
+ * blkdev_report_zones_cached - Get cached zones information
+ * @bdev: Target block device
+ * @sector: Sector from which to report zones
+ * @nr_zones: Maximum number of zones to report
+ * @cb: Callback function called for each reported zone
+ * @data: Private data for the callback function
+ *
+ * Description:
+ * Similar to blkdev_report_zones() but instead of calling into the low level
+ * device driver to get the zone report from the device, use
+ * blkdev_get_zone_info() to generate the report from the disk zone write
+ * plugs and zones condition array. Since calling this function without a
+ * callback does not make sense, @cb must be specified.
+ */
+int blkdev_report_zones_cached(struct block_device *bdev, sector_t sector,
+ unsigned int nr_zones, report_zones_cb cb, void *data)
+{
+ struct gendisk *disk = bdev->bd_disk;
+ sector_t capacity = get_capacity(disk);
+ sector_t zone_sectors = bdev_zone_sectors(bdev);
+ unsigned int idx = 0;
+ struct blk_zone zone;
+ int ret;
+
+ if (!cb || !bdev_is_zoned(bdev) ||
+ WARN_ON_ONCE(!disk->fops->report_zones))
+ return -EOPNOTSUPP;
+
+ if (!nr_zones || sector >= capacity)
+ return 0;
+
+ /*
+ * If we do not have any zone write plug resources, fallback to using
+ * the regular zone report.
+ */
+ if (!disk_need_zone_resources(disk)) {
+ struct blk_report_zones_args args = {
+ .cb = cb,
+ .data = data,
+ .report_active = true,
+ };
+
+ return blkdev_do_report_zones(bdev, sector, nr_zones, &args);
+ }
+
+ for (sector = ALIGN(sector, zone_sectors);
+ sector < capacity && idx < nr_zones;
+ sector += zone_sectors, idx++) {
+ ret = blkdev_get_zone_info(bdev, sector, &zone);
+ if (ret)
+ return ret;
+
+ ret = cb(&zone, idx, data);
+ if (ret)
+ return ret;
+ }
+
+ return idx;
+}
+EXPORT_SYMBOL_GPL(blkdev_report_zones_cached);
+
static void blk_zone_reset_bio_endio(struct bio *bio)
{
struct gendisk *disk = bio->bi_bdev->bd_disk;
@@ -1771,19 +1846,6 @@ void disk_free_zone_resources(struct gendisk *disk)
disk->nr_zones = 0;
}
-static inline bool disk_need_zone_resources(struct gendisk *disk)
-{
- /*
- * All mq zoned devices need zone resources so that the block layer
- * can automatically handle write BIO plugging. BIO-based device drivers
- * (e.g. DM devices) are normally responsible for handling zone write
- * ordering and do not need zone resources, unless the driver requires
- * zone append emulation.
- */
- return queue_is_mq(disk->queue) ||
- queue_emulates_zone_append(disk->queue);
-}
-
struct blk_revalidate_zone_args {
struct gendisk *disk;
u8 *zones_cond;
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 98a0ed989d21..787eae461797 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -442,6 +442,8 @@ int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
#define BLK_ALL_ZONES ((unsigned int)-1)
int blkdev_report_zones(struct block_device *bdev, sector_t sector,
unsigned int nr_zones, report_zones_cb cb, void *data);
+int blkdev_report_zones_cached(struct block_device *bdev, sector_t sector,
+ unsigned int nr_zones, report_zones_cb cb, void *data);
int blkdev_zone_mgmt(struct block_device *bdev, enum req_op op,
sector_t sectors, sector_t nr_sectors);
int blk_revalidate_disk_zones(struct gendisk *disk);
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (8 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 09/13] block: introduce blkdev_report_zones_cached() Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 16:52 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs Damien Le Moal
` (2 subsequent siblings)
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Introduce the new BLKREPORTZONESV2 ioctl command to allow user
applications access to the fast zone report implemented by
blkdev_report_zones_cached(). This new ioctl is defined as number 142
and is documented in include/uapi/linux/fs.h.
Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags
field of struct blk_zone_report also as an inpiut. If as an input, the
user sets the flag BLK_ZONE_REP_CACHED, then
blkdev_report_zones_cached() is used to generate the zone report using
cached zone information. If this flag is not set, then BLKREPORTZONESV2
behaves in the same manner as BLKREPORTZONES and the zone report is
generated by accessing the zoned device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 25 ++++++++++++++++++++++---
block/ioctl.c | 1 +
include/uapi/linux/blkzoned.h | 30 ++++++++++++++++++++++++++----
include/uapi/linux/fs.h | 2 +-
4 files changed, 50 insertions(+), 8 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index 0234bb7f41b3..c8335654b1cd 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -356,7 +356,12 @@ static int blkdev_copy_zone_to_user(struct blk_zone *zone, unsigned int idx,
}
/*
- * BLKREPORTZONE ioctl processing.
+ * Mask of valid input flags for BLKREPORTZONEV2 ioctl.
+ */
+#define BLK_ZONE_REPV2_INPUT_FLAGS (BLK_ZONE_REP_CACHED)
+
+/*
+ * BLKREPORTZONE and BLKREPORTZONEV2 ioctl processing.
* Called from blkdev_ioctl.
*/
int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
@@ -380,8 +385,22 @@ int blkdev_report_zones_ioctl(struct block_device *bdev, unsigned int cmd,
return -EINVAL;
args.zones = argp + sizeof(struct blk_zone_report);
- ret = blkdev_report_zones(bdev, rep.sector, rep.nr_zones,
- blkdev_copy_zone_to_user, &args);
+
+ switch (cmd) {
+ case BLKREPORTZONE:
+ ret = blkdev_report_zones(bdev, rep.sector, rep.nr_zones,
+ blkdev_copy_zone_to_user, &args);
+ break;
+ case BLKREPORTZONEV2:
+ if (rep.flags & ~BLK_ZONE_REPV2_INPUT_FLAGS)
+ return -EINVAL;
+ ret = blkdev_report_zones_cached(bdev, rep.sector, rep.nr_zones,
+ blkdev_copy_zone_to_user, &args);
+ break;
+ default:
+ return -EINVAL;
+ }
+
if (ret < 0)
return ret;
diff --git a/block/ioctl.c b/block/ioctl.c
index 3927ca4707d0..698629e4c619 100644
--- a/block/ioctl.c
+++ b/block/ioctl.c
@@ -581,6 +581,7 @@ static int blkdev_common_ioctl(struct block_device *bdev, blk_mode_t mode,
case BLKGETDISKSEQ:
return put_u64(argp, bdev->bd_disk->diskseq);
case BLKREPORTZONE:
+ case BLKREPORTZONEV2:
return blkdev_report_zones_ioctl(bdev, cmd, arg);
case BLKRESETZONE:
case BLKOPENZONE:
diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
index dab5d9700898..1441d79a6173 100644
--- a/include/uapi/linux/blkzoned.h
+++ b/include/uapi/linux/blkzoned.h
@@ -82,10 +82,20 @@ enum blk_zone_cond {
/**
* enum blk_zone_report_flags - Feature flags of reported zone descriptors.
*
- * @BLK_ZONE_REP_CAPACITY: Zone descriptor has capacity field.
+ * @BLK_ZONE_REP_CAPACITY: Output only. Indicates that zone descriptors in a
+ * zone report have a valid capacity field.
+ * @BLK_ZONE_REP_CACHED: Input only. Indicates that the zone report should be
+ * generated using cached zone information. In this case,
+ * the implicit open, explicit open and closed zone
+ * conditions are all reported with the
+ * BLK_ZONE_COND_ACTIVE condition.
*/
enum blk_zone_report_flags {
+ /* Output flags */
BLK_ZONE_REP_CAPACITY = (1 << 0),
+
+ /* Input flags */
+ BLK_ZONE_REP_CACHED = (1 << 31),
};
/**
@@ -128,6 +138,10 @@ struct blk_zone {
* @sector: starting sector of report
* @nr_zones: IN maximum / OUT actual
* @flags: one or more flags as defined by enum blk_zone_report_flags.
+ * @flags: one or more flags as defined by enum blk_zone_report_flags.
+ * With BLKREPORTZONE, this field is ignored as an input and is valid
+ * only as an output. Using BLKREPORTZONEV2, this field is used as both
+ * input and output.
* @zones: Space to hold @nr_zones @zones entries on reply.
*
* The array of at most @nr_zones must follow this structure in memory.
@@ -154,9 +168,16 @@ struct blk_zone_range {
/**
* Zoned block device ioctl's:
*
- * @BLKREPORTZONE: Get zone information. Takes a zone report as argument.
- * The zone report will start from the zone containing the
- * sector specified in the report request structure.
+ * @BLKREPORTZONE: Get zone information from a zoned device. Takes a zone report
+ * as argument. The zone report will start from the zone
+ * containing the sector specified in struct blk_zone_report.
+ * The flags field of struct blk_zone_report is used as an
+ * output only and ignored as an input.
+ * DEPRECATED, use BLKREPORTZONEV2 instead.
+ * @BLKREPORTZONEV2: Same as @BLKREPORTZONE but uses the flags field of
+ * struct blk_zone_report as an input, allowing to get a zone
+ * report using cached zone information if BLK_ZONE_REP_CACHED
+ * is set.
* @BLKRESETZONE: Reset the write pointer of the zones in the specified
* sector range. The sector range must be zone aligned.
* @BLKGETZONESZ: Get the device zone size in number of 512 B sectors.
@@ -175,5 +196,6 @@ struct blk_zone_range {
#define BLKOPENZONE _IOW(0x12, 134, struct blk_zone_range)
#define BLKCLOSEZONE _IOW(0x12, 135, struct blk_zone_range)
#define BLKFINISHZONE _IOW(0x12, 136, struct blk_zone_range)
+#define BLKREPORTZONEV2 _IOWR(0x12, 142, struct blk_zone_report)
#endif /* _UAPI_BLKZONED_H */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 957ce3343a4f..66ca526cf786 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -298,7 +298,7 @@ struct file_attr {
#define BLKROTATIONAL _IO(0x12,126)
#define BLKZEROOUT _IO(0x12,127)
#define BLKGETDISKSEQ _IOR(0x12,128,__u64)
-/* 130-136 are used by zoned block device ioctls (uapi/linux/blkzoned.h) */
+/* 130-136 and 142 are used by zoned block device ioctls (uapi/linux/blkzoned.h) */
/* 137-141 are used by blk-crypto ioctls (uapi/linux/blk-crypto.h) */
#define BLKTRACESETUP2 _IOWR(0x12, 142, struct blk_user_trace_setup2)
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (9 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 21:55 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 12/13] btrfs: use blkdev_report_zones_cached() Damien Le Moal
2025-10-31 6:13 ` [PATCH 13/13] xfs: " Damien Le Moal
12 siblings, 2 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Modify queue_zone_wplug_show() to include the condition of a zone write
plug to the zone_wplugs debugfs attribute of a zoned block device.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
block/blk-zoned.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/block/blk-zoned.c b/block/blk-zoned.c
index c8335654b1cd..1ccd78beb538 100644
--- a/block/blk-zoned.c
+++ b/block/blk-zoned.c
@@ -2296,18 +2296,21 @@ static void queue_zone_wplug_show(struct blk_zone_wplug *zwplug,
unsigned int zwp_wp_offset, zwp_flags;
unsigned int zwp_zone_no, zwp_ref;
unsigned int zwp_bio_list_size;
+ unsigned int zwp_cond;
unsigned long flags;
spin_lock_irqsave(&zwplug->lock, flags);
zwp_zone_no = zwplug->zone_no;
zwp_flags = zwplug->flags;
zwp_ref = refcount_read(&zwplug->ref);
+ zwp_cond = zwplug->cond;
zwp_wp_offset = zwplug->wp_offset;
zwp_bio_list_size = bio_list_size(&zwplug->bio_list);
spin_unlock_irqrestore(&zwplug->lock, flags);
- seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
- zwp_wp_offset, zwp_bio_list_size);
+ seq_printf(m, "%u 0x%x %u 0x%x %u %u\n",
+ zwp_zone_no, zwp_flags, zwp_ref,
+ zwp_cond, zwp_wp_offset, zwp_bio_list_size);
}
int queue_zone_wplugs_show(void *data, struct seq_file *m)
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 12/13] btrfs: use blkdev_report_zones_cached()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (10 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 19:01 ` David Sterba
2025-10-31 6:13 ` [PATCH 13/13] xfs: " Damien Le Moal
12 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Modify btrfs_get_dev_zones() and btrfs_sb_log_location_bdev() to replace
the call to blkdev_report_zones() with blkdev_report_zones_cached() to
speed-up mount operations. btrfs_get_dev_zone_info() is also modified to
take into account the BLK_ZONE_COND_ACTIVE condition, which is
equivalent to either BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN or
BLK_ZONE_COND_CLOSED.
With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 100ms compared to over 1.8s before.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
fs/btrfs/zoned.c | 11 ++++++-----
1 file changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 0ea0df18a8e4..a16b1a896c78 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -264,8 +264,8 @@ static int btrfs_get_dev_zones(struct btrfs_device *device, u64 pos,
}
}
- ret = blkdev_report_zones(device->bdev, pos >> SECTOR_SHIFT, *nr_zones,
- copy_zone_info_cb, zones);
+ ret = blkdev_report_zones_cached(device->bdev, pos >> SECTOR_SHIFT,
+ *nr_zones, copy_zone_info_cb, zones);
if (ret < 0) {
btrfs_err(device->fs_info,
"zoned: failed to read zone %llu on %s (devid %llu)",
@@ -494,6 +494,7 @@ int btrfs_get_dev_zone_info(struct btrfs_device *device, bool populate_cache)
case BLK_ZONE_COND_IMP_OPEN:
case BLK_ZONE_COND_EXP_OPEN:
case BLK_ZONE_COND_CLOSED:
+ case BLK_ZONE_COND_ACTIVE:
__set_bit(nreported, zone_info->active_zones);
nactive++;
break;
@@ -896,9 +897,9 @@ int btrfs_sb_log_location_bdev(struct block_device *bdev, int mirror, int rw,
if (sb_zone + 1 >= nr_zones)
return -ENOENT;
- ret = blkdev_report_zones(bdev, zone_start_sector(sb_zone, bdev),
- BTRFS_NR_SB_LOG_ZONES, copy_zone_info_cb,
- zones);
+ ret = blkdev_report_zones_cached(bdev, zone_start_sector(sb_zone, bdev),
+ BTRFS_NR_SB_LOG_ZONES,
+ copy_zone_info_cb, zones);
if (ret < 0)
return ret;
if (unlikely(ret != BTRFS_NR_SB_LOG_ZONES))
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* [PATCH 13/13] xfs: use blkdev_report_zones_cached()
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
` (11 preceding siblings ...)
2025-10-31 6:13 ` [PATCH 12/13] btrfs: use blkdev_report_zones_cached() Damien Le Moal
@ 2025-10-31 6:13 ` Damien Le Moal
2025-10-31 8:55 ` Christoph Hellwig
12 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-10-31 6:13 UTC (permalink / raw)
To: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Modify xfs_mount_zones() to replace the call to blkdev_report_zones()
with blkdev_report_zones_cached() to speed-up mount operations.
With this change, mounting a freshly formatted large capacity (30 TB)
SMR HDD completes under 2s compared to over 4.7s before.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
fs/xfs/xfs_zone_alloc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_zone_alloc.c b/fs/xfs/xfs_zone_alloc.c
index 23cdab4515bb..8d819bc134cd 100644
--- a/fs/xfs/xfs_zone_alloc.c
+++ b/fs/xfs/xfs_zone_alloc.c
@@ -1231,7 +1231,7 @@ xfs_mount_zones(
trace_xfs_zones_mount(mp);
if (bdev_is_zoned(bt->bt_bdev)) {
- error = blkdev_report_zones(bt->bt_bdev,
+ error = blkdev_report_zones_cached(bt->bt_bdev,
XFS_FSB_TO_BB(mp, mp->m_sb.sb_rtstart),
mp->m_sb.sb_rgcount, xfs_get_zone_info_cb, &iz);
if (error < 0)
--
2.51.0
^ permalink raw reply related [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
@ 2025-10-31 8:44 ` Christoph Hellwig
2025-10-31 17:48 ` Bart Van Assche
2025-11-03 11:17 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:44 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 02/13] block: cleanup blkdev_report_zones()
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
@ 2025-10-31 8:45 ` Christoph Hellwig
2025-10-31 17:55 ` Bart Van Assche
2025-11-03 11:15 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:45 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 03/13] block: handle zone management operations completions
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
@ 2025-10-31 8:46 ` Christoph Hellwig
2025-10-31 18:01 ` Bart Van Assche
2025-11-03 11:41 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:46 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Fri, Oct 31, 2025 at 03:12:57PM +0900, Damien Le Moal wrote:
> The functions blk_zone_wplug_handle_reset_or_finish() and
> blk_zone_wplug_handle_reset_all() both modify the zone write pointer
> offset of zone write plugs that are the target of a reset, reset all or
> finish zone management operation. However, these functions do this
> modification before the BIO is executed. So if the zone operation fails,
> the modified zone write pointer offsets become invalid.
>
> Avoid this by modifying the zone write pointer offset of a zone write
> plug that is the target of a zone management operation when the
> operation completes. To do so, modify blk_zone_bio_endio() to call the
> new function blk_zone_mgmt_bio_endio() which in turn calls the functions
> blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
> blk_zone_finish_bio_endio() depending on the operation of the completed
> BIO, to modify a zone write plug write pointer offset accordingly.
> These functions are called only if the BIO execution was successful.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Should this have a fixes tag and move to the start of the series?
Otherwise looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 04/13] block: introduce disk_report_zone()
2025-10-31 6:12 ` [PATCH 04/13] block: introduce disk_report_zone() Damien Le Moal
@ 2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:54 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:47 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 05/13] block: reorganize struct blk_zone_wplug
2025-10-31 6:12 ` [PATCH 05/13] block: reorganize struct blk_zone_wplug Damien Le Moal
@ 2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:55 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:47 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Fri, Oct 31, 2025 at 03:12:59PM +0900, Damien Le Moal wrote:
> Reorganize the fields of struct blk_zone_wplug to remove a hole after
> the wp_offset field and avoid having the bio_work structure split
> between 2 cache lines.
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 06/13] block: use zone condition to determine conventional zones
2025-10-31 6:13 ` [PATCH 06/13] block: use zone condition to determine conventional zones Damien Le Moal
@ 2025-10-31 8:48 ` Christoph Hellwig
2025-10-31 21:04 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:48 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-10-31 6:13 ` [PATCH 07/13] block: track zone conditions Damien Le Moal
@ 2025-10-31 8:51 ` Christoph Hellwig
2025-10-31 21:17 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:51 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Fri, Oct 31, 2025 at 03:13:01PM +0900, Damien Le Moal wrote:
> The function blk_revalidate_zone_cond() already cache the condition of
s/cache/caches/ ?
> diff --git a/include/uapi/linux/blkzoned.h b/include/uapi/linux/blkzoned.h
> index f85743ef6e7d..dab5d9700898 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -61,6 +61,10 @@ enum blk_zone_type {
> *
> * Conditions 0x5 to 0xC are reserved by the current ZBC/ZAC spec and should
> * be considered invalid.
> + *
> + * The condition BLK_ZONE_COND_ACTIVE is used to represent any of the
> + * BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN and BLK_ZONE_COND_CLOSED
> + * conditions.
Maybe explain a bit more that this is only seen for cached reports, and
in that case the two open states and closed won't ever be seen, i.e.
that they are mutually exclusive?
Otherwise looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 08/13] block: introduce blkdev_get_zone_info()
2025-10-31 6:13 ` [PATCH 08/13] block: introduce blkdev_get_zone_info() Damien Le Moal
@ 2025-10-31 8:52 ` Christoph Hellwig
2025-10-31 21:40 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:52 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Fri, Oct 31, 2025 at 03:13:02PM +0900, Damien Le Moal wrote:
> In preparation for using blkdev_get_zone_info() in upcoming file systems
> changes, also export this function as a GPL symbol.
As FYI I asked for this an plan to make use of this in XFS to unwind the
callbacks. But I'd rather do that a merge window later to minimize
cross-tree dependencies.
The patch looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 09/13] block: introduce blkdev_report_zones_cached()
2025-10-31 6:13 ` [PATCH 09/13] block: introduce blkdev_report_zones_cached() Damien Le Moal
@ 2025-10-31 8:53 ` Christoph Hellwig
2025-10-31 21:53 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:53 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl
2025-10-31 6:13 ` [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl Damien Le Moal
@ 2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 16:52 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:54 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs
2025-10-31 6:13 ` [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs Damien Le Moal
@ 2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 21:55 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:54 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 13/13] xfs: use blkdev_report_zones_cached()
2025-10-31 6:13 ` [PATCH 13/13] xfs: " Damien Le Moal
@ 2025-10-31 8:55 ` Christoph Hellwig
0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-10-31 8:55 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
(and with a more exhaustive change to remove the callbacks this should
get even better soon).
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl
2025-10-31 6:13 ` [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
@ 2025-10-31 16:52 ` Bart Van Assche
2025-11-03 5:51 ` Damien Le Moal
1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 16:52 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> Unlike the existing BLKREPORTZONES ioctl, this new ioctl uses the flags
> field of struct blk_zone_report also as an inpiut. If as an input, the
inpiut -> input
> /*
> - * BLKREPORTZONE ioctl processing.
> + * Mask of valid input flags for BLKREPORTZONEV2 ioctl.
> + */
> +#define BLK_ZONE_REPV2_INPUT_FLAGS (BLK_ZONE_REP_CACHED)
Parentheses are not needed here - the definition of BLK_ZONE_REP_CACHED
should include parentheses if necessary.
> + case BLKREPORTZONEV2:
> + if (rep.flags & ~BLK_ZONE_REPV2_INPUT_FLAGS)
> + return -EINVAL;
-EINVAL probably should be changed into something that indicates "not
supported" rather than "invalid argument"?
> index dab5d9700898..1441d79a6173 100644
> --- a/include/uapi/linux/blkzoned.h
> +++ b/include/uapi/linux/blkzoned.h
> @@ -82,10 +82,20 @@ enum blk_zone_cond {
> /**
> * enum blk_zone_report_flags - Feature flags of reported zone descriptors.
> *
> - * @BLK_ZONE_REP_CAPACITY: Zone descriptor has capacity field.
> + * @BLK_ZONE_REP_CAPACITY: Output only. Indicates that zone descriptors in a
> + * zone report have a valid capacity field.
> + * @BLK_ZONE_REP_CACHED: Input only. Indicates that the zone report should be
> + * generated using cached zone information. In this case,
> + * the implicit open, explicit open and closed zone
> + * conditions are all reported with the
> + * BLK_ZONE_COND_ACTIVE condition.
> */
> enum blk_zone_report_flags {
> + /* Output flags */
> BLK_ZONE_REP_CAPACITY = (1 << 0),
> +
> + /* Input flags */
> + BLK_ZONE_REP_CACHED = (1 << 31),
> };
Why 1 << 31 instead of 1 << 1?
Otherwise this patch looks good to me.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
2025-10-31 8:44 ` Christoph Hellwig
@ 2025-10-31 17:48 ` Bart Van Assche
2025-11-03 5:55 ` Damien Le Moal
2025-11-03 11:17 ` Hannes Reinecke
2 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 17:48 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:12 PM, Damien Le Moal wrote:
> Modify disk_update_zone_resources() to freeze the device queue before
> updating the number of zones, zone capacity and other zone related
> resources. The locking order resulting from the call to
> queue_limits_commit_update_frozen() is preserved, that is, the queue
> limits lock is first taken by calling queue_limits_start_update() before
> freezing the queue, and the queue is unfrozen after executing
> queue_limits_commit_update(), which replaces the call to
> queue_limits_commit_update_frozen().
>
> This change ensures that there are no in-flights I/Os when the zone
> resources are updated due to a zone revalidation.
>
> Fixes: 0b83c86b444a ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/block/blk-zoned.c b/block/blk-zoned.c
> index 5e2a5788dc3b..f3b371056df4 100644
> --- a/block/blk-zoned.c
> +++ b/block/blk-zoned.c
> @@ -1516,8 +1516,13 @@ static int disk_update_zone_resources(struct gendisk *disk,
> {
> struct request_queue *q = disk->queue;
> unsigned int nr_seq_zones, nr_conv_zones;
> - unsigned int pool_size;
> + unsigned int pool_size, memflags;
> struct queue_limits lim;
> + int ret;
> +
> + lim = queue_limits_start_update(q);
> +
> + memflags = blk_mq_freeze_queue(q);
>
> disk->nr_zones = args->nr_zones;
> disk->zone_capacity = args->zone_capacity;
> @@ -1527,11 +1532,10 @@ static int disk_update_zone_resources(struct gendisk *disk,
> if (nr_conv_zones >= disk->nr_zones) {
> pr_warn("%s: Invalid number of conventional zones %u / %u\n",
> disk->disk_name, nr_conv_zones, disk->nr_zones);
> - return -ENODEV;
> + ret = -ENODEV;
> + goto unfreeze;
> }
>
> - lim = queue_limits_start_update(q);
> -
> /*
> * Some devices can advertize zone resource limits that are larger than
> * the number of sequential zones of the zoned block device, e.g. a
> @@ -1568,7 +1572,12 @@ static int disk_update_zone_resources(struct gendisk *disk,
> }
>
> commit:
> - return queue_limits_commit_update_frozen(q, &lim);
> + ret = queue_limits_commit_update(q, &lim);
> +
> +unfreeze:
> + blk_mq_unfreeze_queue(q, memflags);
> +
> + return ret;
> }
>
> static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
Hi Damien,
disk_update_zone_resources() only has a single caller and just below the
only call of this function the following code is present:
if (ret) {
unsigned int memflags = blk_mq_freeze_queue(q);
disk_free_zone_resources(disk);
blk_mq_unfreeze_queue(q, memflags);
}
Shouldn't this code be moved into disk_update_zone_resources() such that
error handling happens without unfreezing and refreezing the request
queue?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 02/13] block: cleanup blkdev_report_zones()
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
2025-10-31 8:45 ` Christoph Hellwig
@ 2025-10-31 17:55 ` Bart Van Assche
2025-11-03 11:15 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 17:55 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:12 PM, Damien Le Moal wrote:
> The variable capacity is used only in one place and so can be removed
> and get_capacity(disk) used directly instead.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 03/13] block: handle zone management operations completions
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
2025-10-31 8:46 ` Christoph Hellwig
@ 2025-10-31 18:01 ` Bart Van Assche
2025-11-03 6:25 ` Damien Le Moal
2025-11-03 11:41 ` Hannes Reinecke
2 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 18:01 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:12 PM, Damien Le Moal wrote:
> +void blk_zone_mgmt_bio_endio(struct bio *bio)
> +{
> + /* If the BIO failed, we have nothing to do. */
> + if (bio->bi_status != BLK_STS_OK)
> + return;
> +
> + switch (bio_op(bio)) {
> + case REQ_OP_ZONE_RESET:
> + blk_zone_reset_bio_endio(bio);
> + return;
> + case REQ_OP_ZONE_RESET_ALL:
> + blk_zone_reset_all_bio_endio(bio);
> + return;
> + case REQ_OP_ZONE_FINISH:
> + blk_zone_finish_bio_endio(bio);
> + return;
> + default:
> + return;
> + }
> }
"default: return;" is superfluous and can be left out.
> + /*
> + * Zone mamnagement BIOs may impact zone write plugs (e.g. a zone reset
mamnagement -> management
Otherwise this patch looks good to me.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 12/13] btrfs: use blkdev_report_zones_cached()
2025-10-31 6:13 ` [PATCH 12/13] btrfs: use blkdev_report_zones_cached() Damien Le Moal
@ 2025-10-31 19:01 ` David Sterba
0 siblings, 0 replies; 63+ messages in thread
From: David Sterba @ 2025-10-31 19:01 UTC (permalink / raw)
To: Damien Le Moal
Cc: Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Fri, Oct 31, 2025 at 03:13:06PM +0900, Damien Le Moal wrote:
> Modify btrfs_get_dev_zones() and btrfs_sb_log_location_bdev() to replace
> the call to blkdev_report_zones() with blkdev_report_zones_cached() to
> speed-up mount operations. btrfs_get_dev_zone_info() is also modified to
> take into account the BLK_ZONE_COND_ACTIVE condition, which is
> equivalent to either BLK_ZONE_COND_IMP_OPEN, BLK_ZONE_COND_EXP_OPEN or
> BLK_ZONE_COND_CLOSED.
>
> With this change, mounting a freshly formatted large capacity (30 TB)
> SMR HDD completes under 100ms compared to over 1.8s before.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Acked-by: David Sterba <dsterba@suse.com>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 04/13] block: introduce disk_report_zone()
2025-10-31 6:12 ` [PATCH 04/13] block: introduce disk_report_zone() Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
@ 2025-10-31 20:54 ` Bart Van Assche
2025-11-03 5:56 ` Damien Le Moal
1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 20:54 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:12 PM, Damien Le Moal wrote:
> -struct disk_report_zones_cb_args {
> - struct gendisk *disk;
> - report_zones_cb user_cb;
> - void *user_data;
> +/*
> + * Zone report arguments for block device drivers report_zones operation.
> + * @cb: report_zones_cb callback for each reported zone.
> + * @data: Private data passed to report_zones_cb.
> + */
> +struct blk_report_zones_args {
> + report_zones_cb cb;
> + void *data;
> };
The suffix "_args" seems confusing to me because this data structure
includes a callback pointer. Please consider changing "_args" into "_cb"
to make it clear that the data structure includes a callback pointer.
Another data structure that follows this convention is struct
blk_plug_cb:
struct blk_plug_cb;
typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
struct blk_plug_cb {
struct list_head list;
blk_plug_cb_fn callback;
void *data;
};
Since struct blk_report_zones_args is passed as an argument to
disk_report_zone(), how about renaming this data structure into
struct disk_report_zone_cb?'
Otherwise this patch looks good to me.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 05/13] block: reorganize struct blk_zone_wplug
2025-10-31 6:12 ` [PATCH 05/13] block: reorganize struct blk_zone_wplug Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
@ 2025-10-31 20:55 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 20:55 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:12 PM, Damien Le Moal wrote:
> Reorganize the fields of struct blk_zone_wplug to remove a hole after
> the wp_offset field and avoid having the bio_work structure split
> between 2 cache lines.
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 06/13] block: use zone condition to determine conventional zones
2025-10-31 6:13 ` [PATCH 06/13] block: use zone condition to determine conventional zones Damien Le Moal
2025-10-31 8:48 ` Christoph Hellwig
@ 2025-10-31 21:04 ` Bart Van Assche
2025-11-03 6:00 ` Damien Le Moal
1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 21:04 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> zone. This increases the memory usage from 1 bit per zone to 1 bytes per
1 bytes -> 1 byte
> +bool bdev_zone_is_seq(struct block_device *bdev, sector_t sector)
> +{
> + struct gendisk *disk = bdev->bd_disk;
> + unsigned int zno = disk_zone_no(disk, sector);
> + bool is_seq = false;
> + u8 *zones_cond;
> +
> + if (!bdev_is_zoned(bdev))
> + return false;
> +
> + rcu_read_lock();
> + zones_cond = rcu_dereference(disk->zones_cond);
> + if (zones_cond)
> + is_seq = zones_cond[zno] != BLK_ZONE_COND_NOT_WP;
> + rcu_read_unlock();
> +
> + return is_seq;
> +}
> +EXPORT_SYMBOL_GPL(bdev_zone_is_seq);
'zno' should be compared to the size of the zones_cond[] array before
using 'zno' as an array index.
> static int disk_revalidate_zone_resources(struct gendisk *disk,
> - unsigned int nr_zones)
> + struct blk_revalidate_zone_args *args)
> {
> struct queue_limits *lim = &disk->queue->limits;
> unsigned int pool_size;
>
> + args->disk = disk;
> + args->nr_zones =
> + DIV_ROUND_UP_ULL(get_capacity(disk), lim->chunk_sectors);
> +
> + /* Cached zone conditions: 1 byte per zone */
> + args->zones_cond = kzalloc(args->nr_zones, GFP_NOIO);
> + if (!args->zones_cond)
> + return -ENOMEM;
Why args->nr_zones as array size instead of args->nr_conv_zones? The
patch description says that this array is only used for conventional
zones.
> /*
> * Some devices can advertize zone resource limits that are larger than
advertize -> advertise
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-10-31 6:13 ` [PATCH 07/13] block: track zone conditions Damien Le Moal
2025-10-31 8:51 ` Christoph Hellwig
@ 2025-10-31 21:17 ` Bart Van Assche
2025-11-03 6:05 ` Damien Le Moal
1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 21:17 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> Implement tracking of the runtime changes to zone conditions using
> the new cond field in struct blk_zone_wplug. The size of this structure
> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
> end of the structure. For zones that do not have a zone write plug, the
> zones_cond array of a disk is used to track changes to zone conditions,
> e.g. when a zone reset, reset all or finish operation is executed.
Why is it necessary to track the condition of sequential zones that do
not have a zone write plug? Please explain what the use cases are.
The zoned UFS device on my desk has 3420 sequential zones and zero
conventional zones. If the condition of zones that do not have a zone
write plug wouldn't be tracked that would save some memory.
> +static void blk_zone_set_cond(u8 *zones_cond, unsigned int zno,
> + enum blk_zone_cond cond)
> +{
> + if (!zones_cond)
> + return;
> +
> + switch (cond) {
> + case BLK_ZONE_COND_IMP_OPEN:
> + case BLK_ZONE_COND_EXP_OPEN:
> + case BLK_ZONE_COND_CLOSED:
> + zones_cond[zno] = BLK_ZONE_COND_ACTIVE;
> + return;
> + case BLK_ZONE_COND_NOT_WP:
> + case BLK_ZONE_COND_EMPTY:
> + case BLK_ZONE_COND_FULL:
> + case BLK_ZONE_COND_OFFLINE:
> + case BLK_ZONE_COND_READONLY:
> + default:
> + zones_cond[zno] = cond;
> + return;
> + }
> +}
> +
> +static void disk_zone_set_cond(struct gendisk *disk, sector_t sector,
> + enum blk_zone_cond cond)
> +{
> + u8 *zones_cond;
> +
> + rcu_read_lock();
> + zones_cond = rcu_dereference(disk->zones_cond);
> + if (zones_cond) {
> + unsigned int zno = disk_zone_no(disk, sector);
> +
> + /*
> + * The condition of a conventional, readonly and offline zones
> + * never changes, so do nothing if the target zone is in one of
> + * these conditions.
> + */
> + switch (zones_cond[zno]) {
> + case BLK_ZONE_COND_NOT_WP:
> + case BLK_ZONE_COND_READONLY:
> + case BLK_ZONE_COND_OFFLINE:
> + break;
> + default:
> + blk_zone_set_cond(zones_cond, zno, cond);
> + break;
> + }
> + }
> + rcu_read_unlock();
> +}
Why does blk_zone_set_cond() accept a zone number as second argument and
why does disk_zone_set_cond() accept a sector number as second argument?
The callers of disk_zone_set_cond() can be optimized if its second
argument would be changed from a sector number into a zone number.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 08/13] block: introduce blkdev_get_zone_info()
2025-10-31 6:13 ` [PATCH 08/13] block: introduce blkdev_get_zone_info() Damien Le Moal
2025-10-31 8:52 ` Christoph Hellwig
@ 2025-10-31 21:40 ` Bart Van Assche
2025-11-03 6:08 ` Damien Le Moal
1 sibling, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 21:40 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> +static int blkdev_do_report_zones(struct block_device *bdev, sector_t sector,
> + unsigned int nr_zones,
> + struct blk_report_zones_args *args)
> +{
> + struct gendisk *disk = bdev->bd_disk;
> +
> + if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
> + return -EOPNOTSUPP;
> +
> + if (!nr_zones || sector >= get_capacity(disk))
> + return 0;
> +
> + return disk->fops->report_zones(disk, sector, nr_zones, args);
> +}
> +
> /**
> * blkdev_report_zones - Get zones information
> * @bdev: Target block device
> @@ -226,19 +242,12 @@ struct blk_report_zones_args {
> int blkdev_report_zones(struct block_device *bdev, sector_t sector,
> unsigned int nr_zones, report_zones_cb cb, void *data)
> {
> - struct gendisk *disk = bdev->bd_disk;
> struct blk_report_zones_args args = {
> .cb = cb,
> .data = data,
> };
>
> - if (!bdev_is_zoned(bdev) || WARN_ON_ONCE(!disk->fops->report_zones))
> - return -EOPNOTSUPP;
> -
> - if (!nr_zones || sector >= get_capacity(disk))
> - return 0;
> -
> - return disk->fops->report_zones(disk, sector, nr_zones, &args);
> + return blkdev_do_report_zones(bdev, sector, nr_zones, &args);
> }
> EXPORT_SYMBOL_GPL(blkdev_report_zones);
One change per patch please. Please split this patch into one patch that
introduces the blkdev_do_report_zones() function and another patch with
the remaining changes from this patch.
> +/**
> + * blkdev_get_zone_info - Get a zone information from cached data
"Get a zone information" -> "Get zone information"
> + sector = sector & (~(zone_sectors - 1));
Please consider changing the above assignment into:
sector -= bdev_offset_from_zone_start(bdev, sector);
> + /*
> + * If the zone does not have a zone write plug, it is either full or
> + * empty, as we otherwise would have a zone write plug for it. In this
> + * case, set the write pointer accordingly and report the zone.
> + * Otherwise, if we have a zone write plug, use it.
> + */
> + zwplug = disk_get_zone_wplug(disk, sector);
> + if (!zwplug) {
> + if (zone->cond == BLK_ZONE_COND_FULL)
> + zone->wp = sector + zone_sectors;
> + else
> + zone->wp = sector;
> + return 0;
> + }
According to ZBC-2 the write pointer is invalid for full zones.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 09/13] block: introduce blkdev_report_zones_cached()
2025-10-31 6:13 ` [PATCH 09/13] block: introduce blkdev_report_zones_cached() Damien Le Moal
2025-10-31 8:53 ` Christoph Hellwig
@ 2025-10-31 21:53 ` Bart Van Assche
2025-11-03 6:12 ` Damien Le Moal
2025-11-03 7:18 ` Damien Le Moal
1 sibling, 2 replies; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 21:53 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> Introduce the function blkdev_report_zones_cached() to provide a fast
> report zone built using the blkdev_get_zone_info() function, which gets
> zone information from a disk zones_cond array or zone write plugs.
> For a large capacity SMR drive, such fast report zone can be completed
> in a few millioseconds compared to several seconds completion times
> when the report zone is obtained from the device.
millioseconds -> milliseconds
Does retrieving the cached zone information really require multiple
milliseconds instead of only a few microseconds?
> For zoned device that do not use zone write plug resources,
zoned device -> zoned devices
> +static inline bool disk_need_zone_resources(struct gendisk *disk)
> +{
> + /*
> + * All mq zoned devices need zone resources so that the block layer
> + * can automatically handle write BIO plugging. BIO-based device drivers
> + * (e.g. DM devices) are normally responsible for handling zone write
> + * ordering and do not need zone resources, unless the driver requires
> + * zone append emulation.
> + */
> + return queue_is_mq(disk->queue) ||
> + queue_emulates_zone_append(disk->queue);
> +}
Today queue_is_mq() returns true for request-based queues only. Since
this is the terminology used elsewhere in the block layer, maybe change
"mq zoned devices" into "request-based zoned block devices"?
> static inline unsigned int disk_zone_wplugs_hash_size(struct gendisk *disk)
> {
> return 1U << disk->zone_wplugs_hash_bits;
> @@ -962,6 +975,68 @@ int blkdev_get_zone_info(struct block_device *bdev, sector_t sector,
> }
> EXPORT_SYMBOL_GPL(blkdev_get_zone_info);
>
> +/**
> + * blkdev_report_zones_cached - Get cached zones information
> + * @bdev: Target block device
> + * @sector: Sector from which to report zones
> + * @nr_zones: Maximum number of zones to report
> + * @cb: Callback function called for each reported zone
> + * @data: Private data for the callback function
> + *
> + * Description:
> + * Similar to blkdev_report_zones() but instead of calling into the low level
> + * device driver to get the zone report from the device, use
> + * blkdev_get_zone_info() to generate the report from the disk zone write
> + * plugs and zones condition array. Since calling this function without a
> + * callback does not make sense, @cb must be specified.
> + */
> +int blkdev_report_zones_cached(struct block_device *bdev, sector_t sector,
> + unsigned int nr_zones, report_zones_cb cb, void *data)
> +{
> + struct gendisk *disk = bdev->bd_disk;
> + sector_t capacity = get_capacity(disk);
> + sector_t zone_sectors = bdev_zone_sectors(bdev);
> + unsigned int idx = 0;
> + struct blk_zone zone;
> + int ret;
> +
> + if (!cb || !bdev_is_zoned(bdev) ||
> + WARN_ON_ONCE(!disk->fops->report_zones))
> + return -EOPNOTSUPP;
> +
> + if (!nr_zones || sector >= capacity)
> + return 0;
> +
> + /*
> + * If we do not have any zone write plug resources, fallback to using
> + * the regular zone report.
> + */
> + if (!disk_need_zone_resources(disk)) {
> + struct blk_report_zones_args args = {
> + .cb = cb,
> + .data = data,
> + .report_active = true,
> + };
> +
> + return blkdev_do_report_zones(bdev, sector, nr_zones, &args);
> + }
> +
> + for (sector = ALIGN(sector, zone_sectors);
> + sector < capacity && idx < nr_zones;
> + sector += zone_sectors, idx++) {
Please change "sector = ALIGN(sector, zone_sectors)" into an something
based on bdev_offset_from_zone_start(), e.g. the following code:
sector += zone_sectors - 1;
sector -= bdev_offset_from_zone_start(bdev, sector);
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs
2025-10-31 6:13 ` [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
@ 2025-10-31 21:55 ` Bart Van Assche
1 sibling, 0 replies; 63+ messages in thread
From: Bart Van Assche @ 2025-10-31 21:55 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/30/25 11:13 PM, Damien Le Moal wrote:
> - seq_printf(m, "%u 0x%x %u %u %u\n", zwp_zone_no, zwp_flags, zwp_ref,
> - zwp_wp_offset, zwp_bio_list_size);
> + seq_printf(m, "%u 0x%x %u 0x%x %u %u\n",
> + zwp_zone_no, zwp_flags, zwp_ref,
> + zwp_cond, zwp_wp_offset, zwp_bio_list_size);
The debugfs output is incomprehensible for anyone who has not memorized
the source code of queue_zone_wplug_show(). Please expand the debugfs
output such that it becomes easier to comprehend. See e.g. the
queue_zone_wplug_show() changes in
https://lore.kernel.org/linux-block/20251014215428.3686084-15-bvanassche@acm.org/.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl
2025-10-31 16:52 ` Bart Van Assche
@ 2025-11-03 5:51 ` Damien Le Moal
2025-11-03 10:23 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 5:51 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 01:52, Bart Van Assche wrote:
>> + case BLKREPORTZONEV2:
>> + if (rep.flags & ~BLK_ZONE_REPV2_INPUT_FLAGS)
>> + return -EINVAL;
>
> -EINVAL probably should be changed into something that indicates "not
> supported" rather than "invalid argument"?
Not supported could be confused with the cached report zones not being
supported. It is, but the user cannot specify input flags that are not defined.
This is to ensure that undefined flags are always 0 and that we can use these
going forward when we need to define a new flag.
So EINVAL seems appropriate to me.
>
>> index dab5d9700898..1441d79a6173 100644
>> --- a/include/uapi/linux/blkzoned.h
>> +++ b/include/uapi/linux/blkzoned.h
>> @@ -82,10 +82,20 @@ enum blk_zone_cond {
>> /**
>> * enum blk_zone_report_flags - Feature flags of reported zone descriptors.
>> *
>> - * @BLK_ZONE_REP_CAPACITY: Zone descriptor has capacity field.
>> + * @BLK_ZONE_REP_CAPACITY: Output only. Indicates that zone descriptors in a
>> + * zone report have a valid capacity field.
>> + * @BLK_ZONE_REP_CACHED: Input only. Indicates that the zone report should be
>> + * generated using cached zone information. In this case,
>> + * the implicit open, explicit open and closed zone
>> + * conditions are all reported with the
>> + * BLK_ZONE_COND_ACTIVE condition.
>> */
>> enum blk_zone_report_flags {
>> + /* Output flags */
>> BLK_ZONE_REP_CAPACITY = (1 << 0),
>> +
>> + /* Input flags */
>> + BLK_ZONE_REP_CACHED = (1 << 31),
>> };
>
> Why 1 << 31 instead of 1 << 1?
Why not ? That separates input and output flags instead of mixing them in
tighter definition.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-10-31 17:48 ` Bart Van Assche
@ 2025-11-03 5:55 ` Damien Le Moal
2025-11-03 7:18 ` Daniel Vacek
0 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 5:55 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 02:48, Bart Van Assche wrote:
> Hi Damien,
>
> disk_update_zone_resources() only has a single caller and just below the
> only call of this function the following code is present:
>
> if (ret) {
> unsigned int memflags = blk_mq_freeze_queue(q);
>
> disk_free_zone_resources(disk);
> blk_mq_unfreeze_queue(q, memflags);
> }
>
> Shouldn't this code be moved into disk_update_zone_resources() such that
> error handling happens without unfreezing and refreezing the request
> queue?
Check the code again. disk_free_zone_resources() if the report zones callbacks
return an error, and in that case disk_update_zone_resources() is not called.
So having this call as it is cover all cases.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 04/13] block: introduce disk_report_zone()
2025-10-31 20:54 ` Bart Van Assche
@ 2025-11-03 5:56 ` Damien Le Moal
0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 5:56 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 05:54, Bart Van Assche wrote:
> On 10/30/25 11:12 PM, Damien Le Moal wrote:
>> -struct disk_report_zones_cb_args {
>> - struct gendisk *disk;
>> - report_zones_cb user_cb;
>> - void *user_data;
>> +/*
>> + * Zone report arguments for block device drivers report_zones operation.
>> + * @cb: report_zones_cb callback for each reported zone.
>> + * @data: Private data passed to report_zones_cb.
>> + */
>> +struct blk_report_zones_args {
>> + report_zones_cb cb;
>> + void *data;
>> };
>
> The suffix "_args" seems confusing to me because this data structure
> includes a callback pointer. Please consider changing "_args" into "_cb"
> to make it clear that the data structure includes a callback pointer.
> Another data structure that follows this convention is struct
> blk_plug_cb:
>
> struct blk_plug_cb;
> typedef void (*blk_plug_cb_fn)(struct blk_plug_cb *, bool);
> struct blk_plug_cb {
> struct list_head list;
> blk_plug_cb_fn callback;
> void *data;
> };
>
> Since struct blk_report_zones_args is passed as an argument to
> disk_report_zone(), how about renaming this data structure into
> struct disk_report_zone_cb?'
See patch 8. That structure grows beyond just the callback. So I prefer to keep
the name as it is more generic.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 06/13] block: use zone condition to determine conventional zones
2025-10-31 21:04 ` Bart Van Assche
@ 2025-11-03 6:00 ` Damien Le Moal
0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 6:00 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 06:04, Bart Van Assche wrote:
>> static int disk_revalidate_zone_resources(struct gendisk *disk,
>> - unsigned int nr_zones)
>> + struct blk_revalidate_zone_args *args)
>> {
>> struct queue_limits *lim = &disk->queue->limits;
>> unsigned int pool_size;
>>
>> + args->disk = disk;
>> + args->nr_zones =
>> + DIV_ROUND_UP_ULL(get_capacity(disk), lim->chunk_sectors);
>> +
>> + /* Cached zone conditions: 1 byte per zone */
>> + args->zones_cond = kzalloc(args->nr_zones, GFP_NOIO);
>> + if (!args->zones_cond)
>> + return -ENOMEM;
>
> Why args->nr_zones as array size instead of args->nr_conv_zones? The
> patch description says that this array is only used for conventional
> zones.
The bitmap before was of nr_zones bits, because conventional zones can be
anywhere in the LBA space. The same is still true using zone conditions. We need
one condition per zone for all zones.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-10-31 21:17 ` Bart Van Assche
@ 2025-11-03 6:05 ` Damien Le Moal
2025-11-03 15:48 ` Bart Van Assche
0 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 6:05 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 06:17, Bart Van Assche wrote:
> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>> Implement tracking of the runtime changes to zone conditions using
>> the new cond field in struct blk_zone_wplug. The size of this structure
>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>> end of the structure. For zones that do not have a zone write plug, the
>> zones_cond array of a disk is used to track changes to zone conditions,
>> e.g. when a zone reset, reset all or finish operation is executed.
>
> Why is it necessary to track the condition of sequential zones that do
> not have a zone write plug? Please explain what the use cases are.
Because zones that do not have a zone write plug can be empty OR full.
>
> The zoned UFS device on my desk has 3420 sequential zones and zero
> conventional zones. If the condition of zones that do not have a zone
> write plug wouldn't be tracked that would save some memory.
That would really be "some"... Not a lot. Your memory usage will be less than a
mem page...
>> +static void blk_zone_set_cond(u8 *zones_cond, unsigned int zno,
>> + enum blk_zone_cond cond)
>> +{
>> + if (!zones_cond)
>> + return;
>> +
>> + switch (cond) {
>> + case BLK_ZONE_COND_IMP_OPEN:
>> + case BLK_ZONE_COND_EXP_OPEN:
>> + case BLK_ZONE_COND_CLOSED:
>> + zones_cond[zno] = BLK_ZONE_COND_ACTIVE;
>> + return;
>> + case BLK_ZONE_COND_NOT_WP:
>> + case BLK_ZONE_COND_EMPTY:
>> + case BLK_ZONE_COND_FULL:
>> + case BLK_ZONE_COND_OFFLINE:
>> + case BLK_ZONE_COND_READONLY:
>> + default:
>> + zones_cond[zno] = cond;
>> + return;
>> + }
>> +}
>> +
>> +static void disk_zone_set_cond(struct gendisk *disk, sector_t sector,
>> + enum blk_zone_cond cond)
>> +{
>> + u8 *zones_cond;
>> +
>> + rcu_read_lock();
>> + zones_cond = rcu_dereference(disk->zones_cond);
>> + if (zones_cond) {
>> + unsigned int zno = disk_zone_no(disk, sector);
>> +
>> + /*
>> + * The condition of a conventional, readonly and offline zones
>> + * never changes, so do nothing if the target zone is in one of
>> + * these conditions.
>> + */
>> + switch (zones_cond[zno]) {
>> + case BLK_ZONE_COND_NOT_WP:
>> + case BLK_ZONE_COND_READONLY:
>> + case BLK_ZONE_COND_OFFLINE:
>> + break;
>> + default:
>> + blk_zone_set_cond(zones_cond, zno, cond);
>> + break;
>> + }
>> + }
>> + rcu_read_unlock();
>> +}
>
> Why does blk_zone_set_cond() accept a zone number as second argument and
> why does disk_zone_set_cond() accept a sector number as second argument?
> The callers of disk_zone_set_cond() can be optimized if its second
> argument would be changed from a sector number into a zone number.
How so ? all the callers have a BIO sector or a zone start sector on hand, not a
zone number. On the other hand, blk_zone_set_cond() is always used in places
where the zone number is already available.
So this calling convention makes sense to me as it is.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 08/13] block: introduce blkdev_get_zone_info()
2025-10-31 21:40 ` Bart Van Assche
@ 2025-11-03 6:08 ` Damien Le Moal
2025-11-03 10:29 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 6:08 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 06:40, Bart Van Assche wrote:
>> +/**
>> + * blkdev_get_zone_info - Get a zone information from cached data
>
> "Get a zone information" -> "Get zone information"
>
>> + sector = sector & (~(zone_sectors - 1));
>
> Please consider changing the above assignment into:
>
> sector -= bdev_offset_from_zone_start(bdev, sector);
That is a lot more arithmetic for the same thing.
If anything, I think this should be:
sector = ALIGN(sector, zone_sectors);
>> + /*
>> + * If the zone does not have a zone write plug, it is either full or
>> + * empty, as we otherwise would have a zone write plug for it. In this
>> + * case, set the write pointer accordingly and report the zone.
>> + * Otherwise, if we have a zone write plug, use it.
>> + */
>> + zwplug = disk_get_zone_wplug(disk, sector);
>> + if (!zwplug) {
>> + if (zone->cond == BLK_ZONE_COND_FULL)
>> + zone->wp = sector + zone_sectors;
>> + else
>> + zone->wp = sector;
>> + return 0;
>> + }
>
> According to ZBC-2 the write pointer is invalid for full zones.
Sure, meaning that the user should not look at the value. That does not make the
above wrong in any way. But sure, I can chnage that the ULLONG_MAX if you really
prefer.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 09/13] block: introduce blkdev_report_zones_cached()
2025-10-31 21:53 ` Bart Van Assche
@ 2025-11-03 6:12 ` Damien Le Moal
2025-11-03 7:18 ` Damien Le Moal
1 sibling, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 6:12 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 06:53, Bart Van Assche wrote:
>> + for (sector = ALIGN(sector, zone_sectors);
>> + sector < capacity && idx < nr_zones;
>> + sector += zone_sectors, idx++) {
>
> Please change "sector = ALIGN(sector, zone_sectors)" into an something
> based on bdev_offset_from_zone_start(), e.g. the following code:
>
> sector += zone_sectors - 1;
> sector -= bdev_offset_from_zone_start(bdev, sector);
Why ? That is a lot harder to understand that what we are doing is align to the
zone start. And your proposed code is not correct anyway as we would miss one
zone (the zone containing sector) if sector is not already aligned to the zone
start.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 03/13] block: handle zone management operations completions
2025-10-31 18:01 ` Bart Van Assche
@ 2025-11-03 6:25 ` Damien Le Moal
0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 6:25 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 03:01, Bart Van Assche wrote:
> On 10/30/25 11:12 PM, Damien Le Moal wrote:
>> +void blk_zone_mgmt_bio_endio(struct bio *bio)
>> +{
>> + /* If the BIO failed, we have nothing to do. */
>> + if (bio->bi_status != BLK_STS_OK)
>> + return;
>> +
>> + switch (bio_op(bio)) {
>> + case REQ_OP_ZONE_RESET:
>> + blk_zone_reset_bio_endio(bio);
>> + return;
>> + case REQ_OP_ZONE_RESET_ALL:
>> + blk_zone_reset_all_bio_endio(bio);
>> + return;
>> + case REQ_OP_ZONE_FINISH:
>> + blk_zone_finish_bio_endio(bio);
>> + return;
>> + default:
>> + return;
>> + }
>> }
>
> "default: return;" is superfluous and can be left out.
No, it is needed. Otherwise, with GCC 15.2.1, I get:
block/blk-zoned.c: In function ‘blk_zone_mgmt_bio_endio’:
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_READ’ not handled in
switch [-Werror=switch]
778 | switch (bio_op(bio)) {
| ^~~~~~
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_WRITE’ not handled in
switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_FLUSH’ not handled in
switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_DISCARD’ not handled
in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_SECURE_ERASE’ not
handled in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_ZONE_APPEND’ not
handled in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_WRITE_ZEROES’ not
handled in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_ZONE_OPEN’ not handled
in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_ZONE_CLOSE’ not
handled in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_DRV_IN’ not handled in
switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_DRV_OUT’ not handled
in switch [-Werror=switch]
block/blk-zoned.c:778:9: error: enumeration value ‘REQ_OP_LAST’ not handled in
switch [-Werror=switch]
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-11-03 5:55 ` Damien Le Moal
@ 2025-11-03 7:18 ` Daniel Vacek
2025-11-03 7:23 ` Damien Le Moal
2025-11-03 7:30 ` Damien Le Moal
0 siblings, 2 replies; 63+ messages in thread
From: Daniel Vacek @ 2025-11-03 7:18 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Mon, 3 Nov 2025 at 06:55, Damien Le Moal <dlemoal@kernel.org> wrote:
>
> On 11/1/25 02:48, Bart Van Assche wrote:
> > Hi Damien,
> >
> > disk_update_zone_resources() only has a single caller and just below the
> > only call of this function the following code is present:
> >
> > if (ret) {
> > unsigned int memflags = blk_mq_freeze_queue(q);
> >
> > disk_free_zone_resources(disk);
> > blk_mq_unfreeze_queue(q, memflags);
> > }
> >
> > Shouldn't this code be moved into disk_update_zone_resources() such that
> > error handling happens without unfreezing and refreezing the request
> > queue?
>
> Check the code again. disk_free_zone_resources() if the report zones callbacks
> return an error, and in that case disk_update_zone_resources() is not called.
> So having this call as it is cover all cases.
I understand Bart's idea was more like below:
> @@ -1568,7 +1572,12 @@ static int disk_update_zone_resources(str
uct gendisk *disk,
> }
>
> commit:
> - return queue_limits_commit_update_frozen(q, &lim);
> + ret = queue_limits_commit_update(q, &lim);
> +
> +unfreeze:
+ if (ret)
+ disk_free_zone_resources(disk);
> + blk_mq_unfreeze_queue(q, memflags);
> +
> + return ret;
> }
>
> static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
And then in blk_revalidate_disk_zones() do this:
if (ret > 0) {
ret = disk_update_zone_resources(disk, &args);
} else if (ret) {
unsigned int memflags;
pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
memflags = blk_mq_freeze_queue(q);
disk_free_zone_resources(disk);
blk_mq_unfreeze_queue(q, memflags);
}
The question remains if this looks better?
> --
> Damien Le Moal
> Western Digital Research
>
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 09/13] block: introduce blkdev_report_zones_cached()
2025-10-31 21:53 ` Bart Van Assche
2025-11-03 6:12 ` Damien Le Moal
@ 2025-11-03 7:18 ` Damien Le Moal
1 sibling, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 7:18 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/1/25 06:53, Bart Van Assche wrote:
> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>> Introduce the function blkdev_report_zones_cached() to provide a fast
>> report zone built using the blkdev_get_zone_info() function, which gets
>> zone information from a disk zones_cond array or zone write plugs.
>> For a large capacity SMR drive, such fast report zone can be completed
>> in a few millioseconds compared to several seconds completion times
>> when the report zone is obtained from the device.
>
> millioseconds -> milliseconds
>
> Does retrieving the cached zone information really require multiple
> milliseconds instead of only a few microseconds?
There are over 100,000 zones on large capacity SMR HDDs. And I have models with
smaller zone size that have over 200,000 zones. So yes, a few milliseconds are
needed on normal (read not super fast) CPUs.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-11-03 7:18 ` Daniel Vacek
@ 2025-11-03 7:23 ` Damien Le Moal
2025-11-03 7:30 ` Damien Le Moal
1 sibling, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 7:23 UTC (permalink / raw)
To: Daniel Vacek
Cc: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/3/25 16:18, Daniel Vacek wrote:
> On Mon, 3 Nov 2025 at 06:55, Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 11/1/25 02:48, Bart Van Assche wrote:
>>> Hi Damien,
>>>
>>> disk_update_zone_resources() only has a single caller and just below the
>>> only call of this function the following code is present:
>>>
>>> if (ret) {
>>> unsigned int memflags = blk_mq_freeze_queue(q);
>>>
>>> disk_free_zone_resources(disk);
>>> blk_mq_unfreeze_queue(q, memflags);
>>> }
>>>
>>> Shouldn't this code be moved into disk_update_zone_resources() such that
>>> error handling happens without unfreezing and refreezing the request
>>> queue?
>>
>> Check the code again. disk_free_zone_resources() if the report zones callbacks
>> return an error, and in that case disk_update_zone_resources() is not called.
>> So having this call as it is cover all cases.
>
> I understand Bart's idea was more like below:
>
>> @@ -1568,7 +1572,12 @@ static int disk_update_zone_resources(str
> uct gendisk *disk,
>> }
>>
>> commit:
>> - return queue_limits_commit_update_frozen(q, &lim);
>> + ret = queue_limits_commit_update(q, &lim);
>> +
>> +unfreeze:
>
> + if (ret)
> + disk_free_zone_resources(disk);
>
>> + blk_mq_unfreeze_queue(q, memflags);
>> +
>> + return ret;
>> }
>>
>> static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
>
> And then in blk_revalidate_disk_zones() do this:
>
> if (ret > 0) {
> ret = disk_update_zone_resources(disk, &args);
> } else if (ret) {
> unsigned int memflags;
>
> pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
>
> memflags = blk_mq_freeze_queue(q);
> disk_free_zone_resources(disk);
> blk_mq_unfreeze_queue(q, memflags);
> }
>
> The question remains if this looks better?
Hmmm... Matter of taste maybe ?
Personally, I do like better having a single error path that cover all errors in
blk_revalidate_disk_zones().
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-11-03 7:18 ` Daniel Vacek
2025-11-03 7:23 ` Damien Le Moal
@ 2025-11-03 7:30 ` Damien Le Moal
1 sibling, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 7:30 UTC (permalink / raw)
To: Daniel Vacek
Cc: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/3/25 16:18, Daniel Vacek wrote:
> On Mon, 3 Nov 2025 at 06:55, Damien Le Moal <dlemoal@kernel.org> wrote:
>>
>> On 11/1/25 02:48, Bart Van Assche wrote:
>>> Hi Damien,
>>>
>>> disk_update_zone_resources() only has a single caller and just below the
>>> only call of this function the following code is present:
>>>
>>> if (ret) {
>>> unsigned int memflags = blk_mq_freeze_queue(q);
>>>
>>> disk_free_zone_resources(disk);
>>> blk_mq_unfreeze_queue(q, memflags);
>>> }
>>>
>>> Shouldn't this code be moved into disk_update_zone_resources() such that
>>> error handling happens without unfreezing and refreezing the request
>>> queue?
>>
>> Check the code again. disk_free_zone_resources() if the report zones callbacks
>> return an error, and in that case disk_update_zone_resources() is not called.
>> So having this call as it is cover all cases.
>
> I understand Bart's idea was more like below:
>
>> @@ -1568,7 +1572,12 @@ static int disk_update_zone_resources(str
> uct gendisk *disk,
>> }
>>
>> commit:
>> - return queue_limits_commit_update_frozen(q, &lim);
>> + ret = queue_limits_commit_update(q, &lim);
>> +
>> +unfreeze:
>
> + if (ret)
> + disk_free_zone_resources(disk);
>
>> + blk_mq_unfreeze_queue(q, memflags);
>> +
>> + return ret;
>> }
>>
>> static int blk_revalidate_conv_zone(struct blk_zone *zone, unsigned int idx,
>
> And then in blk_revalidate_disk_zones() do this:
>
> if (ret > 0) {
> ret = disk_update_zone_resources(disk, &args);
> } else if (ret) {
> unsigned int memflags;
>
> pr_warn("%s: failed to revalidate zones\n", disk->disk_name);
>
> memflags = blk_mq_freeze_queue(q);
> disk_free_zone_resources(disk);
> blk_mq_unfreeze_queue(q, memflags);
> }
>
> The question remains if this looks better?
Rereading everything, I think that Bart has a good point: moving the call to
disk_free_zone_resources() in the error path of disk_update_zone_resources()
allows doing the error handling without unfreezing and re-freezing the queue.
That's better.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl
2025-11-03 5:51 ` Damien Le Moal
@ 2025-11-03 10:23 ` Christoph Hellwig
0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-11-03 10:23 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Mon, Nov 03, 2025 at 02:51:57PM +0900, Damien Le Moal wrote:
> On 11/1/25 01:52, Bart Van Assche wrote:
> >> + case BLKREPORTZONEV2:
> >> + if (rep.flags & ~BLK_ZONE_REPV2_INPUT_FLAGS)
> >> + return -EINVAL;
> >
> > -EINVAL probably should be changed into something that indicates "not
> > supported" rather than "invalid argument"?
>
> Not supported could be confused with the cached report zones not being
> supported. It is, but the user cannot specify input flags that are not defined.
Yes.
> This is to ensure that undefined flags are always 0 and that we can use these
> going forward when we need to define a new flag.
> So EINVAL seems appropriate to me.
Using EINVAL here is consistent with other APIs, but a bit of an
anti-feature. Not sure what another good more specific error code
would be, but given that we don't have other major EINVAL conditions
we might as well stick to it.
> >> + /* Input flags */
> >> + BLK_ZONE_REP_CACHED = (1 << 31),
> >> };
> >
> > Why 1 << 31 instead of 1 << 1?
>
> Why not ? That separates input and output flags instead of mixing them in
> tighter definition.
Agreed. Thinking about it, once you go up to bit 31 all of them
should be marked unsigned, though, i.e.,
.... = (1U << bitnr),
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 08/13] block: introduce blkdev_get_zone_info()
2025-11-03 6:08 ` Damien Le Moal
@ 2025-11-03 10:29 ` Christoph Hellwig
0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-11-03 10:29 UTC (permalink / raw)
To: Damien Le Moal
Cc: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On Mon, Nov 03, 2025 at 03:08:30PM +0900, Damien Le Moal wrote:
> On 11/1/25 06:40, Bart Van Assche wrote:
> >> +/**
> >> + * blkdev_get_zone_info - Get a zone information from cached data
> >
> > "Get a zone information" -> "Get zone information"
> >
> >> + sector = sector & (~(zone_sectors - 1));
> >
> > Please consider changing the above assignment into:
> >
> > sector -= bdev_offset_from_zone_start(bdev, sector);
>
> That is a lot more arithmetic for the same thing.
> If anything, I think this should be:
>
> sector = ALIGN(sector, zone_sectors);
That would have to be ALIGN_DOWN I think. Which sounds useful to
clean things up to me.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 02/13] block: cleanup blkdev_report_zones()
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
2025-10-31 8:45 ` Christoph Hellwig
2025-10-31 17:55 ` Bart Van Assche
@ 2025-11-03 11:15 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2025-11-03 11:15 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/31/25 07:12, Damien Le Moal wrote:
> The variable capacity is used only in one place and so can be removed
> and get_capacity(disk) used directly instead.
>
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 01/13] block: freeze queue when updating zone resources
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
2025-10-31 8:44 ` Christoph Hellwig
2025-10-31 17:48 ` Bart Van Assche
@ 2025-11-03 11:17 ` Hannes Reinecke
2 siblings, 0 replies; 63+ messages in thread
From: Hannes Reinecke @ 2025-11-03 11:17 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/31/25 07:12, Damien Le Moal wrote:
> Modify disk_update_zone_resources() to freeze the device queue before
> updating the number of zones, zone capacity and other zone related
> resources. The locking order resulting from the call to
> queue_limits_commit_update_frozen() is preserved, that is, the queue
> limits lock is first taken by calling queue_limits_start_update() before
> freezing the queue, and the queue is unfrozen after executing
> queue_limits_commit_update(), which replaces the call to
> queue_limits_commit_update_frozen().
>
> This change ensures that there are no in-flights I/Os when the zone
> resources are updated due to a zone revalidation.
>
> Fixes: 0b83c86b444a ("block: Prevent potential deadlock in blk_revalidate_disk_zones()")
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
> block/blk-zoned.c | 19 ++++++++++++++-----
> 1 file changed, 14 insertions(+), 5 deletions(-)
>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 03/13] block: handle zone management operations completions
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
2025-10-31 8:46 ` Christoph Hellwig
2025-10-31 18:01 ` Bart Van Assche
@ 2025-11-03 11:41 ` Hannes Reinecke
2025-11-03 12:59 ` Damien Le Moal
2 siblings, 1 reply; 63+ messages in thread
From: Hannes Reinecke @ 2025-11-03 11:41 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 10/31/25 07:12, Damien Le Moal wrote:
> The functions blk_zone_wplug_handle_reset_or_finish() and
> blk_zone_wplug_handle_reset_all() both modify the zone write pointer
> offset of zone write plugs that are the target of a reset, reset all or
> finish zone management operation. However, these functions do this
> modification before the BIO is executed. So if the zone operation fails,
> the modified zone write pointer offsets become invalid.
>
> Avoid this by modifying the zone write pointer offset of a zone write
> plug that is the target of a zone management operation when the
> operation completes. To do so, modify blk_zone_bio_endio() to call the
> new function blk_zone_mgmt_bio_endio() which in turn calls the functions
> blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
> blk_zone_finish_bio_endio() depending on the operation of the completed
> BIO, to modify a zone write plug write pointer offset accordingly.
> These functions are called only if the BIO execution was successful.
>
Hmm.
Question remains: what _is_ the status of a write pointer once a
zone management operation is in flight?
I would argue it's turning into a Schroedinger state, and so we
cannot make any assumptions here.
In particular we cannot issue any other write I/O to that zone once
the operation is in flight, and so it becomes meaningless if we set
the write pointer before or after the zone operation.
Once the operation fails we have to issue a 'report write pointer'
command anyway as I'd be surprised if we could assume that a failure
in a zone management command would leave the write pointer unmodified.
So I would assume that zone write plugging already blocks the zone
while an zone management command is in flight.
But if it does, why do we need this patch?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 03/13] block: handle zone management operations completions
2025-11-03 11:41 ` Hannes Reinecke
@ 2025-11-03 12:59 ` Damien Le Moal
0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 12:59 UTC (permalink / raw)
To: Hannes Reinecke, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/3/25 20:41, Hannes Reinecke wrote:
> On 10/31/25 07:12, Damien Le Moal wrote:
>> The functions blk_zone_wplug_handle_reset_or_finish() and
>> blk_zone_wplug_handle_reset_all() both modify the zone write pointer
>> offset of zone write plugs that are the target of a reset, reset all or
>> finish zone management operation. However, these functions do this
>> modification before the BIO is executed. So if the zone operation fails,
>> the modified zone write pointer offsets become invalid.
>>
>> Avoid this by modifying the zone write pointer offset of a zone write
>> plug that is the target of a zone management operation when the
>> operation completes. To do so, modify blk_zone_bio_endio() to call the
>> new function blk_zone_mgmt_bio_endio() which in turn calls the functions
>> blk_zone_reset_all_bio_endio(), blk_zone_reset_bio_endio() or
>> blk_zone_finish_bio_endio() depending on the operation of the completed
>> BIO, to modify a zone write plug write pointer offset accordingly.
>> These functions are called only if the BIO execution was successful.
>>
> Hmm.
> Question remains: what _is_ the status of a write pointer once a
> zone management operation is in flight?
On the device, it will be unchanged until the command completes, or rather, one
can only see it that way since the drive will serialize such command with report
zones.
> I would argue it's turning into a Schroedinger state, and so we
> cannot make any assumptions here.
Let me try to skin that cat below :)
> In particular we cannot issue any other write I/O to that zone once
> the operation is in flight, and so it becomes meaningless if we set
> the write pointer before or after the zone operation.
> Once the operation fails we have to issue a 'report write pointer'
> command anyway as I'd be surprised if we could assume that a failure
> in a zone management command would leave the write pointer unmodified.
> So I would assume that zone write plugging already blocks the zone
> while an zone management command is in flight.
> But if it does, why do we need this patch?
There is no such "blocking" done, the user is free to issue a zone reset while
writes are n flight, and most likely get write errors as a result such bad practice.
For this patch, the assumption is that a failed zone reset or zone finish leaves
the zone write pointer untouched. All the drives I know do that. So it is better
to not modify the zone write plug write pointer offset until we complete the
command.
But granted, that is not always true since the failure may happen *after* the
drive completed the command (e.g. the HBA loses the connection with the drive
before signaling the completion or something like that). In such case, it would
not matter when the update is done. And for zone reset all commands, all bets
are off since the command may fail half-way through all the zones that need a reset.
But in the end, logically speaking, it makes more sense to update things when we
get a success result instead of assuming we will always succeed. This has also
the benefit of leaving the zone write plugs in place for eventual error recovery
if needed.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 6:05 ` Damien Le Moal
@ 2025-11-03 15:48 ` Bart Van Assche
2025-11-03 16:34 ` Chaitanya Kulkarni
` (2 more replies)
0 siblings, 3 replies; 63+ messages in thread
From: Bart Van Assche @ 2025-11-03 15:48 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/2/25 10:05 PM, Damien Le Moal wrote:
> On 11/1/25 06:17, Bart Van Assche wrote:
>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>> Implement tracking of the runtime changes to zone conditions using
>>> the new cond field in struct blk_zone_wplug. The size of this structure
>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>> end of the structure. For zones that do not have a zone write plug, the
>>> zones_cond array of a disk is used to track changes to zone conditions,
>>> e.g. when a zone reset, reset all or finish operation is executed.
>>
>> Why is it necessary to track the condition of sequential zones that do
>> not have a zone write plug? Please explain what the use cases are.
>
> Because zones that do not have a zone write plug can be empty OR full.
Why does the block layer have to track this information? Filesystems can
easily derive this information from the filesystem metadata information,
isn't it?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 15:48 ` Bart Van Assche
@ 2025-11-03 16:34 ` Chaitanya Kulkarni
2025-11-03 22:53 ` Damien Le Moal
2025-11-03 18:31 ` Bart Van Assche
2025-11-03 22:40 ` Damien Le Moal
2 siblings, 1 reply; 63+ messages in thread
From: Chaitanya Kulkarni @ 2025-11-03 16:34 UTC (permalink / raw)
To: Bart Van Assche
Cc: Damien Le Moal, Jens Axboe, linux-block@vger.kernel.org,
linux-nvme@lists.infradead.org, Christoph Hellwig,
dm-devel@lists.linux.dev, Mike Snitzer, Martin K . Petersen,
Mikulas Patocka, linux-xfs@vger.kernel.org, Carlos Maiolino,
linux-btrfs@vger.kernel.org, David Sterba,
linux-scsi@vger.kernel.org, Keith Busch
Adding Keith's current email address :
's/Keith Busch <keith.busch@wdc.com>/kbusch@kernel.org/g'
On 11/3/25 7:48 AM, Bart Van Assche wrote:
> On 11/2/25 10:05 PM, Damien Le Moal wrote:
>> On 11/1/25 06:17, Bart Van Assche wrote:
>>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>>> Implement tracking of the runtime changes to zone conditions using
>>>> the new cond field in struct blk_zone_wplug. The size of this
>>>> structure
>>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>>> end of the structure. For zones that do not have a zone write plug,
>>>> the
>>>> zones_cond array of a disk is used to track changes to zone
>>>> conditions,
>>>> e.g. when a zone reset, reset all or finish operation is executed.
>>>
>>> Why is it necessary to track the condition of sequential zones that do
>>> not have a zone write plug? Please explain what the use cases are.
>>
>> Because zones that do not have a zone write plug can be empty OR full.
>
> Why does the block layer have to track this information? Filesystems can
> easily derive this information from the filesystem metadata information,
> isn't it?
>
> Thanks,
>
> Bart.
>
In case current file systems store this, isn't that a code duplication for each
fs ? perhaps having a central interface at block layer should help remove the
code duplication ?
-ck
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 15:48 ` Bart Van Assche
2025-11-03 16:34 ` Chaitanya Kulkarni
@ 2025-11-03 18:31 ` Bart Van Assche
2025-11-03 22:34 ` Damien Le Moal
2025-11-03 22:40 ` Damien Le Moal
2 siblings, 1 reply; 63+ messages in thread
From: Bart Van Assche @ 2025-11-03 18:31 UTC (permalink / raw)
To: Damien Le Moal, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/3/25 7:48 AM, Bart Van Assche wrote:
> On 11/2/25 10:05 PM, Damien Le Moal wrote:
>> On 11/1/25 06:17, Bart Van Assche wrote:
>>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>>> Implement tracking of the runtime changes to zone conditions using
>>>> the new cond field in struct blk_zone_wplug. The size of this structure
>>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>>> end of the structure. For zones that do not have a zone write plug, the
>>>> zones_cond array of a disk is used to track changes to zone conditions,
>>>> e.g. when a zone reset, reset all or finish operation is executed.
>>>
>>> Why is it necessary to track the condition of sequential zones that do
>>> not have a zone write plug? Please explain what the use cases are.
>>
>> Because zones that do not have a zone write plug can be empty OR full.
>
> Why does the block layer have to track this information? Filesystems can
> easily derive this information from the filesystem metadata information,
> isn't it?
(replying to my own email)
Is this a good way to check what zone type information filesystems need?
$ git grep -nH BLK_ZONE_TYPE_ fs
fs/btrfs/zoned.c:96: ASSERT(zones[i].type != BLK_ZONE_TYPE_CONVENTIONAL);
fs/btrfs/zoned.c:211: zones[i].type = BLK_ZONE_TYPE_CONVENTIONAL;
fs/btrfs/zoned.c:488: if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
fs/btrfs/zoned.c:566: BLK_ZONE_TYPE_CONVENTIONAL)
fs/btrfs/zoned.c:815: if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
fs/btrfs/zoned.c:1360: if (unlikely(zone.type ==
BLK_ZONE_TYPE_CONVENTIONAL)) {
fs/f2fs/segment.c:5295: if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
fs/f2fs/segment.c:5417: if (zone.type != BLK_ZONE_TYPE_SEQWRITE_REQ)
fs/f2fs/segment.c:5473: if (zone.type != BLK_ZONE_TYPE_SEQWRITE_REQ)
fs/f2fs/super.c:4332: if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
fs/xfs/libxfs/xfs_zones.c:177: case BLK_ZONE_TYPE_CONVENTIONAL:
fs/xfs/libxfs/xfs_zones.c:179: case BLK_ZONE_TYPE_SEQWRITE_REQ:
fs/zonefs/super.c:385: zone.type = BLK_ZONE_TYPE_CONVENTIONAL;
fs/zonefs/super.c:874: case BLK_ZONE_TYPE_CONVENTIONAL:
fs/zonefs/super.c:886: case BLK_ZONE_TYPE_SEQWRITE_REQ:
fs/zonefs/super.c:887: case BLK_ZONE_TYPE_SEQWRITE_PREF:
fs/zonefs/zonefs.h:26: * defined in linux/blkzoned.h, that is,
BLK_ZONE_TYPE_SEQWRITE_REQ and
fs/zonefs/zonefs.h:27: * BLK_ZONE_TYPE_SEQWRITE_PREF.
fs/zonefs/zonefs.h:37: if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
In the above I see that all filesystems check for the following zone
types and don't check whether a zone is empty or full:
* CONVENTIONAL
* SEQWRITE_REQ
* SEQWRITE_PREF
Do you agree with this conclusion?
Thanks,
Bart.
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 18:31 ` Bart Van Assche
@ 2025-11-03 22:34 ` Damien Le Moal
0 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 22:34 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/4/25 03:31, Bart Van Assche wrote:
> On 11/3/25 7:48 AM, Bart Van Assche wrote:
>> On 11/2/25 10:05 PM, Damien Le Moal wrote:
>>> On 11/1/25 06:17, Bart Van Assche wrote:
>>>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>>>> Implement tracking of the runtime changes to zone conditions using
>>>>> the new cond field in struct blk_zone_wplug. The size of this structure
>>>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>>>> end of the structure. For zones that do not have a zone write plug, the
>>>>> zones_cond array of a disk is used to track changes to zone conditions,
>>>>> e.g. when a zone reset, reset all or finish operation is executed.
>>>>
>>>> Why is it necessary to track the condition of sequential zones that do
>>>> not have a zone write plug? Please explain what the use cases are.
>>>
>>> Because zones that do not have a zone write plug can be empty OR full.
>>
>> Why does the block layer have to track this information? Filesystems can
>> easily derive this information from the filesystem metadata information,
>> isn't it?
>
> (replying to my own email)
>
> Is this a good way to check what zone type information filesystems need?
>
> $ git grep -nH BLK_ZONE_TYPE_ fs
> fs/btrfs/zoned.c:96: ASSERT(zones[i].type != BLK_ZONE_TYPE_CONVENTIONAL);
> fs/btrfs/zoned.c:211: zones[i].type = BLK_ZONE_TYPE_CONVENTIONAL;
> fs/btrfs/zoned.c:488: if (zones[i].type == BLK_ZONE_TYPE_SEQWRITE_REQ)
> fs/btrfs/zoned.c:566: BLK_ZONE_TYPE_CONVENTIONAL)
> fs/btrfs/zoned.c:815: if (zones[0].type == BLK_ZONE_TYPE_CONVENTIONAL) {
> fs/btrfs/zoned.c:1360: if (unlikely(zone.type ==
> BLK_ZONE_TYPE_CONVENTIONAL)) {
> fs/f2fs/segment.c:5295: if (zone->type != BLK_ZONE_TYPE_SEQWRITE_REQ)
> fs/f2fs/segment.c:5417: if (zone.type != BLK_ZONE_TYPE_SEQWRITE_REQ)
> fs/f2fs/segment.c:5473: if (zone.type != BLK_ZONE_TYPE_SEQWRITE_REQ)
> fs/f2fs/super.c:4332: if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
> fs/xfs/libxfs/xfs_zones.c:177: case BLK_ZONE_TYPE_CONVENTIONAL:
> fs/xfs/libxfs/xfs_zones.c:179: case BLK_ZONE_TYPE_SEQWRITE_REQ:
> fs/zonefs/super.c:385: zone.type = BLK_ZONE_TYPE_CONVENTIONAL;
> fs/zonefs/super.c:874: case BLK_ZONE_TYPE_CONVENTIONAL:
> fs/zonefs/super.c:886: case BLK_ZONE_TYPE_SEQWRITE_REQ:
> fs/zonefs/super.c:887: case BLK_ZONE_TYPE_SEQWRITE_PREF:
> fs/zonefs/zonefs.h:26: * defined in linux/blkzoned.h, that is,
> BLK_ZONE_TYPE_SEQWRITE_REQ and
> fs/zonefs/zonefs.h:27: * BLK_ZONE_TYPE_SEQWRITE_PREF.
> fs/zonefs/zonefs.h:37: if (zone->type == BLK_ZONE_TYPE_CONVENTIONAL)
>
> In the above I see that all filesystems check for the following zone
> types and don't check whether a zone is empty or full:
> * CONVENTIONAL
> * SEQWRITE_REQ
> * SEQWRITE_PREF
>
> Do you agree with this conclusion?
Absolutely not.
git grep -nH BLK_ZONE_COND_ fs
fs/btrfs/zoned.c:75: return (zone->cond == BLK_ZONE_COND_FULL) ||
fs/btrfs/zoned.c:97: empty[i] = (zones[i].cond == BLK_ZONE_COND_EMPTY);
fs/btrfs/zoned.c:212: zones[i].cond = BLK_ZONE_COND_NOT_WP;
fs/btrfs/zoned.c:491: case BLK_ZONE_COND_EMPTY:
fs/btrfs/zoned.c:494: case BLK_ZONE_COND_IMP_OPEN:
fs/btrfs/zoned.c:495: case BLK_ZONE_COND_EXP_OPEN:
fs/btrfs/zoned.c:496: case BLK_ZONE_COND_CLOSED:
fs/btrfs/zoned.c:497: case BLK_ZONE_COND_ACTIVE:
fs/btrfs/zoned.c:833: if (reset && reset->cond != BLK_ZONE_COND_EMPTY) {
fs/btrfs/zoned.c:845: reset->cond = BLK_ZONE_COND_EMPTY;
fs/btrfs/zoned.c:967: if (zone->cond == BLK_ZONE_COND_FULL) {
fs/btrfs/zoned.c:972: if (zone->cond == BLK_ZONE_COND_EMPTY)
fs/btrfs/zoned.c:973: zone->cond = BLK_ZONE_COND_IMP_OPEN;
fs/btrfs/zoned.c:1000: zone->cond = BLK_ZONE_COND_FULL;
fs/btrfs/zoned.c:1373: case BLK_ZONE_COND_OFFLINE:
fs/btrfs/zoned.c:1374: case BLK_ZONE_COND_READONLY:
fs/btrfs/zoned.c:1381: case BLK_ZONE_COND_EMPTY:
fs/btrfs/zoned.c:1384: case BLK_ZONE_COND_FULL:
fs/f2fs/segment.c:5319: if ((!valid_block_cnt && zone->cond ==
BLK_ZONE_COND_EMPTY) ||
fs/f2fs/segment.c:5320: (valid_block_cnt && zone->cond == BLK_ZONE_COND_FULL))
fs/xfs/libxfs/xfs_zones.c:93: case BLK_ZONE_COND_EMPTY:
fs/xfs/libxfs/xfs_zones.c:95: case BLK_ZONE_COND_IMP_OPEN:
fs/xfs/libxfs/xfs_zones.c:96: case BLK_ZONE_COND_EXP_OPEN:
fs/xfs/libxfs/xfs_zones.c:97: case BLK_ZONE_COND_CLOSED:
fs/xfs/libxfs/xfs_zones.c:99: case BLK_ZONE_COND_FULL:
fs/xfs/libxfs/xfs_zones.c:101: case BLK_ZONE_COND_NOT_WP:
fs/xfs/libxfs/xfs_zones.c:102: case BLK_ZONE_COND_OFFLINE:
fs/xfs/libxfs/xfs_zones.c:103: case BLK_ZONE_COND_READONLY:
fs/xfs/libxfs/xfs_zones.c:122: case BLK_ZONE_COND_NOT_WP:
fs/xfs/xfs_zone_alloc.c:985: if (!zone || zone->cond == BLK_ZONE_COND_NOT_WP) {
fs/zonefs/super.c:195: case BLK_ZONE_COND_OFFLINE:
fs/zonefs/super.c:200: case BLK_ZONE_COND_READONLY:
fs/zonefs/super.c:215: case BLK_ZONE_COND_FULL:
fs/zonefs/super.c:386: zone.cond = BLK_ZONE_COND_NOT_WP;
fs/zonefs/super.c:986: if (next->cond ==
BLK_ZONE_COND_READONLY &&
fs/zonefs/super.c:987: zone->cond !=
BLK_ZONE_COND_OFFLINE)
fs/zonefs/super.c:988: zone->cond =
BLK_ZONE_COND_READONLY;
fs/zonefs/super.c:989: else if (next->cond ==
BLK_ZONE_COND_OFFLINE)
fs/zonefs/super.c:990: zone->cond =
BLK_ZONE_COND_OFFLINE;
fs/zonefs/super.c:1034: (zone->cond == BLK_ZONE_COND_IMP_OPEN ||
fs/zonefs/super.c:1035: zone->cond == BLK_ZONE_COND_EXP_OPEN)) {
And if you are still not convinced, read the mount code for XFS and BTRFS.
You'll see the point of having a fast cached report zones to speed that up.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 15:48 ` Bart Van Assche
2025-11-03 16:34 ` Chaitanya Kulkarni
2025-11-03 18:31 ` Bart Van Assche
@ 2025-11-03 22:40 ` Damien Le Moal
2 siblings, 0 replies; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 22:40 UTC (permalink / raw)
To: Bart Van Assche, Jens Axboe, linux-block, linux-nvme, Keith Busch,
Christoph Hellwig, dm-devel, Mike Snitzer, Mikulas Patocka,
Martin K . Petersen, linux-scsi, linux-xfs, Carlos Maiolino,
linux-btrfs, David Sterba
On 11/4/25 00:48, Bart Van Assche wrote:
> On 11/2/25 10:05 PM, Damien Le Moal wrote:
>> On 11/1/25 06:17, Bart Van Assche wrote:
>>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>>> Implement tracking of the runtime changes to zone conditions using
>>>> the new cond field in struct blk_zone_wplug. The size of this structure
>>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>>> end of the structure. For zones that do not have a zone write plug, the
>>>> zones_cond array of a disk is used to track changes to zone conditions,
>>>> e.g. when a zone reset, reset all or finish operation is executed.
>>>
>>> Why is it necessary to track the condition of sequential zones that do
>>> not have a zone write plug? Please explain what the use cases are.
>>
>> Because zones that do not have a zone write plug can be empty OR full.
>
> Why does the block layer have to track this information? Filesystems can
> easily derive this information from the filesystem metadata information,
> isn't it?
The title of this patch series is: Introduce cached report zones.
For that, the answer to your question is I think obvious. Otherwise, how do you
exactly propose to report correctly the condition of a zone without have that
condition tracked and cached in memory ?
The point here is to provide as much information (almost) as a device generated
report zone, but much fasterer, using cached information only, thus avoiding any
(much slower) command execution on the device. And as these patches show, that
is not that hard to do, with the trade-off of a slightly higher memory usage per
device.
See the numbers for XFS and BTRFS mount speedup in patch 14 and 15. The gains
are even higher when mounting a large BTRFS RAID volume with many SMR drives.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 16:34 ` Chaitanya Kulkarni
@ 2025-11-03 22:53 ` Damien Le Moal
2025-11-04 12:03 ` Christoph Hellwig
0 siblings, 1 reply; 63+ messages in thread
From: Damien Le Moal @ 2025-11-03 22:53 UTC (permalink / raw)
To: Chaitanya Kulkarni, Bart Van Assche
Cc: Jens Axboe, linux-block@vger.kernel.org,
linux-nvme@lists.infradead.org, Christoph Hellwig,
dm-devel@lists.linux.dev, Mike Snitzer, Martin K . Petersen,
Mikulas Patocka, linux-xfs@vger.kernel.org, Carlos Maiolino,
linux-btrfs@vger.kernel.org, David Sterba,
linux-scsi@vger.kernel.org, Keith Busch
On 11/4/25 01:34, Chaitanya Kulkarni wrote:
> Adding Keith's current email address :
> 's/Keith Busch <keith.busch@wdc.com>/kbusch@kernel.org/g'
>
> On 11/3/25 7:48 AM, Bart Van Assche wrote:
>> On 11/2/25 10:05 PM, Damien Le Moal wrote:
>>> On 11/1/25 06:17, Bart Van Assche wrote:
>>>> On 10/30/25 11:13 PM, Damien Le Moal wrote:
>>>>> Implement tracking of the runtime changes to zone conditions using
>>>>> the new cond field in struct blk_zone_wplug. The size of this
>>>>> structure
>>>>> remains 112 Bytes as the new field replaces the 4 Bytes padding at the
>>>>> end of the structure. For zones that do not have a zone write plug,
>>>>> the
>>>>> zones_cond array of a disk is used to track changes to zone
>>>>> conditions,
>>>>> e.g. when a zone reset, reset all or finish operation is executed.
>>>>
>>>> Why is it necessary to track the condition of sequential zones that do
>>>> not have a zone write plug? Please explain what the use cases are.
>>>
>>> Because zones that do not have a zone write plug can be empty OR full.
>>
>> Why does the block layer have to track this information? Filesystems can
>> easily derive this information from the filesystem metadata information,
>> isn't it?
>>
>> Thanks,
>>
>> Bart.
>>
>
> In case current file systems store this, isn't that a code duplication for each
> fs ? perhaps having a central interface at block layer should help remove the
> code duplication ?
catch 22: You cannot ask the file system without first knowing the zone layout
and conditions of zone to check the file system metadata.
--
Damien Le Moal
Western Digital Research
^ permalink raw reply [flat|nested] 63+ messages in thread
* Re: [PATCH 07/13] block: track zone conditions
2025-11-03 22:53 ` Damien Le Moal
@ 2025-11-04 12:03 ` Christoph Hellwig
0 siblings, 0 replies; 63+ messages in thread
From: Christoph Hellwig @ 2025-11-04 12:03 UTC (permalink / raw)
To: Damien Le Moal
Cc: Chaitanya Kulkarni, Bart Van Assche, Jens Axboe,
linux-block@vger.kernel.org, linux-nvme@lists.infradead.org,
Christoph Hellwig, dm-devel@lists.linux.dev, Mike Snitzer,
Martin K . Petersen, Mikulas Patocka, linux-xfs@vger.kernel.org,
Carlos Maiolino, linux-btrfs@vger.kernel.org, David Sterba,
linux-scsi@vger.kernel.org, Keith Busch
On Tue, Nov 04, 2025 at 07:53:07AM +0900, Damien Le Moal wrote:
> > In case current file systems store this, isn't that a code duplication for each
> > fs ? perhaps having a central interface at block layer should help remove the
> > code duplication ?
>
> catch 22: You cannot ask the file system without first knowing the zone layout
> and conditions of zone to check the file system metadata.
The file system also really needs to verify that it's view matches the
hardware view at mount time, otherwise you'll run into really nasty
cases later when they are out of sync due to a bug or corruption.
^ permalink raw reply [flat|nested] 63+ messages in thread
end of thread, other threads:[~2025-11-04 12:03 UTC | newest]
Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-31 6:12 [PATCH 00/13] Introduce cached report zones Damien Le Moal
2025-10-31 6:12 ` [PATCH 01/13] block: freeze queue when updating zone resources Damien Le Moal
2025-10-31 8:44 ` Christoph Hellwig
2025-10-31 17:48 ` Bart Van Assche
2025-11-03 5:55 ` Damien Le Moal
2025-11-03 7:18 ` Daniel Vacek
2025-11-03 7:23 ` Damien Le Moal
2025-11-03 7:30 ` Damien Le Moal
2025-11-03 11:17 ` Hannes Reinecke
2025-10-31 6:12 ` [PATCH 02/13] block: cleanup blkdev_report_zones() Damien Le Moal
2025-10-31 8:45 ` Christoph Hellwig
2025-10-31 17:55 ` Bart Van Assche
2025-11-03 11:15 ` Hannes Reinecke
2025-10-31 6:12 ` [PATCH 03/13] block: handle zone management operations completions Damien Le Moal
2025-10-31 8:46 ` Christoph Hellwig
2025-10-31 18:01 ` Bart Van Assche
2025-11-03 6:25 ` Damien Le Moal
2025-11-03 11:41 ` Hannes Reinecke
2025-11-03 12:59 ` Damien Le Moal
2025-10-31 6:12 ` [PATCH 04/13] block: introduce disk_report_zone() Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:54 ` Bart Van Assche
2025-11-03 5:56 ` Damien Le Moal
2025-10-31 6:12 ` [PATCH 05/13] block: reorganize struct blk_zone_wplug Damien Le Moal
2025-10-31 8:47 ` Christoph Hellwig
2025-10-31 20:55 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 06/13] block: use zone condition to determine conventional zones Damien Le Moal
2025-10-31 8:48 ` Christoph Hellwig
2025-10-31 21:04 ` Bart Van Assche
2025-11-03 6:00 ` Damien Le Moal
2025-10-31 6:13 ` [PATCH 07/13] block: track zone conditions Damien Le Moal
2025-10-31 8:51 ` Christoph Hellwig
2025-10-31 21:17 ` Bart Van Assche
2025-11-03 6:05 ` Damien Le Moal
2025-11-03 15:48 ` Bart Van Assche
2025-11-03 16:34 ` Chaitanya Kulkarni
2025-11-03 22:53 ` Damien Le Moal
2025-11-04 12:03 ` Christoph Hellwig
2025-11-03 18:31 ` Bart Van Assche
2025-11-03 22:34 ` Damien Le Moal
2025-11-03 22:40 ` Damien Le Moal
2025-10-31 6:13 ` [PATCH 08/13] block: introduce blkdev_get_zone_info() Damien Le Moal
2025-10-31 8:52 ` Christoph Hellwig
2025-10-31 21:40 ` Bart Van Assche
2025-11-03 6:08 ` Damien Le Moal
2025-11-03 10:29 ` Christoph Hellwig
2025-10-31 6:13 ` [PATCH 09/13] block: introduce blkdev_report_zones_cached() Damien Le Moal
2025-10-31 8:53 ` Christoph Hellwig
2025-10-31 21:53 ` Bart Van Assche
2025-11-03 6:12 ` Damien Le Moal
2025-11-03 7:18 ` Damien Le Moal
2025-10-31 6:13 ` [PATCH 10/13] block: introduce BLKREPORTZONESV2 ioctl Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 16:52 ` Bart Van Assche
2025-11-03 5:51 ` Damien Le Moal
2025-11-03 10:23 ` Christoph Hellwig
2025-10-31 6:13 ` [PATCH 11/13] block: add zone write plug condition to debugfs zone_wplugs Damien Le Moal
2025-10-31 8:54 ` Christoph Hellwig
2025-10-31 21:55 ` Bart Van Assche
2025-10-31 6:13 ` [PATCH 12/13] btrfs: use blkdev_report_zones_cached() Damien Le Moal
2025-10-31 19:01 ` David Sterba
2025-10-31 6:13 ` [PATCH 13/13] xfs: " Damien Le Moal
2025-10-31 8:55 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).