[PATCH 00/12] zbd: fix problems of random write with unaligned block size

public inbox for fio@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/12] zbd: fix problems of random write with unaligned block size
@ 2026-01-09  2:35 Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 01/12] zbd: fix zone selection of random writes Shin'ichiro Kawasaki
                   ` (13 more replies)
  0 siblings, 14 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

When random write workload runs with zonemode=zbd and block size
unaligned to the zone size or the initial write pointer position, two
problems are observed. The first one is write target zone selection.
When one zone is filled by the write workload, the same zone is selected
as the next write target. This results in writes concentrating on
certain zones despite the workload specifies random write.

The second problem is write performance. The writes with unaligned block
size leaves small remainder areas at the end of write target zones. To
free up the zone resource, current fio does zone finish operations to
the zones with the small remainder. Fio also calls io_u_quiesce() to
prepare for the zone finish operation and the write target zone
switching. These zone finish operation and io_u_quiesce() calls
significantly degrade the random write performance.

This series address these problems. The first patch addresses the write
target zone selection problem. The following four patches address the
performance problem. Next three patches clean up the code by removing
the zone finish operation. The last four patches adjust documentation
and the test set for the changes in the series.

Shin'ichiro Kawasaki (12):
  zbd: fix zone selection of random writes
  zbd: set norandommap=1 when zonemode=zbd is specified
  zbd: write zone remainders smaller than minimum block size
  zbd: fix write zone accounting
  zbd: remove io_u_quiesce() at write target zone switch
  zbd: remove zbd_finish_zone()
  oslib: remove blkzoned_finish_zone()
  ioengine: remove finish_zone()
  doc: explain norandommap restriction and small remainder of
    zonemode=zbd
  t/zbd: avoid test case 14 failure due to no randam map
  t/zbd: avoid test case 33 failure due to zone end remainder
  t/zbd: avoid test case 71 failure due to zone end remainder

 HOWTO.rst              |   9 ++-
 engines/libzbc.c       |  34 ---------
 fio.1                  |   7 +-
 init.c                 |  13 ++++
 ioengines.h            |   4 +-
 oslib/blkzoned.h       |   8 --
 oslib/linux-blkzoned.c |  34 ---------
 t/zbd/test-zbd-support |  12 +--
 zbd.c                  | 163 ++++++++---------------------------------
 9 files changed, 65 insertions(+), 219 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 01/12] zbd: fix zone selection of random writes
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified Shin'ichiro Kawasaki
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

In zonemode=zbd, random write workloads targeting zoned block devices
with max write zones limits such as max_open_zones can not do write
operations to randomly chosen offset because of the zoned block device
constraint of writing at write pointers. To adjust the offsets to valid
positions, fio calls the function zbd_convert_to_write_zone(). This
function checks the current write target zones as the next offset
candidates but may fail depending on the conditions of those zones.
In such cases, the function waits for zone condition changes before
retrying.

However, the retry logic begins with the zone where the previous attempt
ended, and selects the zones that were previously write target.
Consequently, the same zones are repeatedly chosen for writing,
resulting in writes concentrating on certain zones despite the workload
specifying random write.

To ensure proper zone selection for random writes, modify
zbd_convert_to_write_zone() to retry the zone selection based on the
original offset provided to the function. The local variable 'zb' keeps
the reference to the zone corresponding to the original offset. Use 'zb'
at the retry attempt start.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/zbd.c b/zbd.c
index 7a66b665..b71f842c 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1643,6 +1643,19 @@ choose_other_zone:
 	}

 retry:
+	/*
+	 * For random writes, retry from the zone chosen at the beginning using
+	 * the initial io_u random offset.
+	 */
+	if (td_random(td)) {
+		zone_unlock(z);
+		zone_lock(td, f, zb);
+		if (zbd_write_zone_get(td, f, zb))
+			return zb;
+		z = zb;
+		zone_idx = zbd_zone_idx(f, z);
+	}
+
 	/* Zone 'z' is full, so try to choose a new zone. */
 	for (i = f->io_size / zbdi->zone_size; i > 0; i--) {
 		zone_idx++;
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 01/12] zbd: fix zone selection of random writes Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-27  1:39   ` Vincent Fu
  2026-01-09  2:35 ` [PATCH 03/12] zbd: write zone remainders smaller than minimum block size Shin'ichiro Kawasaki
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

The random map functionality relies on the minimum block size to manage
the map of written blocks, in order to ensure that all blocks are
written with a random workload. This is normally fine, but in the case
of a zoned block devices being written with a minimum block size that is
not aligned with the zone size or the write pointer position at workload
start, the last blocks of a zone can only be written using a write
operation with a size smaller than the minimum block size. This
conflicts with the random map operation.

In preparation for supporting writing a zone remainder smaller than the
minimum block size without using a zone finish operation, disable the
random map feature by setting norandommap=1 when zonemode=zbd is
specified.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 init.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/init.c b/init.c
index 76e1a86d..9320bca8 100644
--- a/init.c
+++ b/init.c
@@ -665,6 +665,19 @@ static int fixup_options(struct thread_data *td)
 		ret |= 1;
 	}

+	if (o->zone_mode == ZONE_MODE_ZBD) {
+		if (fio_option_is_set(o, norandommap)) {
+			if (o->norandommap == 0) {
+				log_err("fio: zonemode=zbd requires norandommap=1\n");
+				ret |= 1;
+			}
+			/* if == 1, OK */
+		} else {
+			dprint(FD_ZBD, "fio: zonemode=zbd sets norandommap=1\n");
+			o->norandommap = 1;
+		}
+	}
+
 	if (o->zone_mode == ZONE_MODE_STRIDED && !o->zone_size) {
 		log_err("fio: --zonesize must be specified when using --zonemode=strided.\n");
 		ret |= 1;
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 03/12] zbd: write zone remainders smaller than minimum block size
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 01/12] zbd: fix zone selection of random writes Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 04/12] zbd: fix write zone accounting Shin'ichiro Kawasaki
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

When the specified block size is not aligned with the zone size or the
write pointer position at workload start, write workloads create
unwritten remainder areas at the ends of zones. These remainder areas
leave zones in an open condition. This disrupts the intended write
target zone selection.

Previous commits e1a1b59b0b9b ("zbd: finish zones with remainder smaller
than minimum write block size") and e2e29bf6f830 ("zbd: finish zone when
all random write target zones have small remainder") attempted to solve
this problem by issuing zone finish operation for zones with small
remainders. However, this approach caused performance degradation due to
two reasons. First, the zone finish operation requires substantial
execution time. Second, zone finish operation requires to wait for in-
flight writes from other jobs to complete, which is done by calling
io_u_quiesce() before the zone finish operation.

To avoid the performance degradation, just write the small remainder
areas instead of the zone finish operation. Writing these small
remainder areas shifts the target zone into the "full" condition,
freeing up the zone resource of the device and enabling writing to other
zones. Particularly in asynchronous I/O workloads, these write
operations are managed as part of queued I/Os, eliminating the need for
waiting on in-flight writes.

The drawback of this approach is that writing these remainders requires
write sizes smaller than the minimum block size. As a result, when using
zonemode=zbd, the random map feature must be disabled using the
norandommap=1 option, which is automatically done for zonemode=zbd.

Fixes: e1a1b59b0b9b ("zbd: finish zones with remainder smaller than minimum write block size")
Fixes: e2e29bf6f830 ("zbd: finish zone when all random write target zones have small remainder")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 93 ++++++++++-------------------------------------------------
 1 file changed, 15 insertions(+), 78 deletions(-)

diff --git a/zbd.c b/zbd.c
index b71f842c..9d5a3bc3 100644
--- a/zbd.c
+++ b/zbd.c
@@ -92,12 +92,9 @@ static inline uint64_t zbd_zone_remainder(struct fio_zone_info *z)
  *
  * The caller must hold z->mutex.
  */
-static bool zbd_zone_full(const struct fio_file *f, struct fio_zone_info *z,
-			  uint64_t required)
+static bool zbd_zone_full(const struct fio_file *f, struct fio_zone_info *z)
 {
-	assert((required & 511) == 0);
-
-	return z->has_wp && required > zbd_zone_remainder(z);
+	return z->has_wp && zbd_zone_remainder(z) == 0;
 }
 
 static void zone_lock(struct thread_data *td, const struct fio_file *f,
@@ -623,13 +620,11 @@ out:
 static bool zbd_write_zone_get(struct thread_data *td, const struct fio_file *f,
 			       struct fio_zone_info *z)
 {
-	const uint64_t min_bs = td->o.min_bs[DDIR_WRITE];
-
 	/*
 	 * Skip full zones with data verification enabled because resetting a
 	 * zone causes data loss and hence causes verification to fail.
 	 */
-	if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z, min_bs))
+	if (td->o.verify != VERIFY_NONE && zbd_zone_full(f, z))
 		return false;
 
 	return __zbd_write_zone_get(td, f, z);
@@ -1507,7 +1502,6 @@ static struct fio_zone_info *zbd_convert_to_write_zone(struct thread_data *td,
 						       struct io_u *io_u,
 						       struct fio_zone_info *zb)
 {
-	const uint64_t min_bs = td->o.min_bs[io_u->ddir];
 	struct fio_file *f = io_u->file;
 	struct zoned_block_device_info *zbdi = f->zbd_info;
 	struct fio_zone_info *z;
@@ -1516,33 +1510,9 @@ static struct fio_zone_info *zbd_convert_to_write_zone(struct thread_data *td,
 	bool wait_zone_write;
 	bool in_flight;
 	bool should_retry = true;
-	bool need_zone_finish;
 
 	assert(is_valid_offset(f, io_u->offset));
 
-	if (zbd_zone_remainder(zb) > 0 && zbd_zone_remainder(zb) < min_bs) {
-		pthread_mutex_lock(&f->zbd_info->mutex);
-		zbd_write_zone_put(td, f, zb);
-		pthread_mutex_unlock(&f->zbd_info->mutex);
-		dprint(FD_ZBD, "%s: finish zone %d\n",
-		       f->file_name, zbd_zone_idx(f, zb));
-		io_u_quiesce(td);
-		zbd_finish_zone(td, f, zb);
-		zone_unlock(zb);
-
-		if (zbd_zone_idx(f, zb) + 1 >= f->max_zone && !td_random(td))
-			return NULL;
-
-		/* Find the next write pointer zone */
-		do {
-			zb++;
-			if (zbd_zone_idx(f, zb) >= f->max_zone)
-				zb = zbd_get_zone(f, f->min_zone);
-		} while (!zb->has_wp);
-
-		zone_lock(td, f, zb);
-	}
-
 	if (zbd_write_zone_get(td, f, zb))
 		return zb;
 
@@ -1613,7 +1583,7 @@ static struct fio_zone_info *zbd_convert_to_write_zone(struct thread_data *td,
 	/* Both z->mutex and zbdi->mutex are held. */
 
 examine_zone:
-	if (zbd_zone_remainder(z) >= min_bs) {
+	if (zbd_zone_remainder(z) > 0) {
 		pthread_mutex_unlock(&zbdi->mutex);
 		goto out;
 	}
@@ -1679,9 +1649,8 @@ retry:
 
 	/* Only z->mutex is held. */
 
-	/* Check whether the write fits in any of the write target zones. */
+	/* Check write target zones. */
 	pthread_mutex_lock(&zbdi->mutex);
-	need_zone_finish = true;
 	for (i = 0; i < zbdi->num_write_zones; i++) {
 		zone_idx = zbdi->write_zones[i];
 		if (zone_idx < f->min_zone || zone_idx >= f->max_zone)
@@ -1692,10 +1661,8 @@ retry:
 		z = zbd_get_zone(f, zone_idx);
 
 		zone_lock(td, f, z);
-		if (zbd_zone_remainder(z) >= min_bs) {
-			need_zone_finish = false;
+		if (zbd_zone_remainder(z) > 0)
 			goto out;
-		}
 		pthread_mutex_lock(&zbdi->mutex);
 	}
 
@@ -1718,26 +1685,6 @@ retry:
 		goto retry;
 	}
 
-	if (td_random(td) && td->o.verify == VERIFY_NONE && need_zone_finish)
-		/*
-		 * If all open zones have remainder smaller than the block size
-		 * for random write jobs, choose one of the write target zones
-		 * and finish it. When verify is enabled, skip this zone finish
-		 * operation to avoid verify data corruption by overwrite to the
-		 * zone.
-		 */
-		if (zbd_pick_write_zone(f, io_u, &zone_idx)) {
-			pthread_mutex_unlock(&zbdi->mutex);
-			zone_unlock(z);
-			z = zbd_get_zone(f, zone_idx);
-			zone_lock(td, f, z);
-			io_u_quiesce(td);
-			dprint(FD_ZBD, "%s(%s): All write target zones have remainder smaller than block size. Choose zone %d and finish.\n",
-			       __func__, f->file_name, zone_idx);
-			zbd_finish_zone(td, f, z);
-			goto out;
-		}
-
 	pthread_mutex_unlock(&zbdi->mutex);
 
 	zone_unlock(z);
@@ -2213,7 +2160,6 @@ retry_lock:
 			goto eof;
 		}
 
-retry:
 		zb = zbd_convert_to_write_zone(td, io_u, zb);
 		if (!zb) {
 			dprint(FD_IO, "%s: can't convert to write target zone",
@@ -2221,10 +2167,6 @@ retry:
 			goto eof;
 		}
 
-		if (zbd_zone_remainder(zb) > 0 &&
-		    zbd_zone_remainder(zb) < min_bs)
-			goto retry;
-
 		/* Check whether the zone reset threshold has been exceeded */
 		if (td->o.zrf.u.f) {
 			if (zbdi->wp_valid_data_bytes >=
@@ -2234,7 +2176,7 @@ retry:
 		}
 
 		/* Reset the zone pointer if necessary */
-		if (zb->reset_zone || zbd_zone_full(f, zb, min_bs)) {
+		if (zb->reset_zone || zbd_zone_full(f, zb)) {
 			if (td->o.verify != VERIFY_NONE) {
 				/*
 				 * Unset io-u->file to tell get_next_verify()
@@ -2269,7 +2211,7 @@ retry:
 		}
 
 		/* Make writes occur at the write pointer */
-		assert(!zbd_zone_full(f, zb, min_bs));
+		assert(!zbd_zone_full(f, zb));
 		io_u->offset = zb->wp;
 		if (!is_valid_offset(f, io_u->offset)) {
 			td_verror(td, EINVAL, "invalid WP value");
@@ -2285,21 +2227,16 @@ retry:
 		 */
 		new_len = min((unsigned long long)io_u->buflen,
 			      zbd_zone_capacity_end(zb) - io_u->offset);
-		new_len = new_len / min_bs * min_bs;
+		assert(new_len > 0);
+		if (new_len > min_bs)
+			new_len = new_len / min_bs * min_bs;
 		if (new_len == io_u->buflen)
 			goto accept;
-		if (new_len >= min_bs) {
-			io_u->buflen = new_len;
-			dprint(FD_IO, "Changed length from %u into %llu\n",
-			       orig_len, io_u->buflen);
-			goto accept;
-		}
 
-		td_verror(td, EIO, "zone remainder too small");
-		log_err("zone remainder %lld smaller than min block size %"PRIu64"\n",
-			(zbd_zone_capacity_end(zb) - io_u->offset), min_bs);
-
-		goto eof;
+		io_u->buflen = new_len;
+		dprint(FD_IO, "Changed length from %u into %llu\n",
+		       orig_len, io_u->buflen);
+		goto accept;
 
 	case DDIR_TRIM:
 		/* Check random trim targets a non-empty zone */
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 04/12] zbd: fix write zone accounting
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (2 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 03/12] zbd: write zone remainders smaller than minimum block size Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 05/12] zbd: remove io_u_quiesce() at write target zone switch Shin'ichiro Kawasaki
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

Currently, zbd_convert_to_write_zones() calls io_u_quiesce() when the
number of write target zones hits one of the limits of write zones. This
wait by io_u_quiesce() significantly degrade the performance. However,
when the io_u_quiesce() is removed, the test case 58 of
t/zbd/test-zbd-support fails with null_blk devices that have a
max_active_zones limit set.

The failure cause is an incorrect write target zone accounting in
zbd_convert_to_write_zones(). This function checks the current write
target zones, and selects one of them as the next write target zone.
After the zone selection, it locks the zone. But when the zone is
locked, another job might have removed the zone from the write target
zones array. This caused an incorrect zone accounting and the test case
failure.

To avoid the incorrect zone accounting, call zbd_write_zone_get() after
the selected zone gets locked. If the zone is removed from the write
target zones array, the function adds the zone back to the array.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/zbd.c b/zbd.c
index 9d5a3bc3..e5f4c8f6 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1661,7 +1661,12 @@ retry:
 		z = zbd_get_zone(f, zone_idx);

 		zone_lock(td, f, z);
-		if (zbd_zone_remainder(z) > 0)
+		/*
+		 * The zone might be already removed from zbdi->write_zones[] by
+		 * other jobs at this moment. Even if the zone has remainder,
+		 * call zbd_write_zone_get() to ensure that it is in the array.
+		 */
+		if (zbd_zone_remainder(z) > 0 && zbd_write_zone_get(td, f, z))
 			goto out;
 		pthread_mutex_lock(&zbdi->mutex);
 	}
-- 
2.49.0

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 05/12] zbd: remove io_u_quiesce() at write target zone switch
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (3 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 04/12] zbd: fix write zone accounting Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 06/12] zbd: remove zbd_finish_zone() Shin'ichiro Kawasaki
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

zbd_convert_to_write_zones() calls io_u_quiesce() when write target zone
changes. This call was hiding the bug fixed by the previous commit. Now
it is safe to remove the io_u_quiesce() call. Remove it. This avoids the
performance drop at write target zone switch.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/zbd.c b/zbd.c
index e5f4c8f6..2682babb 100644
--- a/zbd.c
+++ b/zbd.c
@@ -1507,7 +1507,6 @@ static struct fio_zone_info *zbd_convert_to_write_zone(struct thread_data *td,
 	struct fio_zone_info *z;
 	uint32_t zone_idx, new_zone_idx;
 	int i;
-	bool wait_zone_write;
 	bool in_flight;
 	bool should_retry = true;
 
@@ -1589,29 +1588,10 @@ examine_zone:
 	}
 
 choose_other_zone:
-	/* Check if number of write target zones reaches one of limits. */
-	wait_zone_write =
-		zbdi->num_write_zones == f->max_zone - f->min_zone ||
-		(zbdi->max_write_zones &&
-		 zbdi->num_write_zones == zbdi->max_write_zones) ||
-		(td->o.job_max_open_zones &&
-		 td->num_write_zones == td->o.job_max_open_zones);
-
 	pthread_mutex_unlock(&zbdi->mutex);
 
 	/* Only z->mutex is held. */
 
-	/*
-	 * When number of write target zones reaches to one of limits, wait for
-	 * zone write completion to one of them before trying a new zone.
-	 */
-	if (wait_zone_write) {
-		dprint(FD_ZBD,
-		       "%s(%s): quiesce to remove a zone from write target zones array\n",
-		       __func__, f->file_name);
-		io_u_quiesce(td);
-	}
-
 retry:
 	/*
 	 * For random writes, retry from the zone chosen at the beginning using
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 06/12] zbd: remove zbd_finish_zone()
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (4 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 05/12] zbd: remove io_u_quiesce() at write target zone switch Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 07/12] oslib: remove blkzoned_finish_zone() Shin'ichiro Kawasaki
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

After the commit "zbd: write zone remainders smaller than minimum block
size", zbd_finish_zone() for zone finish operation is no longer
required. Remove it.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 zbd.c | 38 --------------------------------------
 1 file changed, 38 deletions(-)

diff --git a/zbd.c b/zbd.c
index 2682babb..25da87da 100644
--- a/zbd.c
+++ b/zbd.c
@@ -359,44 +359,6 @@ static int zbd_reset_zone(struct thread_data *td, struct fio_file *f,
 	return 0;
 }
 
-/**
- * zbd_finish_zone - finish the specified zone
- * @td: FIO thread data.
- * @f: FIO file for which to finish a zone
- * @z: Zone to finish.
- *
- * Finish the zone at @offset with open or close status.
- */
-static int zbd_finish_zone(struct thread_data *td, struct fio_file *f,
-			   struct fio_zone_info *z)
-{
-	uint64_t offset = z->start;
-	uint64_t length = f->zbd_info->zone_size;
-	int ret = 0;
-
-	switch (f->zbd_info->model) {
-	case ZBD_HOST_AWARE:
-	case ZBD_HOST_MANAGED:
-		if (td->io_ops && td->io_ops->finish_zone)
-			ret = td->io_ops->finish_zone(td, f, offset, length);
-		else
-			ret = blkzoned_finish_zone(td, f, offset, length);
-		break;
-	default:
-		break;
-	}
-
-	if (ret < 0) {
-		td_verror(td, errno, "finish zone failed");
-		log_err("%s: finish zone at sector %"PRIu64" failed (%d).\n",
-			f->file_name, offset >> 9, errno);
-	} else {
-		z->wp = (z+1)->start;
-	}
-
-	return ret;
-}
-
 /**
  * zbd_reset_zones - Reset a range of zones.
  * @td: fio thread data.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 07/12] oslib: remove blkzoned_finish_zone()
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (5 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 06/12] zbd: remove zbd_finish_zone() Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:35 ` [PATCH 08/12] ioengine: remove finish_zone() Shin'ichiro Kawasaki
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

Zone finish operation is no longer required, then the function
blkzoned_finish_zone() is not required either. Remove it.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 oslib/blkzoned.h       |  8 --------
 oslib/linux-blkzoned.c | 34 ----------------------------------
 2 files changed, 42 deletions(-)

diff --git a/oslib/blkzoned.h b/oslib/blkzoned.h
index a8e4a948..c6549ecb 100644
--- a/oslib/blkzoned.h
+++ b/oslib/blkzoned.h
@@ -24,8 +24,6 @@ extern int blkzoned_get_max_open_zones(struct thread_data *td, struct fio_file *
 extern int blkzoned_get_max_active_zones(struct thread_data *td,
 					 struct fio_file *f,
 					 unsigned int *max_active_zones);
-extern int blkzoned_finish_zone(struct thread_data *td, struct fio_file *f,
-				uint64_t offset, uint64_t length);
 #else
 /*
  * Define stubs for systems that do not have zoned block device support.
@@ -71,12 +69,6 @@ static inline int blkzoned_get_max_active_zones(struct thread_data *td,
 {
 	return -EIO;
 }
-static inline int blkzoned_finish_zone(struct thread_data *td,
-				       struct fio_file *f,
-				       uint64_t offset, uint64_t length)
-{
-	return -EIO;
-}
 #endif
 
 #endif /* FIO_BLKZONED_H */
diff --git a/oslib/linux-blkzoned.c b/oslib/linux-blkzoned.c
index c45ef623..34ac66ce 100644
--- a/oslib/linux-blkzoned.c
+++ b/oslib/linux-blkzoned.c
@@ -338,40 +338,6 @@ int blkzoned_reset_wp(struct thread_data *td, struct fio_file *f,
 	return ret;
 }
 
-int blkzoned_finish_zone(struct thread_data *td, struct fio_file *f,
-			 uint64_t offset, uint64_t length)
-{
-	struct blk_zone_range zr = {
-		.sector         = offset >> 9,
-		.nr_sectors     = length >> 9,
-	};
-	int fd, ret = 0;
-
-	/* If the file is not yet opened, open it for this function. */
-	fd = f->fd;
-	if (fd < 0) {
-		fd = open(f->file_name, O_RDWR | O_LARGEFILE);
-		if (fd < 0)
-			return -errno;
-	}
-
-	if (ioctl(fd, BLKFINISHZONE, &zr) < 0) {
-		ret = -errno;
-		/*
-		 * Kernel versions older than 5.5 do not support BLKFINISHZONE
-		 * and return the ENOTTY error code. These old kernels only
-		 * support block devices that close zones automatically.
-		 */
-		if (ret == ENOTTY)
-			ret = 0;
-	}
-
-	if (f->fd < 0)
-		close(fd);
-
-	return ret;
-}
-
 int blkzoned_move_zone_wp(struct thread_data *td, struct fio_file *f,
 			  struct zbd_zone *z, uint64_t length, const char *buf)
 {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 08/12] ioengine: remove finish_zone()
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (6 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 07/12] oslib: remove blkzoned_finish_zone() Shin'ichiro Kawasaki
@ 2026-01-09  2:35 ` Shin'ichiro Kawasaki
  2026-01-09  2:36 ` [PATCH 09/12] doc: explain norandommap restriction and small remainder of zonemode=zbd Shin'ichiro Kawasaki
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:35 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

Zone finish operation is no longer required, then the callback
function finish_zone() of I/O engines is not required either.
Remove it as well as its implementation libzbc_finish_zone()
for libzbc engine. Also increment FIO_IOOPS_VERSION to note
that struct ioengine_ops changed.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 engines/libzbc.c | 34 ----------------------------------
 ioengines.h      |  4 +---
 2 files changed, 1 insertion(+), 37 deletions(-)

diff --git a/engines/libzbc.c b/engines/libzbc.c
index 0fa6bfd1..1918bed7 100644
--- a/engines/libzbc.c
+++ b/engines/libzbc.c
@@ -350,39 +350,6 @@ static int libzbc_move_zone_wp(struct thread_data *td, struct fio_file *f,
 	return -ret;
 }
 
-static int libzbc_finish_zone(struct thread_data *td, struct fio_file *f,
-			      uint64_t offset, uint64_t length)
-{
-	struct libzbc_data *ld = td->io_ops_data;
-	uint64_t sector = offset >> 9;
-	unsigned int nr_zones;
-	struct zbc_errno err;
-	int i, ret;
-
-	assert(ld);
-	assert(ld->zdev);
-
-	nr_zones = (length + td->o.zone_size - 1) / td->o.zone_size;
-	assert(nr_zones > 0);
-
-	for (i = 0; i < nr_zones; i++, sector += td->o.zone_size >> 9) {
-		ret = zbc_finish_zone(ld->zdev, sector, 0);
-		if (ret)
-			goto err;
-	}
-
-	return 0;
-
-err:
-	zbc_errno(ld->zdev, &err);
-	td_verror(td, errno, "zbc_finish_zone failed");
-	if (err.sk)
-		log_err("%s: finish zone failed %s:%s\n",
-			f->file_name,
-			zbc_sk_str(err.sk), zbc_asc_ascq_str(err.asc_ascq));
-	return -ret;
-}
-
 static int libzbc_get_max_open_zones(struct thread_data *td, struct fio_file *f,
 				     unsigned int *max_open_zones)
 {
@@ -486,7 +453,6 @@ FIO_STATIC struct ioengine_ops ioengine = {
 	.reset_wp		= libzbc_reset_wp,
 	.move_zone_wp		= libzbc_move_zone_wp,
 	.get_max_open_zones	= libzbc_get_max_open_zones,
-	.finish_zone		= libzbc_finish_zone,
 	.queue			= libzbc_queue,
 	.flags			= FIO_SYNCIO | FIO_NOEXTEND | FIO_RAWIO,
 };
diff --git a/ioengines.h b/ioengines.h
index 3d220a73..fd2c1242 100644
--- a/ioengines.h
+++ b/ioengines.h
@@ -9,7 +9,7 @@
 #include "zbd_types.h"
 #include "dataplacement.h"
 
-#define FIO_IOOPS_VERSION	39
+#define FIO_IOOPS_VERSION	40
 
 #ifndef CONFIG_DYNAMIC_ENGINES
 #define FIO_STATIC	static
@@ -65,8 +65,6 @@ struct ioengine_ops {
 				  unsigned int *);
 	int (*get_max_active_zones)(struct thread_data *, struct fio_file *,
 				    unsigned int *);
-	int (*finish_zone)(struct thread_data *, struct fio_file *,
-			   uint64_t, uint64_t);
 	int (*fdp_fetch_ruhs)(struct thread_data *, struct fio_file *,
 			      struct fio_ruhs_info *);
 	int option_struct_size;
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 09/12] doc: explain norandommap restriction and small remainder of zonemode=zbd
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (7 preceding siblings ...)
  2026-01-09  2:35 ` [PATCH 08/12] ioengine: remove finish_zone() Shin'ichiro Kawasaki
@ 2026-01-09  2:36 ` Shin'ichiro Kawasaki
  2026-01-09  2:36 ` [PATCH 10/12] t/zbd: avoid test case 14 failure due to no randam map Shin'ichiro Kawasaki
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:36 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

The recent commit "zbd: set norandommap=1 when zonemode=zbd is
specified" introduced norandommap=1 restriction for the zbd zonemode.
Also, the recent commit "zbd: write zone remainders smaller than minimum
block size" changed the handling of the small remainder at zone end.
Describe them in the documents.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 HOWTO.rst | 9 ++++++++-
 fio.1     | 7 ++++++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/HOWTO.rst b/HOWTO.rst
index b1642cf0..cd5edd01 100644
--- a/HOWTO.rst
+++ b/HOWTO.rst
@@ -1034,7 +1034,14 @@ Target file/device
 				all zones instead of being restricted to a
 				single zone. The :option:`zoneskip` parameter
 				is ignored. :option:`zonerange` and
-				:option:`zonesize` must be identical.
+				:option:`zonesize` must be identical. The option
+				:option:`norandommap` must be set to 1.
+				Otherwise it is automatically set. When the
+				block size is not aligned to the zone size,
+				write operations leave small areas at zone ends.
+				Fio truncates the specified block size to fill
+				the small left areas with write size smaller
+				than the block size.
 				Trim is handled using a zone reset operation.
 				Trim only considers non-empty sequential write
 				required and sequential write preferred zones.
diff --git a/fio.1 b/fio.1
index 3ee154ed..f7552eb5 100644
--- a/fio.1
+++ b/fio.1
@@ -809,7 +809,12 @@ starts. The \fBzonecapacity\fR parameter is ignored.
 .B zbd
 Zoned block device mode. I/O happens sequentially in each zone, even if random
 I/O has been selected. Random I/O happens across all zones instead of being
-restricted to a single zone.
+restricted to a single zone. The \fBzoneskip\fR parameter is ignored.
+\fBzonerange\fR and \fBzonesize\fR must be identical. The option
+\fBnorandommap\fR must be set to 1. Otherwise it is automatically set. When the
+block size is not aligned to the zone size, write operations leave small areas
+at zone ends. Fio truncates the specified block size to fill the small left
+areas with write size smaller than the block size.
 Trim is handled using a zone reset operation. Trim only considers non-empty
 sequential write required and sequential write preferred zones.
 .RE
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 10/12] t/zbd: avoid test case 14 failure due to no randam map
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (8 preceding siblings ...)
  2026-01-09  2:36 ` [PATCH 09/12] doc: explain norandommap restriction and small remainder of zonemode=zbd Shin'ichiro Kawasaki
@ 2026-01-09  2:36 ` Shin'ichiro Kawasaki
  2026-01-09  2:36 ` [PATCH 11/12] t/zbd: avoid test case 33 failure due to zone end remainder Shin'ichiro Kawasaki
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:36 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

The recent change disabled the random map feature in zonemode=zbd. Then
random writes for conventional zones may have overlap. This made the
test case 14 fail. To avoid the failure, modify the test case to count
the number of overlaps.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 t/zbd/test-zbd-support | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index 40f1de90..e7a711b9 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -637,7 +637,7 @@ test13() {
 
 # Random write to conventional zones.
 test14() {
-    local off size
+    local off size nr_overlaps
 
     if ! result=($(first_online_zone "$dev")); then
 	echo "Failed to determine first online zone"
@@ -650,10 +650,11 @@ test14() {
 
     run_one_fio_job "$(ioengine "libaio")" --iodepth=64 --rw=randwrite --bs=16K \
 		    --zonemode=zbd --zonesize="${zone_size}" --do_verify=1 \
-		    --verify=md5 --offset=$off --size=$size\
+		    --verify=md5 --offset=$off --size=$size --debug=io \
 		    >>"${logfile}.${test_number}" 2>&1 || return $?
+    nr_overlaps=$(grep "iolog: overlap" --count "${logfile}.${test_number}")
     check_written $((size)) || return $?
-    check_read $((size)) || return $?
+    check_read $((size - nr_overlaps * 16 * 1024)) || return $?
 }
 
 # Sequential read on a mix of empty and full zones.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 11/12] t/zbd: avoid test case 33 failure due to zone end remainder
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (9 preceding siblings ...)
  2026-01-09  2:36 ` [PATCH 10/12] t/zbd: avoid test case 14 failure due to no randam map Shin'ichiro Kawasaki
@ 2026-01-09  2:36 ` Shin'ichiro Kawasaki
  2026-01-09  2:36 ` [PATCH 12/12] t/zbd: avoid test case 71 " Shin'ichiro Kawasaki
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:36 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

The recent change introduced the writes to small remainder areas at zone
ends. This changed the number of writes of the test case 33, and made
the test case fail. To avoid the failure, modify the test condition of
the test case.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 t/zbd/test-zbd-support | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index e7a711b9..c3ccb559 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -977,7 +977,7 @@ test33() {
     run_fio_on_seq "$(ioengine "psync")" --iodepth=1 --rw=write	\
 		   --size=$size --io_size=$io_size --bs=$bs	\
 		   >> "${logfile}.${test_number}" 2>&1 || return $?
-    check_written $((io_size / bs * bs)) || return $?
+    check_written $((io_size)) || return $?
 }
 
 # Test repeated async write job with verify using two unaligned block sizes.
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 12/12] t/zbd: avoid test case 71 failure due to zone end remainder
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (10 preceding siblings ...)
  2026-01-09  2:36 ` [PATCH 11/12] t/zbd: avoid test case 33 failure due to zone end remainder Shin'ichiro Kawasaki
@ 2026-01-09  2:36 ` Shin'ichiro Kawasaki
  2026-01-09  9:19 ` [PATCH 00/12] zbd: fix problems of random write with unaligned block size fiotestbot
  2026-01-26  6:50 ` Damien Le Moal
  13 siblings, 0 replies; 19+ messages in thread
From: Shin'ichiro Kawasaki @ 2026-01-09  2:36 UTC (permalink / raw)
  To: fio, Jens Axboe, Vincent Fu; +Cc: Damien Le Moal, Shin'ichiro Kawasaki

The recent change introduced the writes to small remainder areas at zone
ends. This changed the number of writes of the test case 71, and made
the test case fail. To avoid the failure, modify the test condition
of the test case.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
---
 t/zbd/test-zbd-support | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/t/zbd/test-zbd-support b/t/zbd/test-zbd-support
index c3ccb559..ae731d5a 100755
--- a/t/zbd/test-zbd-support
+++ b/t/zbd/test-zbd-support
@@ -1737,7 +1737,8 @@ test71() {
 			--max_open_zones=1 --debug=zbd \
 		       >> "${logfile}.${test_number}" 2>&1 || return $?
 
-	check_written $((zone_size * 8)) || return $?
+	check_written $((zone_size * 8)) ||
+		check_written $((zone_size *8 + 4096)) || return $?
 }
 
 set_nullb_badblocks() {
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/12] zbd: fix problems of random write with unaligned block size
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (11 preceding siblings ...)
  2026-01-09  2:36 ` [PATCH 12/12] t/zbd: avoid test case 71 " Shin'ichiro Kawasaki
@ 2026-01-09  9:19 ` fiotestbot
  2026-01-26  6:50 ` Damien Le Moal
  13 siblings, 0 replies; 19+ messages in thread
From: fiotestbot @ 2026-01-09  9:19 UTC (permalink / raw)
  To: fio

[-- Attachment #1: Type: text/plain, Size: 146 bytes --]


The result of fio's continuous integration tests was: cancelled

For more details see https://github.com/fiotestbot/fio/actions/runs/20839657240

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 00/12] zbd: fix problems of random write with unaligned block size
  2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
                   ` (12 preceding siblings ...)
  2026-01-09  9:19 ` [PATCH 00/12] zbd: fix problems of random write with unaligned block size fiotestbot
@ 2026-01-26  6:50 ` Damien Le Moal
  13 siblings, 0 replies; 19+ messages in thread
From: Damien Le Moal @ 2026-01-26  6:50 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki, fio, Jens Axboe, Vincent Fu

On 1/9/26 11:35 AM, Shin'ichiro Kawasaki wrote:
> When random write workload runs with zonemode=zbd and block size
> unaligned to the zone size or the initial write pointer position, two
> problems are observed. The first one is write target zone selection.
> When one zone is filled by the write workload, the same zone is selected
> as the next write target. This results in writes concentrating on
> certain zones despite the workload specifies random write.
> 
> The second problem is write performance. The writes with unaligned block
> size leaves small remainder areas at the end of write target zones. To
> free up the zone resource, current fio does zone finish operations to
> the zones with the small remainder. Fio also calls io_u_quiesce() to
> prepare for the zone finish operation and the write target zone
> switching. These zone finish operation and io_u_quiesce() calls
> significantly degrade the random write performance.
> 
> This series address these problems. The first patch addresses the write
> target zone selection problem. The following four patches address the
> performance problem. Next three patches clean up the code by removing
> the zone finish operation. The last four patches adjust documentation
> and the test set for the changes in the series.
> 
> Shin'ichiro Kawasaki (12):
>   zbd: fix zone selection of random writes
>   zbd: set norandommap=1 when zonemode=zbd is specified
>   zbd: write zone remainders smaller than minimum block size
>   zbd: fix write zone accounting
>   zbd: remove io_u_quiesce() at write target zone switch
>   zbd: remove zbd_finish_zone()
>   oslib: remove blkzoned_finish_zone()
>   ioengine: remove finish_zone()
>   doc: explain norandommap restriction and small remainder of
>     zonemode=zbd
>   t/zbd: avoid test case 14 failure due to no randam map
>   t/zbd: avoid test case 33 failure due to zone end remainder
>   t/zbd: avoid test case 71 failure due to zone end remainder

For the series:

Tested-by Damien Le Moal <dlemoal@kernel.org>


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified
  2026-01-09  2:35 ` [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified Shin'ichiro Kawasaki
@ 2026-01-27  1:39   ` Vincent Fu
  2026-01-27  5:05     ` Shinichiro Kawasaki
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Fu @ 2026-01-27  1:39 UTC (permalink / raw)
  To: Shin'ichiro Kawasaki; +Cc: fio, Jens Axboe, Damien Le Moal

On Thu, Jan 8, 2026 at 9:36 PM Shin'ichiro Kawasaki
<shinichiro.kawasaki@wdc.com> wrote:
>
> The random map functionality relies on the minimum block size to manage
> the map of written blocks, in order to ensure that all blocks are
> written with a random workload. This is normally fine, but in the case
> of a zoned block devices being written with a minimum block size that is
> not aligned with the zone size or the write pointer position at workload
> start, the last blocks of a zone can only be written using a write
> operation with a size smaller than the minimum block size. This
> conflicts with the random map operation.
>
> In preparation for supporting writing a zone remainder smaller than the
> minimum block size without using a zone finish operation, disable the
> random map feature by setting norandommap=1 when zonemode=zbd is
> specified.
>
> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> ---
>  init.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/init.c b/init.c
> index 76e1a86d..9320bca8 100644
> --- a/init.c
> +++ b/init.c
> @@ -665,6 +665,19 @@ static int fixup_options(struct thread_data *td)
>                 ret |= 1;
>         }
>
> +       if (o->zone_mode == ZONE_MODE_ZBD) {
> +               if (fio_option_is_set(o, norandommap)) {
> +                       if (o->norandommap == 0) {
> +                               log_err("fio: zonemode=zbd requires norandommap=1\n");
> +                               ret |= 1;
> +                       }
> +                       /* if == 1, OK */
> +               } else {
> +                       dprint(FD_ZBD, "fio: zonemode=zbd sets norandommap=1\n");
> +                       o->norandommap = 1;
> +               }
> +       }
> +
>         if (o->zone_mode == ZONE_MODE_STRIDED && !o->zone_size) {
>                 log_err("fio: --zonesize must be specified when using --zonemode=strided.\n");
>                 ret |= 1;
> --
> 2.49.0
>

As you know, we try to avoid breaking existing job files. This change
breaks 3 test cases.
Fixing a genuine bug or clarifying previously undefined behavior can
justify this sort of breaking change.

It seems heavy handed to suddenly prohibit running a test on a zoned
block device with a random map. Can you provide a stronger
justification for this change?

Perhaps you could instead emit a warning when a random job is run with
a random map and then condition the relevant changes in later patches
on the absence of a random map.

Vincent

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified
  2026-01-27  1:39   ` Vincent Fu
@ 2026-01-27  5:05     ` Shinichiro Kawasaki
  2026-01-30 20:01       ` Vincent Fu
  0 siblings, 1 reply; 19+ messages in thread
From: Shinichiro Kawasaki @ 2026-01-27  5:05 UTC (permalink / raw)
  To: Vincent Fu; +Cc: fio@vger.kernel.org, Jens Axboe, Damien Le Moal

Vincent, thanks for the comments.

On Jan 26, 2026 / 20:39, Vincent Fu wrote:
[...]
> As you know, we try to avoid breaking existing job files. This change
> breaks 3 test cases.

To be precise, only 1 of the 3 test cases in t/zbd was broken due to the
norandommap=1 restriction (test case 14). The other two were broken due to the
change from zone finish operation to simple writes. (Still your point is valid
for the single test case.)

> Fixing a genuine bug or clarifying previously undefined behavior can
> justify this sort of breaking change.
> 
> It seems heavy handed to suddenly prohibit running a test on a zoned
> block device with a random map. Can you provide a stronger
> justification for this change?

There are two justifications to prohibit random map:

1) Random map is not accurate for zoned block devices since zoned block devices
   have the restriction that the writes shall be done at write pointer
   positions. The random map feature manages write offsets, but the offsets are
   modified to the write pointer positions before issuing the write io_u. Still
   random map has the effect to balance the write amounts across zones, but it
   does not work to ensure each sectors has no overlapped writes.

2) This change to ensure norandomap=1 is required to do writes to zone ends
   that have remainders smaller than min_bs. Before this change, fio handled
   such remainders by zone finish operations, but it turned out that simple
   writes for the remainders are much faster than zone finish operations. This
   indicates that users will not use zone finish, and will use writes in their
   systems to fill the remainders. Based on this understanding, the current
   implementation with zone finish can not show the performance that users
   expect (This is the reason I started working on this series). The major use
   case of fio is performance measurement, so in that sense, I would say the
   current implementation has a performance measurement bug. I thought that the
   norandommap=1 restriction can be allowed to fix this bug.

> 
> Perhaps you could instead emit a warning when a random job is run with
> a random map and then condition the relevant changes in later patches
> on the absence of a random map.

Actually, I thought about other options as follows to seek for a better
solution:

1) Leave the current remainder handling with zone finish operation, and add the
   new handling with simple writes. Choose one of the two handlings by a new
   option, and ensure norandommap=1 only for the handling with simple writes.
    -> This can keep the current behavior with norandommap=1 workloads, but it
       comes with the zone finish operation that shows bad performance. I
       thought this leaves complexity for users and in the code.

2) Do not set norandommap=1 always, and set it only when it is required. To be
   precise, set norandommap=1 when,
     - bs is not aligned to zone size, or,
     - initial write pointer positions of a zone is not aligned to bs
    -> I guess this can be implemented, and can leave the behavior with
       norandaommpa=0 for some workloads. However, still the norandommap=1
       override is required when the conditions are not met. I thought this
       complexity could confuse users (Users may need to move write pointers
       to run their workloads with norandammap=0).

3) Modify axmap and randmap handling to support block size smaller than min_bs.
    -> This will be a fundamental change in fio and axmap designs: min_bs is
       referred to in many places. I don't think this approach is feasible.

I think this patch is the better than the other solutions above, but I'm open to
other options. Let me know your thoughts on them. I guess your opinion is
similar as the option 2) above.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified
  2026-01-27  5:05     ` Shinichiro Kawasaki
@ 2026-01-30 20:01       ` Vincent Fu
  2026-02-09 23:57         ` Shinichiro Kawasaki
  0 siblings, 1 reply; 19+ messages in thread
From: Vincent Fu @ 2026-01-30 20:01 UTC (permalink / raw)
  To: Shinichiro Kawasaki; +Cc: fio@vger.kernel.org, Jens Axboe, Damien Le Moal

On Tue, Jan 27, 2026 at 12:05 AM Shinichiro Kawasaki
<shinichiro.kawasaki@wdc.com> wrote:
>
> Vincent, thanks for the comments.
>
> On Jan 26, 2026 / 20:39, Vincent Fu wrote:
 [...]
> >
> > Perhaps you could instead emit a warning when a random job is run with
> > a random map and then condition the relevant changes in later patches
> > on the absence of a random map.
>
> Actually, I thought about other options as follows to seek for a better
> solution:
>
> 1) Leave the current remainder handling with zone finish operation, and add the
>    new handling with simple writes. Choose one of the two handlings by a new
>    option, and ensure norandommap=1 only for the handling with simple writes.
>     -> This can keep the current behavior with norandommap=1 workloads, but it
>        comes with the zone finish operation that shows bad performance. I
>        thought this leaves complexity for users and in the code.

The above is my preferred solution. It preserves backward
compatibility albeit at the cost of additional complexity.

We do have many other situations where some technical knowledge and
complexity is required to maximize performance.

Vincent

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified
  2026-01-30 20:01       ` Vincent Fu
@ 2026-02-09 23:57         ` Shinichiro Kawasaki
  0 siblings, 0 replies; 19+ messages in thread
From: Shinichiro Kawasaki @ 2026-02-09 23:57 UTC (permalink / raw)
  To: Vincent Fu; +Cc: fio@vger.kernel.org, Jens Axboe, Damien Le Moal

On Jan 30, 2026 / 15:01, Vincent Fu wrote:
> On Tue, Jan 27, 2026 at 12:05 AM Shinichiro Kawasaki
> <shinichiro.kawasaki@wdc.com> wrote:
> >
> > Vincent, thanks for the comments.
> >
> > On Jan 26, 2026 / 20:39, Vincent Fu wrote:
>  [...]
> > >
> > > Perhaps you could instead emit a warning when a random job is run with
> > > a random map and then condition the relevant changes in later patches
> > > on the absence of a random map.
> >
> > Actually, I thought about other options as follows to seek for a better
> > solution:
> >
> > 1) Leave the current remainder handling with zone finish operation, and add the
> >    new handling with simple writes. Choose one of the two handlings by a new
> >    option, and ensure norandommap=1 only for the handling with simple writes.
> >     -> This can keep the current behavior with norandommap=1 workloads, but it
> >        comes with the zone finish operation that shows bad performance. I
> >        thought this leaves complexity for users and in the code.
> 
> The above is my preferred solution. It preserves backward
> compatibility albeit at the cost of additional complexity.
> 
> We do have many other situations where some technical knowledge and
> complexity is required to maximize performance.

Okay, thank you for the comment. I will prepare v2 series based on this option.

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-02-09 23:57 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09  2:35 [PATCH 00/12] zbd: fix problems of random write with unaligned block size Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 01/12] zbd: fix zone selection of random writes Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 02/12] zbd: set norandommap=1 when zonemode=zbd is specified Shin'ichiro Kawasaki
2026-01-27  1:39   ` Vincent Fu
2026-01-27  5:05     ` Shinichiro Kawasaki
2026-01-30 20:01       ` Vincent Fu
2026-02-09 23:57         ` Shinichiro Kawasaki
2026-01-09  2:35 ` [PATCH 03/12] zbd: write zone remainders smaller than minimum block size Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 04/12] zbd: fix write zone accounting Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 05/12] zbd: remove io_u_quiesce() at write target zone switch Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 06/12] zbd: remove zbd_finish_zone() Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 07/12] oslib: remove blkzoned_finish_zone() Shin'ichiro Kawasaki
2026-01-09  2:35 ` [PATCH 08/12] ioengine: remove finish_zone() Shin'ichiro Kawasaki
2026-01-09  2:36 ` [PATCH 09/12] doc: explain norandommap restriction and small remainder of zonemode=zbd Shin'ichiro Kawasaki
2026-01-09  2:36 ` [PATCH 10/12] t/zbd: avoid test case 14 failure due to no randam map Shin'ichiro Kawasaki
2026-01-09  2:36 ` [PATCH 11/12] t/zbd: avoid test case 33 failure due to zone end remainder Shin'ichiro Kawasaki
2026-01-09  2:36 ` [PATCH 12/12] t/zbd: avoid test case 71 " Shin'ichiro Kawasaki
2026-01-09  9:19 ` [PATCH 00/12] zbd: fix problems of random write with unaligned block size fiotestbot
2026-01-26  6:50 ` Damien Le Moal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox