Re: [PATCH v12 4/5] qcow2: add zoned emulation capability

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: qemu-devel@nongnu.org, "Michael S. Tsirkin" <mst@redhat.com>,
	Hanna Reitz <hreitz@redhat.com>, Kevin Wolf <kwolf@redhat.com>,
	cassel@kernel.org, dlemoal@kernel.org, qemu-block@nongnu.org,
	dmitry.fomichev@wdc.com, Markus Armbruster <armbru@redhat.com>,
	Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com>,
	Eric Blake <eblake@redhat.com>
Subject: Re: [PATCH v12 4/5] qcow2: add zoned emulation capability
Date: Mon, 29 Jun 2026 19:40:06 +0200	[thread overview]
Message-ID: <20260629174006.GA376767@fedora> (raw)
In-Reply-To: <20260623184830.373232-5-faithilikerun@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9074 bytes --]

On Tue, Jun 23, 2026 at 08:48:29PM +0200, Sam Li wrote:
> +/*
> + * Read/Write the new wp value for zone `index` through write pointe

pointe -> pointer

> +static int coroutine_fn
> +qcow2_co_zone_report(BlockDriverState *bs, int64_t offset,
> +                     unsigned int *nr_zones, BlockZoneDescriptor *zones)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    uint64_t zone_size = s->zoned_header.zone_size;
> +    uint64_t zone_capacity = s->zoned_header.zone_capacity;
> +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
> +    int64_t size = bs->bl.nr_zones * zone_size;
> +    unsigned int nrz;
> +    int i = 0;
> +    int si;

Should i and si be unsigned int since nr_zones is an unsigned int?

> +static int coroutine_fn GRAPH_RDLOCK
> +qcow2_reset_zone(BlockDriverState *bs, uint32_t index,
> +                            int64_t len)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int nrz = bs->bl.nr_zones;
> +    int64_t zone_size = bs->bl.zone_size;
> +    int n, ret = 0;
> +    bool any_dirtied = false;

nr_zones is an unsigned int. Variables holding numbers of zones like
nrz, i, and n should be unsigned int, too. That way they behave
correctly over their full range instead of behaving unexpectedly when
they become negative (e.g. in a for loop with i < n).

> +
> +    qemu_co_mutex_lock(&bs->wps->colock);
> +    uint64_t *wp = &bs->wps->wp[index];
> +    if (len == bs->total_sectors << BDRV_SECTOR_BITS) {
> +        n = nrz;
> +        index = 0;
> +        wp = &bs->wps->wp[0];
> +    } else {
> +        n = len / zone_size;
> +    }
> +
> +    for (int i = 0; i < n; ++i) {
> +        uint64_t *wp_i = (uint64_t *)(wp + i);
> +        uint64_t wpi_v = *wp_i;
> +        if (QCOW2_ZT_IS_CONV(wpi_v)) {
> +            continue;
> +        }
> +
> +        /* Reject if any write is in flight for this zone. */
> +        if (s->zone_wp_state &&
> +            !QTAILQ_EMPTY(&s->zone_wp_state[index + i].in_flight)) {
> +            ret = -EBUSY;
> +            goto unlock;
> +        }

This is a coroutine function and it should yield to wait instead of
returning -EBUSY. If -EBUSY is propagated to the guest that would result
in unexpected behavior - I think real devices don't do this and the
guest will not be prepared to handle -EBUSY.

> +
> +        qemu_co_mutex_lock(&s->lock);
> +        BlockZoneState zs = qcow2_get_zone_state(bs, index + i);
> +        switch (zs) {
> +        case BLK_ZS_EMPTY:
> +            break;
> +        case BLK_ZS_IOPEN:
> +            qcow2_rm_imp_open_zone(s, index + i);
> +            trace_qcow2_imp_open_zones(BLK_ZO_RESET, s->nr_zones_imp_open);
> +            break;
> +        case BLK_ZS_EOPEN:
> +            qcow2_rm_exp_open_zone(s, index + i);
> +            break;
> +        case BLK_ZS_CLOSED:
> +            qcow2_rm_closed_zone(s, index + i);
> +            break;
> +        case BLK_ZS_FULL:
> +            break;
> +        default:
> +            ret = -EINVAL;
> +            qemu_co_mutex_unlock(&s->lock);
> +            goto unlock;
> +        }
> +        qemu_co_mutex_unlock(&s->lock);
> +
> +        if (zs == BLK_ZS_EMPTY) {
> +            continue;
> +        }
> +
> +        /*
> +         * Zero the data extent first. Date write fires before the WP cluster

Date -> Data

> +         * hits disk. So the wp advance cannot become durable while stale data
> +         * is still readable.
> +         */
> +        ret = qcow2_co_pwrite_zeroes(bs, (uint64_t)(index + i) * zone_size,
> +                                     zone_size, 0);
> +        if (ret < 0) {
> +            error_report("Failed to clear zone data at zone %u",
> +                         index + i);
> +            goto unlock;
> +        }
> +
> +        qemu_co_mutex_lock(&s->lock);
> +        qcow2_cache_depends_on_flush(s->wp_cache);
> +        qemu_co_mutex_unlock(&s->lock);
> +
> +        *wp_i = (uint64_t)(index + i) * zone_size;
> +        ret = qcow2_rw_wp_at(bs, wp_i, index + i, true);
> +        if (ret < 0) {
> +            goto unlock;

*wp_i has been modified but the operation will fail. The write pointers
in memory and the write pointers in the qcow2 cache and on disk are now
inconsistent. I expect the failure semantics to be that the write
pointers in memory are consistent with the on-disk write pointers,
although I haven't checked specs to see if they say anything about this.

> +        }
> +        any_dirtied = true;
> +    }
> +
> +    if (any_dirtied) {
> +        /* Single flush at the end. */
> +        qemu_co_mutex_lock(&s->lock);
> +        ret = qcow2_cache_flush(bs, s->wp_cache);
> +        qemu_co_mutex_unlock(&s->lock);
> +    }
> +
> +unlock:
> +    qemu_co_mutex_unlock(&bs->wps->colock);
> +    return ret;
> +}
> +
> +static int coroutine_fn GRAPH_RDLOCK
> +qcow2_co_zone_mgmt(BlockDriverState *bs, BlockZoneOp op,
> +                                           int64_t offset, int64_t len)

Indentation is off here.

> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    int ret = 0;
> +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
> +    int64_t zone_size = s->zoned_header.zone_size;
> +    int64_t zone_size_mask = zone_size - 1;
> +    uint32_t index = offset / zone_size;

It would be safer to check that this qcow2 image supports zone emulation
before getting deeper into this function. zone_size could be 0 here if
zone emulation is not enabled and it would cause a divide-by-zero
exception.

All block driver zone callbacks in qcow2 should perform this check just
in case an emulated device exposes a code path where the guest can
perform a zoned operation on a non-zoned BlockDriverState.
Alternatively, you could add a generic check in bdrv_co_zone_mgmt() and
similar functions.

> +    BlockZoneWps *wps = bs->wps;
> +
> +    if (offset >= capacity) {
> +        error_report("offset %" PRId64 " is equal to or greater than the"
> +                     "device capacity %" PRId64 "", offset, capacity);
> +        return -EINVAL;
> +    }
> +
> +    if (offset & zone_size_mask) {
> +        error_report("sector offset %" PRId64 " is not aligned to zone size"
> +                     " %" PRId64 "", offset / 512, zone_size / 512);
> +        return -EINVAL;
> +    }
> +
> +    if (((offset + len) < capacity && len & zone_size_mask) ||

Here is how the capacity check is expressed without an integer overflow
in block/block-backend.c:blk_check_byte_request():

  if (offset > len || len - offset < bytes) {
                      ^^^^^^^^^^^^^^^^^^^^
      return -EIO;
  }

It would be preferrable to use that form to avoid integer overflow.

> +        offset + len > capacity) {
> +        error_report("number of sectors %" PRId64 " is not aligned to zone"
> +                     " size %" PRId64 "", len / 512, zone_size / 512);
> +        return -EINVAL;
> +    }
> +
> +    qemu_co_mutex_lock(&wps->colock);
> +    uint64_t wpv = wps->wp[index];
> +    qemu_co_mutex_unlock(&wps->colock);
> +
> +    if (QCOW2_ZT_IS_CONV(wpv)) {
> +        /*
> +         * ZONE_RESET_ALL is a global operation that is allowed when the
> +         * starting zone is conventional; the zone reset path itself skips
> +         * conventional zones.
> +         */
> +        if (op != BLK_ZO_RESET || len != capacity) {
> +            error_report("zone mgmt operation 0x%x is not allowed on "
> +                         "a conventional zone", op);
> +            return -EIO;
> +        }
> +    }
> +
> +    switch (op) {
> +    case BLK_ZO_OPEN:
> +        ret = qcow2_open_zone(bs, index);
> +        break;
> +    case BLK_ZO_CLOSE:
> +        ret = qcow2_close_zone(bs, index);
> +        break;
> +    case BLK_ZO_FINISH:
> +        ret = qcow2_finish_zone(bs, index);
> +        break;
> +    case BLK_ZO_RESET:
> +        ret = qcow2_reset_zone(bs, index, len);
> +        break;
> +    default:
> +        error_report("Unsupported zone op: 0x%x", op);
> +        ret = -ENOTSUP;
> +        break;
> +    }
> +    return ret;
> +}
> +
> +static int coroutine_fn GRAPH_RDLOCK
> +qcow2_co_zone_append(BlockDriverState *bs, int64_t *offset, QEMUIOVector *qiov,
> +                     BdrvRequestFlags flags)
> +{
> +    assert(flags == 0);
> +    int64_t capacity = bs->total_sectors << BDRV_SECTOR_BITS;
> +    int64_t zone_size_mask = bs->bl.zone_size - 1;
> +    int64_t iov_len = 0;
> +    int64_t len = 0;
> +
> +    if (*offset >= capacity) {
> +        error_report("*offset %" PRId64 " is equal to or greater than the"
> +                     "device capacity %" PRId64 "", *offset, capacity);
> +        return -EINVAL;

I'm not sure whether this should be -ENOSPC or -EINVAL. Did you look
into what is most appropriate? pwrite(2) on Linux returns ENOSPC when
writing beyond the end of the block device.

This also applies to other places in this patch that check if offset >=
capacity.

> +    }
> +
> +    /* offset + len should not pass the end of that zone starting from offset */
> +    if (*offset & zone_size_mask) {

The comment doesn't match the check that is performed here.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

next prev parent reply	other threads:[~2026-06-29 17:41 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23 18:48 [PATCH v12 0/5] Add full zoned storage emulation to the qcow2 driver Sam Li
2026-06-23 18:48 ` [PATCH v12 1/5] docs/qcow2: add the zoned format feature Sam Li
2026-06-23 18:48 ` [PATCH v12 2/5] qcow2: add configurations for zoned format extension Sam Li
2026-06-26 12:26   ` Niklas Cassel
2026-06-23 18:48 ` [PATCH v12 3/5] virtio-blk: do not merge writes across a zone boundary Sam Li
2026-06-24 19:21   ` Stefan Hajnoczi
2026-06-26 11:44   ` Niklas Cassel
2026-06-26 11:50     ` Sam Li
2026-06-23 18:48 ` [PATCH v12 4/5] qcow2: add zoned emulation capability Sam Li
2026-06-29 17:40   ` Stefan Hajnoczi [this message]
2026-06-29 18:20     ` Sam Li
2026-06-29 18:23   ` Stefan Hajnoczi
2026-06-23 18:48 ` [PATCH v12 5/5] iotests: test the zoned format feature for qcow2 file Sam Li
2026-06-26 11:33 ` [PATCH v12 0/5] Add full zoned storage emulation to the qcow2 driver Niklas Cassel
2026-06-26 11:41   ` Sam Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260629174006.GA376767@fedora \
    --to=stefanha@redhat.com \
    --cc=armbru@redhat.com \
    --cc=cassel@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=dmitry.fomichev@wdc.com \
    --cc=eblake@redhat.com \
    --cc=faithilikerun@gmail.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mst@redhat.com \
    --cc=pierrick.bouvier@oss.qualcomm.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.