All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: qemu-devel@nongnu.org, Markus Armbruster <armbru@redhat.com>,
	qemu-block@nongnu.org, Eric Blake <eblake@redhat.com>,
	Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com>,
	dmitry.fomichev@wdc.com, Hanna Reitz <hreitz@redhat.com>,
	hare@suse.de, "Michael S. Tsirkin" <mst@redhat.com>,
	Kevin Wolf <kwolf@redhat.com>,
	cassel@kernel.org, dlemoal@kernel.org
Subject: Re: [PATCH v11 4/5] qcow2: add zoned emulation capability
Date: Thu, 4 Jun 2026 16:51:53 -0400	[thread overview]
Message-ID: <20260604205153.GA250536@fedora> (raw)
In-Reply-To: <20260601214405.252818-5-faithilikerun@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 10650 bytes --]

On Mon, Jun 01, 2026 at 11:44:04PM +0200, Sam Li wrote:
> By adding zone operations and zoned metadata, the zoned emulation
> capability enables full emulation support of zoned device using
> a qcow2 file. The zoned device metadata includes zone type,
> zoned device state and write pointer (WP) of each zone, which is
> stored to an array of unsigned integers.
> 
> WP accessor (qcow2_rw_wp_at) routes reads and writes of an 8-byte
> WP slot through the write pointer cache. The write pointer cache is
> written to disk after the qcow2 metadata is written, thus guaranteeing
> that the write pointer is updated after the corresponding data is
> written. Per-completion cache flush is deferred. The WP cluster reaches
> disk on the next flush.
> 
> Each zone of a zoned device makes state transitions following
> the zone state machine. The zone state machine mainly describes
> five states, IMPLICIT OPEN, EXPLICIT OPEN, FULL, EMPTY and CLOSED.
> READ ONLY and OFFLINE states will generally be affected by device
> internal events. The operations on zones cause corresponding state
> changing.
> 
> Zoned devices have limits on zone resources, which put constraints on
> write operations on zones. It is managed by active zone queues
> following LRU policy.
> 
> Signed-off-by: Sam Li <faithilikerun@gmail.com>
> ---
>  block/qcow2-cache.c    |    8 +
>  block/qcow2-refcount.c |    7 +
>  block/qcow2.c          | 1137 +++++++++++++++++++++++++++++++++++++++-
>  block/trace-events     |    2 +
>  4 files changed, 1149 insertions(+), 5 deletions(-)
> 
> diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
> index 23d9588b08..bdfb11ce88 100644
> --- a/block/qcow2-cache.c
> +++ b/block/qcow2-cache.c
> @@ -275,6 +275,14 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c,
>  {
>      int ret;
>  
> +    /*
> +     * If the dependency graph is unchanged, nothing to do. This avoids
> +     * a synchronous flush on every call below.
> +     */
> +    if (c->depends == dependency) {
> +        return 0;
> +    }

This makes part of the expression below tautologous:

    if (c->depends && (c->depends != dependency)) {
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
        ret = qcow2_cache_flush_dependency(bs, c);
        if (ret < 0) {
            return ret;
        }
    }

That sub-expression could be dropped, but it makes me worry that the
earlier if (dependency->depends) statement is needed even when
c->depends == dependency.

Kevin: Any thoughts on this?

> +
>      if (dependency->depends) {
>          ret = qcow2_cache_flush_dependency(bs, dependency);
>          if (ret < 0) {
> diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
> index 6512cda407..f551726609 100644
> --- a/block/qcow2-refcount.c
> +++ b/block/qcow2-refcount.c
> @@ -1239,6 +1239,13 @@ int qcow2_write_caches(BlockDriverState *bs)
>          }
>      }
>  
> +    if (s->wp_cache) {
> +        ret = qcow2_cache_write(bs, s->wp_cache);
> +        if (ret < 0) {
> +            return ret;
> +        }
> +    }
> +
>      return 0;
>  }
>  
> diff --git a/block/qcow2.c b/block/qcow2.c
> index 29eec33e34..bdc8923b71 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -195,6 +195,300 @@ qcow2_extract_crypto_opts(QemuOpts *opts, const char *fmt, Error **errp)
>      return cryptoopts_qdict;
>  }
>  
> +#define QCOW2_ZT_IS_CONV(wp)    (wp & 1ULL << 59)
> +#define QCOW2_GET_WP(wp)        ((wp << 5) >> 5)
> +
> +/*
> + * To emulate a real zoned device, closed, empty and full states are
> + * preserved after a power cycle. The open states are in-memory and will
> + * be lost after closing the device. Read-only and offline states are
> + * device-internal events, which are not considered for simplicity.
> + */
> +static inline BlockZoneState qcow2_get_zone_state(BlockDriverState *bs,
> +                                                  uint32_t index)

I guess this function requires bs->wps->colock or s->lock, otherwise the
TAILQ accesses could race? Please check and document the locking
requirements.

I/O requests may be processed in multiple threads simultaneously.
s->lock protects qcow2 state.

> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +    uint64_t zone_wp = bs->wps->wp[index];
> +    uint64_t zone_start;
> +
> +    if (QCOW2_ZT_IS_CONV(zone_wp)) {
> +        return BLK_ZS_NOT_WP;
> +    }
> +
> +    if (QTAILQ_IN_USE(zone_entry, exp_open_zone_entry)) {
> +        return BLK_ZS_EOPEN;
> +    }
> +    if (QTAILQ_IN_USE(zone_entry, imp_open_zone_entry)) {
> +        return BLK_ZS_IOPEN;
> +    }
> +
> +    zone_start = index * bs->bl.zone_size;

This is a uint32_t * uint32_t multiplication that can overflow. Avoid
that with:

  zone_start = (uint64_t)index * bs->bl.zone_size;

> +    if (zone_wp == zone_start) {
> +        return BLK_ZS_EMPTY;
> +    }
> +    if (zone_wp >= zone_start + bs->bl.zone_capacity) {
> +        return BLK_ZS_FULL;
> +    }
> +    if (zone_wp > zone_start) {
> +        if (!QTAILQ_IN_USE(zone_entry, closed_zone_entry)) {
> +            /*
> +             * The number of closed zones is not always updated in time when
> +             * the device is closed. However, it only matters when doing
> +             * zone report. Refresh the count and list of closed zones to
> +             * provide correct zone states for zone report.
> +             */
> +            QTAILQ_INSERT_HEAD(&s->closed_zones, zone_entry, closed_zone_entry);
> +            s->nr_zones_closed++;
> +        }
> +        return BLK_ZS_CLOSED;
> +    }
> +    return BLK_ZS_NOT_WP;
> +}
> +
> +static void qcow2_rm_exp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index)

Locking requirements here and in the functions that follow?

> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_REMOVE(&s->exp_open_zones, zone_entry, exp_open_zone_entry);
> +    s->nr_zones_exp_open--;
> +}
> +
> +static void qcow2_rm_imp_open_zone(BDRVQcow2State *s,
> +                                   int32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry;
> +    if (index < 0) {
> +        /* Apply LRU when the index is not specified. */
> +        zone_entry = QTAILQ_LAST(&s->imp_open_zones);
> +    } else {
> +        zone_entry = &s->zone_list_entries[index];
> +    }
> +
> +    QTAILQ_REMOVE(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +    s->nr_zones_imp_open--;
> +}
> +
> +static void qcow2_rm_open_zone(BDRVQcow2State *s,
> +                               uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    if (QTAILQ_IN_USE(zone_entry, exp_open_zone_entry)) {
> +        qcow2_rm_exp_open_zone(s, index);
> +    } else if (QTAILQ_IN_USE(zone_entry, imp_open_zone_entry)) {
> +        qcow2_rm_imp_open_zone(s, index);
> +    }
> +}
> +
> +static void qcow2_rm_closed_zone(BDRVQcow2State *s,
> +                                 uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_REMOVE(&s->closed_zones, zone_entry, closed_zone_entry);
> +    s->nr_zones_closed--;
> +}
> +
> +static void qcow2_do_imp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index,
> +                                   BlockZoneState zs)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    switch (zs) {
> +    case BLK_ZS_EMPTY:
> +        break;
> +    case BLK_ZS_CLOSED:
> +        qcow2_rm_closed_zone(s, index);
> +        break;
> +    case BLK_ZS_IOPEN:
> +        /*
> +         * The LRU policy: update the zone that is most recently
> +         * used to the head of the zone list
> +         */
> +        if (zone_entry == QTAILQ_FIRST(&s->imp_open_zones)) {
> +            return;
> +        }
> +        QTAILQ_REMOVE(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +        s->nr_zones_imp_open--;
> +        break;
> +    default:
> +        return;
> +    }
> +
> +    QTAILQ_INSERT_HEAD(&s->imp_open_zones, zone_entry, imp_open_zone_entry);
> +    s->nr_zones_imp_open++;
> +}
> +
> +static void qcow2_do_exp_open_zone(BDRVQcow2State *s,
> +                                   uint32_t index)
> +{
> +    Qcow2ZoneListEntry *zone_entry = &s->zone_list_entries[index];
> +
> +    QTAILQ_INSERT_HEAD(&s->exp_open_zones, zone_entry, exp_open_zone_entry);
> +    s->nr_zones_exp_open++;
> +}
> +
> +/*
> + * The list of zones is managed using an LRU policy: the last
> + * zone of the list is always the one that was least recently used
> + * for writing and is chosen as the zone to close to be able to
> + * implicitly open another zone.
> + *
> + * We can only close the open zones. The index is not specified
> + * when it is less than 0.
> + */
> +static void qcow2_do_close_zone(BlockDriverState *bs,
> +                                int32_t index,
> +                                BlockZoneState zs)
> +{
> +    BDRVQcow2State *s = bs->opaque;
> +    Qcow2ZoneListEntry *zone_entry;
> +
> +    if (index >= 0) {
> +        zone_entry = &s->zone_list_entries[index];
> +    } else {
> +        /* before removal of the last implicitly open zone */
> +        zone_entry = QTAILQ_LAST(&s->imp_open_zones);

There is an assumption that zone_entry is no NULL when zs ==
BLK_ZS_IOPEN? I think that make sense and it means NULL dereferences
cannot happen, but I wanted to check. You could add an assert(zone_entry
!= NULL) here to make that explicit.

> +    }
> +
> +    if (zs == BLK_ZS_IOPEN) {
> +        qcow2_rm_imp_open_zone(s, index);
> +        goto close_zone;
> +    }
> +
> +    if (index >= 0 && zs == BLK_ZS_EOPEN) {
> +        qcow2_rm_exp_open_zone(s, index);
> +        /*
> +         * The zone state changes when the zone is removed from the list of
> +         * open zones (explicitly open -> empty). The closed zone list is
> +         * refreshed during get_zone_state().
> +         */
> +        qcow2_get_zone_state(bs, index);
> +    }
> +    return;
> +
> +close_zone:
> +    QTAILQ_INSERT_HEAD(&s->closed_zones, zone_entry, closed_zone_entry);
> +    s->nr_zones_closed++;

Is the goto and label necessary? Maybe move this inside the if
statement instead to simplify the function.

> +}

I've reviewed up to here for now.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  reply	other threads:[~2026-06-04 20:53 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-01 21:44 [PATCH v11 0/5] Add full zoned storage emulation to the qcow2 driver Sam Li
2026-06-01 21:44 ` [PATCH v11 1/5] docs/qcow2: add the zoned format feature Sam Li
2026-06-01 21:44 ` [PATCH v11 2/5] qcow2: add configurations for zoned format extension Sam Li
2026-06-02 20:03   ` Stefan Hajnoczi
2026-06-03  7:37   ` Markus Armbruster
2026-06-01 21:44 ` [PATCH v11 3/5] virtio-blk: do not merge writes across a zone boundary Sam Li
2026-06-03 20:41   ` Stefan Hajnoczi
2026-06-01 21:44 ` [PATCH v11 4/5] qcow2: add zoned emulation capability Sam Li
2026-06-04 20:51   ` Stefan Hajnoczi [this message]
2026-06-01 21:44 ` [PATCH v11 5/5] iotests: test the zoned format feature for qcow2 file Sam Li
2026-06-02 20:06 ` [PATCH v11 0/5] Add full zoned storage emulation to the qcow2 driver Stefan Hajnoczi
2026-06-02 20:07   ` Stefan Hajnoczi
2026-06-02 20:10   ` Sam Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260604205153.GA250536@fedora \
    --to=stefanha@redhat.com \
    --cc=armbru@redhat.com \
    --cc=cassel@kernel.org \
    --cc=dlemoal@kernel.org \
    --cc=dmitry.fomichev@wdc.com \
    --cc=eblake@redhat.com \
    --cc=faithilikerun@gmail.com \
    --cc=hare@suse.de \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mst@redhat.com \
    --cc=pierrick.bouvier@oss.qualcomm.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.