From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sam Li <faithilikerun@gmail.com>
Cc: dlemoal@kernel.org, qemu-devel@nongnu.org,
Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com>,
dmitry.fomichev@wdc.com, Kevin Wolf <kwolf@redhat.com>,
cassel@kernel.org, Markus Armbruster <armbru@redhat.com>,
qemu-block@nongnu.org, Eric Blake <eblake@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
hare@suse.de, Hanna Reitz <hreitz@redhat.com>
Subject: Re: [PATCH v10 2/4] qcow2: add configurations for zoned format extension
Date: Thu, 21 May 2026 11:18:13 -0400 [thread overview]
Message-ID: <20260521151813.GE464582@fedora> (raw)
In-Reply-To: <CAAAx-8KMLBn1Fb1pMNHwsPsM2FKbJAhFFRFbjhvTvtbZ5FKfpA@mail.gmail.com>
[-- Attachment #1: Type: text/plain, Size: 5262 bytes --]
On Wed, May 20, 2026 at 08:23:41PM +0200, Sam Li wrote:
> On Wed, May 20, 2026 at 7:59 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, May 19, 2026 at 11:20:18PM +0200, Sam Li wrote:
> > > On Tue, May 19, 2026 at 5:49 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Mon, May 18, 2026 at 12:21:55AM +0200, Sam Li wrote:
> > > > > On Thu, May 14, 2026 at 9:49 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Sun, May 10, 2026 at 07:50:57PM +0200, Sam Li wrote:
> > > > > > > + 48 - 55: zonedmeta_offset
> > > > > > > + The offset of zoned metadata structure in the contained
> > > > > > > + image, in bytes.
> > > > > >
> > > > > > Do you want to say anything about the order in which metadata is
> > > > > > persisted to disk when zones used? I guess the data is written into the
> > > > > > image file first, then the non-zoned qcow2 L1/L2/refcount metadata is
> > > > > > updated, and finally the write pointer is written. Write pointers are
> > > > > > not guaranteed to be updated on disk until the write request followed by
> > > > > > a flush request are both completed.
> > > > >
> > > > > The current ordering is not like that. The write pointer is written
> > > > > persistently first, then the data writes and the non-zoned qcow2
> > > > > L1/L2/refcount metadata updates. On IO failure, the corresponding
> > > > > write pointer is re-read from disk. As noted in the previous comment,
> > > > > the wp must be updated when issuing the IO, under the assumption that
> > > > > the write IO will succeed.
> > > > >
> > > > > The ordering has been settled this way since v7 to deal with
> > > > > concurrent zone append writes. If the wp was only updated after data
> > > > > I/O, two concurrent appends would both have read the same wp and tried
> > > > > to write to the same position.
> > > > >
> > > > > >
> > > > > > (The idea is that the data must be visible in the qcow2 file before it
> > > > > > is safe to update the write pointer. Otherwise a power failure would
> > > > > > leave the file in an inconsistent state where the write pointer has
> > > > > > advanced but the data was not written.)
> > > > >
> > > > > The crash-consistency is a concern...
> > > >
> > > > Yes, I'm thinking about crash-consistency. The ordering you described
> > > > can result in qcow2 images where the write pointer is ahead of the
> > > > actually written data after a power failure or maybe a QEMU crash.
> > > >
> > > > QEMU's block layer must follow the same data integrity behavior that
> > > > real devices guarantee.
> > >
> > > I may have found a solution to deal with both cases. The fix is to
> > > update wp in memory instead of flushing it before qcow2 metadata and
> > > data writes. The zone append write path would become:
> > >
> > > On submission:
> > >
> > > 1) wp_lock()
> > > 2) Check write alignment
> > > 3) wp_update (in memory)
> > > 4) wp_unlock()
> > > 5) Issue write
> > >
> > > And on completion:
> > > 1) If no error: wp_flush with locks and return success
> >
> > The data may not be visible in the qcow2 file yet because qcow2's L1/L2/refcount
> > cache is not written back to the file until a flush request. I think the
> > write pointer updates should have a dependency on the qcow2 metadata so
> > that write pointers are only written after qcow2 metadata.
>
> Indeed. The qcow2 cache was also my concern. Since wp should be
> persisted after corresponding data is flushed, the cache dependency
> would be qcow2 metadata -> data -> wp. Can we set wp's dependency on
> the data so that wp is written after data is persisted? I might be
> missing something here.
The cached metadata is written after the data, so you don't need to do
anything special to ensure data -> qcow2 metadata -> wp ordering.
One thing to consider is when to increment the write pointer in the
cache. When there are concurrent requests, the wp written to file should
reflect the last _completed_ data write and not in-flight data writes.
It might be necessary to use additional state rather than incrementing
the wp cache immediately when submitting a write request. For example,
iterating over in-flight write requests to calculate the next wp based
on the maximum offset + length and only falling back to the wp cache
when there are no in-flight append requests in this zone.
>
> >
> > See block/qcow2-cache.c and qcow2_cache_set_dependency(). The idea is
> > that one type of cached metadata can set a dependency on another type of
> > cached metadata so that ordering is guaranteed.
>
> Thanks, I'll check it out.
By the way, I think this will require making the wp metadata a qcow2
cache object that is created with qcow2_cache_create().
Stefan
>
> >
> > > 2) else, wp_lock()
> > > 3) read_wp (from disk) and use the read wp value as the current wp
> > > 4) wp_unlock()
> > > 5) return IO error
> > >
> > > Sam
> > >
> > > >
> > > > Damien: Do real zoned block devices guarantee that the updated write
> > > > pointer is persisted only after appended data has written been
> > > > persisted?
> > > >
> > > > Stefan
> > >
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2026-05-21 15:33 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-10 17:50 [PATCH v10 0/4] Add full zoned storage emulation to qcow2 driver Sam Li
2026-05-10 17:50 ` [PATCH v10 1/4] docs/qcow2: add the zoned format feature Sam Li
2026-05-14 18:47 ` Stefan Hajnoczi
2026-05-17 21:07 ` Sam Li
2026-05-10 17:50 ` [PATCH v10 2/4] qcow2: add configurations for zoned format extension Sam Li
2026-05-14 19:49 ` Stefan Hajnoczi
2026-05-17 22:21 ` Sam Li
2026-05-19 15:49 ` Stefan Hajnoczi
2026-05-19 15:55 ` Damien Le Moal
2026-05-19 21:20 ` Sam Li
2026-05-20 17:59 ` Stefan Hajnoczi
2026-05-20 18:23 ` Sam Li
2026-05-21 15:18 ` Stefan Hajnoczi [this message]
2026-05-18 7:57 ` Markus Armbruster
2026-05-10 17:50 ` [PATCH v10 3/4] qcow2: add zoned emulation capability Sam Li
2026-05-14 20:23 ` Stefan Hajnoczi
2026-05-10 17:50 ` [PATCH v10 4/4] iotests: test the zoned format feature for qcow2 file Sam Li
2026-05-14 18:38 ` [PATCH v10 0/4] Add full zoned storage emulation to qcow2 driver Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260521151813.GE464582@fedora \
--to=stefanha@redhat.com \
--cc=armbru@redhat.com \
--cc=cassel@kernel.org \
--cc=dlemoal@kernel.org \
--cc=dmitry.fomichev@wdc.com \
--cc=eblake@redhat.com \
--cc=faithilikerun@gmail.com \
--cc=hare@suse.de \
--cc=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=mst@redhat.com \
--cc=pierrick.bouvier@oss.qualcomm.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.