Re: [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: "Denis V. Lunev" <den@openvz.org>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	qemu-stable@nongnu.org, Stefan Hajnoczi <stefanha@redhat.com>,
	Hanna Reitz <hreitz@redhat.com>
Subject: Re: [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes
Date: Mon, 27 Apr 2026 19:04:28 +0200	[thread overview]
Message-ID: <ae-XHByosS2Q4vHC@redhat.com> (raw)
In-Reply-To: <20260421155628.3600671-2-den@openvz.org>

Am 21.04.2026 um 17:56 hat Denis V. Lunev geschrieben:
> qcow2's write path drops s->lock around the data I/O of an allocating
> write. A concurrent discard (or MAY_UNMAP write-zeroes) on the same
> guest offset lands the cluster-free operation in that window. The
> original writer then reacquires the lock and unconditionally writes
> L2[G] = alloc_offset | OFLAG_COPIED on its now-stale l2meta, binding
> the L2 entry to a freed cluster:
> 
>   WRITE coroutine                       DISCARD coroutine
>   ---------------                       -----------------
>   qcow2_co_pwritev_part:
>     lock(s->lock)
>     qcow2_alloc_host_offset:
>       handle_copied reads L2[G] = C | OFLAG_COPIED
>       builds l2meta { alloc=C, keep_old_clusters=true }
>     unlock(s->lock)                 -->
>     bdrv_co_pwritev_part (data I/O)    lock(s->lock)
>                                        qcow2_co_pdiscard on G:
>                                          discard_in_l2_slice
>                                            set_l2_entry(G, 0)
>                                            free_any_cluster(C):
>                                              rc(C) 1 -> 0
>                                        unlock(s->lock)
>     lock(s->lock)
>     qcow2_handle_l2meta(link_l2=true):
>       qcow2_alloc_cluster_link_l2:
>         set_l2_entry(G, C | OFLAG_COPIED)  <- stale alloc onto
>                                               freed cluster
> 
> The next allocator pass re-hands C out on rc=0, so we end up with two
> L2 entries aliasing one host cluster. On disk this shows up in
> qemu-img check as refcount=0 with a live OFLAG_COPIED reference or as
> refcount < reference; at runtime the next discard on either alias
> prints "qcow2_free_clusters failed: Invalid argument" on stderr with
> no guest-visible error.

I mostly agree with the problem description, except that I don't think
it's qcow2_alloc_host_offset(), but rather handle_copied(). If it were a
completely new cluster, it wouldn't be mapped in the L2 table yet and
discard wouldn't even see it. But when an existing cluster needs to have
its L2 entry modified (e.g. pre-allocated zero cluster, or with
subclusters), then we get the problem.

> Mark both discards and all write-zeroes (with or without MAY_UNMAP)
> as BDRV_REQ_SERIALISING in the generic block layer. Their
> tracked_request then waits for overlapping in-flight writes,
> including non-serialising ones, to finish their format-driver commit
> before any L2/refcount mutation happens.

I don't think this is a good fix.

On the one hand it's a very big hammer, serialising all requests with
discards/write zeroes while only a small subset of writes really
constitute a problem.

On the other hand, it's not enough to ensure consistency. Writes also
have to be serialised against conflicting other writes, so this kind of
synchronisation is already the block driver's problem. And in fact,
qcow2 does have the mechanism for it (s->cluster_allocs). Discards and
write_zeroes requests just don't look at it yet. This is easy enough to
fix in qcow2.

If another driver drops the lock for writes, it has to build a similar
synchronisation mechanism either way.

So just serialising some requests that we suspect might exhibit bugs in
the block driver feels like it's just papering over the problem instead
of solving it.

I'll send an alternative proposal in a moment.

Kevin

next prev parent reply	other threads:[~2026-04-27 17:05 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21 15:56 [PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race Denis V. Lunev via qemu development
2026-04-21 15:56 ` [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Denis V. Lunev via qemu development
2026-04-27 17:04   ` Kevin Wolf [this message]
2026-04-21 15:56 ` [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race Denis V. Lunev via qemu development

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae-XHByosS2Q4vHC@redhat.com \
    --to=kwolf@redhat.com \
    --cc=den@openvz.org \
    --cc=hreitz@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.