* [PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race
@ 2026-04-21 15:56 Denis V. Lunev via qemu development
2026-04-21 15:56 ` [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Denis V. Lunev via qemu development
2026-04-21 15:56 ` [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race Denis V. Lunev via qemu development
0 siblings, 2 replies; 4+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-04-21 15:56 UTC (permalink / raw)
To: qemu-devel, qemu-block, qemu-stable
Cc: den, Stefan Hajnoczi, Kevin Wolf, Hanna Reitz
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
This series fixes a qemu block-layer race between an in-flight
format-driver write and a concurrent discard or MAY_UNMAP
write-zeroes on the same guest range. The race has been latent
in upstream since v1.0 (commit 68d100e905 "qcow2: Use coroutines",
2011-06-30) and has been producing silent qcow2 metadata
corruption in production.
Mechanism
---------
qcow2's write path drops s->lock around the data I/O of an
allocating write. If a discard / pwrite_zeroes(MAY_UNMAP) on the
same guest offset lands in that window, it clears the L2 entry
and decrements the cluster's refcount to zero; the writer then
reacquires the lock and unconditionally writes L2[G] =
alloc_offset | OFLAG_COPIED onto the now-freed cluster. The next
allocation re-hands the cluster out and we end up with two L2
entries aliasing one host cluster. Patch 1/2 carries the
per-frame diagram of the interleaving.
On-disk signature: qemu-img check reports refcount=0 with a live
OFLAG_COPIED reference, or refcount < reference. Runtime
signature: "qcow2_free_clusters failed: Invalid argument" on
stderr with no guest-visible error.
Production consequences
-----------------------
* Silent data drift. Once two guest offsets share one host
cluster, writes through either alias overwrite bytes the
other alias owns. The guest reads back bytes it never
wrote, with no I/O error hit anywhere in the stack.
* Guest-filesystem corruption. ext4 discovers the resulting
inconsistency and remounts read-only. Because the backing
qemu returned success for every request, nothing in the
guest's own block layer logs anything; kernels have been
observed stopping all FS writes silently for hours until
userspace tries to write.
* Latent poisoning with multi-day incubation. A block-job
(commit, stream, active mirror, legacy block-migrate +
commit on a destination) running concurrently with guest
discard traffic plants aliased clusters that may not
produce a symptom until a later guest discard walks one
of them. Cases in the wild have surfaced 8 to 17 days
after the originating block-job window.
* Recovery requires both fsck inside the guest AND
qemu-img check -r all on the host -- the former repairs
the ext4 level, the latter repairs the qcow2 refcount/L2
aliasing; fsck alone leaves the image to re-corrupt the
moment writes to an aliased cluster resume.
Why it surfaces only under block-jobs
-------------------------------------
Guest-only I/O rarely opens the race window: the guest's own
block layer serialises DISCARD and WRITE to the same LBA range,
so at any moment a cluster is either "being written" or "being
discarded" from the guest, not both. The race requires a second
I/O producer on the same BDS that does not observe guest-side
ordering -- i.e. a block job. Every migration / commit / backup
/ mirror cycle is an exposure window; steady-state VMs are
essentially immune until the next image-management operation
runs.
Fix
---
Patch 1/2 marks both pdiscard and all pwrite_zeroes (with or
without MAY_UNMAP) as BDRV_REQ_SERIALISING in the generic block
layer. Their tracked_request then waits for overlapping
in-flight writes -- including non-serialising ones -- to finish
their format-driver commit before any L2/refcount mutation
happens.
The gate lives in block/io.c rather than in qcow2 so that:
* every format driver that drops an internal mutex during
the data I/O of an allocating write is covered, not just
qcow2;
* the NBD WRITE_ZEROES path (blk_co_pwrite_zeroes ->
blk_co_pwritev -> bdrv_co_pwritev_part ->
bdrv_aligned_pwritev, which bypasses the
bdrv_co_pwrite_zeroes wrapper entirely) is still caught
-- the gate is placed where BDRV_REQ_ZERO_WRITE is
observed on the way down to the driver.
Perf impact is limited to the overlap window. The serialising
request only waits when a conflict actually exists, which is
exactly the corruption surface. Steady-state non-overlapping
traffic pays nothing.
Test
----
Patch 2/2 adds a deterministic iotest
(tests/qemu-iotests/tests/discard-write-serialisation) that
drives a single qemu-io process with a fixed-seed 5000-command
sequence of interleaved aio_write and aio_write -z -u at random
cluster-aligned offsets in a small contention region, then runs
qemu-img check and asserts zero corruptions. Results across 8
runs each:
fixed tree: 8/8 clean
unfixed tree: 8/8 detect (2-4 corruptions per run)
100% detection on the unfixed tree, zero false positives on the
fixed tree, under 30 seconds per run. The test is scoped to
qcow2 because qcow2 is the format whose qemu-img check validates
the fingerprint; the underlying race is format-agnostic.
Denis V. Lunev (2):
block/io: serialise discard and write-zeroes against in-flight writes
iotests: regression test for discard/write-zeroes vs in-flight write
race
block/io.c | 25 ++++-
.../tests/discard-write-serialisation | 97 +++++++++++++++++++
.../tests/discard-write-serialisation.out | 1 +
3 files changed, 122 insertions(+), 1 deletion(-)
create mode 100755 tests/qemu-iotests/tests/discard-write-serialisation
create mode 100644 tests/qemu-iotests/tests/discard-write-serialisation.out
--
2.51.0
^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes
2026-04-21 15:56 [PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race Denis V. Lunev via qemu development
@ 2026-04-21 15:56 ` Denis V. Lunev via qemu development
2026-04-27 17:04 ` Kevin Wolf
2026-04-21 15:56 ` [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race Denis V. Lunev via qemu development
1 sibling, 1 reply; 4+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-04-21 15:56 UTC (permalink / raw)
To: qemu-devel, qemu-block, qemu-stable
Cc: den, Stefan Hajnoczi, Kevin Wolf, Hanna Reitz
qcow2's write path drops s->lock around the data I/O of an allocating
write. A concurrent discard (or MAY_UNMAP write-zeroes) on the same
guest offset lands the cluster-free operation in that window. The
original writer then reacquires the lock and unconditionally writes
L2[G] = alloc_offset | OFLAG_COPIED on its now-stale l2meta, binding
the L2 entry to a freed cluster:
WRITE coroutine DISCARD coroutine
--------------- -----------------
qcow2_co_pwritev_part:
lock(s->lock)
qcow2_alloc_host_offset:
handle_copied reads L2[G] = C | OFLAG_COPIED
builds l2meta { alloc=C, keep_old_clusters=true }
unlock(s->lock) -->
bdrv_co_pwritev_part (data I/O) lock(s->lock)
qcow2_co_pdiscard on G:
discard_in_l2_slice
set_l2_entry(G, 0)
free_any_cluster(C):
rc(C) 1 -> 0
unlock(s->lock)
lock(s->lock)
qcow2_handle_l2meta(link_l2=true):
qcow2_alloc_cluster_link_l2:
set_l2_entry(G, C | OFLAG_COPIED) <- stale alloc onto
freed cluster
The next allocator pass re-hands C out on rc=0, so we end up with two
L2 entries aliasing one host cluster. On disk this shows up in
qemu-img check as refcount=0 with a live OFLAG_COPIED reference or as
refcount < reference; at runtime the next discard on either alias
prints "qcow2_free_clusters failed: Invalid argument" on stderr with
no guest-visible error.
Mark both discards and all write-zeroes (with or without MAY_UNMAP)
as BDRV_REQ_SERIALISING in the generic block layer. Their
tracked_request then waits for overlapping in-flight writes,
including non-serialising ones, to finish their format-driver commit
before any L2/refcount mutation happens.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
---
block/io.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)
diff --git a/block/io.c b/block/io.c
index dd5f13c694..9f23029b95 100644
--- a/block/io.c
+++ b/block/io.c
@@ -2097,6 +2097,16 @@ bdrv_aligned_pwritev(BdrvChild *child, BdrvTrackedRequest *req,
max_transfer = QEMU_ALIGN_DOWN(MIN_NON_ZERO(bs->bl.max_transfer, INT_MAX),
align);
+ /*
+ * Zero-writes (with or without MAY_UNMAP) mutate L2 entries / refcounts
+ * in the format driver and therefore race with concurrent in-flight
+ * regular writes that have dropped their internal mutex for the data
+ * I/O. See the comment in bdrv_co_pdiscard(). Serialise them.
+ */
+ if (flags & BDRV_REQ_ZERO_WRITE) {
+ flags |= BDRV_REQ_SERIALISING;
+ }
+
ret = bdrv_co_write_req_prepare(child, offset, bytes, req, flags);
if (!ret && bs->detect_zeroes != BLOCKDEV_DETECT_ZEROES_OPTIONS_OFF &&
@@ -3192,7 +3202,20 @@ int coroutine_fn bdrv_co_pdiscard(BdrvChild *child, int64_t offset,
bdrv_inc_in_flight(bs);
tracked_request_begin(&req, bs, offset, bytes, BDRV_TRACKED_DISCARD);
- ret = bdrv_co_write_req_prepare(child, offset, bytes, &req, 0);
+ /*
+ * Discards must serialise against overlapping in-flight writes.
+ * A format driver's write path may drop its internal mutex around
+ * the data I/O while still holding a pending cluster-allocation
+ * commit (see qcow2's handle_copied / qcow2_alloc_cluster_link_l2
+ * sequence). A concurrent discard that clears L2 and drops the
+ * refcount during that window leaves the writer pointing at a
+ * freed cluster - the root of the refcount/reference aliasing
+ * corruption family. Marking the discard serialising makes it wait
+ * for the in-flight write's tracked_request to complete before any
+ * L2/refcount mutation happens.
+ */
+ ret = bdrv_co_write_req_prepare(child, offset, bytes, &req,
+ BDRV_REQ_SERIALISING);
if (ret < 0) {
goto out;
}
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race
2026-04-21 15:56 [PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race Denis V. Lunev via qemu development
2026-04-21 15:56 ` [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Denis V. Lunev via qemu development
@ 2026-04-21 15:56 ` Denis V. Lunev via qemu development
1 sibling, 0 replies; 4+ messages in thread
From: Denis V. Lunev via qemu development @ 2026-04-21 15:56 UTC (permalink / raw)
To: qemu-devel, qemu-block, qemu-stable
Cc: den, Stefan Hajnoczi, Kevin Wolf, Hanna Reitz
Add tests/qemu-iotests/tests/discard-write-serialisation, a deterministic
regression test for the race fixed in the previous commit.
Drive a single qemu-io process with a fixed-seed interleaved sequence of
async aio_write and aio_write -z -u commands at random cluster-aligned
offsets in a small contention region, then run qemu-img check and assert
zero corruptions. On an unpatched tree the same workload reproduces the
refcount-aliasing fingerprint deterministically; on the fixed tree the
image comes back clean.
The test is scoped to qcow2 because qcow2 is the format whose qemu-img
check validates refcount/reference consistency and therefore actually
detects the fingerprint. The underlying race is in the generic block
layer and not format-specific, but a test that asserts "qemu-img check
returns zero corruptions" only has signal on formats that run such a
check.
Signed-off-by: Denis V. Lunev <den@openvz.org>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>
Cc: Hanna Reitz <hreitz@redhat.com>
---
.../tests/discard-write-serialisation | 97 +++++++++++++++++++
.../tests/discard-write-serialisation.out | 1 +
2 files changed, 98 insertions(+)
create mode 100755 tests/qemu-iotests/tests/discard-write-serialisation
create mode 100644 tests/qemu-iotests/tests/discard-write-serialisation.out
diff --git a/tests/qemu-iotests/tests/discard-write-serialisation b/tests/qemu-iotests/tests/discard-write-serialisation
new file mode 100755
index 0000000000..45a3f7f043
--- /dev/null
+++ b/tests/qemu-iotests/tests/discard-write-serialisation
@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+# group: rw quick auto
+#
+# Regression test for the block-layer race fixed in
+# block/io: serialise discard and write-zeroes against in-flight writes.
+#
+# A format driver's write path may drop its internal mutex around the
+# data I/O of an allocating write (qcow2 does so between
+# qcow2_alloc_host_offset and qcow2_alloc_cluster_link_l2). A
+# concurrent discard or MAY_UNMAP write-zeroes on the same guest range,
+# running in that window, can clear the L2 entry and drop the cluster's
+# refcount to zero; the writer's subsequent link then binds the L2
+# entry to a freed cluster. qemu-img check reports this as refcount=0
+# with a live OFLAG_COPIED reference, or refcount < reference when the
+# allocator re-hands the cluster out.
+#
+# The bug is in the generic block layer, not format-specific; qcow2 is
+# the detection vehicle because its refcount validation in qemu-img
+# check catches the fingerprint. The test drives a single qemu-io
+# process with interleaved async aio_write and aio_write -z -u commands
+# at random cluster-aligned offsets in a small contention region, then
+# runs qemu-img check and asserts zero corruptions. On an unpatched
+# tree the same workload reproduces the fingerprint deterministically
+# (seed is fixed).
+#
+# SPDX-License-Identifier: GPL-2.0-or-later
+
+import random
+import subprocess
+
+import iotests
+from iotests import qemu_img_create, qemu_img_check, qemu_io_wrap_args
+
+
+iotests.script_initialize(supported_fmts=['qcow2'],
+ supported_platforms=['linux'])
+
+IMG_SIZE = 256 * 1024 * 1024 # 256 MiB
+REGION = 64 * 1024 * 1024 # contention region: 64 MiB
+CLUSTER = 1024 * 1024 # 1 MiB
+SUBCLUSTER = 32 * 1024 # 32 KiB
+OPS = 5000
+SEED = 7
+
+def build_commands() -> bytes:
+ rng = random.Random(SEED)
+ max_cluster = REGION // CLUSTER - 1
+ lines = []
+ for _ in range(OPS):
+ cl = rng.randint(0, max_cluster)
+ off = cl * CLUSTER
+ if rng.random() < 0.5:
+ # Small sub-cluster write at an unaligned position inside
+ # the cluster -- exercises the handle_copied path and the
+ # s->lock drop around the data I/O.
+ sub = rng.randrange(0, CLUSTER, SUBCLUSTER)
+ lines.append(f'aio_write -q {off + sub} 32k')
+ else:
+ # MAY_UNMAP write-zeroes aligned to the cluster -- frees
+ # clusters at the format driver level and is the concurrent
+ # cluster-free source that races with the in-flight writes.
+ lines.append(f'aio_write -q -z -u {off} 1M')
+ lines.append('aio_flush')
+ return ('\n'.join(lines) + '\n').encode()
+
+
+def main() -> None:
+ with iotests.FilePath('disk.img') as img:
+ qemu_img_create('-f', 'qcow2',
+ '-o', 'cluster_size=1M,extended_l2=on,'
+ 'lazy_refcounts=on,refcount_bits=16',
+ img, str(IMG_SIZE))
+
+ # Run qemu-io with async AIO. --cache=none and --aio=native ensure
+ # the writer coroutine actually yields around its data I/O (which
+ # is what opens the race window). Swallow stdout/stderr: the
+ # result we care about is the on-disk state, checked below.
+ args = qemu_io_wrap_args(['-f', 'qcow2', '-n',
+ '--cache=none', '--aio=native', img])
+ subprocess.run(args, input=build_commands(),
+ stdout=subprocess.DEVNULL,
+ stderr=subprocess.DEVNULL,
+ check=True)
+
+ result = qemu_img_check(img)
+ corruptions = result.get('corruptions', 0)
+ check_errors = result.get('check-errors', 0)
+ if corruptions or check_errors:
+ iotests.log(f'FAIL: qemu-img check reports '
+ f'corruptions={corruptions} '
+ f'check-errors={check_errors}')
+ else:
+ iotests.log('OK')
+
+
+if __name__ == '__main__':
+ main()
diff --git a/tests/qemu-iotests/tests/discard-write-serialisation.out b/tests/qemu-iotests/tests/discard-write-serialisation.out
new file mode 100644
index 0000000000..d86bac9de5
--- /dev/null
+++ b/tests/qemu-iotests/tests/discard-write-serialisation.out
@@ -0,0 +1 @@
+OK
--
2.51.0
^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes
2026-04-21 15:56 ` [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Denis V. Lunev via qemu development
@ 2026-04-27 17:04 ` Kevin Wolf
0 siblings, 0 replies; 4+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:04 UTC (permalink / raw)
To: Denis V. Lunev
Cc: qemu-devel, qemu-block, qemu-stable, Stefan Hajnoczi, Hanna Reitz
Am 21.04.2026 um 17:56 hat Denis V. Lunev geschrieben:
> qcow2's write path drops s->lock around the data I/O of an allocating
> write. A concurrent discard (or MAY_UNMAP write-zeroes) on the same
> guest offset lands the cluster-free operation in that window. The
> original writer then reacquires the lock and unconditionally writes
> L2[G] = alloc_offset | OFLAG_COPIED on its now-stale l2meta, binding
> the L2 entry to a freed cluster:
>
> WRITE coroutine DISCARD coroutine
> --------------- -----------------
> qcow2_co_pwritev_part:
> lock(s->lock)
> qcow2_alloc_host_offset:
> handle_copied reads L2[G] = C | OFLAG_COPIED
> builds l2meta { alloc=C, keep_old_clusters=true }
> unlock(s->lock) -->
> bdrv_co_pwritev_part (data I/O) lock(s->lock)
> qcow2_co_pdiscard on G:
> discard_in_l2_slice
> set_l2_entry(G, 0)
> free_any_cluster(C):
> rc(C) 1 -> 0
> unlock(s->lock)
> lock(s->lock)
> qcow2_handle_l2meta(link_l2=true):
> qcow2_alloc_cluster_link_l2:
> set_l2_entry(G, C | OFLAG_COPIED) <- stale alloc onto
> freed cluster
>
> The next allocator pass re-hands C out on rc=0, so we end up with two
> L2 entries aliasing one host cluster. On disk this shows up in
> qemu-img check as refcount=0 with a live OFLAG_COPIED reference or as
> refcount < reference; at runtime the next discard on either alias
> prints "qcow2_free_clusters failed: Invalid argument" on stderr with
> no guest-visible error.
I mostly agree with the problem description, except that I don't think
it's qcow2_alloc_host_offset(), but rather handle_copied(). If it were a
completely new cluster, it wouldn't be mapped in the L2 table yet and
discard wouldn't even see it. But when an existing cluster needs to have
its L2 entry modified (e.g. pre-allocated zero cluster, or with
subclusters), then we get the problem.
> Mark both discards and all write-zeroes (with or without MAY_UNMAP)
> as BDRV_REQ_SERIALISING in the generic block layer. Their
> tracked_request then waits for overlapping in-flight writes,
> including non-serialising ones, to finish their format-driver commit
> before any L2/refcount mutation happens.
I don't think this is a good fix.
On the one hand it's a very big hammer, serialising all requests with
discards/write zeroes while only a small subset of writes really
constitute a problem.
On the other hand, it's not enough to ensure consistency. Writes also
have to be serialised against conflicting other writes, so this kind of
synchronisation is already the block driver's problem. And in fact,
qcow2 does have the mechanism for it (s->cluster_allocs). Discards and
write_zeroes requests just don't look at it yet. This is easy enough to
fix in qcow2.
If another driver drops the lock for writes, it has to build a similar
synchronisation mechanism either way.
So just serialising some requests that we suspect might exhibit bugs in
the block driver feels like it's just papering over the problem instead
of solving it.
I'll send an alternative proposal in a moment.
Kevin
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2026-04-27 17:05 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-21 15:56 [PATCH 0/2] block/io: fix reproducible silent data corruption in write-vs-discard race Denis V. Lunev via qemu development
2026-04-21 15:56 ` [PATCH 1/2] block/io: serialise discard and write-zeroes against in-flight writes Denis V. Lunev via qemu development
2026-04-27 17:04 ` Kevin Wolf
2026-04-21 15:56 ` [PATCH 2/2] iotests: regression test for discard/write-zeroes vs in-flight write race Denis V. Lunev via qemu development
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.