[PATCH 0/4] qcow2: Fix corruption on discard during write with COW

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] qcow2: Fix corruption on discard during write with COW
@ 2026-04-27 17:05 Kevin Wolf
  2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:05 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, hreitz, den, stefanha, qemu-stable, qemu-devel

This is an alternative fix for the corruption problem reported by Denis:
https://patchew.org/QEMU/20260421155628.3600671-1-den@openvz.org/

I think it really is a qcow2 level bug and should be fixed on the qcow2
level. The test is also more targeted than just doing random I/O for a
while, and therefore runs faster and documents the problem better.

Kevin Wolf (4):
  commit: Drain nodes across all of bdrv_commit()
  qemu-io: Add 'aio_discard' command
  qcow2: Fix corruption on discard during write with COW
  iotests/046: Test that discard/write_zeroes wait for dependencies

 block/commit.c             |  10 +++-
 block/qcow2-cluster.c      |  52 ++++++++++++++++-
 qemu-io-cmds.c             | 113 +++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/046     |  46 +++++++++++++++
 tests/qemu-iotests/046.out |  36 ++++++++++++
 5 files changed, 252 insertions(+), 5 deletions(-)

-- 
2.53.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] commit: Drain nodes across all of bdrv_commit()
  2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
@ 2026-04-27 17:05 ` Kevin Wolf
  2026-04-29 15:06   ` Fiona Ebner
  2026-04-27 17:05 ` [PATCH 2/4] qemu-io: Add 'aio_discard' command Kevin Wolf
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:05 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, hreitz, den, stefanha, qemu-stable, qemu-devel

The whole implementation of bdrv_commit() is only correct if no new
writes come in while it's running: It has only a single loop checking
the allocation status for each block and finally calls bdrv_make_empty()
without checking if that throws away any new changes.

We already have to drain while taking the graph write lock. Just extend
the drained section to all of bdrv_commit() to make sure that we don't
get any inconsistencies.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/commit.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/commit.c b/block/commit.c
index 0d9e1a16d7a..c5e3ef03a21 100644
--- a/block/commit.c
+++ b/block/commit.c
@@ -518,6 +518,7 @@ int bdrv_commit(BlockDriverState *bs)
     if (!drv)
         return -ENOMEDIUM;
 
+    bdrv_drain_all_begin();
     bdrv_graph_rdlock_main_loop();
 
     backing_file_bs = bdrv_cow_bs(bs);
@@ -549,6 +550,10 @@ int bdrv_commit(BlockDriverState *bs)
                   BLK_PERM_ALL);
     backing = blk_new(ctx, BLK_PERM_WRITE | BLK_PERM_RESIZE, BLK_PERM_ALL);
 
+    /* We drained all nodes, but still make requests through BlockBackends */
+    blk_set_disable_request_queuing(src, true);
+    blk_set_disable_request_queuing(backing, true);
+
     ret = blk_insert_bs(src, bs, &local_err);
     if (ret < 0) {
         error_report_err(local_err);
@@ -565,7 +570,7 @@ int bdrv_commit(BlockDriverState *bs)
 
     bdrv_graph_rdunlock_main_loop();
 
-    bdrv_graph_wrlock_drained();
+    bdrv_graph_wrlock();
     bdrv_set_backing_hd(commit_top_bs, backing_file_bs, &error_abort);
     bdrv_set_backing_hd(bs, commit_top_bs, &error_abort);
     bdrv_graph_wrunlock();
@@ -647,7 +652,7 @@ ro_cleanup:
     blk_unref(backing);
 
     bdrv_graph_rdunlock_main_loop();
-    bdrv_graph_wrlock_drained();
+    bdrv_graph_wrlock();
     if (bdrv_cow_bs(bs) != backing_file_bs) {
         bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
     }
@@ -663,6 +668,7 @@ ro_cleanup:
 
 out:
     bdrv_graph_rdunlock_main_loop();
+    bdrv_drain_all_end();
 
     return ret;
 }
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] qemu-io: Add 'aio_discard' command
  2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
  2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
@ 2026-04-27 17:05 ` Kevin Wolf
  2026-04-28 10:59   ` Denis V. Lunev
  2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:05 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, hreitz, den, stefanha, qemu-stable, qemu-devel

Testing interactions between multiple requests that include discard
requests require that qemu-io can do the discard asynchronously, like it
already does for reads and writes. To this effect, add an 'aio_discard'
command.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 qemu-io-cmds.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 113 insertions(+)

diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
index 13e03301624..0f5baa70f94 100644
--- a/qemu-io-cmds.c
+++ b/qemu-io-cmds.c
@@ -2218,6 +2218,118 @@ static int discard_f(BlockBackend *blk, int argc, char **argv)
     return 0;
 }
 
+static void aio_discard_help(void)
+{
+    printf(
+"\n"
+" asynchronously discards a range of bytes from the given offset\n"
+"\n"
+" Example:\n"
+" 'aio_discard 512 1k' - discards 1 kilobyte from 512 bytes into the file\n"
+"\n"
+" Discards a segment of the currently open file.\n"
+" -C, -- report statistics in a machine parsable format\n"
+" -q, -- quiet mode, do not show I/O statistics\n"
+" The discard is performed asynchronously and the aio_flush command must be\n"
+" used to ensure all outstanding aio requests have been completed.\n"
+" Note that due to its asynchronous nature, this command will be\n"
+" considered successful once the request is submitted, independently\n"
+" of potential I/O errors.\n"
+"\n");
+}
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv);
+
+static const cmdinfo_t aio_discard_cmd = {
+    .name       = "aio_discard",
+    .cfunc      = aio_discard_f,
+    .perm       = BLK_PERM_WRITE,
+    .argmin     = 2,
+    .argmax     = -1,
+    .args       = "[-Cq] off len",
+    .oneline    = "asynchronously discards a number of bytes",
+    .help       = aio_discard_help,
+};
+
+static void aio_discard_done(void *opaque, int ret)
+{
+    struct aio_ctx *ctx = opaque;
+    struct timespec t2;
+
+    clock_gettime(CLOCK_MONOTONIC, &t2);
+
+    if (ret < 0) {
+        printf("aio_discard failed: %s\n", strerror(-ret));
+        block_acct_failed(blk_get_stats(ctx->blk), &ctx->acct);
+        goto out;
+    }
+
+    block_acct_done(blk_get_stats(ctx->blk), &ctx->acct);
+
+    if (ctx->qflag) {
+        goto out;
+    }
+
+    /* Finally, report back -- -C gives a parsable format */
+    t2 = tsub(t2, ctx->t1);
+    print_report("discarded ", &t2, ctx->offset, ctx->qiov.size,
+                 ctx->qiov.size, 1, ctx->Cflag);
+out:
+    g_free(ctx);
+}
+
+static int aio_discard_f(BlockBackend *blk, int argc, char **argv)
+{
+    int c, ret;
+    int64_t count;
+    struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
+
+    ctx->blk = blk;
+
+    while ((c = getopt(argc, argv, "Cq")) != -1) {
+        switch (c) {
+        case 'C':
+            ctx->Cflag = true;
+            break;
+        case 'q':
+            ctx->qflag = true;
+            break;
+        default:
+            qemuio_command_usage(&aio_discard_cmd);
+            return -EINVAL;
+        }
+    }
+
+    if (optind != argc - 2) {
+        qemuio_command_usage(&aio_discard_cmd);
+        return -EINVAL;
+    }
+
+    ctx->offset = cvtnum(argv[optind]);
+    if (ctx->offset < 0) {
+        ret = ctx->offset;
+        print_cvtnum_err(ret, argv[optind]);
+        g_free(ctx);
+        return ret;
+    }
+    optind++;
+
+    count = cvtnum(argv[optind]);
+    if (count < 0) {
+        print_cvtnum_err(count, argv[optind]);
+        g_free(ctx);
+        return count;
+    }
+
+    clock_gettime(CLOCK_MONOTONIC, &ctx->t1);
+    ctx->qiov.size = count;
+    block_acct_start(blk_get_stats(blk), &ctx->acct, ctx->qiov.size,
+                     BLOCK_ACCT_UNMAP);
+    blk_aio_pdiscard(blk, ctx->offset, count, aio_discard_done, ctx);
+
+    return 0;
+}
+
 static int alloc_f(BlockBackend *blk, int argc, char **argv)
 {
     BlockDriverState *bs = blk_bs(blk);
@@ -2800,6 +2912,7 @@ static void __attribute((constructor)) init_qemuio_commands(void)
     qemuio_add_command(&length_cmd);
     qemuio_add_command(&info_cmd);
     qemuio_add_command(&discard_cmd);
+    qemuio_add_command(&aio_discard_cmd);
     qemuio_add_command(&alloc_cmd);
     qemuio_add_command(&map_cmd);
     qemuio_add_command(&reopen_cmd);
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
  2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
  2026-04-27 17:05 ` [PATCH 2/4] qemu-io: Add 'aio_discard' command Kevin Wolf
@ 2026-04-27 17:05 ` Kevin Wolf
  2026-04-29 15:28   ` Fiona Ebner
  2026-05-21 12:12   ` Fiona Ebner
  2026-04-27 17:05 ` [PATCH 4/4] iotests/046: Test that discard/write_zeroes wait for dependencies Kevin Wolf
  2026-04-28 11:00 ` [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Denis V. Lunev
  4 siblings, 2 replies; 19+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:05 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, hreitz, den, stefanha, qemu-stable, qemu-devel

Most code in qcow2 that accesses (and potentially modifies) L2 tables
does so while holding s->lock.

There is one exception, which is allocating writes. They hold the lock
initially while allocating clusters, but drop it for writing the guest
payload before taking the lock again for updating the L2 tables. This
allows concurrent requests that touch other parts of the image file to
continue in parallel and is an important performance optimisation.

However, this means that other requests that run while the lock is
dropped for writing guest data must synchronise with the list of
allocating requests in s->cluster_allocs and wait if they would overlap.
For writes, this is done in handle_dependencies(), but discard and write
zeros operations neglect to synchronise with s->cluster_allocs.

This means that discard can free a cluster whose L2 entry will already
be modified in qcow2_alloc_cluster_link_l2() by a previously started
write. In the case of a pre-allocated zero cluster that is in the
process of being overwritten, this means that discard can lead to a
situation where the cluster is still mapped (because the write will
restore the L2 entry just without the zero flag), but its refcount has
been decreased, resulting in a corrupted image.

Add the missing synchronisation to qcow2_cluster_discard() and
qcow2_subcluster_zeroize() to fix the problem.

Cc: qemu-stable@nongnu.org
Reported-by: Denis V. Lunev <den@openvz.org>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/qcow2-cluster.c | 52 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c655bf6df42..8b1e80bd0b3 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1392,6 +1392,9 @@ count_single_write_clusters(BlockDriverState *bs, int nb_clusters,
  * the same cluster. In this case we need to wait until the previous
  * request has completed and updated the L2 table accordingly.
  *
+ * If allow_shortening == true, instead of waiting for a dependency, *cur_bytes
+ * can be shortened so that the cluster allocations don't overlap.
+ *
  * Returns:
  *   0       if there was no dependency. *cur_bytes indicates the number of
  *           bytes from guest_offset that can be read before the next
@@ -1403,7 +1406,9 @@ count_single_write_clusters(BlockDriverState *bs, int nb_clusters,
  */
 static int coroutine_fn handle_dependencies(BlockDriverState *bs,
                                             uint64_t guest_offset,
-                                            uint64_t *cur_bytes, QCowL2Meta **m)
+                                            uint64_t *cur_bytes,
+                                            bool allow_shortening,
+                                            QCowL2Meta **m)
 {
     BDRVQcow2State *s = bs->opaque;
     QCowL2Meta *old_alloc;
@@ -1434,7 +1439,7 @@ static int coroutine_fn handle_dependencies(BlockDriverState *bs,
 
         /* Conflict */
 
-        if (start < old_start) {
+        if (start < old_start && allow_shortening) {
             /* Stop at the start of a running allocation */
             bytes = old_start - start;
         } else {
@@ -1469,6 +1474,29 @@ static int coroutine_fn handle_dependencies(BlockDriverState *bs,
     return 0;
 }
 
+static void coroutine_mixed_fn wait_for_dependencies(BlockDriverState *bs,
+                                                     uint64_t guest_offset,
+                                                     uint64_t bytes)
+{
+    BDRVQcow2State *s = bs->opaque;
+    QCowL2Meta *m = NULL;
+    int ret;
+
+    /*
+     * Discard has some non-coroutine callers (creating internal snapshots and
+     * make empty). They are calling from qemu-img or in a drained section, so
+     * we know that no writes can be in progress.
+     */
+    if (!qemu_in_coroutine()) {
+        assert(QLIST_EMPTY(&s->cluster_allocs));
+        return;
+    }
+
+    do {
+        ret = handle_dependencies(bs, guest_offset, &bytes, false, &m);
+    } while (ret == -EAGAIN);
+}
+
 /*
  * Checks how many already allocated clusters that don't require a new
  * allocation there are at the given guest_offset (up to *bytes).
@@ -1840,7 +1868,7 @@ again:
          *         the right synchronisation between the in-flight request and
          *         the new one.
          */
-        ret = handle_dependencies(bs, start, &cur_bytes, m);
+        ret = handle_dependencies(bs, start, &cur_bytes, true, m);
         if (ret == -EAGAIN) {
             /* Currently handle_dependencies() doesn't yield if we already had
              * an allocation. If it did, we would have to clean up the L2Meta
@@ -2000,6 +2028,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
     int64_t cleared;
     int ret;
 
+    /*
+     * If we're touching a cluster for which allocating writes are in flight,
+     * wait for them to complete to avoid conflicting metadata updates.
+     *
+     * We don't need to allocate a QCowL2Meta for the discard operation because
+     * s->lock is held for the duration of the whole operation.
+     */
+    wait_for_dependencies(bs, offset, bytes);
+
     /* Caller must pass aligned values, except at image end */
     assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
     assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||
@@ -2160,6 +2197,15 @@ int coroutine_fn qcow2_subcluster_zeroize(BlockDriverState *bs, uint64_t offset,
     int64_t cleared;
     int ret;
 
+    /*
+     * If we're touching a cluster for which allocating writes are in flight,
+     * wait for them to complete to avoid conflicting metadata updates.
+     *
+     * We don't need to allocate a QCowL2Meta for the zeroize operation because
+     * s->lock is held for the duration of the whole operation.
+     */
+    wait_for_dependencies(bs, offset, bytes);
+
     /* If we have to stay in sync with an external data file, zero out
      * s->data_file first. */
     if (data_file_is_raw(bs)) {
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] iotests/046: Test that discard/write_zeroes wait for dependencies
  2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
                   ` (2 preceding siblings ...)
  2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
@ 2026-04-27 17:05 ` Kevin Wolf
  2026-04-28 11:00 ` [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Denis V. Lunev
  4 siblings, 0 replies; 19+ messages in thread
From: Kevin Wolf @ 2026-04-27 17:05 UTC (permalink / raw)
  To: qemu-block; +Cc: kwolf, hreitz, den, stefanha, qemu-stable, qemu-devel

This is a regression test for the bug fixed in the previous commit where
discard and write_zeroes operations wouldn't consider their dependencies
in s->cluster_allocs. Without the fix, this results in a corrupted
image.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/046     | 46 ++++++++++++++++++++++++++++++++++++++
 tests/qemu-iotests/046.out | 36 +++++++++++++++++++++++++++++
 2 files changed, 82 insertions(+)

diff --git a/tests/qemu-iotests/046 b/tests/qemu-iotests/046
index 4c9ed4d26e1..e03dd401479 100755
--- a/tests/qemu-iotests/046
+++ b/tests/qemu-iotests/046
@@ -184,6 +184,48 @@ aio_write -P 160 0x104000 0x18000
 resume A
 aio_flush
 EOF
+
+# Create a pre-allocated zero cluster, then start a write on it and discard it
+# before the L2 update is made
+cat  <<EOF
+write -P 181 0x120000 0x10000
+write -z 0x120000 0x10000
+
+break write_aio A
+aio_write -P 180 0x120000 0x10000
+wait_break A
+aio_discard 0x120000 0x10000
+resume A
+aio_flush
+EOF
+
+# Create a pre-allocated zero cluster, then start a write on it and a
+# concurrent zero write with MAY_UNMAP before the L2 update is made
+cat  <<EOF
+write -P 181 0x130000 0x10000
+write -z 0x130000 0x10000
+
+break write_aio A
+aio_write -P 180 0x130000 0x10000
+wait_break A
+aio_write -z -u 0x130000 0x10000
+resume A
+aio_flush
+EOF
+
+# Create a pre-allocated zero cluster, then start a write on it and a
+# concurrent zero write without MAY_UNMAP before the L2 update is made
+cat  <<EOF
+write -P 181 0x140000 0x10000
+write -z 0x140000 0x10000
+
+break write_aio A
+aio_write -P 180 0x140000 0x10000
+wait_break A
+aio_write -z 0x140000 0x10000
+resume A
+aio_flush
+EOF
 }
 
 overlay_io | $QEMU_IO blkdebug::"$TEST_IMG" | _filter_qemu_io |\
@@ -264,6 +306,10 @@ verify_io()
     # Undefined content for 0x10c000 0x8000
     echo read -P 160 0x114000 0x8000
     echo read -P 17  0x11c000 0x4000
+
+    echo read -P 0   0x120000 0x10000
+    echo read -P 0   0x130000 0x10000
+    echo read -P 0   0x140000 0x10000
 }
 
 verify_io | $QEMU_IO "$TEST_IMG" | _filter_qemu_io
diff --git a/tests/qemu-iotests/046.out b/tests/qemu-iotests/046.out
index b1a03f40414..6341df335c9 100644
--- a/tests/qemu-iotests/046.out
+++ b/tests/qemu-iotests/046.out
@@ -139,6 +139,36 @@ wrote XXX/XXX bytes at offset XXX
 XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 wrote XXX/XXX bytes at offset XXX
 XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+discarded  XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+blkdebug: Suspended request 'A'
+blkdebug: Resuming request 'A'
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote XXX/XXX bytes at offset XXX
+XXX KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
 == Verify image content ==
 read 65536/65536 bytes at offset 0
@@ -239,5 +269,11 @@ read 32768/32768 bytes at offset 1130496
 32 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 read 16384/16384 bytes at offset 1163264
 16 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 1179648
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 1245184
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 65536/65536 bytes at offset 1310720
+64 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 *** done
-- 
2.53.0



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] qemu-io: Add 'aio_discard' command
  2026-04-27 17:05 ` [PATCH 2/4] qemu-io: Add 'aio_discard' command Kevin Wolf
@ 2026-04-28 10:59   ` Denis V. Lunev
  0 siblings, 0 replies; 19+ messages in thread
From: Denis V. Lunev @ 2026-04-28 10:59 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: hreitz, den, stefanha, qemu-stable, qemu-devel

On 4/27/26 19:05, Kevin Wolf wrote:
> Testing interactions between multiple requests that include discard
> requests require that qemu-io can do the discard asynchronously, like it
> already does for reads and writes. To this effect, add an 'aio_discard'
> command.
>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  qemu-io-cmds.c | 113 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 113 insertions(+)
>
> diff --git a/qemu-io-cmds.c b/qemu-io-cmds.c
> index 13e03301624..0f5baa70f94 100644
> --- a/qemu-io-cmds.c
> +++ b/qemu-io-cmds.c
> @@ -2218,6 +2218,118 @@ static int discard_f(BlockBackend *blk, int argc, char **argv)
>      return 0;
>  }
>  
> +static void aio_discard_help(void)
> +{
> +    printf(
> +"\n"
> +" asynchronously discards a range of bytes from the given offset\n"
> +"\n"
> +" Example:\n"
> +" 'aio_discard 512 1k' - discards 1 kilobyte from 512 bytes into the file\n"
> +"\n"
> +" Discards a segment of the currently open file.\n"
> +" -C, -- report statistics in a machine parsable format\n"
> +" -q, -- quiet mode, do not show I/O statistics\n"
> +" The discard is performed asynchronously and the aio_flush command must be\n"
> +" used to ensure all outstanding aio requests have been completed.\n"
> +" Note that due to its asynchronous nature, this command will be\n"
> +" considered successful once the request is submitted, independently\n"
> +" of potential I/O errors.\n"
> +"\n");
> +}
> +
> +static int aio_discard_f(BlockBackend *blk, int argc, char **argv);
> +
> +static const cmdinfo_t aio_discard_cmd = {
> +    .name       = "aio_discard",
> +    .cfunc      = aio_discard_f,
> +    .perm       = BLK_PERM_WRITE,
> +    .argmin     = 2,
> +    .argmax     = -1,
> +    .args       = "[-Cq] off len",
> +    .oneline    = "asynchronously discards a number of bytes",
> +    .help       = aio_discard_help,
> +};
> +
> +static void aio_discard_done(void *opaque, int ret)
> +{
> +    struct aio_ctx *ctx = opaque;
> +    struct timespec t2;
> +
> +    clock_gettime(CLOCK_MONOTONIC, &t2);
> +
> +    if (ret < 0) {
> +        printf("aio_discard failed: %s\n", strerror(-ret));
> +        block_acct_failed(blk_get_stats(ctx->blk), &ctx->acct);
> +        goto out;
> +    }
> +
> +    block_acct_done(blk_get_stats(ctx->blk), &ctx->acct);
> +
> +    if (ctx->qflag) {
> +        goto out;
> +    }
> +
> +    /* Finally, report back -- -C gives a parsable format */
> +    t2 = tsub(t2, ctx->t1);
> +    print_report("discarded ", &t2, ctx->offset, ctx->qiov.size,
> +                 ctx->qiov.size, 1, ctx->Cflag);
> +out:
> +    g_free(ctx);
> +}
> +
> +static int aio_discard_f(BlockBackend *blk, int argc, char **argv)
> +{
> +    int c, ret;
> +    int64_t count;
> +    struct aio_ctx *ctx = g_new0(struct aio_ctx, 1);
> +
> +    ctx->blk = blk;
> +
> +    while ((c = getopt(argc, argv, "Cq")) != -1) {
> +        switch (c) {
> +        case 'C':
> +            ctx->Cflag = true;
> +            break;
> +        case 'q':
> +            ctx->qflag = true;
> +            break;
> +        default:
> +            qemuio_command_usage(&aio_discard_cmd);
leaked ctx

> +            return -EINVAL;
> +        }
> +    }
> +
> +    if (optind != argc - 2) {
> +        qemuio_command_usage(&aio_discard_cmd);
leaked ctx

> +        return -EINVAL;
> +    }
> +
> +    ctx->offset = cvtnum(argv[optind]);
> +    if (ctx->offset < 0) {
> +        ret = ctx->offset;
> +        print_cvtnum_err(ret, argv[optind]);
> +        g_free(ctx);
> +        return ret;
> +    }
> +    optind++;
> +
> +    count = cvtnum(argv[optind]);
> +    if (count < 0) {
> +        print_cvtnum_err(count, argv[optind]);
> +        g_free(ctx);
> +        return count;
> +    }
> +
> +    clock_gettime(CLOCK_MONOTONIC, &ctx->t1);
> +    ctx->qiov.size = count;
> +    block_acct_start(blk_get_stats(blk), &ctx->acct, ctx->qiov.size,
> +                     BLOCK_ACCT_UNMAP);
> +    blk_aio_pdiscard(blk, ctx->offset, count, aio_discard_done, ctx);
> +
> +    return 0;
> +}
> +
>  static int alloc_f(BlockBackend *blk, int argc, char **argv)
>  {
>      BlockDriverState *bs = blk_bs(blk);
> @@ -2800,6 +2912,7 @@ static void __attribute((constructor)) init_qemuio_commands(void)
>      qemuio_add_command(&length_cmd);
>      qemuio_add_command(&info_cmd);
>      qemuio_add_command(&discard_cmd);
> +    qemuio_add_command(&aio_discard_cmd);
>      qemuio_add_command(&alloc_cmd);
>      qemuio_add_command(&map_cmd);
>      qemuio_add_command(&reopen_cmd);



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] qcow2: Fix corruption on discard during write with COW
  2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
                   ` (3 preceding siblings ...)
  2026-04-27 17:05 ` [PATCH 4/4] iotests/046: Test that discard/write_zeroes wait for dependencies Kevin Wolf
@ 2026-04-28 11:00 ` Denis V. Lunev
  4 siblings, 0 replies; 19+ messages in thread
From: Denis V. Lunev @ 2026-04-28 11:00 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: hreitz, den, stefanha, qemu-stable, qemu-devel

On 4/27/26 19:05, Kevin Wolf wrote:
> This is an alternative fix for the corruption problem reported by Denis:
> https://patchew.org/QEMU/20260421155628.3600671-1-den@openvz.org/
>
> I think it really is a qcow2 level bug and should be fixed on the qcow2
> level. The test is also more targeted than just doing random I/O for a
> while, and therefore runs faster and documents the problem better.
>
> Kevin Wolf (4):
>   commit: Drain nodes across all of bdrv_commit()
>   qemu-io: Add 'aio_discard' command
>   qcow2: Fix corruption on discard during write with COW
>   iotests/046: Test that discard/write_zeroes wait for dependencies
>
>  block/commit.c             |  10 +++-
>  block/qcow2-cluster.c      |  52 ++++++++++++++++-
>  qemu-io-cmds.c             | 113 +++++++++++++++++++++++++++++++++++++
>  tests/qemu-iotests/046     |  46 +++++++++++++++
>  tests/qemu-iotests/046.out |  36 ++++++++++++
>  5 files changed, 252 insertions(+), 5 deletions(-)
>
with a 2 small nickpicks fixed

Reviewed-by: Denis V. Lunev <den@openvz.org>
Tested-by: Denis V. Lunev <den@openvz.org>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] commit: Drain nodes across all of bdrv_commit()
  2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
@ 2026-04-29 15:06   ` Fiona Ebner
  2026-05-12 12:16     ` Kevin Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2026-04-29 15:06 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: hreitz, den, stefanha, qemu-stable, qemu-devel

Am 27.04.26 um 7:43 PM schrieb Kevin Wolf:
> The whole implementation of bdrv_commit() is only correct if no new
> writes come in while it's running: It has only a single loop checking
> the allocation status for each block and finally calls bdrv_make_empty()
> without checking if that throws away any new changes.
> 
> We already have to drain while taking the graph write lock. Just extend
> the drained section to all of bdrv_commit() to make sure that we don't
> get any inconsistencies.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/commit.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/block/commit.c b/block/commit.c
> index 0d9e1a16d7a..c5e3ef03a21 100644
> --- a/block/commit.c
> +++ b/block/commit.c
> @@ -518,6 +518,7 @@ int bdrv_commit(BlockDriverState *bs)
>      if (!drv)
>          return -ENOMEDIUM;
>  
> +    bdrv_drain_all_begin();

I suppose we could instead drain only the affected BDSs? Blocking all
for the whole duration of the blk_pread+blk_pwrite loop seems a bit much.

Independently of that, I wonder if blk_commit_all() should drain all?
I'm not familiar with it, but it seems that the intent is to have a
point-in-time state which is consistent between different BDSs? That
intent could be made explicit by draining all.

>      bdrv_graph_rdlock_main_loop();
>  
>      backing_file_bs = bdrv_cow_bs(bs);
> @@ -549,6 +550,10 @@ int bdrv_commit(BlockDriverState *bs)
>                    BLK_PERM_ALL);
>      backing = blk_new(ctx, BLK_PERM_WRITE | BLK_PERM_RESIZE, BLK_PERM_ALL);
>  
> +    /* We drained all nodes, but still make requests through BlockBackends */
> +    blk_set_disable_request_queuing(src, true);
> +    blk_set_disable_request_queuing(backing, true);
> +
>      ret = blk_insert_bs(src, bs, &local_err);
>      if (ret < 0) {
>          error_report_err(local_err);
> @@ -565,7 +570,7 @@ int bdrv_commit(BlockDriverState *bs)
>  
>      bdrv_graph_rdunlock_main_loop();
>  
> -    bdrv_graph_wrlock_drained();
> +    bdrv_graph_wrlock();
>      bdrv_set_backing_hd(commit_top_bs, backing_file_bs, &error_abort);
>      bdrv_set_backing_hd(bs, commit_top_bs, &error_abort);
>      bdrv_graph_wrunlock();
> @@ -647,7 +652,7 @@ ro_cleanup:
>      blk_unref(backing);
>  
>      bdrv_graph_rdunlock_main_loop();
> -    bdrv_graph_wrlock_drained();
> +    bdrv_graph_wrlock();
>      if (bdrv_cow_bs(bs) != backing_file_bs) {
>          bdrv_set_backing_hd(bs, backing_file_bs, &error_abort);
>      }
> @@ -663,6 +668,7 @@ ro_cleanup:
>  
>  out:
>      bdrv_graph_rdunlock_main_loop();
> +    bdrv_drain_all_end();
>  
>      return ret;
>  }




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
@ 2026-04-29 15:28   ` Fiona Ebner
  2026-05-12 12:22     ` Kevin Wolf
  2026-05-21 12:12   ` Fiona Ebner
  1 sibling, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2026-04-29 15:28 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block; +Cc: hreitz, den, stefanha, qemu-stable, qemu-devel

Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> @@ -2000,6 +2028,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
>      int64_t cleared;
>      int ret;
>  
> +    /*
> +     * If we're touching a cluster for which allocating writes are in flight,
> +     * wait for them to complete to avoid conflicting metadata updates.
> +     *
> +     * We don't need to allocate a QCowL2Meta for the discard operation because
> +     * s->lock is held for the duration of the whole operation.

The caller in qcow2_make_empty() does not acquire s->lock AFAICS. Can
that be a problem?

> +     */
> +    wait_for_dependencies(bs, offset, bytes);
> +
>      /* Caller must pass aligned values, except at image end */
>      assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
>      assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] commit: Drain nodes across all of bdrv_commit()
  2026-04-29 15:06   ` Fiona Ebner
@ 2026-05-12 12:16     ` Kevin Wolf
  2026-05-13 15:28       ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-05-12 12:16 UTC (permalink / raw)
  To: Fiona Ebner; +Cc: qemu-block, hreitz, den, stefanha, qemu-stable, qemu-devel

Am 29.04.2026 um 17:06 hat Fiona Ebner geschrieben:
> Am 27.04.26 um 7:43 PM schrieb Kevin Wolf:
> > The whole implementation of bdrv_commit() is only correct if no new
> > writes come in while it's running: It has only a single loop checking
> > the allocation status for each block and finally calls bdrv_make_empty()
> > without checking if that throws away any new changes.
> > 
> > We already have to drain while taking the graph write lock. Just extend
> > the drained section to all of bdrv_commit() to make sure that we don't
> > get any inconsistencies.
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> > ---
> >  block/commit.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/block/commit.c b/block/commit.c
> > index 0d9e1a16d7a..c5e3ef03a21 100644
> > --- a/block/commit.c
> > +++ b/block/commit.c
> > @@ -518,6 +518,7 @@ int bdrv_commit(BlockDriverState *bs)
> >      if (!drv)
> >          return -ENOMEDIUM;
> >  
> > +    bdrv_drain_all_begin();
> 
> I suppose we could instead drain only the affected BDSs? Blocking all
> for the whole duration of the blk_pread+blk_pwrite loop seems a bit much.

Possible, but I'm not completely sure. Basically what I did here is just
moving the drain part of bdrv_graph_wrlock_drained() earlier, which also
drains all nodes.

I think when you introduced it, the idea was that just draining
everything is acceptable and easier to verify. Which means that it might
not be strictly necessary, but I don't want to prove that now either.

Ultimately this drain_all goes back to commit 91ba0e1, in which you
stated:

    More granular draining is not trivially possible, because
    bdrv_change_aio_context() can recursively call itself e.g. via
    bdrv_child_change_aio_context().

> Independently of that, I wonder if blk_commit_all() should drain all?
> I'm not familiar with it, but it seems that the intent is to have a
> point-in-time state which is consistent between different BDSs? That
> intent could be made explicit by draining all.

Hm, I suppose that would make sense, yes. Do you want to send a patch on
top of this? Bonus points for a test case that shows the inconsistency.

Kevin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-04-29 15:28   ` Fiona Ebner
@ 2026-05-12 12:22     ` Kevin Wolf
  2026-05-13 15:34       ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-05-12 12:22 UTC (permalink / raw)
  To: Fiona Ebner; +Cc: qemu-block, hreitz, den, stefanha, qemu-stable, qemu-devel

Am 29.04.2026 um 17:28 hat Fiona Ebner geschrieben:
> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> > @@ -2000,6 +2028,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
> >      int64_t cleared;
> >      int ret;
> >  
> > +    /*
> > +     * If we're touching a cluster for which allocating writes are in flight,
> > +     * wait for them to complete to avoid conflicting metadata updates.
> > +     *
> > +     * We don't need to allocate a QCowL2Meta for the discard operation because
> > +     * s->lock is held for the duration of the whole operation.
> 
> The caller in qcow2_make_empty() does not acquire s->lock AFAICS. Can
> that be a problem?

This is the case the comment inside wait_for_dependencies() is about:

    /*
     * Discard has some non-coroutine callers (creating internal snapshots and
     * make empty). They are calling from qemu-img or in a drained section, so
     * we know that no writes can be in progress.
     */

So it should be fine in practice because we know a bigger hammer is
already in use for synchronisation.

Kevin

> > +     */
> > +    wait_for_dependencies(bs, offset, bytes);
> > +
> >      /* Caller must pass aligned values, except at image end */
> >      assert(QEMU_IS_ALIGNED(offset, s->cluster_size));
> >      assert(QEMU_IS_ALIGNED(end_offset, s->cluster_size) ||



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] commit: Drain nodes across all of bdrv_commit()
  2026-05-12 12:16     ` Kevin Wolf
@ 2026-05-13 15:28       ` Fiona Ebner
  0 siblings, 0 replies; 19+ messages in thread
From: Fiona Ebner @ 2026-05-13 15:28 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, hreitz, den, stefanha, qemu-stable, qemu-devel

Hi Kevin,

Am 12.05.26 um 2:14 PM schrieb Kevin Wolf:
> Am 29.04.2026 um 17:06 hat Fiona Ebner geschrieben:
>> Am 27.04.26 um 7:43 PM schrieb Kevin Wolf:
>>> The whole implementation of bdrv_commit() is only correct if no new
>>> writes come in while it's running: It has only a single loop checking
>>> the allocation status for each block and finally calls bdrv_make_empty()
>>> without checking if that throws away any new changes.
>>>
>>> We already have to drain while taking the graph write lock. Just extend
>>> the drained section to all of bdrv_commit() to make sure that we don't
>>> get any inconsistencies.
>>>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>> ---
>>>  block/commit.c | 10 ++++++++--
>>>  1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/block/commit.c b/block/commit.c
>>> index 0d9e1a16d7a..c5e3ef03a21 100644
>>> --- a/block/commit.c
>>> +++ b/block/commit.c
>>> @@ -518,6 +518,7 @@ int bdrv_commit(BlockDriverState *bs)
>>>      if (!drv)
>>>          return -ENOMEDIUM;
>>>  
>>> +    bdrv_drain_all_begin();
>>
>> I suppose we could instead drain only the affected BDSs? Blocking all
>> for the whole duration of the blk_pread+blk_pwrite loop seems a bit much.
> 
> Possible, but I'm not completely sure. Basically what I did here is just
> moving the drain part of bdrv_graph_wrlock_drained() earlier, which also
> drains all nodes.
> 
> I think when you introduced it, the idea was that just draining
> everything is acceptable and easier to verify. Which means that it might
> not be strictly necessary, but I don't want to prove that now either.
> 
> Ultimately this drain_all goes back to commit 91ba0e1, in which you
> stated:
> 
>     More granular draining is not trivially possible, because
>     bdrv_change_aio_context() can recursively call itself e.g. via
>     bdrv_child_change_aio_context().
> 

oh, right. I had forgotten about that recursion.

>> Independently of that, I wonder if blk_commit_all() should drain all?
>> I'm not familiar with it, but it seems that the intent is to have a
>> point-in-time state which is consistent between different BDSs? That
>> intent could be made explicit by draining all.
> 
> Hm, I suppose that would make sense, yes. Do you want to send a patch on
> top of this? Bonus points for a test case that shows the inconsistency.

I'll add it to my TODO pile, but it'll be a few weeks, as I'm quite busy
with other work at the moment.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-12 12:22     ` Kevin Wolf
@ 2026-05-13 15:34       ` Fiona Ebner
  0 siblings, 0 replies; 19+ messages in thread
From: Fiona Ebner @ 2026-05-13 15:34 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-block, hreitz, den, stefanha, qemu-stable, qemu-devel

Am 12.05.26 um 2:21 PM schrieb Kevin Wolf:
> Am 29.04.2026 um 17:28 hat Fiona Ebner geschrieben:
>> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
>>> @@ -2000,6 +2028,15 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset,
>>>      int64_t cleared;
>>>      int ret;
>>>  
>>> +    /*
>>> +     * If we're touching a cluster for which allocating writes are in flight,
>>> +     * wait for them to complete to avoid conflicting metadata updates.
>>> +     *
>>> +     * We don't need to allocate a QCowL2Meta for the discard operation because
>>> +     * s->lock is held for the duration of the whole operation.
>>
>> The caller in qcow2_make_empty() does not acquire s->lock AFAICS. Can
>> that be a problem?
> 
> This is the case the comment inside wait_for_dependencies() is about:
> 
>     /*
>      * Discard has some non-coroutine callers (creating internal snapshots and
>      * make empty). They are calling from qemu-img or in a drained section, so
>      * we know that no writes can be in progress.
>      */
> 
> So it should be fine in practice because we know a bigger hammer is
> already in use for synchronisation.

Ah sorry, I missed that. Ignore me then.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
  2026-04-29 15:28   ` Fiona Ebner
@ 2026-05-21 12:12   ` Fiona Ebner
  2026-05-21 13:46     ` Kevin Wolf
  1 sibling, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2026-05-21 12:12 UTC (permalink / raw)
  To: Kevin Wolf, qemu-block, Michael Tokarev
  Cc: hreitz, den, stefanha, qemu-stable, qemu-devel, Thomas Lamprecht

Dear maintainers,

Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> Most code in qcow2 that accesses (and potentially modifies) L2 tables
> does so while holding s->lock.
> 
> There is one exception, which is allocating writes. They hold the lock
> initially while allocating clusters, but drop it for writing the guest
> payload before taking the lock again for updating the L2 tables. This
> allows concurrent requests that touch other parts of the image file to
> continue in parallel and is an important performance optimisation.
> 
> However, this means that other requests that run while the lock is
> dropped for writing guest data must synchronise with the list of
> allocating requests in s->cluster_allocs and wait if they would overlap.
> For writes, this is done in handle_dependencies(), but discard and write
> zeros operations neglect to synchronise with s->cluster_allocs.
> 
> This means that discard can free a cluster whose L2 entry will already
> be modified in qcow2_alloc_cluster_link_l2() by a previously started
> write. In the case of a pre-allocated zero cluster that is in the
> process of being overwritten, this means that discard can lead to a
> situation where the cluster is still mapped (because the write will
> restore the L2 entry just without the zero flag), but its refcount has
> been decreased, resulting in a corrupted image.
> 
> Add the missing synchronisation to qcow2_cluster_discard() and
> qcow2_subcluster_zeroize() to fix the problem.
> 
> Cc: qemu-stable@nongnu.org
> Reported-by: Denis V. Lunev <den@openvz.org>
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>

we had started rolling out a build of QEMU 11 with this patch already
included. However, some of our users reported issues with VMs using
qcow2 disks soon after [0][1]. I was able to reproduce the in-guest
segfaults from [1] in a memory-constrained Debian 12 guest when using a
swap partition on the same disk. Thanks to Thomas for the hunch with
swap! After reverting this patch, I wasn't able to reproduce the issue
anymore. I do not have a better reproducer yet and am not sure about the
exact pattern causing the issue. It's related to the
wait_for_dependencies() call in qcow2_subcluster_zeroize(), because if I
revert just the one in qcow2_cluster_discard(), the issue still reproduces.

Commandline for my reproducer VM [2]. The issue does not happen if I
drop "detect-zeroes":"unmap". Note that I don't have discard-no-unref
for the qcow2 image, so in zero_in_l2_slice(), the branch with
qcow2_free_any_cluster() is taken. Could the conflict be related to that?

I'm still trying to figure things out and come up with a better
reproducer, but wanted to let you know early, also because of the
upcoming stable releases. Of course, I'd also be happy for hints/hunches
and am happy to test suggestions!

Best Regards,
Fiona

[0]: https://forum.proxmox.com/threads/183679/
[1]: https://forum.proxmox.com/threads/183639/

[2]:

> ./qemu-system-x86_64 \
>   -accel kvm \
>   -chardev 'socket,id=qmp,path=/var/run/qemu-server/300.qmp,server=on,wait=off' \
>   -mon 'chardev=qmp,mode=control' \
>   -chardev 'socket,id=qmp-event,path=/var/run/qmeventd.sock,reconnect-ms=5000' \
>   -mon 'chardev=qmp-event,mode=control' \
>   -pidfile /var/run/qemu-server/300.pid \
>   -smp '4,sockets=2,cores=2,maxcpus=4' \
>   -nodefaults \
>   -boot 'menu=on,strict=on,reboot-timeout=1000,splash=/usr/share/qemu-server/bootsplash.jpg' \
>   -vnc 'unix:/var/run/qemu-server/300.vnc,password=on' \
>   -cpu host,+kvm_pv_eoi,+kvm_pv_unhalt \
>   -m 256 \
>   -device 'pci-bridge,id=pci.1,chassis_nr=1,bus=pci.0,addr=0x1e' \
>   -device 'pci-bridge,id=pci.2,chassis_nr=2,bus=pci.0,addr=0x1f' \
>   -device 'pci-bridge,id=pci.3,chassis_nr=3,bus=pci.0,addr=0x5' \
>   -device 'VGA,id=vga,bus=pci.0,addr=0x2' \
>   -device 'virtio-scsi-pci,id=virtioscsi0,bus=pci.3,addr=0x1' \
>   -blockdev '{"detect-zeroes":"unmap","discard":"unmap","driver":"qcow2","file":{"detect-zeroes":"unmap","discard":"unmap","driver":"file","filename":"/mnt/pve/dir/images/300/vm-300-disk-0.qcow2","node-name":"e377549e25f53abd39f9ba01c03653e"},"node-name":"drive-scsi0"}' \
>   -device 'scsi-hd,bus=virtioscsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0,id=scsi0,device_id=drive-scsi0,bootindex=100' \
>   -netdev 'type=tap,id=net1,ifname=tap300i1,script=/usr/libexec/qemu-server/pve-bridge,downscript=/usr/libexec/qemu-server/pve-bridgedown,vhost=on' \
>   -device 'virtio-net-pci,mac=BC:24:11:CA:B4:EF,netdev=net1,bus=pci.0,addr=0x13,id=net1,rx_queue_size=1024,tx_queue_size=256,host_mtu=1500'





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-21 12:12   ` Fiona Ebner
@ 2026-05-21 13:46     ` Kevin Wolf
  2026-05-21 14:18       ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-05-21 13:46 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, Michael Tokarev, hreitz, den, stefanha, qemu-stable,
	qemu-devel, Thomas Lamprecht

Hi Fiona,

Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> > Most code in qcow2 that accesses (and potentially modifies) L2 tables
> > does so while holding s->lock.
> > 
> > There is one exception, which is allocating writes. They hold the lock
> > initially while allocating clusters, but drop it for writing the guest
> > payload before taking the lock again for updating the L2 tables. This
> > allows concurrent requests that touch other parts of the image file to
> > continue in parallel and is an important performance optimisation.
> > 
> > However, this means that other requests that run while the lock is
> > dropped for writing guest data must synchronise with the list of
> > allocating requests in s->cluster_allocs and wait if they would overlap.
> > For writes, this is done in handle_dependencies(), but discard and write
> > zeros operations neglect to synchronise with s->cluster_allocs.
> > 
> > This means that discard can free a cluster whose L2 entry will already
> > be modified in qcow2_alloc_cluster_link_l2() by a previously started
> > write. In the case of a pre-allocated zero cluster that is in the
> > process of being overwritten, this means that discard can lead to a
> > situation where the cluster is still mapped (because the write will
> > restore the L2 entry just without the zero flag), but its refcount has
> > been decreased, resulting in a corrupted image.
> > 
> > Add the missing synchronisation to qcow2_cluster_discard() and
> > qcow2_subcluster_zeroize() to fix the problem.
> > 
> > Cc: qemu-stable@nongnu.org
> > Reported-by: Denis V. Lunev <den@openvz.org>
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> 
> we had started rolling out a build of QEMU 11 with this patch already
> included. However, some of our users reported issues with VMs using
> qcow2 disks soon after [0][1]. I was able to reproduce the in-guest
> segfaults from [1] in a memory-constrained Debian 12 guest when using a
> swap partition on the same disk. Thanks to Thomas for the hunch with
> swap! After reverting this patch, I wasn't able to reproduce the issue
> anymore. I do not have a better reproducer yet and am not sure about the
> exact pattern causing the issue. It's related to the
> wait_for_dependencies() call in qcow2_subcluster_zeroize(), because if I
> revert just the one in qcow2_cluster_discard(), the issue still reproduces.
> 
> Commandline for my reproducer VM [2]. The issue does not happen if I
> drop "detect-zeroes":"unmap". Note that I don't have discard-no-unref
> for the qcow2 image, so in zero_in_l2_slice(), the branch with
> qcow2_free_any_cluster() is taken. Could the conflict be related to that?
> 
> I'm still trying to figure things out and come up with a better
> reproducer, but wanted to let you know early, also because of the
> upcoming stable releases. Of course, I'd also be happy for hints/hunches
> and am happy to test suggestions!

Do you have any information about the options used with the image file?
In particular, is it using subclusters? Maybe just the 'qemu-img info'
output would already give a bit more context.

Could you already locate the actual corruption and check what the
pattern looks like? Something like zeros where we would expect data or
the other way around? Or something less clear? (If you don't know,
that's a good answer too. I know well that this kind of things is hard
to debug.)

Kevin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-21 13:46     ` Kevin Wolf
@ 2026-05-21 14:18       ` Fiona Ebner
  2026-05-21 14:35         ` Kevin Wolf
  0 siblings, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2026-05-21 14:18 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, Michael Tokarev, hreitz, den, stefanha, qemu-stable,
	qemu-devel, Thomas Lamprecht

Hi Kevin,

Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
> Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
>> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
>>> Most code in qcow2 that accesses (and potentially modifies) L2 tables
>>> does so while holding s->lock.
>>>
>>> There is one exception, which is allocating writes. They hold the lock
>>> initially while allocating clusters, but drop it for writing the guest
>>> payload before taking the lock again for updating the L2 tables. This
>>> allows concurrent requests that touch other parts of the image file to
>>> continue in parallel and is an important performance optimisation.
>>>
>>> However, this means that other requests that run while the lock is
>>> dropped for writing guest data must synchronise with the list of
>>> allocating requests in s->cluster_allocs and wait if they would overlap.
>>> For writes, this is done in handle_dependencies(), but discard and write
>>> zeros operations neglect to synchronise with s->cluster_allocs.
>>>
>>> This means that discard can free a cluster whose L2 entry will already
>>> be modified in qcow2_alloc_cluster_link_l2() by a previously started
>>> write. In the case of a pre-allocated zero cluster that is in the
>>> process of being overwritten, this means that discard can lead to a
>>> situation where the cluster is still mapped (because the write will
>>> restore the L2 entry just without the zero flag), but its refcount has
>>> been decreased, resulting in a corrupted image.
>>>
>>> Add the missing synchronisation to qcow2_cluster_discard() and
>>> qcow2_subcluster_zeroize() to fix the problem.
>>>
>>> Cc: qemu-stable@nongnu.org
>>> Reported-by: Denis V. Lunev <den@openvz.org>
>>> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
>>
>> we had started rolling out a build of QEMU 11 with this patch already
>> included. However, some of our users reported issues with VMs using
>> qcow2 disks soon after [0][1]. I was able to reproduce the in-guest
>> segfaults from [1] in a memory-constrained Debian 12 guest when using a
>> swap partition on the same disk. Thanks to Thomas for the hunch with
>> swap! After reverting this patch, I wasn't able to reproduce the issue
>> anymore. I do not have a better reproducer yet and am not sure about the
>> exact pattern causing the issue. It's related to the
>> wait_for_dependencies() call in qcow2_subcluster_zeroize(), because if I
>> revert just the one in qcow2_cluster_discard(), the issue still reproduces.
>>
>> Commandline for my reproducer VM [2]. The issue does not happen if I
>> drop "detect-zeroes":"unmap". Note that I don't have discard-no-unref
>> for the qcow2 image, so in zero_in_l2_slice(), the branch with
>> qcow2_free_any_cluster() is taken. Could the conflict be related to that?
>>
>> I'm still trying to figure things out and come up with a better
>> reproducer, but wanted to let you know early, also because of the
>> upcoming stable releases. Of course, I'd also be happy for hints/hunches
>> and am happy to test suggestions!
> 
> Do you have any information about the options used with the image file?
> In particular, is it using subclusters? Maybe just the 'qemu-img info'
> output would already give a bit more context.

No subclusters if I'm not missing anything. When I created the image the
output was:

Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
cluster_size=65536 extended_l2=off preallocation=metadata
compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16

Our management layer doesn't log the command itself, but doing the same
operation with logging added (and 301 instead of 300):

/usr/bin/qemu-img create -o preallocation=metadata -f qcow2
/mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K

qemu-img info gives:

{
    "children": [
        {
            "name": "file",
            "info": {
                "children": [
                ],
                "virtual-size": 6018105344,
                "filename": "/mnt/pve/dir/images/300/vm-300-disk-0.qcow2",
                "format": "file",
                "actual-size": 3794231296,
                "format-specific": {
                    "type": "file",
                    "data": {
                        "extent-size-hint": 1048576
                    }
                },
                "dirty-flag": false
            }
        }
    ],
    "snapshots": [
        {
            "icount": 0,
            "vm-clock-nsec": 0,
            "name": "s0",
            "date-sec": 1779354489,
            "date-nsec": 415278000,
            "vm-clock-sec": 0,
            "id": "1",
            "vm-state-size": 0
        }
    ],
    "virtual-size": 4510973952,
    "filename": "/mnt/pve/dir/images/300/vm-300-disk-0.qcow2",
    "cluster-size": 65536,
    "format": "qcow2",
    "actual-size": 3794231296,
    "format-specific": {
        "type": "qcow2",
        "data": {
            "compat": "1.1",
            "compression-type": "zlib",
            "lazy-refcounts": false,
            "refcount-bits": 16,
            "corrupt": false,
            "extended-l2": false
        }
    },
    "dirty-flag": false
}

> Could you already locate the actual corruption and check what the
> pattern looks like? Something like zeros where we would expect data or
> the other way around? Or something less clear? (If you don't know,
> that's a good answer too. I know well that this kind of things is hard
> to debug.)

Unfortunately not. I can only see the symptom of memory swapped back in
being corrupt (at least that's what happens AFAIU), leading to segfaults
in various processes as well as issues with heap allocations, e.g.:
corrupted double-linked list
free(): invalid pointer

I'll write a small program which allocates memory with a fixed pattern
and regularly dumps it, maybe that works to get an idea about the
corruption.

Best Regards,
Fiona



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-21 14:18       ` Fiona Ebner
@ 2026-05-21 14:35         ` Kevin Wolf
  2026-05-21 15:14           ` Fiona Ebner
  0 siblings, 1 reply; 19+ messages in thread
From: Kevin Wolf @ 2026-05-21 14:35 UTC (permalink / raw)
  To: Fiona Ebner
  Cc: qemu-block, Michael Tokarev, hreitz, den, stefanha, qemu-stable,
	qemu-devel, Thomas Lamprecht

Am 21.05.2026 um 16:18 hat Fiona Ebner geschrieben:
> Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
> > Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
> >> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> >> I'm still trying to figure things out and come up with a better
> >> reproducer, but wanted to let you know early, also because of the
> >> upcoming stable releases. Of course, I'd also be happy for hints/hunches
> >> and am happy to test suggestions!
> > 
> > Do you have any information about the options used with the image file?
> > In particular, is it using subclusters? Maybe just the 'qemu-img info'
> > output would already give a bit more context.
> 
> No subclusters if I'm not missing anything. When I created the image the
> output was:
> 
> Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
> cluster_size=65536 extended_l2=off preallocation=metadata
> compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16
> 
> Our management layer doesn't log the command itself, but doing the same
> operation with logging added (and 301 instead of 300):
> 
> /usr/bin/qemu-img create -o preallocation=metadata -f qcow2
> /mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K
> 
> qemu-img info gives:
> [...]

Ok, looks like all default options.

> > Could you already locate the actual corruption and check what the
> > pattern looks like? Something like zeros where we would expect data or
> > the other way around? Or something less clear? (If you don't know,
> > that's a good answer too. I know well that this kind of things is hard
> > to debug.)
> 
> Unfortunately not. I can only see the symptom of memory swapped back in
> being corrupt (at least that's what happens AFAIU), leading to segfaults
> in various processes as well as issues with heap allocations, e.g.:
> corrupted double-linked list
> free(): invalid pointer
> 
> I'll write a small program which allocates memory with a fixed pattern
> and regularly dumps it, maybe that works to get an idea about the
> corruption.

AI suggests a scenario that looks like a real bug to me, though I'm not
sure if it's yours. See the reproducer below.

Basically it boils down to a non-allocating write being in flight to a
cluster that is concurrently discarded, turning the write essentially
into a host-cluster use-after-free. If you then allocate a new cluster
at the same time, the host cluster will be reused and the write that was
for a different guest cluster still writes to it.

I'm not completely sure yet what the right synchronisation mechanism
would be for this.

Anyway, as it depends on a specific pattern of discard and cluster
allocation happening while a write request is in flight, it should be
possible to use tracing to find out if anything like that is happening
in your case.

Kevin


blkdebug.conf:

[set-state]
state = "1"
event = "write_aio"
new_state = "2"

[set-state]
state = "2"
event = "cluster_alloc"
new_state = "3"


race_test.sh:

#!/bin/bash
#
# Reproducer for the wait_for_dependencies / skip_cow race in
# qcow2_subcluster_zeroize — demonstrating data corruption at an
# UNRELATED guest offset through host cluster reuse.
#
# The scenario:
#   1. Write A to a ZERO_ALLOC cluster creates l2meta. Data I/O suspended.
#   2. Write B to same cluster waits for A. Zero-write also waits for A.
#   3. A completes (cluster → NORMAL). B wakes first (FIFO), gets
#      skip_cow=true (no l2meta), starts data I/O — suspended by blkdebug.
#      Zero-write wakes, finds no deps (B invisible), frees cluster.
#   4. Write D to a DIFFERENT guest offset allocates the freed cluster.
#      D writes its data. D completes.
#   5. B resumes and writes to the same physical cluster, overwriting D.
#   6. Reading D's guest offset returns B's data. CORRUPTION.

set -e

DIR="$(cd "$(dirname "$0")" && pwd)"
QEMU_IO="${DIR}/../build/qemu-io"
QEMU_IMG="${DIR}/../build/qemu-img"
TEST_IMG="/tmp/race_test_$$.qcow2"
BLKDEBUG_CONF="${DIR}/blkdebug.conf"
LOG="/home/cursor/qemu/debug-8a8071.log"

cleanup() {
    rm -f "$TEST_IMG"
}
trap cleanup EXIT

echo "=== Creating test image ==="
"$QEMU_IMG" create -f qcow2 "$TEST_IMG" 1M

echo ""
echo "=== Preparing ZERO_ALLOC cluster at guest offset 0 ==="
"$QEMU_IO" -c "write -P 0x11 0 64k" \
            -c "write -z 0 64k" \
            "$TEST_IMG"

echo ""
echo "=== Running race reproducer ==="
#
# blkdebug.conf state machine:
#   State 1 --(write_aio)--> State 2 --(cluster_alloc)--> State 3
#
# - State 1: tagA breakpoint catches write A
# - State 2: tagB breakpoint catches write B (skip_cow write)
# - State 2→3 transition on cluster_alloc: D's allocation transitions
#   state to 3 BEFORE D fires write_aio, so D is NOT caught by tagB
#
# Sequence:
#   break write_aio tagA          -- breakpoint for state 1
#   aio_write A 0xAA 0 64k       -- suspended at tagA (state 1→2)
#   wait_break tagA
#   break write_aio tagB          -- breakpoint for state 2
#   aio_write B 0xBB 0 64k       -- waits for A (handle_dependencies)
#   aio_write -z -u 0 64k        -- waits for A (wait_for_dependencies)
#   resume tagA                   -- A completes. B wakes (skip_cow),
#                                    caught by tagB. Zero-write frees
#                                    cluster.
#   wait_break tagB               -- B suspended, cluster freed
#   write D 0xDD 64k 64k         -- D allocates the freed cluster
#                                    (cluster_alloc transitions to
#                                    state 3). D's write_aio fires at
#                                    state 3 — no breakpoint. D writes
#                                    its data and completes.
#   resume tagB                   -- B writes to the SAME physical
#                                    cluster, overwriting D's data
#   aio_flush
#
#   read -P 0xDD 64k 64k         -- EXPECTS D's data (0xDD)
#                                    GETS B's data (0xBB) → CORRUPTION

QEMU_IO_OUTPUT=$("$QEMU_IO" \
    -c "break write_aio tagA" \
    -c "aio_write -P 0xAA 0 64k" \
    -c "wait_break tagA" \
    -c "break write_aio tagB" \
    -c "aio_write -P 0xBB 0 64k" \
    -c "aio_write -z -u 0 64k" \
    -c "resume tagA" \
    -c "wait_break tagB" \
    -c "write -P 0xDD 64k 64k" \
    -c "resume tagB" \
    -c "aio_flush" \
    -c "read -vP 0xDD 64k 512" \
    -c "read -vP 0 0 512" \
    "blkdebug:${BLKDEBUG_CONF}:${TEST_IMG}" 2>&1) || true

echo "$QEMU_IO_OUTPUT"

PATTERN_FAIL=$(echo "$QEMU_IO_OUTPUT" | grep -c "Pattern verification failed" || true)
if [ "$PATTERN_FAIL" -gt 0 ]; then
    echo ""
    echo "*** DATA CORRUPTION DETECTED at guest offset 64K ***"
    echo "*** D wrote 0xDD, but reading returns B's data (0xBB)."
    echo "*** B's write to the freed+reallocated cluster corrupted"
    echo "*** an UNRELATED guest address."
fi

echo ""
echo "=== Checking image integrity (metadata) ==="
"$QEMU_IMG" check "$TEST_IMG" || true

echo ""
echo "=== Allocation map ==="
"$QEMU_IMG" map --output=json "$TEST_IMG"

echo ""
echo "=== Checking log for race evidence ==="
if [ -f "$LOG" ]; then
    echo "--- Log entries (chronological) ---"
    cat "$LOG"
    echo ""
    echo "--- Race analysis ---"
    # Extract host offsets for the skip_cow write (B) and D's write
    B_HOST=$(grep '"has_l2meta":0' "$LOG" | grep -o '"host_offset":[0-9]*' | head -1 | grep -o '[0-9]*')
    D_HOST=$(grep '"offset":65536' "$LOG" | grep -o '"host_offset":[0-9]*' | head -1 | grep -o '[0-9]*')

    echo "Write B (skip_cow, no l2meta) host_offset: $B_HOST"
    echo "Write D (different guest offset)  host_offset: $D_HOST"

    if [ -n "$B_HOST" ] && [ -n "$D_HOST" ] && [ "$B_HOST" = "$D_HOST" ]; then
        echo ""
        echo "*** CLUSTER REUSE CONFIRMED: B and D write to the same"
        echo "*** physical cluster ($B_HOST) for different guest offsets."
        echo "*** B (guest offset 0) overwrites D (guest offset 64K)."
        echo "*** Reading guest offset 64K returns B's data → CORRUPTION"
        echo "*** at an unrelated guest address."
    fi
else
    echo "No log file found at $LOG"
fi



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-21 14:35         ` Kevin Wolf
@ 2026-05-21 15:14           ` Fiona Ebner
  2026-05-21 16:14             ` Thomas Lamprecht
  0 siblings, 1 reply; 19+ messages in thread
From: Fiona Ebner @ 2026-05-21 15:14 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, Michael Tokarev, hreitz, den, stefanha, qemu-stable,
	qemu-devel, Thomas Lamprecht

Am 21.05.26 um 4:35 PM schrieb Kevin Wolf:
> Am 21.05.2026 um 16:18 hat Fiona Ebner geschrieben:
>> Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
>>> Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
>>>> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
>>>> I'm still trying to figure things out and come up with a better
>>>> reproducer, but wanted to let you know early, also because of the
>>>> upcoming stable releases. Of course, I'd also be happy for hints/hunches
>>>> and am happy to test suggestions!
>>>
>>> Do you have any information about the options used with the image file?
>>> In particular, is it using subclusters? Maybe just the 'qemu-img info'
>>> output would already give a bit more context.
>>
>> No subclusters if I'm not missing anything. When I created the image the
>> output was:
>>
>> Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
>> cluster_size=65536 extended_l2=off preallocation=metadata
>> compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16
>>
>> Our management layer doesn't log the command itself, but doing the same
>> operation with logging added (and 301 instead of 300):
>>
>> /usr/bin/qemu-img create -o preallocation=metadata -f qcow2
>> /mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K
>>
>> qemu-img info gives:
>> [...]
> 
> Ok, looks like all default options.
> 
>>> Could you already locate the actual corruption and check what the
>>> pattern looks like? Something like zeros where we would expect data or
>>> the other way around? Or something less clear? (If you don't know,
>>> that's a good answer too. I know well that this kind of things is hard
>>> to debug.)
>>
>> Unfortunately not. I can only see the symptom of memory swapped back in
>> being corrupt (at least that's what happens AFAIU), leading to segfaults
>> in various processes as well as issues with heap allocations, e.g.:
>> corrupted double-linked list
>> free(): invalid pointer
>>
>> I'll write a small program which allocates memory with a fixed pattern
>> and regularly dumps it, maybe that works to get an idea about the
>> corruption.
> 
> AI suggests a scenario that looks like a real bug to me, though I'm not
> sure if it's yours. See the reproducer below.
> 
> Basically it boils down to a non-allocating write being in flight to a
> cluster that is concurrently discarded, turning the write essentially
> into a host-cluster use-after-free. If you then allocate a new cluster
> at the same time, the host cluster will be reused and the write that was
> for a different guest cluster still writes to it.
> 
> I'm not completely sure yet what the right synchronisation mechanism
> would be for this.
> 
> Anyway, as it depends on a specific pattern of discard and cluster
> allocation happening while a write request is in flight, it should be
> possible to use tracing to find out if anything like that is happening
> in your case.

I will try to do tracing. With the following program, I see that the
corrupt memory reads back as zeroes:

> #include <stdbool.h>
> #include <stdint.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> 
> #define SIZE 1024
> #define COUNT 100 * 1024
> 
> uint8_t data[COUNT * SIZE];
> 
> int main(void) {
>     while (true) {
>         for (uint32_t i = 0; i < COUNT; i++) {
>             memset(data + i * SIZE, (i % 100) + 1, SIZE);
>         }
>         sleep(10);
>         for (uint32_t i = 0; i < COUNT; i++) {
>             for (uint32_t j = 0; j < SIZE; j++) {
>                 if (data[i * SIZE + j] != (i % 100) + 1) {
>                     uint32_t start = j;
>                     uint32_t corrupt_val = data[i * SIZE + j];
>                     while (j + 1 < SIZE && data[i * SIZE + j + 1] == corrupt_val) {
>                         j++;
>                     }
>                     printf("corrupt block %u byte range %u - %u: val %u expected %u\n",
>                            i, start, j, corrupt_val, (i % 100) + 1);
>                 }
>             }
>         }
>     }
>     return 0;
> }

Example output:

> corrupt block 21495 byte range 928 - 1023: val 0 expected 96
> corrupt block 21496 byte range 0 - 1023: val 0 expected 97
> corrupt block 21497 byte range 0 - 1023: val 0 expected 98
> corrupt block 21498 byte range 0 - 1023: val 0 expected 99
> corrupt block 21499 byte range 0 - 1023: val 0 expected 100
> corrupt block 21500 byte range 0 - 1023: val 0 expected 1
> corrupt block 21501 byte range 0 - 1023: val 0 expected 2
> corrupt block 21502 byte range 0 - 1023: val 0 expected 3
> corrupt block 21503 byte range 0 - 1023: val 0 expected 4
> corrupt block 21504 byte range 0 - 1023: val 0 expected 5
> corrupt block 21505 byte range 0 - 1023: val 0 expected 6
> corrupt block 21506 byte range 0 - 1023: val 0 expected 7
> corrupt block 21507 byte range 0 - 1023: val 0 expected 8
> corrupt block 21508 byte range 0 - 1023: val 0 expected 9
> corrupt block 21509 byte range 0 - 1023: val 0 expected 10
> corrupt block 21510 byte range 0 - 1023: val 0 expected 11
> corrupt block 21511 byte range 0 - 927: val 0 expected 12
> corrupt block 22727 byte range 928 - 1023: val 0 expected 28
> corrupt block 22728 byte range 0 - 1023: val 0 expected 29
> corrupt block 22729 byte range 0 - 1023: val 0 expected 30
> corrupt block 22730 byte range 0 - 1023: val 0 expected 31
> corrupt block 22731 byte range 0 - 1023: val 0 expected 32
> corrupt block 22732 byte range 0 - 1023: val 0 expected 33
> corrupt block 22733 byte range 0 - 1023: val 0 expected 34
> corrupt block 22734 byte range 0 - 1023: val 0 expected 35
> corrupt block 22735 byte range 0 - 1023: val 0 expected 36
> corrupt block 22736 byte range 0 - 1023: val 0 expected 37
> corrupt block 22737 byte range 0 - 1023: val 0 expected 38
> corrupt block 22738 byte range 0 - 1023: val 0 expected 39
> corrupt block 22739 byte range 0 - 1023: val 0 expected 40
> corrupt block 22740 byte range 0 - 1023: val 0 expected 41
> corrupt block 22741 byte range 0 - 1023: val 0 expected 42
> corrupt block 22742 byte range 0 - 1023: val 0 expected 43
> corrupt block 22743 byte range 0 - 1023: val 0 expected 44
> corrupt block 22744 byte range 0 - 1023: val 0 expected 45
> corrupt block 22745 byte range 0 - 1023: val 0 expected 46
> corrupt block 22746 byte range 0 - 1023: val 0 expected 47
> corrupt block 22747 byte range 0 - 1023: val 0 expected 48
> corrupt block 22748 byte range 0 - 1023: val 0 expected 49
> corrupt block 22749 byte range 0 - 1023: val 0 expected 50
> corrupt block 22750 byte range 0 - 1023: val 0 expected 51
> corrupt block 22751 byte range 0 - 927: val 0 expected 52
> corrupt block 23451 byte range 928 - 1023: val 0 expected 52
> corrupt block 23452 byte range 0 - 1023: val 0 expected 53
> corrupt block 23453 byte range 0 - 1023: val 0 expected 54
> corrupt block 23454 byte range 0 - 1023: val 0 expected 55
> corrupt block 23455 byte range 0 - 1023: val 0 expected 56
> corrupt block 23456 byte range 0 - 1023: val 0 expected 57
> corrupt block 23457 byte range 0 - 1023: val 0 expected 58
> corrupt block 23458 byte range 0 - 1023: val 0 expected 59
> corrupt block 23459 byte range 0 - 1023: val 0 expected 60
> corrupt block 23460 byte range 0 - 1023: val 0 expected 61
> corrupt block 23461 byte range 0 - 1023: val 0 expected 62
> corrupt block 23462 byte range 0 - 1023: val 0 expected 63
> corrupt block 23463 byte range 0 - 1023: val 0 expected 64
> corrupt block 23464 byte range 0 - 1023: val 0 expected 65
> corrupt block 23465 byte range 0 - 1023: val 0 expected 66
> corrupt block 23466 byte range 0 - 1023: val 0 expected 67
> corrupt block 23467 byte range 0 - 1023: val 0 expected 68
> corrupt block 23468 byte range 0 - 1023: val 0 expected 69
> corrupt block 23469 byte range 0 - 1023: val 0 expected 70
> corrupt block 23470 byte range 0 - 1023: val 0 expected 71
> corrupt block 23471 byte range 0 - 1023: val 0 expected 72
> corrupt block 23472 byte range 0 - 1023: val 0 expected 73
> corrupt block 23473 byte range 0 - 1023: val 0 expected 74
> corrupt block 23474 byte range 0 - 1023: val 0 expected 75
> corrupt block 23475 byte range 0 - 1023: val 0 expected 76
> corrupt block 23476 byte range 0 - 1023: val 0 expected 77
> corrupt block 23477 byte range 0 - 1023: val 0 expected 78
> corrupt block 23478 byte range 0 - 1023: val 0 expected 79
> corrupt block 23479 byte range 0 - 1023: val 0 expected 80
> corrupt block 23480 byte range 0 - 1023: val 0 expected 81
> corrupt block 23481 byte range 0 - 1023: val 0 expected 82
> corrupt block 23482 byte range 0 - 1023: val 0 expected 83
> corrupt block 23483 byte range 0 - 1023: val 0 expected 84
> corrupt block 23484 byte range 0 - 1023: val 0 expected 85
> corrupt block 23485 byte range 0 - 1023: val 0 expected 86
> corrupt block 23486 byte range 0 - 1023: val 0 expected 87
> corrupt block 23487 byte range 0 - 927: val 0 expected 88




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
  2026-05-21 15:14           ` Fiona Ebner
@ 2026-05-21 16:14             ` Thomas Lamprecht
  0 siblings, 0 replies; 19+ messages in thread
From: Thomas Lamprecht @ 2026-05-21 16:14 UTC (permalink / raw)
  To: Fiona Ebner, Kevin Wolf
  Cc: qemu-block, Michael Tokarev, hreitz, den, stefanha, qemu-stable,
	qemu-devel

Am 21.05.26 um 17:14 schrieb Fiona Ebner:
> Am 21.05.26 um 4:35 PM schrieb Kevin Wolf:
>> Anyway, as it depends on a specific pattern of discard and cluster
>> allocation happening while a write request is in flight, it should be
>> possible to use tracing to find out if anything like that is happening
>> in your case.
> 
> I will try to do tracing. With the following program, I see that the
> corrupt memory reads back as zeroes:
Hmm, looking at qcow2_co_pwrite_zeroes(): could the cluster state
observed before wait_for_dependencies() drops s->lock differ from what
qcow2_subcluster_zeroize() then operates on?



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-05-21 16:16 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
2026-04-29 15:06   ` Fiona Ebner
2026-05-12 12:16     ` Kevin Wolf
2026-05-13 15:28       ` Fiona Ebner
2026-04-27 17:05 ` [PATCH 2/4] qemu-io: Add 'aio_discard' command Kevin Wolf
2026-04-28 10:59   ` Denis V. Lunev
2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
2026-04-29 15:28   ` Fiona Ebner
2026-05-12 12:22     ` Kevin Wolf
2026-05-13 15:34       ` Fiona Ebner
2026-05-21 12:12   ` Fiona Ebner
2026-05-21 13:46     ` Kevin Wolf
2026-05-21 14:18       ` Fiona Ebner
2026-05-21 14:35         ` Kevin Wolf
2026-05-21 15:14           ` Fiona Ebner
2026-05-21 16:14             ` Thomas Lamprecht
2026-04-27 17:05 ` [PATCH 4/4] iotests/046: Test that discard/write_zeroes wait for dependencies Kevin Wolf
2026-04-28 11:00 ` [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Denis V. Lunev

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.