Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Kevin Wolf <kwolf@redhat.com>
To: Fiona Ebner <f.ebner@proxmox.com>
Cc: qemu-block@nongnu.org, Michael Tokarev <mjt@tls.msk.ru>,
	hreitz@redhat.com, den@openvz.org, stefanha@redhat.com,
	qemu-stable@nongnu.org, qemu-devel@nongnu.org,
	Thomas Lamprecht <t.lamprecht@proxmox.com>
Subject: Re: [PATCH 3/4] qcow2: Fix corruption on discard during write with COW
Date: Thu, 21 May 2026 16:35:05 +0200	[thread overview]
Message-ID: <ag8YGeNoYB-sanMh@redhat.com> (raw)
In-Reply-To: <2fa73e56-f4b5-436e-ab25-5654e0837bce@proxmox.com>

Am 21.05.2026 um 16:18 hat Fiona Ebner geschrieben:
> Am 21.05.26 um 3:46 PM schrieb Kevin Wolf:
> > Am 21.05.2026 um 14:12 hat Fiona Ebner geschrieben:
> >> Am 27.04.26 um 7:04 PM schrieb Kevin Wolf:
> >> I'm still trying to figure things out and come up with a better
> >> reproducer, but wanted to let you know early, also because of the
> >> upcoming stable releases. Of course, I'd also be happy for hints/hunches
> >> and am happy to test suggestions!
> > 
> > Do you have any information about the options used with the image file?
> > In particular, is it using subclusters? Maybe just the 'qemu-img info'
> > output would already give a bit more context.
> 
> No subclusters if I'm not missing anything. When I created the image the
> output was:
> 
> Formatting '/mnt/pve/dir/images/300/vm-300-disk-0.qcow2', fmt=qcow2
> cluster_size=65536 extended_l2=off preallocation=metadata
> compression_type=zlib size=4510973952 lazy_refcounts=off refcount_bits=16
> 
> Our management layer doesn't log the command itself, but doing the same
> operation with logging added (and 301 instead of 300):
> 
> /usr/bin/qemu-img create -o preallocation=metadata -f qcow2
> /mnt/pve/dir/images/301/vm-301-disk-0.qcow2 4405248K
> 
> qemu-img info gives:
> [...]

Ok, looks like all default options.

> > Could you already locate the actual corruption and check what the
> > pattern looks like? Something like zeros where we would expect data or
> > the other way around? Or something less clear? (If you don't know,
> > that's a good answer too. I know well that this kind of things is hard
> > to debug.)
> 
> Unfortunately not. I can only see the symptom of memory swapped back in
> being corrupt (at least that's what happens AFAIU), leading to segfaults
> in various processes as well as issues with heap allocations, e.g.:
> corrupted double-linked list
> free(): invalid pointer
> 
> I'll write a small program which allocates memory with a fixed pattern
> and regularly dumps it, maybe that works to get an idea about the
> corruption.

AI suggests a scenario that looks like a real bug to me, though I'm not
sure if it's yours. See the reproducer below.

Basically it boils down to a non-allocating write being in flight to a
cluster that is concurrently discarded, turning the write essentially
into a host-cluster use-after-free. If you then allocate a new cluster
at the same time, the host cluster will be reused and the write that was
for a different guest cluster still writes to it.

I'm not completely sure yet what the right synchronisation mechanism
would be for this.

Anyway, as it depends on a specific pattern of discard and cluster
allocation happening while a write request is in flight, it should be
possible to use tracing to find out if anything like that is happening
in your case.

Kevin


blkdebug.conf:

[set-state]
state = "1"
event = "write_aio"
new_state = "2"

[set-state]
state = "2"
event = "cluster_alloc"
new_state = "3"


race_test.sh:

#!/bin/bash
#
# Reproducer for the wait_for_dependencies / skip_cow race in
# qcow2_subcluster_zeroize — demonstrating data corruption at an
# UNRELATED guest offset through host cluster reuse.
#
# The scenario:
#   1. Write A to a ZERO_ALLOC cluster creates l2meta. Data I/O suspended.
#   2. Write B to same cluster waits for A. Zero-write also waits for A.
#   3. A completes (cluster → NORMAL). B wakes first (FIFO), gets
#      skip_cow=true (no l2meta), starts data I/O — suspended by blkdebug.
#      Zero-write wakes, finds no deps (B invisible), frees cluster.
#   4. Write D to a DIFFERENT guest offset allocates the freed cluster.
#      D writes its data. D completes.
#   5. B resumes and writes to the same physical cluster, overwriting D.
#   6. Reading D's guest offset returns B's data. CORRUPTION.

set -e

DIR="$(cd "$(dirname "$0")" && pwd)"
QEMU_IO="${DIR}/../build/qemu-io"
QEMU_IMG="${DIR}/../build/qemu-img"
TEST_IMG="/tmp/race_test_$$.qcow2"
BLKDEBUG_CONF="${DIR}/blkdebug.conf"
LOG="/home/cursor/qemu/debug-8a8071.log"

cleanup() {
    rm -f "$TEST_IMG"
}
trap cleanup EXIT

echo "=== Creating test image ==="
"$QEMU_IMG" create -f qcow2 "$TEST_IMG" 1M

echo ""
echo "=== Preparing ZERO_ALLOC cluster at guest offset 0 ==="
"$QEMU_IO" -c "write -P 0x11 0 64k" \
            -c "write -z 0 64k" \
            "$TEST_IMG"

echo ""
echo "=== Running race reproducer ==="
#
# blkdebug.conf state machine:
#   State 1 --(write_aio)--> State 2 --(cluster_alloc)--> State 3
#
# - State 1: tagA breakpoint catches write A
# - State 2: tagB breakpoint catches write B (skip_cow write)
# - State 2→3 transition on cluster_alloc: D's allocation transitions
#   state to 3 BEFORE D fires write_aio, so D is NOT caught by tagB
#
# Sequence:
#   break write_aio tagA          -- breakpoint for state 1
#   aio_write A 0xAA 0 64k       -- suspended at tagA (state 1→2)
#   wait_break tagA
#   break write_aio tagB          -- breakpoint for state 2
#   aio_write B 0xBB 0 64k       -- waits for A (handle_dependencies)
#   aio_write -z -u 0 64k        -- waits for A (wait_for_dependencies)
#   resume tagA                   -- A completes. B wakes (skip_cow),
#                                    caught by tagB. Zero-write frees
#                                    cluster.
#   wait_break tagB               -- B suspended, cluster freed
#   write D 0xDD 64k 64k         -- D allocates the freed cluster
#                                    (cluster_alloc transitions to
#                                    state 3). D's write_aio fires at
#                                    state 3 — no breakpoint. D writes
#                                    its data and completes.
#   resume tagB                   -- B writes to the SAME physical
#                                    cluster, overwriting D's data
#   aio_flush
#
#   read -P 0xDD 64k 64k         -- EXPECTS D's data (0xDD)
#                                    GETS B's data (0xBB) → CORRUPTION

QEMU_IO_OUTPUT=$("$QEMU_IO" \
    -c "break write_aio tagA" \
    -c "aio_write -P 0xAA 0 64k" \
    -c "wait_break tagA" \
    -c "break write_aio tagB" \
    -c "aio_write -P 0xBB 0 64k" \
    -c "aio_write -z -u 0 64k" \
    -c "resume tagA" \
    -c "wait_break tagB" \
    -c "write -P 0xDD 64k 64k" \
    -c "resume tagB" \
    -c "aio_flush" \
    -c "read -vP 0xDD 64k 512" \
    -c "read -vP 0 0 512" \
    "blkdebug:${BLKDEBUG_CONF}:${TEST_IMG}" 2>&1) || true

echo "$QEMU_IO_OUTPUT"

PATTERN_FAIL=$(echo "$QEMU_IO_OUTPUT" | grep -c "Pattern verification failed" || true)
if [ "$PATTERN_FAIL" -gt 0 ]; then
    echo ""
    echo "*** DATA CORRUPTION DETECTED at guest offset 64K ***"
    echo "*** D wrote 0xDD, but reading returns B's data (0xBB)."
    echo "*** B's write to the freed+reallocated cluster corrupted"
    echo "*** an UNRELATED guest address."
fi

echo ""
echo "=== Checking image integrity (metadata) ==="
"$QEMU_IMG" check "$TEST_IMG" || true

echo ""
echo "=== Allocation map ==="
"$QEMU_IMG" map --output=json "$TEST_IMG"

echo ""
echo "=== Checking log for race evidence ==="
if [ -f "$LOG" ]; then
    echo "--- Log entries (chronological) ---"
    cat "$LOG"
    echo ""
    echo "--- Race analysis ---"
    # Extract host offsets for the skip_cow write (B) and D's write
    B_HOST=$(grep '"has_l2meta":0' "$LOG" | grep -o '"host_offset":[0-9]*' | head -1 | grep -o '[0-9]*')
    D_HOST=$(grep '"offset":65536' "$LOG" | grep -o '"host_offset":[0-9]*' | head -1 | grep -o '[0-9]*')

    echo "Write B (skip_cow, no l2meta) host_offset: $B_HOST"
    echo "Write D (different guest offset)  host_offset: $D_HOST"

    if [ -n "$B_HOST" ] && [ -n "$D_HOST" ] && [ "$B_HOST" = "$D_HOST" ]; then
        echo ""
        echo "*** CLUSTER REUSE CONFIRMED: B and D write to the same"
        echo "*** physical cluster ($B_HOST) for different guest offsets."
        echo "*** B (guest offset 0) overwrites D (guest offset 64K)."
        echo "*** Reading guest offset 64K returns B's data → CORRUPTION"
        echo "*** at an unrelated guest address."
    fi
else
    echo "No log file found at $LOG"
fi

next prev parent reply	other threads:[~2026-05-21 14:36 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-27 17:05 [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
2026-04-27 17:05 ` [PATCH 1/4] commit: Drain nodes across all of bdrv_commit() Kevin Wolf
2026-04-29 15:06   ` Fiona Ebner
2026-05-12 12:16     ` Kevin Wolf
2026-05-13 15:28       ` Fiona Ebner
2026-04-27 17:05 ` [PATCH 2/4] qemu-io: Add 'aio_discard' command Kevin Wolf
2026-04-28 10:59   ` Denis V. Lunev
2026-04-27 17:05 ` [PATCH 3/4] qcow2: Fix corruption on discard during write with COW Kevin Wolf
2026-04-29 15:28   ` Fiona Ebner
2026-05-12 12:22     ` Kevin Wolf
2026-05-13 15:34       ` Fiona Ebner
2026-05-21 12:12   ` Fiona Ebner
2026-05-21 13:46     ` Kevin Wolf
2026-05-21 14:18       ` Fiona Ebner
2026-05-21 14:35         ` Kevin Wolf [this message]
2026-05-21 15:14           ` Fiona Ebner
2026-05-21 16:14             ` Thomas Lamprecht
2026-04-27 17:05 ` [PATCH 4/4] iotests/046: Test that discard/write_zeroes wait for dependencies Kevin Wolf
2026-04-28 11:00 ` [PATCH 0/4] qcow2: Fix corruption on discard during write with COW Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ag8YGeNoYB-sanMh@redhat.com \
    --to=kwolf@redhat.com \
    --cc=den@openvz.org \
    --cc=f.ebner@proxmox.com \
    --cc=hreitz@redhat.com \
    --cc=mjt@tls.msk.ru \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-stable@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=t.lamprecht@proxmox.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.