From: Hanna Reitz <hreitz@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>, qemu-devel@nongnu.org
Cc: Kevin Wolf <kwolf@redhat.com>,
Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>,
qemu-block@nongnu.org
Subject: Re: [RFC] block-backend: prevent dangling BDS pointer in blk_drain()
Date: Thu, 9 Dec 2021 16:45:13 +0100 [thread overview]
Message-ID: <404e8b66-b64a-fded-db53-81eef675fc33@redhat.com> (raw)
In-Reply-To: <20211209142304.381253-1-stefanha@redhat.com>
On 09.12.21 15:23, Stefan Hajnoczi wrote:
> The BlockBackend root child can change during bdrv_drained_begin() when
> aio_poll() is invoked. In fact the BlockDriverState can reach refcnt 0
> and blk_drain() is left with a dangling BDS pointer.
>
> One example is scsi_device_purge_requests(), which calls blk_drain() to
> wait for in-flight requests to cancel. If the backup blockjob is active,
> then the BlockBackend root child is a temporary filter BDS owned by the
> blockjob. The blockjob can complete during bdrv_drained_begin() and the
> last reference to the BDS is released when the temporary filter node is
> removed. This results in a use-after-free when blk_drain() calls
> bdrv_drained_end(bs) on the dangling pointer.
>
> The general problem is that a function and its callers must not assume
> that bs is still valid across aio_poll(). Explicitly hold a reference to
> bs in blk_drain() to avoid the dangling pointer.
>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
> I found that BDS nodes are sometimes deleted with bs->quiesce_counter >
> 0 (at least when running "make check"), so it is currently not possible
> to put the bdrv_ref/unref() calls in bdrv_do_drained_begin() and
> bdrv_do_drained_end() because they will be unbalanced. That would have
> been a more general solution than only fixing blk_drain().
Deleting nodes that have a `quiesce_counter > 0` doesn’t seem wrong to
me – deleting only depends on strong references, and so I’d expect that
anything that increases the quiesce_counter also has a strong reference
to the node if the former wants the latter to stay around.
I suppose we could make it so that both the quiesce_counter and the
refcnt need to be 0 before a BDS is deleted (and then deletion can
happen both from bdrv_unref() and drained_end), but I don’t know whether
that’s really necessary. I’d rather leave it to the caller to ensure
they keep a strong reference throughout the drain.
The question is, how often do we have a situation like this, where we
take a weak reference for draining, because we assume there’s a strong
reference backing us up (namely the one through blk->root), but that
strong reference then can go away due to draining...
> Any suggestions for a better fix?
The fix makes sense to me.
One alternative that comes to my mind is to instead re-fetch `bs =
blk_bs(blk);` after the AIO_WAIT_WHILE() loop. But that might be wrong,
because if the node attached to the BB changed (i.e. isn’t `bs`, and
isn’t `NULL`), then we’d end the drain on the wrong node.
So I think your fix is the right one.
Hanna
> I think it's likely that more "dangling pointer across aio_poll()"
> problems exist :(.
>
> Here is the (hacky) reproducer:
>
> build/qemu-system-x86_64 \
> -name 'avocado-vt-vm1' \
> -sandbox on \
> -machine q35,memory-backend=mem-machine_mem \
> -device pcie-root-port,id=pcie-root-port-0,multifunction=on,bus=pcie.0,addr=0x1,chassis=1 \
> -device pcie-pci-bridge,id=pcie-pci-bridge-0,addr=0x0,bus=pcie-root-port-0 \
> -nodefaults \
> -device VGA,bus=pcie.0,addr=0x2 \
> -m 1024 \
> -object memory-backend-ram,size=1024M,id=mem-machine_mem \
> -smp 10,maxcpus=10,cores=5,threads=1,dies=1,sockets=2 \
> -cpu 'Cascadelake-Server-noTSX',+kvm_pv_unhalt \
> -chardev socket,wait=off,server=on,id=qmp_id_qmpmonitor1,path=/tmp/qmp.sock \
> -mon chardev=qmp_id_qmpmonitor1,mode=control \
> -chardev socket,wait=off,server=on,id=qmp_id_catch_monitor,path=/tmp/catch_monitor.sock \
> -mon chardev=qmp_id_catch_monitor,mode=control \
> -device pvpanic,ioport=0x505,id=idgKHYrQ \
> -chardev socket,wait=off,server=on,id=chardev_serial0,path=/tmp/serial.sock \
> -device isa-serial,id=serial0,chardev=chardev_serial0 \
> -chardev socket,id=seabioslog_id_20211110-012521-TNCkxDmn,path=/tmp/seabios.sock,server=on,wait=off \
> -device isa-debugcon,chardev=seabioslog_id_20211110-012521-TNCkxDmn,iobase=0x402 \
> -device pcie-root-port,id=pcie-root-port-2,port=0x2,addr=0x1.0x2,bus=pcie.0,chassis=3 \
> -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pcie-root-port-2,addr=0x0 \
> -blockdev node-name=file_image1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=test.img,cache.direct=on,cache.no-flush=off \
> -blockdev node-name=drive_image1,driver=raw,read-only=off,cache.direct=on,cache.no-flush=off,file=file_image1 \
> -device scsi-hd,id=image1,drive=drive_image1,write-cache=on \
> -blockdev node-name=file_src1,driver=file,auto-read-only=on,discard=unmap,aio=threads,filename=sr1.qcow2,cache.direct=on,cache.no-flush=off \
> -blockdev node-name=drive_src1,driver=qcow2,read-only=off,cache.direct=on,cache.no-flush=off,file=file_src1 \
> -device scsi-hd,id=src1,drive=drive_src1,write-cache=on \
> -device pcie-root-port,id=pcie-root-port-3,port=0x3,addr=0x1.0x3,bus=pcie.0,chassis=4 \
> -device virtio-net-pci,mac=9a:11:64:b0:5d:a8,id=idxnEEYY,netdev=idBjpylo,bus=pcie-root-port-3,addr=0x0 \
> -netdev user,id=idBjpylo \
> -vnc :0 \
> -rtc base=utc,clock=host,driftfix=slew \
> -boot menu=off,order=cdn,once=c,strict=off \
> -enable-kvm \
> -device pcie-root-port,id=pcie_extra_root_port_0,multifunction=on,bus=pcie.0,addr=0x3,chassis=5 &
>
> sleep 8 # delay for VM startup and socket creation
>
> nc -U /tmp/qmp.sock <<EOF
> {'execute': 'qmp_capabilities'}
> {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'file', 'filename': 'dst1.qcow2', 'size': 209715200}, 'job-id': 'file_dst1'}, 'id': 'Fk1bF3FV'}
> EOF
> sleep 1 # wait for blockdev-create completion
> nc -U /tmp/qmp.sock <<EOF
> {'execute': 'qmp_capabilities'}
> {'execute': 'job-dismiss', 'arguments': {'id': 'file_dst1'}, 'id': '13R5TDSj'}
> {'execute': 'blockdev-add', 'arguments': {'node-name': 'file_dst1', 'driver': 'file', 'filename': 'dst1.qcow2', 'aio': 'threads', 'auto-read-only': true, 'discard': 'unmap'}, 'id': 'VIzrN0zy'}
> {'execute': 'blockdev-create', 'arguments': {'options': {'driver': 'qcow2', 'file': 'file_dst1', 'size': 209715200}, 'job-id': 'drive_dst1'}, 'id': 'YX8t8hBs'}
> EOF
> sleep 1 # wait for blockdev-create completion
> nc -U /tmp/qmp.sock <<EOF
> {'execute': 'qmp_capabilities'}
> {'execute': 'job-dismiss', 'arguments': {'id': 'drive_dst1'}, 'id': 'OTZwYb7J'}
> {'execute': 'blockdev-add', 'arguments': {'node-name': 'drive_dst1', 'driver': 'qcow2', 'file': 'file_dst1', 'read-only': false}, 'id': 'QHyUxtql'}
> {'execute': 'system_reset', 'id': 'OREutgnz'}
> {'execute': 'blockdev-backup', 'arguments': {'device': 'drive_src1', 'target': 'drive_dst1', 'job-id': 'drive_src1_qnFF', 'sync': 'full', 'speed': 0, 'compress': false, 'auto-finalize': true, 'auto-dismiss': true, 'on-source-error': 'report', 'on-target-error': 'report'}, 'id': 'WbDARa8c'}
> EOF
> ---
> block/block-backend.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 12ef80ea17..5608c0451b 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -1705,6 +1705,7 @@ void blk_drain(BlockBackend *blk)
> BlockDriverState *bs = blk_bs(blk);
>
> if (bs) {
> + bdrv_ref(bs);
> bdrv_drained_begin(bs);
> }
>
> @@ -1714,6 +1715,7 @@ void blk_drain(BlockBackend *blk)
>
> if (bs) {
> bdrv_drained_end(bs);
> + bdrv_unref(bs);
> }
> }
>
next prev parent reply other threads:[~2021-12-09 15:46 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-12-09 14:23 [RFC] block-backend: prevent dangling BDS pointer in blk_drain() Stefan Hajnoczi
2021-12-09 15:45 ` Hanna Reitz [this message]
2021-12-09 16:09 ` Vladimir Sementsov-Ogievskiy
2021-12-09 16:32 ` Stefan Hajnoczi
2021-12-09 16:51 ` Vladimir Sementsov-Ogievskiy
2021-12-13 9:39 ` Stefan Hajnoczi
2021-12-10 14:00 ` Kevin Wolf
2021-12-13 10:06 ` Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=404e8b66-b64a-fded-db53-81eef675fc33@redhat.com \
--to=hreitz@redhat.com \
--cc=kwolf@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=stefanha@redhat.com \
--cc=vsementsov@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).