* [PATCH v3 0/4] fix two edge cases related to stream block jobs
@ 2024-03-22 9:50 Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request Fiona Ebner
` (4 more replies)
0 siblings, 5 replies; 17+ messages in thread
From: Fiona Ebner @ 2024-03-22 9:50 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, hreitz, kwolf, fam, stefanha,
t.lamprecht, w.bumiller
Changes in v3:
* Also deal with edge case in bdrv_next_cleanup(). Haven't run
into an actual issue there, but at least the caller in
migration/block.c uses bdrv_nb_sectors() which, while not a
coroutine wrapper itself (it's written manually), may call
bdrv_refresh_total_sectors(), which is a generated coroutine
wrapper, so AFAIU, the block graph can change during that call.
And even without that, it's just better to be more consistent
with bdrv_next().
Changes in v2:
* Ran into another issue while writing the IO test Stefan wanted
to have (good call :)), so include a fix for that and add the
test. I didn't notice during manual testing, because I hadn't
used a scripted QMP 'quit', so there was no race.
Fiona Ebner (3):
block-backend: fix edge case in bdrv_next() where BDS associated to BB
changes
block-backend: fix edge case in bdrv_next_cleanup() where BDS
associated to BB changes
iotests: add test for stream job with an unaligned prefetch read
Stefan Reiter (1):
block/io: accept NULL qiov in bdrv_pad_request
block/block-backend.c | 18 ++--
block/io.c | 31 ++++---
.../tests/stream-unaligned-prefetch | 86 +++++++++++++++++++
.../tests/stream-unaligned-prefetch.out | 5 ++
4 files changed, 117 insertions(+), 23 deletions(-)
create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
--
2.39.2
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
@ 2024-03-22 9:50 ` Fiona Ebner
2024-03-25 19:56 ` Stefan Hajnoczi
2024-03-22 9:50 ` [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes Fiona Ebner
` (3 subsequent siblings)
4 siblings, 1 reply; 17+ messages in thread
From: Fiona Ebner @ 2024-03-22 9:50 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, hreitz, kwolf, fam, stefanha,
t.lamprecht, w.bumiller
From: Stefan Reiter <s.reiter@proxmox.com>
Some operations, e.g. block-stream, perform reads while discarding the
results (only copy-on-read matters). In this case, they will pass NULL
as the target QEMUIOVector, which will however trip bdrv_pad_request,
since it wants to extend its passed vector. In particular, this is the
case for the blk_co_preadv() call in stream_populate().
If there is no qiov, no operation can be done with it, but the bytes
and offset still need to be updated, so the subsequent aligned read
will actually be aligned and not run into an assertion failure.
In particular, this can happen when the request alignment of the top
node is larger than the allocated part of the bottom node, in which
case padding becomes necessary. For example:
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> <<EOF
> {"execute": "qmp_capabilities"}
> {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": "node0", "node-name": "node1" } }
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": "node1" } }
> EOF
Originally-by: Stefan Reiter <s.reiter@proxmox.com>
Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
[FE: do update bytes and offset in any case
add reproducer to commit message]
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
No changes in v3.
No changes in v2.
block/io.c | 31 +++++++++++++++++++------------
1 file changed, 19 insertions(+), 12 deletions(-)
diff --git a/block/io.c b/block/io.c
index 33150c0359..395bea3bac 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1726,22 +1726,29 @@ static int bdrv_pad_request(BlockDriverState *bs,
return 0;
}
- sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
- &sliced_head, &sliced_tail,
- &sliced_niov);
+ /*
+ * For prefetching in stream_populate(), no qiov is passed along, because
+ * only copy-on-read matters.
+ */
+ if (qiov && *qiov) {
+ sliced_iov = qemu_iovec_slice(*qiov, *qiov_offset, *bytes,
+ &sliced_head, &sliced_tail,
+ &sliced_niov);
- /* Guaranteed by bdrv_check_request32() */
- assert(*bytes <= SIZE_MAX);
- ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
- sliced_head, *bytes);
- if (ret < 0) {
- bdrv_padding_finalize(pad);
- return ret;
+ /* Guaranteed by bdrv_check_request32() */
+ assert(*bytes <= SIZE_MAX);
+ ret = bdrv_create_padded_qiov(bs, pad, sliced_iov, sliced_niov,
+ sliced_head, *bytes);
+ if (ret < 0) {
+ bdrv_padding_finalize(pad);
+ return ret;
+ }
+ *qiov = &pad->local_qiov;
+ *qiov_offset = 0;
}
+
*bytes += pad->head + pad->tail;
*offset -= pad->head;
- *qiov = &pad->local_qiov;
- *qiov_offset = 0;
if (padded) {
*padded = true;
}
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request Fiona Ebner
@ 2024-03-22 9:50 ` Fiona Ebner
2024-03-25 20:06 ` Stefan Hajnoczi
2024-03-26 12:44 ` Kevin Wolf
2024-03-22 9:50 ` [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() " Fiona Ebner
` (2 subsequent siblings)
4 siblings, 2 replies; 17+ messages in thread
From: Fiona Ebner @ 2024-03-22 9:50 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, hreitz, kwolf, fam, stefanha,
t.lamprecht, w.bumiller
The old_bs variable in bdrv_next() is currently determined by looking
at the old block backend. However, if the block graph changes before
the next bdrv_next() call, it might be that the associated BDS is not
the same that was referenced previously. In that case, the wrong BDS
is unreferenced, leading to an assertion failure later:
> bdrv_unref: Assertion `bs->refcnt > 0' failed.
In particular, this can happen in the context of bdrv_flush_all(),
when polling for bdrv_co_flush() in the generated co-wrapper leads to
a graph change (for example with a stream block job [0]).
A racy reproducer:
> #!/bin/bash
> rm -f /tmp/backing.qcow2
> rm -f /tmp/top.qcow2
> ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> ./qemu-system-x86_64 --qmp stdio \
> --blockdev qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> <<EOF
> {"execute": "qmp_capabilities"}
> {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": "node0" } }
> {"execute": "quit"}
> EOF
[0]:
> #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., errp=...)
> #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., detach_subchain=..., errp=...)
> #3 bdrv_drop_filter (bs=..., errp=...)
> #4 bdrv_cor_filter_drop (cor_filter_bs=...)
> #5 stream_prepare (job=...)
> #6 job_prepare_locked (job=...)
> #7 job_txn_apply_locked (fn=..., job=...)
> #8 job_do_finalize_locked (job=...)
> #9 job_exit (opaque=...)
> #10 aio_bh_poll (ctx=...)
> #11 aio_poll (ctx=..., blocking=...)
> #12 bdrv_poll_co (s=...)
> #13 bdrv_flush (bs=...)
> #14 bdrv_flush_all ()
> #15 do_vm_stop (state=..., send_stop=...)
> #16 vm_shutdown ()
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
No changes in v3.
New in v2.
block/block-backend.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
diff --git a/block/block-backend.c b/block/block-backend.c
index 9c4de79e6b..28af1eb17a 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -599,14 +599,14 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
/* Must be called from the main loop */
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
+ old_bs = it->bs;
+
/* First, return all root nodes of BlockBackends. In order to avoid
* returning a BDS twice when multiple BBs refer to it, we only return it
* if the BB is the first one in the parent list of the BDS. */
if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
BlockBackend *old_blk = it->blk;
- old_bs = old_blk ? blk_bs(old_blk) : NULL;
-
do {
it->blk = blk_all_next(it->blk);
bs = it->blk ? blk_bs(it->blk) : NULL;
@@ -620,11 +620,10 @@ BlockDriverState *bdrv_next(BdrvNextIterator *it)
if (bs) {
bdrv_ref(bs);
bdrv_unref(old_bs);
+ it->bs = bs;
return bs;
}
it->phase = BDRV_NEXT_MONITOR_OWNED;
- } else {
- old_bs = it->bs;
}
/* Then return the monitor-owned BDSes without a BB attached. Ignore all
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes Fiona Ebner
@ 2024-03-22 9:50 ` Fiona Ebner
2024-03-25 20:07 ` Stefan Hajnoczi
2024-03-22 9:50 ` [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read Fiona Ebner
2024-03-25 20:11 ` [PATCH v3 0/4] fix two edge cases related to stream block jobs Stefan Hajnoczi
4 siblings, 1 reply; 17+ messages in thread
From: Fiona Ebner @ 2024-03-22 9:50 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, hreitz, kwolf, fam, stefanha,
t.lamprecht, w.bumiller
Same rationale as for commit "block-backend: fix edge case in
bdrv_next() where BDS associated to BB changes". The block graph might
change between the bdrv_next() call and the bdrv_next_cleanup() call,
so it could be that the associated BDS is not the same that was
referenced previously anymore. Instead, rely on bdrv_next() to set
it->bs to the BDS it referenced and unreference that one in any case.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
New in v3.
block/block-backend.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/block/block-backend.c b/block/block-backend.c
index 28af1eb17a..db6f9b92a3 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -663,13 +663,10 @@ void bdrv_next_cleanup(BdrvNextIterator *it)
/* Must be called from the main loop */
assert(qemu_get_current_aio_context() == qemu_get_aio_context());
- if (it->phase == BDRV_NEXT_BACKEND_ROOTS) {
- if (it->blk) {
- bdrv_unref(blk_bs(it->blk));
- blk_unref(it->blk);
- }
- } else {
- bdrv_unref(it->bs);
+ bdrv_unref(it->bs);
+
+ if (it->phase == BDRV_NEXT_BACKEND_ROOTS && it->blk) {
+ blk_unref(it->blk);
}
bdrv_next_reset(it);
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
` (2 preceding siblings ...)
2024-03-22 9:50 ` [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() " Fiona Ebner
@ 2024-03-22 9:50 ` Fiona Ebner
2024-03-25 20:09 ` Stefan Hajnoczi
2024-03-25 20:11 ` [PATCH v3 0/4] fix two edge cases related to stream block jobs Stefan Hajnoczi
4 siblings, 1 reply; 17+ messages in thread
From: Fiona Ebner @ 2024-03-22 9:50 UTC (permalink / raw)
To: qemu-devel
Cc: qemu-block, qemu-stable, hreitz, kwolf, fam, stefanha,
t.lamprecht, w.bumiller
Previously, bdrv_pad_request() could not deal with a NULL qiov when
a read needed to be aligned. During prefetch, a stream job will pass a
NULL qiov. Add a test case to cover this scenario.
By accident, also covers a previous race during shutdown, where block
graph changes during iteration in bdrv_flush_all() could lead to
unreferencing the wrong block driver state and an assertion failure
later.
Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
---
No changes in v3.
New in v2.
.../tests/stream-unaligned-prefetch | 86 +++++++++++++++++++
.../tests/stream-unaligned-prefetch.out | 5 ++
2 files changed, 91 insertions(+)
create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch b/tests/qemu-iotests/tests/stream-unaligned-prefetch
new file mode 100755
index 0000000000..546db1d369
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+# group: rw quick
+#
+# Test what happens when a stream job does an unaligned prefetch read
+# which requires padding while having a NULL qiov.
+#
+# Copyright (C) Proxmox Server Solutions GmbH
+#
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 2 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program. If not, see <http://www.gnu.org/licenses/>.
+#
+
+import os
+import iotests
+from iotests import imgfmt, qemu_img_create, qemu_io, QMPTestCase
+
+image_size = 1 * 1024 * 1024
+cluster_size = 64 * 1024
+base = os.path.join(iotests.test_dir, 'base.img')
+top = os.path.join(iotests.test_dir, 'top.img')
+
+class TestStreamUnalignedPrefetch(QMPTestCase):
+ def setUp(self) -> None:
+ """
+ Create two images:
+ - base image {base} with {cluster_size // 2} bytes allocated
+ - top image {top} without any data allocated and coarser
+ cluster size
+
+ Attach a compress filter for the top image, because that
+ requires that the request alignment is the top image's cluster
+ size.
+ """
+ qemu_img_create('-f', imgfmt,
+ '-o', 'cluster_size={}'.format(cluster_size // 2),
+ base, str(image_size))
+ qemu_io('-c', f'write 0 {cluster_size // 2}', base)
+ qemu_img_create('-f', imgfmt,
+ '-o', 'cluster_size={}'.format(cluster_size),
+ top, str(image_size))
+
+ self.vm = iotests.VM()
+ self.vm.add_blockdev(self.vm.qmp_to_opts({
+ 'driver': imgfmt,
+ 'node-name': 'base',
+ 'file': {
+ 'driver': 'file',
+ 'filename': base
+ }
+ }))
+ self.vm.add_blockdev(self.vm.qmp_to_opts({
+ 'driver': 'compress',
+ 'node-name': 'compress-top',
+ 'file': {
+ 'driver': imgfmt,
+ 'node-name': 'top',
+ 'file': {
+ 'driver': 'file',
+ 'filename': top
+ },
+ 'backing': 'base'
+ }
+ }))
+ self.vm.launch()
+
+ def tearDown(self) -> None:
+ self.vm.shutdown()
+ os.remove(top)
+ os.remove(base)
+
+ def test_stream_unaligned_prefetch(self) -> None:
+ self.vm.cmd('block-stream', job_id='stream', device='compress-top')
+
+
+if __name__ == '__main__':
+ iotests.main(supported_fmts=['qcow2'], supported_protocols=['file'])
diff --git a/tests/qemu-iotests/tests/stream-unaligned-prefetch.out b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
new file mode 100644
index 0000000000..ae1213e6f8
--- /dev/null
+++ b/tests/qemu-iotests/tests/stream-unaligned-prefetch.out
@@ -0,0 +1,5 @@
+.
+----------------------------------------------------------------------
+Ran 1 tests
+
+OK
--
2.39.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request
2024-03-22 9:50 ` [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request Fiona Ebner
@ 2024-03-25 19:56 ` Stefan Hajnoczi
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2024-03-25 19:56 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, kwolf, fam,
t.lamprecht, w.bumiller
[-- Attachment #1: Type: text/plain, Size: 1960 bytes --]
On Fri, Mar 22, 2024 at 10:50:06AM +0100, Fiona Ebner wrote:
> From: Stefan Reiter <s.reiter@proxmox.com>
>
> Some operations, e.g. block-stream, perform reads while discarding the
> results (only copy-on-read matters). In this case, they will pass NULL
> as the target QEMUIOVector, which will however trip bdrv_pad_request,
> since it wants to extend its passed vector. In particular, this is the
> case for the blk_co_preadv() call in stream_populate().
>
> If there is no qiov, no operation can be done with it, but the bytes
> and offset still need to be updated, so the subsequent aligned read
> will actually be aligned and not run into an assertion failure.
>
> In particular, this can happen when the request alignment of the top
> node is larger than the allocated part of the bottom node, in which
> case padding becomes necessary. For example:
>
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M -o cluster_size=32768
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > <<EOF
> > {"execute": "qmp_capabilities"}
> > {"execute": "blockdev-add", "arguments": { "driver": "compress", "file": "node0", "node-name": "node1" } }
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": "node1" } }
> > EOF
>
> Originally-by: Stefan Reiter <s.reiter@proxmox.com>
> Signed-off-by: Thomas Lamprecht <t.lamprecht@proxmox.com>
> [FE: do update bytes and offset in any case
> add reproducer to commit message]
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> No changes in v3.
> No changes in v2.
>
> block/io.c | 31 +++++++++++++++++++------------
> 1 file changed, 19 insertions(+), 12 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-03-22 9:50 ` [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes Fiona Ebner
@ 2024-03-25 20:06 ` Stefan Hajnoczi
2024-03-26 12:44 ` Kevin Wolf
1 sibling, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2024-03-25 20:06 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, kwolf, fam,
t.lamprecht, w.bumiller
[-- Attachment #1: Type: text/plain, Size: 2310 bytes --]
On Fri, Mar 22, 2024 at 10:50:07AM +0100, Fiona Ebner wrote:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
>
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.
>
> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).
>
> A racy reproducer:
>
> > #!/bin/bash
> > rm -f /tmp/backing.qcow2
> > rm -f /tmp/top.qcow2
> > ./qemu-img create /tmp/backing.qcow2 -f qcow2 64M
> > ./qemu-io -c "write -P42 0x0 0x1" /tmp/backing.qcow2
> > ./qemu-img create /tmp/top.qcow2 -f qcow2 64M -b /tmp/backing.qcow2 -F qcow2
> > ./qemu-system-x86_64 --qmp stdio \
> > --blockdev qcow2,node-name=node0,file.driver=file,file.filename=/tmp/top.qcow2 \
> > <<EOF
> > {"execute": "qmp_capabilities"}
> > {"execute": "block-stream", "arguments": { "job-id": "stream0", "device": "node0" } }
> > {"execute": "quit"}
> > EOF
>
> [0]:
>
> > #0 bdrv_replace_child_tran (child=..., new_bs=..., tran=...)
> > #1 bdrv_replace_node_noperm (from=..., to=..., auto_skip=..., tran=..., errp=...)
> > #2 bdrv_replace_node_common (from=..., to=..., auto_skip=..., detach_subchain=..., errp=...)
> > #3 bdrv_drop_filter (bs=..., errp=...)
> > #4 bdrv_cor_filter_drop (cor_filter_bs=...)
> > #5 stream_prepare (job=...)
> > #6 job_prepare_locked (job=...)
> > #7 job_txn_apply_locked (fn=..., job=...)
> > #8 job_do_finalize_locked (job=...)
> > #9 job_exit (opaque=...)
> > #10 aio_bh_poll (ctx=...)
> > #11 aio_poll (ctx=..., blocking=...)
> > #12 bdrv_poll_co (s=...)
> > #13 bdrv_flush (bs=...)
> > #14 bdrv_flush_all ()
> > #15 do_vm_stop (state=..., send_stop=...)
> > #16 vm_shutdown ()
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> No changes in v3.
> New in v2.
>
> block/block-backend.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() where BDS associated to BB changes
2024-03-22 9:50 ` [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() " Fiona Ebner
@ 2024-03-25 20:07 ` Stefan Hajnoczi
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2024-03-25 20:07 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, kwolf, fam,
t.lamprecht, w.bumiller
[-- Attachment #1: Type: text/plain, Size: 718 bytes --]
On Fri, Mar 22, 2024 at 10:50:08AM +0100, Fiona Ebner wrote:
> Same rationale as for commit "block-backend: fix edge case in
> bdrv_next() where BDS associated to BB changes". The block graph might
> change between the bdrv_next() call and the bdrv_next_cleanup() call,
> so it could be that the associated BDS is not the same that was
> referenced previously anymore. Instead, rely on bdrv_next() to set
> it->bs to the BDS it referenced and unreference that one in any case.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> New in v3.
>
> block/block-backend.c | 11 ++++-------
> 1 file changed, 4 insertions(+), 7 deletions(-)
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read
2024-03-22 9:50 ` [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read Fiona Ebner
@ 2024-03-25 20:09 ` Stefan Hajnoczi
0 siblings, 0 replies; 17+ messages in thread
From: Stefan Hajnoczi @ 2024-03-25 20:09 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, kwolf, fam,
t.lamprecht, w.bumiller
[-- Attachment #1: Type: text/plain, Size: 970 bytes --]
On Fri, Mar 22, 2024 at 10:50:09AM +0100, Fiona Ebner wrote:
> Previously, bdrv_pad_request() could not deal with a NULL qiov when
> a read needed to be aligned. During prefetch, a stream job will pass a
> NULL qiov. Add a test case to cover this scenario.
>
> By accident, also covers a previous race during shutdown, where block
> graph changes during iteration in bdrv_flush_all() could lead to
> unreferencing the wrong block driver state and an assertion failure
> later.
>
> Signed-off-by: Fiona Ebner <f.ebner@proxmox.com>
> ---
>
> No changes in v3.
> New in v2.
>
> .../tests/stream-unaligned-prefetch | 86 +++++++++++++++++++
> .../tests/stream-unaligned-prefetch.out | 5 ++
> 2 files changed, 91 insertions(+)
> create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
` (3 preceding siblings ...)
2024-03-22 9:50 ` [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read Fiona Ebner
@ 2024-03-25 20:11 ` Stefan Hajnoczi
2024-03-26 12:53 ` Kevin Wolf
4 siblings, 1 reply; 17+ messages in thread
From: Stefan Hajnoczi @ 2024-03-25 20:11 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, kwolf, fam,
t.lamprecht, w.bumiller, Vladimir Sementsov-Ogievskiy
[-- Attachment #1: Type: text/plain, Size: 1848 bytes --]
On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> Changes in v3:
> * Also deal with edge case in bdrv_next_cleanup(). Haven't run
> into an actual issue there, but at least the caller in
> migration/block.c uses bdrv_nb_sectors() which, while not a
> coroutine wrapper itself (it's written manually), may call
> bdrv_refresh_total_sectors(), which is a generated coroutine
> wrapper, so AFAIU, the block graph can change during that call.
> And even without that, it's just better to be more consistent
> with bdrv_next().
>
> Changes in v2:
> * Ran into another issue while writing the IO test Stefan wanted
> to have (good call :)), so include a fix for that and add the
> test. I didn't notice during manual testing, because I hadn't
> used a scripted QMP 'quit', so there was no race.
>
> Fiona Ebner (3):
> block-backend: fix edge case in bdrv_next() where BDS associated to BB
> changes
> block-backend: fix edge case in bdrv_next_cleanup() where BDS
> associated to BB changes
> iotests: add test for stream job with an unaligned prefetch read
>
> Stefan Reiter (1):
> block/io: accept NULL qiov in bdrv_pad_request
>
> block/block-backend.c | 18 ++--
> block/io.c | 31 ++++---
> .../tests/stream-unaligned-prefetch | 86 +++++++++++++++++++
> .../tests/stream-unaligned-prefetch.out | 5 ++
> 4 files changed, 117 insertions(+), 23 deletions(-)
> create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
Looks good to me. I will wait until Thursday before merging in case
Hanna, Vladimir, or Kevin have comments. Thanks!
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-03-22 9:50 ` [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes Fiona Ebner
2024-03-25 20:06 ` Stefan Hajnoczi
@ 2024-03-26 12:44 ` Kevin Wolf
2024-06-03 14:17 ` Fiona Ebner
1 sibling, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2024-03-26 12:44 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> The old_bs variable in bdrv_next() is currently determined by looking
> at the old block backend. However, if the block graph changes before
> the next bdrv_next() call, it might be that the associated BDS is not
> the same that was referenced previously. In that case, the wrong BDS
> is unreferenced, leading to an assertion failure later:
>
> > bdrv_unref: Assertion `bs->refcnt > 0' failed.
Your change makes sense, but in theory it shouldn't make a difference.
The real bug is in the caller, you can't allow graph modifications while
iterating the list of nodes. Even if it doesn't crash (like after your
patch), you don't have any guarantee that you will have seen every node
that exists that the end - and maybe not even that you don't get the
same node twice.
> In particular, this can happen in the context of bdrv_flush_all(),
> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> a graph change (for example with a stream block job [0]).
The whole locking around this case is a bit tricky and would deserve
some cleanup.
The basic rule for bdrv_next() callers is that they need to hold the
graph reader lock as long as they are iterating the graph, otherwise
it's not safe. This implies that the ref/unref pairs in it should never
make a difference either - which is important, because at least
releasing the last reference is forbidden while holding the graph lock.
I intended to remove the ref/unref for bdrv_next(), but I didn't because
I realised that the callers need to be audited first that they really
obey the rules. You found one that would be problematic.
The thing that bdrv_flush_all() gets wrong is that it promises to follow
the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
something that polls. The compiler can't catch this because bdrv_flush()
is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
- If called outside of coroutine context, they are GRAPH_UNLOCKED
- If called in coroutine context, they are GRAPH_RDLOCK
We should probably try harder to get rid of the mixed functions, because
a synchronous co_wrapper_bdrv_rdlock could actually be marked
GRAPH_UNLOCKED in the code and then the compiler could catch this case.
The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
with a coroutine wrapper so that the graph lock is held for the whole
function. Then calling bdrv_co_flush() while iterating the list is safe
and doesn't allow concurrent graph modifications.
Kevin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 0/4] fix two edge cases related to stream block jobs
2024-03-25 20:11 ` [PATCH v3 0/4] fix two edge cases related to stream block jobs Stefan Hajnoczi
@ 2024-03-26 12:53 ` Kevin Wolf
0 siblings, 0 replies; 17+ messages in thread
From: Kevin Wolf @ 2024-03-26 12:53 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Fiona Ebner, qemu-devel, qemu-block, qemu-stable, hreitz, fam,
t.lamprecht, w.bumiller, Vladimir Sementsov-Ogievskiy
[-- Attachment #1: Type: text/plain, Size: 2181 bytes --]
Am 25.03.2024 um 21:11 hat Stefan Hajnoczi geschrieben:
> On Fri, Mar 22, 2024 at 10:50:05AM +0100, Fiona Ebner wrote:
> > Changes in v3:
> > * Also deal with edge case in bdrv_next_cleanup(). Haven't run
> > into an actual issue there, but at least the caller in
> > migration/block.c uses bdrv_nb_sectors() which, while not a
> > coroutine wrapper itself (it's written manually), may call
> > bdrv_refresh_total_sectors(), which is a generated coroutine
> > wrapper, so AFAIU, the block graph can change during that call.
> > And even without that, it's just better to be more consistent
> > with bdrv_next().
> >
> > Changes in v2:
> > * Ran into another issue while writing the IO test Stefan wanted
> > to have (good call :)), so include a fix for that and add the
> > test. I didn't notice during manual testing, because I hadn't
> > used a scripted QMP 'quit', so there was no race.
> >
> > Fiona Ebner (3):
> > block-backend: fix edge case in bdrv_next() where BDS associated to BB
> > changes
> > block-backend: fix edge case in bdrv_next_cleanup() where BDS
> > associated to BB changes
> > iotests: add test for stream job with an unaligned prefetch read
> >
> > Stefan Reiter (1):
> > block/io: accept NULL qiov in bdrv_pad_request
> >
> > block/block-backend.c | 18 ++--
> > block/io.c | 31 ++++---
> > .../tests/stream-unaligned-prefetch | 86 +++++++++++++++++++
> > .../tests/stream-unaligned-prefetch.out | 5 ++
> > 4 files changed, 117 insertions(+), 23 deletions(-)
> > create mode 100755 tests/qemu-iotests/tests/stream-unaligned-prefetch
> > create mode 100644 tests/qemu-iotests/tests/stream-unaligned-prefetch.out
>
> Looks good to me. I will wait until Thursday before merging in case
> Hanna, Vladimir, or Kevin have comments. Thanks!
Let's not delay it to -rc2. If something turns out to be wrong with it,
we can still revert it, but I think getting fixes in earlier is better
during freeze.
Thanks, applied to the block branch.
Kevin
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-03-26 12:44 ` Kevin Wolf
@ 2024-06-03 14:17 ` Fiona Ebner
2024-06-03 16:21 ` Kevin Wolf
0 siblings, 1 reply; 17+ messages in thread
From: Fiona Ebner @ 2024-06-03 14:17 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Hi Kevin,
Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
>> The old_bs variable in bdrv_next() is currently determined by looking
>> at the old block backend. However, if the block graph changes before
>> the next bdrv_next() call, it might be that the associated BDS is not
>> the same that was referenced previously. In that case, the wrong BDS
>> is unreferenced, leading to an assertion failure later:
>>
>>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
>
> Your change makes sense, but in theory it shouldn't make a difference.
> The real bug is in the caller, you can't allow graph modifications while
> iterating the list of nodes. Even if it doesn't crash (like after your
> patch), you don't have any guarantee that you will have seen every node
> that exists that the end - and maybe not even that you don't get the
> same node twice.
>
>> In particular, this can happen in the context of bdrv_flush_all(),
>> when polling for bdrv_co_flush() in the generated co-wrapper leads to
>> a graph change (for example with a stream block job [0]).
>
> The whole locking around this case is a bit tricky and would deserve
> some cleanup.
>
> The basic rule for bdrv_next() callers is that they need to hold the
> graph reader lock as long as they are iterating the graph, otherwise
> it's not safe. This implies that the ref/unref pairs in it should never
> make a difference either - which is important, because at least
> releasing the last reference is forbidden while holding the graph lock.
> I intended to remove the ref/unref for bdrv_next(), but I didn't because
> I realised that the callers need to be audited first that they really
> obey the rules. You found one that would be problematic.
>
> The thing that bdrv_flush_all() gets wrong is that it promises to follow
> the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> something that polls. The compiler can't catch this because bdrv_flush()
> is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
>
> - If called outside of coroutine context, they are GRAPH_UNLOCKED
> - If called in coroutine context, they are GRAPH_RDLOCK
>
> We should probably try harder to get rid of the mixed functions, because
> a synchronous co_wrapper_bdrv_rdlock could actually be marked
> GRAPH_UNLOCKED in the code and then the compiler could catch this case.
>
> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> with a coroutine wrapper so that the graph lock is held for the whole
> function. Then calling bdrv_co_flush() while iterating the list is safe
> and doesn't allow concurrent graph modifications.
I attempted this now, but ran into two issues:
The first is that I had to add support for a function without arguments
to the block-coroutine-wrapper.py script. I can send this as an RFC in
any case if desired.
The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion `!qemu_in_coroutine()' failed.
Looking at the backtrace:
> #5 0x0000762a90cc3eb2 in __GI___assert_fail
> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 "../block/graph-lock.c", line=113, function=0x5afb07991f20 <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> at ./assert/assert.c:101
> #6 0x00005afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> #7 0x00005afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at ../block/block-backend.c:901
> #8 0x00005afb075729a7 in blk_delete (blk=0x5afb0af99420) at ../block/block-backend.c:487
> #9 0x00005afb07572d88 in blk_unref (blk=0x5afb0af99420) at ../block/block-backend.c:547
> #10 0x00005afb07572fe8 in bdrv_next (it=0x762a852fef00) at ../block/block-backend.c:618
> #11 0x00005afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> #12 0x00005afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x7ffff12c6050) at block/block-gen.c:1391
> #13 0x00005afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
So I guess calling bdrv_next() is not safe from a coroutine, because the
function doing the iteration could end up being the last thing to have a
reference for the BB.
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-06-03 14:17 ` Fiona Ebner
@ 2024-06-03 16:21 ` Kevin Wolf
2024-06-04 7:58 ` Fiona Ebner
0 siblings, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2024-06-03 16:21 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> Hi Kevin,
>
> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> > Am 22.03.2024 um 10:50 hat Fiona Ebner geschrieben:
> >> The old_bs variable in bdrv_next() is currently determined by looking
> >> at the old block backend. However, if the block graph changes before
> >> the next bdrv_next() call, it might be that the associated BDS is not
> >> the same that was referenced previously. In that case, the wrong BDS
> >> is unreferenced, leading to an assertion failure later:
> >>
> >>> bdrv_unref: Assertion `bs->refcnt > 0' failed.
> >
> > Your change makes sense, but in theory it shouldn't make a difference.
> > The real bug is in the caller, you can't allow graph modifications while
> > iterating the list of nodes. Even if it doesn't crash (like after your
> > patch), you don't have any guarantee that you will have seen every node
> > that exists that the end - and maybe not even that you don't get the
> > same node twice.
> >
> >> In particular, this can happen in the context of bdrv_flush_all(),
> >> when polling for bdrv_co_flush() in the generated co-wrapper leads to
> >> a graph change (for example with a stream block job [0]).
> >
> > The whole locking around this case is a bit tricky and would deserve
> > some cleanup.
> >
> > The basic rule for bdrv_next() callers is that they need to hold the
> > graph reader lock as long as they are iterating the graph, otherwise
> > it's not safe. This implies that the ref/unref pairs in it should never
> > make a difference either - which is important, because at least
> > releasing the last reference is forbidden while holding the graph lock.
> > I intended to remove the ref/unref for bdrv_next(), but I didn't because
> > I realised that the callers need to be audited first that they really
> > obey the rules. You found one that would be problematic.
> >
> > The thing that bdrv_flush_all() gets wrong is that it promises to follow
> > the graph lock rules with GRAPH_RDLOCK_GUARD_MAINLOOP(), but then calls
> > something that polls. The compiler can't catch this because bdrv_flush()
> > is a co_wrapper_mixed_bdrv_rdlock. The behaviour for these functions is:
> >
> > - If called outside of coroutine context, they are GRAPH_UNLOCKED
> > - If called in coroutine context, they are GRAPH_RDLOCK
> >
> > We should probably try harder to get rid of the mixed functions, because
> > a synchronous co_wrapper_bdrv_rdlock could actually be marked
> > GRAPH_UNLOCKED in the code and then the compiler could catch this case.
> >
> > The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> > with a coroutine wrapper so that the graph lock is held for the whole
> > function. Then calling bdrv_co_flush() while iterating the list is safe
> > and doesn't allow concurrent graph modifications.
>
> I attempted this now, but ran into two issues:
>
> The first is that I had to add support for a function without arguments
> to the block-coroutine-wrapper.py script. I can send this as an RFC in
> any case if desired.
I think I wouldn't necessarily merge it if we don't have code that makes
use of it, but having it on the list as an RFC can't hurt either way.
Then we can come back to it once we do need it for something.
> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
>
> > ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion `!qemu_in_coroutine()' failed.
>
> Looking at the backtrace:
>
> > #5 0x0000762a90cc3eb2 in __GI___assert_fail
> > (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 "../block/graph-lock.c", line=113, function=0x5afb07991f20 <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> > at ./assert/assert.c:101
> > #6 0x00005afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> > #7 0x00005afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at ../block/block-backend.c:901
> > #8 0x00005afb075729a7 in blk_delete (blk=0x5afb0af99420) at ../block/block-backend.c:487
> > #9 0x00005afb07572d88 in blk_unref (blk=0x5afb0af99420) at ../block/block-backend.c:547
> > #10 0x00005afb07572fe8 in bdrv_next (it=0x762a852fef00) at ../block/block-backend.c:618
> > #11 0x00005afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> > #12 0x00005afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x7ffff12c6050) at block/block-gen.c:1391
> > #13 0x00005afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>
> So I guess calling bdrv_next() is not safe from a coroutine, because
> the function doing the iteration could end up being the last thing to
> have a reference for the BB.
Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
is surprising, because while we hold the graph lock, no reference should
be able to go away - you need the writer lock for that and you won't get
it as long as bdrv_co_flush_all() locks the graph. So whatever had a
reference before the bdrv_next() loop must still have it now. Do you
know where it gets dropped?
Kevin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-06-03 16:21 ` Kevin Wolf
@ 2024-06-04 7:58 ` Fiona Ebner
2024-06-04 15:28 ` Kevin Wolf
0 siblings, 1 reply; 17+ messages in thread
From: Fiona Ebner @ 2024-06-04 7:58 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
>> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>>>
>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
>>> with a coroutine wrapper so that the graph lock is held for the whole
>>> function. Then calling bdrv_co_flush() while iterating the list is safe
>>> and doesn't allow concurrent graph modifications.
>>
>> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
>>
>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion `!qemu_in_coroutine()' failed.
>>
>> Looking at the backtrace:
>>
>>> #5 0x0000762a90cc3eb2 in __GI___assert_fail
>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 "../block/graph-lock.c", line=113, function=0x5afb07991f20 <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
>>> at ./assert/assert.c:101
>>> #6 0x00005afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
>>> #7 0x00005afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at ../block/block-backend.c:901
>>> #8 0x00005afb075729a7 in blk_delete (blk=0x5afb0af99420) at ../block/block-backend.c:487
>>> #9 0x00005afb07572d88 in blk_unref (blk=0x5afb0af99420) at ../block/block-backend.c:547
>>> #10 0x00005afb07572fe8 in bdrv_next (it=0x762a852fef00) at ../block/block-backend.c:618
>>> #11 0x00005afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
>>> #12 0x00005afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x7ffff12c6050) at block/block-gen.c:1391
>>> #13 0x00005afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>>
>> So I guess calling bdrv_next() is not safe from a coroutine, because
>> the function doing the iteration could end up being the last thing to
>> have a reference for the BB.
>
> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> is surprising, because while we hold the graph lock, no reference should
> be able to go away - you need the writer lock for that and you won't get
> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> reference before the bdrv_next() loop must still have it now. Do you
> know where it gets dropped?
>
AFAICT, yes, it does hold the graph reader lock. The generated code is:
> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> {
> BdrvFlushAll *s = opaque;
>
> bdrv_graph_co_rdlock();
> s->ret = bdrv_co_flush_all();
> bdrv_graph_co_rdunlock();
> s->poll_state.in_progress = false;
>
> aio_wait_kick();
> }
Apparently when the mirror job is aborted/exits, which can happen during
the polling for bdrv_co_flush_all_entry(), a reference can go away
without the write lock (at least my breakpoints didn't trigger) being held:
> #0 blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> #1 0x00005cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at ../block/mirror.c:710
> #2 0x00005cdefb263575 in mirror_abort (job=0x5cdefeb53000) at ../block/mirror.c:823
> #3 0x00005cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> #4 0x00005cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at ../job.c:855
> #5 0x00005cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) at ../job.c:958
> #6 0x00005cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at ../job.c:1065
> #7 0x00005cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> #8 0x00005cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at ../util/async.c:171
> #9 0x00005cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at ../util/async.c:218
> #10 0x00005cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at ../util/aio-posix.c:722
> #11 0x00005cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at ../block/block-gen.h:43
> #12 0x00005cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> #13 0x00005cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, send_stop=false)
> at ../system/cpus.c:297
> #14 0x00005cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
> #15 0x00005cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871
> #16 0x00005cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
> #17 0x00005cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at ../system/main.c:48
Looking at the code in mirror_exit_common(), it doesn't seem to acquire
a write lock:
> bdrv_graph_rdunlock_main_loop();
>
> /*
> * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
> * inserting target_bs at s->to_replace, where we might not be able to get
> * these permissions.
> */
> blk_unref(s->target);
> s->target = NULL;
The write lock is taken in blk_remove_bs() when the refcount drops to 0
and the BB is actually removed:
> bdrv_graph_wrlock();
> bdrv_root_unref_child(root);
> bdrv_graph_wrunlock();
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-06-04 7:58 ` Fiona Ebner
@ 2024-06-04 15:28 ` Kevin Wolf
2024-06-05 14:14 ` Fiona Ebner
0 siblings, 1 reply; 17+ messages in thread
From: Kevin Wolf @ 2024-06-04 15:28 UTC (permalink / raw)
To: Fiona Ebner
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
> > Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
> >> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
> >>>
> >>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
> >>> with a coroutine wrapper so that the graph lock is held for the whole
> >>> function. Then calling bdrv_co_flush() while iterating the list is safe
> >>> and doesn't allow concurrent graph modifications.
> >>
> >> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
> >>
> >>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion `!qemu_in_coroutine()' failed.
> >>
> >> Looking at the backtrace:
> >>
> >>> #5 0x0000762a90cc3eb2 in __GI___assert_fail
> >>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 "../block/graph-lock.c", line=113, function=0x5afb07991f20 <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
> >>> at ./assert/assert.c:101
> >>> #6 0x00005afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
> >>> #7 0x00005afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at ../block/block-backend.c:901
> >>> #8 0x00005afb075729a7 in blk_delete (blk=0x5afb0af99420) at ../block/block-backend.c:487
> >>> #9 0x00005afb07572d88 in blk_unref (blk=0x5afb0af99420) at ../block/block-backend.c:547
> >>> #10 0x00005afb07572fe8 in bdrv_next (it=0x762a852fef00) at ../block/block-backend.c:618
> >>> #11 0x00005afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
> >>> #12 0x00005afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x7ffff12c6050) at block/block-gen.c:1391
> >>> #13 0x00005afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
> >>
> >> So I guess calling bdrv_next() is not safe from a coroutine, because
> >> the function doing the iteration could end up being the last thing to
> >> have a reference for the BB.
> >
> > Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
> > is surprising, because while we hold the graph lock, no reference should
> > be able to go away - you need the writer lock for that and you won't get
> > it as long as bdrv_co_flush_all() locks the graph. So whatever had a
> > reference before the bdrv_next() loop must still have it now. Do you
> > know where it gets dropped?
> >
>
> AFAICT, yes, it does hold the graph reader lock. The generated code is:
>
> > static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
> > {
> > BdrvFlushAll *s = opaque;
> >
> > bdrv_graph_co_rdlock();
> > s->ret = bdrv_co_flush_all();
> > bdrv_graph_co_rdunlock();
> > s->poll_state.in_progress = false;
> >
> > aio_wait_kick();
> > }
>
> Apparently when the mirror job is aborted/exits, which can happen during
> the polling for bdrv_co_flush_all_entry(), a reference can go away
> without the write lock (at least my breakpoints didn't trigger) being held:
>
> > #0 blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
> > #1 0x00005cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at ../block/mirror.c:710
> > #2 0x00005cdefb263575 in mirror_abort (job=0x5cdefeb53000) at ../block/mirror.c:823
> > #3 0x00005cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
> > #4 0x00005cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at ../job.c:855
> > #5 0x00005cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) at ../job.c:958
> > #6 0x00005cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at ../job.c:1065
> > #7 0x00005cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
> > #8 0x00005cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at ../util/async.c:171
> > #9 0x00005cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at ../util/async.c:218
> > #10 0x00005cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at ../util/aio-posix.c:722
> > #11 0x00005cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at ../block/block-gen.h:43
> > #12 0x00005cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
> > #13 0x00005cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, send_stop=false)
> > at ../system/cpus.c:297
> > #14 0x00005cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
> > #15 0x00005cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871
> > #16 0x00005cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
> > #17 0x00005cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at ../system/main.c:48
>
> Looking at the code in mirror_exit_common(), it doesn't seem to acquire
> a write lock:
>
> > bdrv_graph_rdunlock_main_loop();
> >
> > /*
> > * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
> > * inserting target_bs at s->to_replace, where we might not be able to get
> > * these permissions.
> > */
> > blk_unref(s->target);
> > s->target = NULL;
>
> The write lock is taken in blk_remove_bs() when the refcount drops to 0
> and the BB is actually removed:
>
> > bdrv_graph_wrlock();
> > bdrv_root_unref_child(root);
> > bdrv_graph_wrunlock();
Ah, so it _would_ take the writer lock (and wait for bdrv_flush_all() to
finish) if it were the last reference, but because bdrv_next() took
another reference, it can just decrease the refcount without modifying
the graph.
Does this mean that if we just remove the whole blk/bdrv_ref/unref()
code from bdrv_next(), which is what the final state should look like
anyway, it would actually work?
Of course, we then need to audit the callers of bdrv_next() to make sure
that all of them really keep the graph lock and don't poll during the
whole iteration like they are supposed to (i.e. nobody else violates the
rules the same way bdrv_flush_all() does).
Kevin
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes
2024-06-04 15:28 ` Kevin Wolf
@ 2024-06-05 14:14 ` Fiona Ebner
0 siblings, 0 replies; 17+ messages in thread
From: Fiona Ebner @ 2024-06-05 14:14 UTC (permalink / raw)
To: Kevin Wolf
Cc: qemu-devel, qemu-block, qemu-stable, hreitz, fam, stefanha,
t.lamprecht, w.bumiller
Am 04.06.24 um 17:28 schrieb Kevin Wolf:
> Am 04.06.2024 um 09:58 hat Fiona Ebner geschrieben:
>> Am 03.06.24 um 18:21 schrieb Kevin Wolf:
>>> Am 03.06.2024 um 16:17 hat Fiona Ebner geschrieben:
>>>> Am 26.03.24 um 13:44 schrieb Kevin Wolf:
>>>>>
>>>>> The fix for bdrv_flush_all() is probably to make it bdrv_co_flush_all()
>>>>> with a coroutine wrapper so that the graph lock is held for the whole
>>>>> function. Then calling bdrv_co_flush() while iterating the list is safe
>>>>> and doesn't allow concurrent graph modifications.
>>>>
>>>> The second is that iotest 255 ran into an assertion failure upon QMP 'quit':
>>>>
>>>>> ../block/graph-lock.c:113: bdrv_graph_wrlock: Assertion `!qemu_in_coroutine()' failed.
>>>>
>>>> Looking at the backtrace:
>>>>
>>>>> #5 0x0000762a90cc3eb2 in __GI___assert_fail
>>>>> (assertion=0x5afb07991e7d "!qemu_in_coroutine()", file=0x5afb07991e00 "../block/graph-lock.c", line=113, function=0x5afb07991f20 <__PRETTY_FUNCTION__.4> "bdrv_graph_wrlock")
>>>>> at ./assert/assert.c:101
>>>>> #6 0x00005afb07585311 in bdrv_graph_wrlock () at ../block/graph-lock.c:113
>>>>> #7 0x00005afb07573a36 in blk_remove_bs (blk=0x5afb0af99420) at ../block/block-backend.c:901
>>>>> #8 0x00005afb075729a7 in blk_delete (blk=0x5afb0af99420) at ../block/block-backend.c:487
>>>>> #9 0x00005afb07572d88 in blk_unref (blk=0x5afb0af99420) at ../block/block-backend.c:547
>>>>> #10 0x00005afb07572fe8 in bdrv_next (it=0x762a852fef00) at ../block/block-backend.c:618
>>>>> #11 0x00005afb0758cd65 in bdrv_co_flush_all () at ../block/io.c:2347
>>>>> #12 0x00005afb0753ba37 in bdrv_co_flush_all_entry (opaque=0x7ffff12c6050) at block/block-gen.c:1391
>>>>> #13 0x00005afb0773bf41 in coroutine_trampoline (i0=168365184, i1=23291)
>>>>
>>>> So I guess calling bdrv_next() is not safe from a coroutine, because
>>>> the function doing the iteration could end up being the last thing to
>>>> have a reference for the BB.
>>>
>>> Does your bdrv_co_flush_all() take the graph (reader) lock? If so, this
>>> is surprising, because while we hold the graph lock, no reference should
>>> be able to go away - you need the writer lock for that and you won't get
>>> it as long as bdrv_co_flush_all() locks the graph. So whatever had a
>>> reference before the bdrv_next() loop must still have it now. Do you
>>> know where it gets dropped?
>>>
>>
>> AFAICT, yes, it does hold the graph reader lock. The generated code is:
>>
>>> static void coroutine_fn bdrv_co_flush_all_entry(void *opaque)
>>> {
>>> BdrvFlushAll *s = opaque;
>>>
>>> bdrv_graph_co_rdlock();
>>> s->ret = bdrv_co_flush_all();
>>> bdrv_graph_co_rdunlock();
>>> s->poll_state.in_progress = false;
>>>
>>> aio_wait_kick();
>>> }
>>
>> Apparently when the mirror job is aborted/exits, which can happen during
>> the polling for bdrv_co_flush_all_entry(), a reference can go away
>> without the write lock (at least my breakpoints didn't trigger) being held:
>>
>>> #0 blk_unref (blk=0x5cdefe943d20) at ../block/block-backend.c:537
>>> #1 0x00005cdefb26697e in mirror_exit_common (job=0x5cdefeb53000) at ../block/mirror.c:710
>>> #2 0x00005cdefb263575 in mirror_abort (job=0x5cdefeb53000) at ../block/mirror.c:823
>>> #3 0x00005cdefb2248a6 in job_abort (job=0x5cdefeb53000) at ../job.c:825
>>> #4 0x00005cdefb2245f2 in job_finalize_single_locked (job=0x5cdefeb53000) at ../job.c:855
>>> #5 0x00005cdefb223852 in job_completed_txn_abort_locked (job=0x5cdefeb53000) at ../job.c:958
>>> #6 0x00005cdefb223714 in job_completed_locked (job=0x5cdefeb53000) at ../job.c:1065
>>> #7 0x00005cdefb224a8b in job_exit (opaque=0x5cdefeb53000) at ../job.c:1088
>>> #8 0x00005cdefb4134fc in aio_bh_call (bh=0x5cdefe7487c0) at ../util/async.c:171
>>> #9 0x00005cdefb4136ce in aio_bh_poll (ctx=0x5cdefd9cd750) at ../util/async.c:218
>>> #10 0x00005cdefb3efdfd in aio_poll (ctx=0x5cdefd9cd750, blocking=true) at ../util/aio-posix.c:722
>>> #11 0x00005cdefb20435e in bdrv_poll_co (s=0x7ffe491621d8) at ../block/block-gen.h:43
>>> #12 0x00005cdefb206a33 in bdrv_flush_all () at block/block-gen.c:1410
>>> #13 0x00005cdefae5c8ed in do_vm_stop (state=RUN_STATE_SHUTDOWN, send_stop=false)
>>> at ../system/cpus.c:297
>>> #14 0x00005cdefae5c850 in vm_shutdown () at ../system/cpus.c:308
>>> #15 0x00005cdefae6d892 in qemu_cleanup (status=0) at ../system/runstate.c:871
>>> #16 0x00005cdefb1a7e78 in qemu_default_main () at ../system/main.c:38
>>> #17 0x00005cdefb1a7eb8 in main (argc=34, argv=0x7ffe491623a8) at ../system/main.c:48
>>
>> Looking at the code in mirror_exit_common(), it doesn't seem to acquire
>> a write lock:
>>
>>> bdrv_graph_rdunlock_main_loop();
>>>
>>> /*
>>> * Remove target parent that still uses BLK_PERM_WRITE/RESIZE before
>>> * inserting target_bs at s->to_replace, where we might not be able to get
>>> * these permissions.
>>> */
>>> blk_unref(s->target);
>>> s->target = NULL;
>>
>> The write lock is taken in blk_remove_bs() when the refcount drops to 0
>> and the BB is actually removed:
>>
>>> bdrv_graph_wrlock();
>>> bdrv_root_unref_child(root);
>>> bdrv_graph_wrunlock();
>
> Ah, so it _would_ take the writer lock (and wait for bdrv_flush_all() to
> finish) if it were the last reference, but because bdrv_next() took
> another reference, it can just decrease the refcount without modifying
> the graph.
>
> Does this mean that if we just remove the whole blk/bdrv_ref/unref()
> code from bdrv_next(), which is what the final state should look like
> anyway, it would actually work?
I tried this and removed those calls. Unfortunately, I wasn't able to
run into the case where blk_unref() via job_exit() waits in
bdrv_graph_wrlock(), because there is a blk_drain() call before
blk_delete() and in my testing, it always waited there, in
bdrv_do_drained_begin()'s
> BDRV_POLL_WHILE(bs, bdrv_drain_poll_top_level(bs, parent));
> Of course, we then need to audit the callers of bdrv_next() to make sure
> that all of them really keep the graph lock and don't poll during the
> whole iteration like they are supposed to (i.e. nobody else violates the
> rules the same way bdrv_flush_all() does).
Note that below, none of the functions I list at the top level (i.e.
after a number) are coroutines.
1. hmp_info_snapshots() in block/monitor/block-hmp-cmds.c
2. vm_completion() in migration/migration-hmp-cmds.c
both call bdrv_can_snapshot() which calls the coroutine wrapper
bdrv_is_inserted() and thus polls
3. bdrv_flush_all() in block/io.c:
calls the coroutine wrapper bdrv_flush() and thus polls, the one that
led us here ;)
4. bdrv_all_get_snapshot_devices() in block/snapshot.c:
marked as GRAPH_RDLOCK, only calls g_list_append() in the loop body, so
is fine :)
5. bdrv_activate_all() in block.c:
calls bdrv_activate() which can (e.g. after outgoing migration) call the
(manual) coroutine wrappers bdrv_invalidate_cache() and
bdrv_refresh_total_sectors() and thus poll
6. bdrv_inactivate_all() in block.c:
calls bdrv_inactivate_recurse() which for qcow2, calls
qcow2_inactivate() which calls qcow2_cache_flush() which calls the
coroutine wrapper bdrv_flush() and thus polls
My list of code paths leading to polling here is probably not even
complete, but it's already clear that there is some work left before no
callers of bdrv_next() violate the rules.
Best Regards,
Fiona
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-06-05 14:15 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-22 9:50 [PATCH v3 0/4] fix two edge cases related to stream block jobs Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 1/4] block/io: accept NULL qiov in bdrv_pad_request Fiona Ebner
2024-03-25 19:56 ` Stefan Hajnoczi
2024-03-22 9:50 ` [PATCH v3 2/4] block-backend: fix edge case in bdrv_next() where BDS associated to BB changes Fiona Ebner
2024-03-25 20:06 ` Stefan Hajnoczi
2024-03-26 12:44 ` Kevin Wolf
2024-06-03 14:17 ` Fiona Ebner
2024-06-03 16:21 ` Kevin Wolf
2024-06-04 7:58 ` Fiona Ebner
2024-06-04 15:28 ` Kevin Wolf
2024-06-05 14:14 ` Fiona Ebner
2024-03-22 9:50 ` [PATCH v3 3/4] block-backend: fix edge case in bdrv_next_cleanup() " Fiona Ebner
2024-03-25 20:07 ` Stefan Hajnoczi
2024-03-22 9:50 ` [PATCH v3 4/4] iotests: add test for stream job with an unaligned prefetch read Fiona Ebner
2024-03-25 20:09 ` Stefan Hajnoczi
2024-03-25 20:11 ` [PATCH v3 0/4] fix two edge cases related to stream block jobs Stefan Hajnoczi
2024-03-26 12:53 ` Kevin Wolf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).