[PATCH 0/3] block/graph-lock: Disable locking for now

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] block/graph-lock: Disable locking for now
@ 2023-05-17 15:28 Kevin Wolf
  2023-05-17 15:28 ` [PATCH 1/3] graph-lock: " Kevin Wolf
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Kevin Wolf @ 2023-05-17 15:28 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, hreitz, stefanha, mjt, eblake, pbonzini, qemu-devel,
	qemu-stable

tl;dr is that graph locking introduced deadlocks in 8.0, and disabling
it for now fixes them again. See patch 1 for the details.

I still intend the fix this properly before we remove the AioContext
lock (which is when the deadlock would be automatically solved), but
it's not trivial enough for something that would be ready now and
backportable to stable versions. Let's try the real thing again in 8.1
and fix 8.0 with this stopgap solution.

Patch 2 is a prerequisite for the test case. Instead of reproducing the
deadlock problem (which it unfortunately doesn't do reliably anyway, the
timing seems hard to get right), I got NBD server crashes without it. I
actually made some more NBD changes to fix the crashes before this one,
but it seems to be stable with only this. Maybe the rest only fixed
symptoms of the same root cause, I'll have another look at them.

Kevin Wolf (3):
  graph-lock: Disable locking for now
  nbd/server: Fix drained_poll to wake coroutine in right AioContext
  iotests: Test commit with iothreads and ongoing I/O

 include/io/channel.h                          | 10 ++++
 block/graph-lock.c                            | 21 +++++++
 io/channel.c                                  | 33 +++++++++--
 nbd/server.c                                  |  3 +-
 tests/qemu-iotests/iotests.py                 |  4 ++
 .../qemu-iotests/tests/graph-changes-while-io | 56 +++++++++++++++++--
 .../tests/graph-changes-while-io.out          |  4 +-
 7 files changed, 117 insertions(+), 14 deletions(-)

-- 
2.40.1

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/3] graph-lock: Disable locking for now
  2023-05-17 15:28 [PATCH 0/3] block/graph-lock: Disable locking for now Kevin Wolf
@ 2023-05-17 15:28 ` Kevin Wolf
  2023-05-18 12:39   ` Eric Blake
  2023-05-17 15:28 ` [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext Kevin Wolf
  2023-05-17 15:28 ` [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O Kevin Wolf
  2 siblings, 1 reply; 7+ messages in thread
From: Kevin Wolf @ 2023-05-17 15:28 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, hreitz, stefanha, mjt, eblake, pbonzini, qemu-devel,
	qemu-stable

In QEMU 8.0, we've been seeing deadlocks in bdrv_graph_wrlock(). They
come from callers that hold an AioContext lock, which is not allowed
during polling. In theory, we could temporarily release the lock, but
callers are inconsistent about whether they hold a lock, and if they do,
some are also confused about which one they hold. While all of this is
fixable, it's not trivial, and the best course of action for 8.0.1 is
probably just disabling the graph locking code temporarily.

We don't currently rely on graph locking yet. It is supposed to replace
the AioContext lock eventually to enable multiqueue support, but as long
as we still have the AioContext lock, it is sufficient without the graph
lock. Once the AioContext lock goes away, the deadlock doesn't exist any
more either and this commit can be reverted. (Of course, it can also be
reverted while the AioContext lock still exists if the callers have been
fixed.)

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 block/graph-lock.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/block/graph-lock.c b/block/graph-lock.c
index 9c42bd9799..9806fd4ecb 100644
--- a/block/graph-lock.c
+++ b/block/graph-lock.c
@@ -30,8 +30,10 @@ BdrvGraphLock graph_lock;
 /* Protects the list of aiocontext and orphaned_reader_count */
 static QemuMutex aio_context_list_lock;
 
+#if 0
 /* Written and read with atomic operations. */
 static int has_writer;
+#endif
 
 /*
  * A reader coroutine could move from an AioContext to another.
@@ -88,6 +90,7 @@ void unregister_aiocontext(AioContext *ctx)
     g_free(ctx->bdrv_graph);
 }
 
+#if 0
 static uint32_t reader_count(void)
 {
     BdrvGraphRWlock *brdv_graph;
@@ -105,10 +108,17 @@ static uint32_t reader_count(void)
     assert((int32_t)rd >= 0);
     return rd;
 }
+#endif
 
 void bdrv_graph_wrlock(void)
 {
     GLOBAL_STATE_CODE();
+    /*
+     * TODO Some callers hold an AioContext lock when this is called, which
+     * causes deadlocks. Reenable once the AioContext locking is cleaned up (or
+     * AioContext locks are gone).
+     */
+#if 0
     assert(!qatomic_read(&has_writer));
 
     /* Make sure that constantly arriving new I/O doesn't cause starvation */
@@ -139,11 +149,13 @@ void bdrv_graph_wrlock(void)
     } while (reader_count() >= 1);
 
     bdrv_drain_all_end();
+#endif
 }
 
 void bdrv_graph_wrunlock(void)
 {
     GLOBAL_STATE_CODE();
+#if 0
     QEMU_LOCK_GUARD(&aio_context_list_lock);
     assert(qatomic_read(&has_writer));
 
@@ -155,10 +167,13 @@ void bdrv_graph_wrunlock(void)
 
     /* Wake up all coroutine that are waiting to read the graph */
     qemu_co_enter_all(&reader_queue, &aio_context_list_lock);
+#endif
 }
 
 void coroutine_fn bdrv_graph_co_rdlock(void)
 {
+    /* TODO Reenable when wrlock is reenabled */
+#if 0
     BdrvGraphRWlock *bdrv_graph;
     bdrv_graph = qemu_get_current_aio_context()->bdrv_graph;
 
@@ -218,10 +233,12 @@ void coroutine_fn bdrv_graph_co_rdlock(void)
             qemu_co_queue_wait(&reader_queue, &aio_context_list_lock);
         }
     }
+#endif
 }
 
 void coroutine_fn bdrv_graph_co_rdunlock(void)
 {
+#if 0
     BdrvGraphRWlock *bdrv_graph;
     bdrv_graph = qemu_get_current_aio_context()->bdrv_graph;
 
@@ -239,6 +256,7 @@ void coroutine_fn bdrv_graph_co_rdunlock(void)
     if (qatomic_read(&has_writer)) {
         aio_wait_kick();
     }
+#endif
 }
 
 void bdrv_graph_rdlock_main_loop(void)
@@ -264,5 +282,8 @@ void assert_bdrv_graph_readable(void)
 void assert_bdrv_graph_writable(void)
 {
     assert(qemu_in_main_thread());
+    /* TODO Reenable when wrlock is reenabled */
+#if 0
     assert(qatomic_read(&has_writer));
+#endif
 }
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext
  2023-05-17 15:28 [PATCH 0/3] block/graph-lock: Disable locking for now Kevin Wolf
  2023-05-17 15:28 ` [PATCH 1/3] graph-lock: " Kevin Wolf
@ 2023-05-17 15:28 ` Kevin Wolf
  2023-05-18 12:43   ` Eric Blake
  2023-05-17 15:28 ` [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O Kevin Wolf
  2 siblings, 1 reply; 7+ messages in thread
From: Kevin Wolf @ 2023-05-17 15:28 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, hreitz, stefanha, mjt, eblake, pbonzini, qemu-devel,
	qemu-stable

nbd_drained_poll() generally runs in the main thread, not whatever
iothread the NBD server coroutine is meant to run in, so it can't
directly reenter the coroutines to wake them up.

The code seems to have the right intention, it specifies the correct
AioContext when it calls qemu_aio_coroutine_enter(). However, this
functions doesn't schedule the coroutine to run in that AioContext, but
it assumes it is already called in the home thread of the AioContext.

To fix this, add a new thread-safe qio_channel_wake_read() that can be
called in the main thread to wake up the coroutine in its AioContext,
and use this in nbd_drained_poll().

Cc: qemu-stable@nongnu.org
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 include/io/channel.h | 10 ++++++++++
 io/channel.c         | 33 +++++++++++++++++++++++++++------
 nbd/server.c         |  3 +--
 3 files changed, 38 insertions(+), 8 deletions(-)

diff --git a/include/io/channel.h b/include/io/channel.h
index 446a566e5e..229bf36910 100644
--- a/include/io/channel.h
+++ b/include/io/channel.h
@@ -757,6 +757,16 @@ void qio_channel_detach_aio_context(QIOChannel *ioc);
 void coroutine_fn qio_channel_yield(QIOChannel *ioc,
                                     GIOCondition condition);
 
+/**
+ * qio_channel_wake_read:
+ * @ioc: the channel object
+ *
+ * If qio_channel_yield() is currently waiting for the channel to become
+ * readable, interrupt it and reenter immediately. This function is safe to call
+ * from any thread.
+ */
+void qio_channel_wake_read(QIOChannel *ioc);
+
 /**
  * qio_channel_wait:
  * @ioc: the channel object
diff --git a/io/channel.c b/io/channel.c
index 375a130a39..72f0066af5 100644
--- a/io/channel.c
+++ b/io/channel.c
@@ -19,6 +19,7 @@
  */
 
 #include "qemu/osdep.h"
+#include "block/aio-wait.h"
 #include "io/channel.h"
 #include "qapi/error.h"
 #include "qemu/main-loop.h"
@@ -514,7 +515,11 @@ int qio_channel_flush(QIOChannel *ioc,
 static void qio_channel_restart_read(void *opaque)
 {
     QIOChannel *ioc = opaque;
-    Coroutine *co = ioc->read_coroutine;
+    Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL);
+
+    if (!co) {
+        return;
+    }
 
     /* Assert that aio_co_wake() reenters the coroutine directly */
     assert(qemu_get_current_aio_context() ==
@@ -525,7 +530,11 @@ static void qio_channel_restart_read(void *opaque)
 static void qio_channel_restart_write(void *opaque)
 {
     QIOChannel *ioc = opaque;
-    Coroutine *co = ioc->write_coroutine;
+    Coroutine *co = qatomic_xchg(&ioc->write_coroutine, NULL);
+
+    if (!co) {
+        return;
+    }
 
     /* Assert that aio_co_wake() reenters the coroutine directly */
     assert(qemu_get_current_aio_context() ==
@@ -568,7 +577,11 @@ void qio_channel_detach_aio_context(QIOChannel *ioc)
 void coroutine_fn qio_channel_yield(QIOChannel *ioc,
                                     GIOCondition condition)
 {
+    AioContext *ioc_ctx = ioc->ctx ?: qemu_get_aio_context();
+
     assert(qemu_in_coroutine());
+    assert(in_aio_context_home_thread(ioc_ctx));
+
     if (condition == G_IO_IN) {
         assert(!ioc->read_coroutine);
         ioc->read_coroutine = qemu_coroutine_self();
@@ -580,18 +593,26 @@ void coroutine_fn qio_channel_yield(QIOChannel *ioc,
     }
     qio_channel_set_aio_fd_handlers(ioc);
     qemu_coroutine_yield();
+    assert(in_aio_context_home_thread(ioc_ctx));
 
     /* Allow interrupting the operation by reentering the coroutine other than
      * through the aio_fd_handlers. */
-    if (condition == G_IO_IN && ioc->read_coroutine) {
-        ioc->read_coroutine = NULL;
+    if (condition == G_IO_IN) {
+        assert(ioc->read_coroutine == NULL);
         qio_channel_set_aio_fd_handlers(ioc);
-    } else if (condition == G_IO_OUT && ioc->write_coroutine) {
-        ioc->write_coroutine = NULL;
+    } else if (condition == G_IO_OUT) {
+        assert(ioc->write_coroutine == NULL);
         qio_channel_set_aio_fd_handlers(ioc);
     }
 }
 
+void qio_channel_wake_read(QIOChannel *ioc)
+{
+    Coroutine *co = qatomic_xchg(&ioc->read_coroutine, NULL);
+    if (co) {
+        aio_co_wake(co);
+    }
+}
 
 static gboolean qio_channel_wait_complete(QIOChannel *ioc,
                                           GIOCondition condition,
diff --git a/nbd/server.c b/nbd/server.c
index e239c2890f..2664d43bff 100644
--- a/nbd/server.c
+++ b/nbd/server.c
@@ -1599,8 +1599,7 @@ static bool nbd_drained_poll(void *opaque)
              * enter it here so we don't depend on the client to wake it up.
              */
             if (client->recv_coroutine != NULL && client->read_yielding) {
-                qemu_aio_coroutine_enter(exp->common.ctx,
-                                         client->recv_coroutine);
+                qio_channel_wake_read(client->ioc);
             }
 
             return true;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O
  2023-05-17 15:28 [PATCH 0/3] block/graph-lock: Disable locking for now Kevin Wolf
  2023-05-17 15:28 ` [PATCH 1/3] graph-lock: " Kevin Wolf
  2023-05-17 15:28 ` [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext Kevin Wolf
@ 2023-05-17 15:28 ` Kevin Wolf
  2023-05-18 21:00   ` Eric Blake
  2 siblings, 1 reply; 7+ messages in thread
From: Kevin Wolf @ 2023-05-17 15:28 UTC (permalink / raw)
  To: qemu-block
  Cc: kwolf, hreitz, stefanha, mjt, eblake, pbonzini, qemu-devel,
	qemu-stable

This tests exercises graph locking, draining, and graph modifications
with AioContext switches a lot. Amongst others, it serves as a
regression test for bdrv_graph_wrlock() deadlocking because it is called
with a locked AioContext and for AioContext handling in the NBD server.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
---
 tests/qemu-iotests/iotests.py                 |  4 ++
 .../qemu-iotests/tests/graph-changes-while-io | 56 +++++++++++++++++--
 .../tests/graph-changes-while-io.out          |  4 +-
 3 files changed, 58 insertions(+), 6 deletions(-)

diff --git a/tests/qemu-iotests/iotests.py b/tests/qemu-iotests/iotests.py
index 3e82c634cf..7073579a7d 100644
--- a/tests/qemu-iotests/iotests.py
+++ b/tests/qemu-iotests/iotests.py
@@ -462,6 +462,10 @@ def qmp(self, cmd: str, args: Optional[Dict[str, object]] = None) \
         assert self._qmp is not None
         return self._qmp.cmd(cmd, args)
 
+    def get_qmp(self) -> QEMUMonitorProtocol:
+        assert self._qmp is not None
+        return self._qmp
+
     def stop(self, kill_signal=15):
         self._p.send_signal(kill_signal)
         self._p.wait()
diff --git a/tests/qemu-iotests/tests/graph-changes-while-io b/tests/qemu-iotests/tests/graph-changes-while-io
index 7664f33689..750e7d4d38 100755
--- a/tests/qemu-iotests/tests/graph-changes-while-io
+++ b/tests/qemu-iotests/tests/graph-changes-while-io
@@ -22,19 +22,19 @@
 import os
 from threading import Thread
 import iotests
-from iotests import imgfmt, qemu_img, qemu_img_create, QMPTestCase, \
-        QemuStorageDaemon
+from iotests import imgfmt, qemu_img, qemu_img_create, qemu_io, \
+        QMPTestCase, QemuStorageDaemon
 
 
 top = os.path.join(iotests.test_dir, 'top.img')
 nbd_sock = os.path.join(iotests.sock_dir, 'nbd.sock')
 
 
-def do_qemu_img_bench() -> None:
+def do_qemu_img_bench(count: int = 2000000) -> None:
     """
     Do some I/O requests on `nbd_sock`.
     """
-    qemu_img('bench', '-f', 'raw', '-c', '2000000',
+    qemu_img('bench', '-f', 'raw', '-c', str(count),
              f'nbd+unix:///node0?socket={nbd_sock}')
 
 
@@ -84,6 +84,54 @@ class TestGraphChangesWhileIO(QMPTestCase):
 
         bench_thr.join()
 
+    def test_commit_while_io(self) -> None:
+        # Run qemu-img bench in the background
+        bench_thr = Thread(target=do_qemu_img_bench, args=(200000, ))
+        bench_thr.start()
+
+        qemu_io('-c', 'write 0 64k', top)
+        qemu_io('-c', 'write 128k 64k', top)
+
+        result = self.qsd.qmp('blockdev-add', {
+            'driver': imgfmt,
+            'node-name': 'overlay',
+            'backing': None,
+            'file': {
+                'driver': 'file',
+                'filename': top
+            }
+        })
+        self.assert_qmp(result, 'return', {})
+
+        result = self.qsd.qmp('blockdev-snapshot', {
+            'node': 'node0',
+            'overlay': 'overlay',
+        })
+        self.assert_qmp(result, 'return', {})
+
+        # While qemu-img bench is running, repeatedly commit overlay to node0
+        while bench_thr.is_alive():
+            result = self.qsd.qmp('block-commit', {
+                'job-id': 'job0',
+                'device': 'overlay',
+            })
+            self.assert_qmp(result, 'return', {})
+
+            result = self.qsd.qmp('block-job-cancel', {
+                'device': 'job0',
+            })
+            self.assert_qmp(result, 'return', {})
+
+            cancelled = False
+            while not cancelled:
+                for event in self.qsd.get_qmp().get_events(wait=10.0):
+                    if event['event'] != 'JOB_STATUS_CHANGE':
+                        continue
+                    if event['data']['status'] == 'null':
+                        cancelled = True
+
+        bench_thr.join()
+
 if __name__ == '__main__':
     # Format must support raw backing files
     iotests.main(supported_fmts=['qcow', 'qcow2', 'qed'],
diff --git a/tests/qemu-iotests/tests/graph-changes-while-io.out b/tests/qemu-iotests/tests/graph-changes-while-io.out
index ae1213e6f8..fbc63e62f8 100644
--- a/tests/qemu-iotests/tests/graph-changes-while-io.out
+++ b/tests/qemu-iotests/tests/graph-changes-while-io.out
@@ -1,5 +1,5 @@
-.
+..
 ----------------------------------------------------------------------
-Ran 1 tests
+Ran 2 tests
 
 OK
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 1/3] graph-lock: Disable locking for now
  2023-05-17 15:28 ` [PATCH 1/3] graph-lock: " Kevin Wolf
@ 2023-05-18 12:39   ` Eric Blake
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Blake @ 2023-05-18 12:39 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, hreitz, stefanha, mjt, pbonzini, qemu-devel,
	qemu-stable

On Wed, May 17, 2023 at 05:28:32PM +0200, Kevin Wolf wrote:
> In QEMU 8.0, we've been seeing deadlocks in bdrv_graph_wrlock(). They
> come from callers that hold an AioContext lock, which is not allowed
> during polling. In theory, we could temporarily release the lock, but
> callers are inconsistent about whether they hold a lock, and if they do,
> some are also confused about which one they hold. While all of this is
> fixable, it's not trivial, and the best course of action for 8.0.1 is
> probably just disabling the graph locking code temporarily.
> 
> We don't currently rely on graph locking yet. It is supposed to replace
> the AioContext lock eventually to enable multiqueue support, but as long
> as we still have the AioContext lock, it is sufficient without the graph
> lock. Once the AioContext lock goes away, the deadlock doesn't exist any
> more either and this commit can be reverted. (Of course, it can also be
> reverted while the AioContext lock still exists if the callers have been
> fixed.)

Makes sense - certainly easier than finding all the lurking bugs,
while we still have the AioContext lock protection.

> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  block/graph-lock.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext
  2023-05-17 15:28 ` [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext Kevin Wolf
@ 2023-05-18 12:43   ` Eric Blake
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Blake @ 2023-05-18 12:43 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, hreitz, stefanha, mjt, pbonzini, qemu-devel,
	qemu-stable

On Wed, May 17, 2023 at 05:28:33PM +0200, Kevin Wolf wrote:
> nbd_drained_poll() generally runs in the main thread, not whatever
> iothread the NBD server coroutine is meant to run in, so it can't
> directly reenter the coroutines to wake them up.
> 
> The code seems to have the right intention, it specifies the correct
> AioContext when it calls qemu_aio_coroutine_enter(). However, this
> functions doesn't schedule the coroutine to run in that AioContext, but
> it assumes it is already called in the home thread of the AioContext.
> 
> To fix this, add a new thread-safe qio_channel_wake_read() that can be
> called in the main thread to wake up the coroutine in its AioContext,
> and use this in nbd_drained_poll().
> 
> Cc: qemu-stable@nongnu.org
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
> ---
>  include/io/channel.h | 10 ++++++++++
>  io/channel.c         | 33 +++++++++++++++++++++++++++------
>  nbd/server.c         |  3 +--
>  3 files changed, 38 insertions(+), 8 deletions(-)
>

Lots of support code...

> +++ b/nbd/server.c
> @@ -1599,8 +1599,7 @@ static bool nbd_drained_poll(void *opaque)
>               * enter it here so we don't depend on the client to wake it up.
>               */
>              if (client->recv_coroutine != NULL && client->read_yielding) {
> -                qemu_aio_coroutine_enter(exp->common.ctx,
> -                                         client->recv_coroutine);
> +                qio_channel_wake_read(client->ioc);
>              }

...for what boils down to a deceptively simple alternative call.  But
your reasoning made sense to me, and I can see how your new function
solves the issue.

Reviewed-by: Eric Blake <eblake@redhat.com>

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O
  2023-05-17 15:28 ` [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O Kevin Wolf
@ 2023-05-18 21:00   ` Eric Blake
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Blake @ 2023-05-18 21:00 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: qemu-block, hreitz, stefanha, mjt, pbonzini, qemu-devel,
	qemu-stable

On Wed, May 17, 2023 at 05:28:34PM +0200, Kevin Wolf wrote:
> This tests exercises graph locking, draining, and graph modifications
> with AioContext switches a lot. Amongst others, it serves as a
> regression test for bdrv_graph_wrlock() deadlocking because it is called
> with a locked AioContext and for AioContext handling in the NBD server.
> 
> Signed-off-by: Kevin Wolf <kwolf@redhat.com>

I've now confirmed the following setups with just './check
graph-changes-while-io -qcow2':

patch 3 alone => test fails with wrlock assertion (good, we're
catching the 8.0 regression where the new assertion failure is
tripping)

patch 1 and 3 => test fails differently than patch 3 alone (good,
we're exposing the fact that NBD had a pre-existing bug, regardless of
whether the added rwlock made it easier to spot)

patch 2 and 3 => test passes (good, patch 2 appears to have fixed this
particular bug, and when we are ready to revert patch 1 because we get
rid of AioContext locking we'll still be okay)

patch 1, 2, and 3 => test passes (good, we fixed the NBD bug, and have
regression testing in place for a scenario that previously wasn't
getting good testing)

As such, I'm happy to supply:

Tested-by: Eric Blake <eblake@redhat.com>


Now on to the patch itself...

> ---
>  tests/qemu-iotests/iotests.py                 |  4 ++
>  .../qemu-iotests/tests/graph-changes-while-io | 56 +++++++++++++++++--
>  .../tests/graph-changes-while-io.out          |  4 +-
>  3 files changed, 58 insertions(+), 6 deletions(-)
> 

> @@ -84,6 +84,54 @@ class TestGraphChangesWhileIO(QMPTestCase):
>  
>          bench_thr.join()
>  
> +    def test_commit_while_io(self) -> None:
> +        # Run qemu-img bench in the background
> +        bench_thr = Thread(target=do_qemu_img_bench, args=(200000, ))

TIL - you can create a 1-item tuple in Python.  It caught me
off-guard, but makes sense now that I've re-read it.

> +
> +        # While qemu-img bench is running, repeatedly commit overlay to node0
> +        while bench_thr.is_alive():
> +            result = self.qsd.qmp('block-commit', {
> +                'job-id': 'job0',
> +                'device': 'overlay',
> +            })
> +            self.assert_qmp(result, 'return', {})
> +
> +            result = self.qsd.qmp('block-job-cancel', {
> +                'device': 'job0',
> +            })
> +            self.assert_qmp(result, 'return', {})
> +
> +            cancelled = False
> +            while not cancelled:
> +                for event in self.qsd.get_qmp().get_events(wait=10.0):

The updated test took about 34 seconds on my machine; long enough that
it is rightfully not part of './check -g quick', but still reasonable
that I had no problems reproducing the issue while the test was
running.

> +                    if event['event'] != 'JOB_STATUS_CHANGE':
> +                        continue
> +                    if event['data']['status'] == 'null':
> +                        cancelled = True
> +
> +        bench_thr.join()
> +

It feels a bit odd that the test is skipped during './check -nbd', yet
it IS utilizing nbd, and the fix in patch 2 is indeed in NBD code.
But that's not a flaw in the test itself, just a limitation of what
images we need in order to set up the NBD service in a way to trigger
the problem.

Reviewed-by: Eric Blake <eblake@redhat.com>

I'm in a spot where I can quickly queue this through my NBD tree so we
can get it backported to 8.0.1; pull request coming up, provided the
full series passes a few more tests on my end.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-05-18 21:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-17 15:28 [PATCH 0/3] block/graph-lock: Disable locking for now Kevin Wolf
2023-05-17 15:28 ` [PATCH 1/3] graph-lock: " Kevin Wolf
2023-05-18 12:39   ` Eric Blake
2023-05-17 15:28 ` [PATCH 2/3] nbd/server: Fix drained_poll to wake coroutine in right AioContext Kevin Wolf
2023-05-18 12:43   ` Eric Blake
2023-05-17 15:28 ` [PATCH 3/3] iotests: Test commit with iothreads and ongoing I/O Kevin Wolf
2023-05-18 21:00   ` Eric Blake

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).